P1-07: Tailed U-Net: Multi-Scale Music Representation Learning
Vélez Vásquez, Marcel A*, Burgoyne, John Ashley
Subjects (starting with primary): MIR fundamentals and methodology -> music signal processing ; Musical features and properties -> representations of music ; Domain knowledge -> representations of music ; Musical features and properties -> musical style and genre ; Domain knowledge -> machine learning/artificial intelligence for music ; Musical features and properties -> musical affect, emotion and mood
Presented In-person, in Bengaluru: 4-minute short-format presentation
Self-supervised learning has steadily been gaining traction in recent years. In music information retrieval (MIR), one promising recent application of self-supervised learning is the CLMR framework (contrastive learning of musical representations). CLMR has shown good performance, achieving results on par with state-of-the-art end-to-end classification models, but it is strictly an encoding framework. It suffers the characteristic limitation of any encoder that it cannot explicitly combine multi-timescale information, whereas a characteristic feature of human audio perception is that we tend to perceive all frequencies simultaneously. To this end, we propose a generalization of CLMR that learns to extract and explicitly combine representations across different frequency resolutions, which we coin the tailed U-Net (TUNe). TUNe architectures combine multi-timescale information during a decoding phase, similar to U-Net architectures used in computer vision and source separation, but have a tail added to reduce sample-level information to a smaller pre-defined number of representation dimensions. The size of the decoding phase is a hyperparameter, and in the case of a zero-layer decoding phase, TUNe reduces to CLMR. The best TUNe architectures, however, require less training time to match CLMR performance, have superior transfer learning performance, and are competitive with state-of-the-art models even at dramatically reduced dimensionalities.