LV-49: MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning

Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao MA, Chenghua Lin, Xingran Chen, Anton Ragni, Hanzhi Yin, Zhijie Hu, Haoyu He, Emmanouil Benetos, Norbert Gyenge, Ruibo Liu, Jie Fu

Abstract: The deep learning community has witnessed an exponentially growing interest in self-supervised learning (SSL). However, it still remains unexplored how to build a framework for learning useful representations of raw music waveforms in a self-supervised manner. In this work, we design MAP-Music2Vec, a framework exploring different self-supervised learning algorithmic components and tricks for music audios. Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly lightweight with less than 2% of parameters of the latter. The model will be released on Huggingface.