P3-01: Raga Classification From Vocal Performances Using Multimodal Analysis

Clayton, Martin, Rao, Preeti*, Shikarpur, Nithya Nadig, Roychowdhury, Sujoy, Li, Jin

Subjects (starting with primary): Evaluation, datasets, and reproducibility -> novel datasets and use cases ; Domain knowledge -> computational ethnomusicology ; MIR tasks -> automatic classification

Presented In-person, in Bengaluru: 10-minute long-format presentation


Work on musical gesture and embodied cognition suggests a rich complementarity between audio and movement information in musical performance. Pose estimation algorithms now make it possible (in contrast to Motion Capture) to collect rich movement information from unconstrained performances of indefinite length. Vocal performances of Indian art music offer the opportunity to carry out multimodal analysis using this information, combing musician’s body movements (i.e. pose and gesture data) with audio features. In this work we investigate raga identification from 12 s excerpts from a dataset of 3 singers and 9 ragas using the combination of audio and visual representations that are each semantically salient on their own. While gesture based classification is relatively weak by itself, we show that combining latent representations from the pre-trained unimodal networks can surpass the already high performance obtained by audio features.

Direct link to video