Definitive embeddings remain a fundamental challenge of computational musicology for symbolic music in deep learning today. Analogous to natural languages, a piece of music can be modeled as a sequence of tokens. This motivates most existing solutions to explore the utilization of word embedding models in natural language processing (e.g., skip-gram and CBOW) to build music embeddings. However, music differs in two key aspects from natural languages: (1) musical token is multi-faceted -- it comprises of pitch, rhythm and dynamic information simultaneously; and (2) musical context is two-dimensional -- each musical token is dependent on the surrounding tokens from both melodic (horizontal) and harmonic (vertical) contexts.
In this work, we attempt to provide a comprehensive solution by proposing a novel framework named PiRhDy that integrates pitch, rhythm and dynamics information seamlessly. Specifically, PiRhDy adopts a hierarchical strategy which can be decomposed into two steps: (1) token (note event) modeling, which separately represents pitch, rhythm and dynamics and integrates them into a single token embedding; and (2) context modeling, which utilizes melodic and harmonic knowledge to train the token embedding. To examine our method, we make a thorough study by decomposing PiRhDy on components and strategies. We further valid our embeddings in three downstream tasks of melody completion, accompaniment suggestion and genre classification. We demonstrate our PiRhDy embeddings significantly outperform the baseline methods.
License type:
Funding Info:
This work was supported in part by the Ministry of Education of Humanities and Social Science project under grant 16YJC790123 and National Research Foundation, Singapore under its International Research Centres in Singapore Funding Initiative.
Description:
Open Access article. This conference paper was chosen Best Paper by the Committee of ACM Multimedia 2020.