PhD defense Manvi Agarwal: Designing positional encoding with musical priors for generative applications
Télécom Paris, 19 place Marguerite Perey F-91120 Palaiseau [getting there], amphi 2 and in videoconferencing
Jury
- Gaël Richard, Professor, Télécom Paris, France (Thesis supervisor)
- Changhong Wang, Postdoctoral Researcher, Télécom Paris, France (Thesis co-supervisor)
- Florence d’Alché-Buc, Professor, Télécom Paris, France (President of the jury)
- Louis Bigo, Professor, Bordeaux INP / Enseirb-Matmeca, France (Reviewer)
- Emmanouil Benetos, Reader, Queen Mary University of London, UK (Reviewer)
- Umut Şimşekli, Researcher, INRIA et École Normale Supérieure, France (Examiner)
Abstract
Attention-based Transformer generative models have reached remarkable maturity in recent years, producing high-quality, realistic samples. As a result, they have become increasingly omnipresent in our lives through a broad range of commercial applications. For effective performance, they fundamentally require two resources: substantial compute budgets and large-scale datasets.
The first requirement, which is partly rooted in the fact that the computational complexity of attention scales quadratically with sequence length, raises broader questions about, for instance, skewed socio-economic resource allocation and detrimental environmental impact. Computational challenges notwithstanding, the domain of symbolic music has rapidly adopted such generative models. However, symbolic music is a low-resource domain constrained by limited data, which presents a roadblock to the fulfillment of the second requirement. These factors present an interesting scientific challenge: how can we maintain superior performance on generative symbolic music tasks with limited compute and data?
In this thesis, we address this question through empirical and theoretical approaches focused on a specific component of the Transformer architecture: positional encoding (PE). This component was introduced to break the permutation-invariant nature of the Transformer’s computation and typically represents the passage of time associated with a sequence. We explore how the design of positional encoding can be improved to meet the twin challenges of low data and low compute through the introduction of musically-relevant prior knowledge.
At an empirical level, we, first, attack the low data challenge by showing how small Transformers trained on small datasets can maintain competitive performance on symbolic music generation tasks by infusing positional encoding with musical structure information, thereby exposing the model to appropriate inductive biases through Structure-Informed Positional Encoding. We, then, attack the low compute challenge by venturing beyond the quadratic-complexity regime. We introduce Fast, Structure-Informed Positional Encoding (F-StrIPE), obtained by melding together musical structure information with an existing linear-complexity relative positional encoding method called Stochastic Positional Encoding (SPE).
At a theoretical level, we base our contributions on recent work interpreting vanilla attention without positional encoding through the lens of kernels. We show not only that SPE is deeply connected to well-known kernel approximation methods, but also that linearized relative PEs such as SPE and F-StrIPE can be understood through a canonical form of PE-enriched attention. This form neatly separates attention into two component kernels operating on different spaces: one encodes input-level similarity and the other encodes prior-level similarity. We further show that other widely-used efficient PE methods from the literature, such as Rotary Positional Encoding (RoPE), also fit within our framework. This enables us to comparatively study F-StrIPE and RoPE and reveal shared computational principles among efficient PEs that seem rather different on the surface. Our unified and comparative view also allows us to propose a novel efficient positional encoding method and analyze its functional form, showing that it is richer than F-StrIPE and RoPE. Finally, we use the input-prior factorization afforded by our canonical form to provide insights on why well-designed priors help empirical performance.
We close this thesis by reaching beyond the domain of symbolic music and showing that our ideas on structure-informed positional encoding are also applicable to tokenized audio. In particular, we introduce a new resource, the Audio909 dataset, through which we show that musically-informed priors, such as information on musical structure and symbolic representations, benefits performance on music generation tasks that use tokenized audio.