Agenda

PhD defense Aurian Quelennec: Deep learning of self-supervised audio representation for general audio and music tagging

Friday 13 February, 2026, at 14:00 (Paris time) at Télécom Paris

Télécom Paris, 19 place Marguerite Perey F-91120 Palaiseau [getting there], amphi Rose Dieng-Kuntz and in videoconferencing

Jury

  • Director: Slim Essid, Research Scientist, NVIDIA, Paris
  • Co-Director: Geoffroy Peeters, Professor, Télécom Paris, Palaiseau
  • Reviewer: Emmanouil Benetos, Director of Research, Queen Mary University, London
  • Reviewer: Romain Serizel, Université de Lorraine, Nancy
  • President: Elsa Angelini Professor, Télécom Paris, Palaiseau
  • Examiner: Rachel Bittner, Research Scientist, Spotify, Paris
  • Examiner: Gabriel Meseguer-Brocal, Deezer, Paris

Abstract

The objective of this dissertation is to develop new deep learning methods for building general-purpose audio foundation models capable of learning meaningful representations without annotated data. We first conduct a systematic analysis of both self-supervised and supervised audio foundation models within a shared evaluation protocol. In particular, we study their inference behavior and measure how robust they are to variations in input segment duration.

Learn more

This analysis provides practical guidance for choosing pre-trained audio embeddings and highlights the strong robustness of Audio Spectrogram Transformer models when processing shorter audio segments. Building on these observations, we propose MATPAC, a self-supervised method based on Masked Latent Prediction combined with an unsupervised classification objective. This design encourages the model to learn both predictive and semantic structure, leading to representations that outperform previous self-supervised approaches and rival supervised models. We then introduce an improved version, MATPAC++, which addresses the ambiguity of predicting masked content by generating multiple hypotheses and selecting among them with Multiple Choice Learning. This variant achieves state-of-the-art performance on various audio tagging tasks. Finally, we explore how these models can be reused indifferent contexts. We demonstrate that their learned representations can be efficiently compressed into small models without labels, and we show that they can serve as conditioning modules in multimodal systems. In particular, we introduce TinyMU, a compact music language model that leverages MATPAC++ to achieve high-level music reasoning while remaining lightweight and deployable.