PhD defense Elio Gruttadauria: Streaming Speaker Diarization in the Wild

Monday 9 March, 2026, at 09:00 (Paris time) at Télécom Paris

Télécom Paris, 19 place Marguerite Perey F-91120 Palaiseau [getting there], amphi 2 and in videoconferencing

Jury

Director: Slim Essid, Research Scientist, Nvidia
Co-Director: Mathieu Fontaine, Associate Professor, Télécom Paris
Reviewer: Claude Barras, Senior Researcher, Vocapia Research
Reviewer: Marie Tahon, Professor, Université du Mans
President: Nicholas Evans, Professor, Eurécom
Examiner: Hervé Bredin, Chief Science Officer, PyannoteAI
Invited: Marc Delcroix, Distinguished Researcher, NTT Communication Science Labs

Abstract

This thesis investigates online (streaming) speaker diarization, an area that remains relatively underexplored compared to its offline counterpart. While offline diarization has benefited from increasingly sophisticated neural architectures and powerful contextual modeling, these advances do not readily transfer to streaming scenarios.

Learn more

The online setting introduces additional challenges due to its strictly causal nature: speaker counting, overlapped speech handling, and maintaining robustness across diverse acoustic conditions become significantly more difficult when operating under online inference.
In such settings, the system has access only to past information, which makes every decision irreversible and heightens the effect of early errors.
Nonetheless, streaming diarization is essential for latency-sensitive applications such as hearing aids, smart audio devices, and real-time transcription services, motivating the need for improved modeling techniques that balance accuracy, stability, and computational efficiency.

Traditional online diarization approaches typically rely on incremental clustering methods, which are difficult to optimize in an end-to-end fashion and often incur performance trade-offs.
Such methods frequently depend on heuristic decisions, including how to represent speakers, when to create new clusters, and how to handle ambiguous boundary cases.
As a result, they may drift over time or respond poorly to rapid speaker changes.
Although end-to-end neural systems have been proposed for online speaker diarization, they generally depend on large context windows and buffer sizes to maintain performance, limiting their practicality in embedded or low-latency environments where memory and compute budgets are constrained.

To address these limitations, this thesis explores several strategies to enhance sliding-window-based models.
First, we investigate the integration of speech separation into the front end of the diarization pipeline, enabling more reliable handling of overlapped speech, a persistent source of errors in both offline and online systems.
We also introduce novel data augmentation schemes tailored to the streaming paradigm, designed to expose models to a broader range of temporal and acoustic variability during training.
To complement the role of data augmentations, we examine the role of self-supervised learning (SSL) features, which have recently shown remarkable generalization capabilities across domains and tasks; their incorporation offers a promising route toward models that remain stable even under unseen acoustic conditions.
Together, these components aim to build a more resilient and adaptable online diarization framework.

The central contribution of this work is the development of an online neural clustering system, a lightweight diarization framework that tightly integrates end-to-end modeling with an online and fully differentiable clustering mechanism.
Unlike conventional online clustering methods, the proposed system does not rely on external heuristics or tuned thresholds.
Instead, clustering behavior is learned jointly with the rest of the model, enabling it to adaptively structure speaker representations as the audio stream unfolds.
A key strength of the proposed approach is its balanced trade-off between accuracy and computational efficiency.
Notably, its performance degrades only marginally even when no overlap is shared between consecutive chunks, demonstrating robustness to abrupt speaker transitions.
Compared with prior work, the system is competitive with state-of-the-art methods in the two-speaker conversational telephone speech (CTS) domain.
Beyond its standalone performance, we introduce a fusion strategy that combines the proposed system with a traditional clustering method, improving the trade-off between performance and robustness to clustering errors in the wild.
This hybrid approach leverages the complementary strengths of learned and heuristic components, yielding a more stable diarization pipeline that can generalize effectively across datasets.
Collectively, these contributions aim to advance the frontier of online speaker diarization, offering practical and theoretically grounded solutions for real-time speech technologies.

All the PhD defenses