PhD defense Arturo Castellanos: Data Metrics with Statistical Depths and Kernel Methods

Wednesday 25 March, 2026, at 14:00 (Paris time) at Télécom Paris

Télécom Paris, 19 place Marguerite Perey F-91120 Palaiseau [getting there], amphi Estaunié and in videoconferencing

Jury

Bharath Sriperumbudur, Professor, Pennsylvania State University (Reviewer)
Eustasio del Barrio, Professor, Universidad de Valladolid (Reviewer)
Cristina Butucea, Professor, CREST, ENSAE, IP Paris (Examiner)
Alain Célisse, Professor, Université Paris 1 Panthéon-Sorbonne (Examiner)
Abdelaati Daouia, Professor, Toulouse School of Economics (Examiner)
Pavlo Mozharovskyi, Professor, Télécom Paris (LTCI), IP Paris (Thesis supervisor)
Florence d’Alché-Buc, Professor, Télécom Paris (LTCI), IP Paris (Guest)
Anna Korba, Assistant Professor, CREST, ENSAE, IP Paris (Guest)

Abstract

Data is ubiquitous nowadays, with more and more virtualization of the real world. That raises the question: how well can we understand and picture data? “Mathematics and the Picturing of Data” was the title of the seminal work by John Tukey in 1975 that started the field of statistical data depths, and it is suggesting, of course, to rely on mathematical techniques, while keeping the idea of simplicity of a picture.

Learn more

The challenge is to keep the balance between fidelity to the complexity of the data and simplicity to be understandable by the human mind. In computational statistics, this challenge is often translated as using computational power to analyze a finite number of samples, that should approximate well an underlying distribution, yet to compute relevant characteristics of such distribution. Because the computational complexity is bound to increase with the number of samples, a trade-off appears where the statistical approximation error is hoped to decrease fast enough with the number of samples. In this work, we design tools to understand better data under the constraints that their sample complexity should not grow too high. First, we focus on designing new data depth functions with two kind of extensions of the Tukey depth. One extension is done by changing its geometrical nature, inspired by kernel methods to change from the classical Euclidean inner product. The other is by seeing the Tukey depth as classification risk and changing the space of classifiers as well as the loss function. In both cases, we manage to guarantee parametric rate for this non-parametric methods. In a another part, we also consider understanding distribution by looking at distances to discriminate them. The Wasserstein distance from optimal transport is a well-known distance to quantify the difference of two distributions, however its sample complexity suffers from the curse of dimensionality. On the other hand, we show that the Maximum Mean Discrepancy (MMD), a kernel-based distance enjoying good sample complexity, is sometimes too lenient for discriminating distributions, and we suggest to use a distance in-between those two that has higher discriminating power than MMD at the cost of a slight increase of the sample complexity. All of these methods can be used to prevent contamination of data to ruin our statistical analyses, whether it is by detecting outliers using data depth scores or by enhancing the discrimination of distributions via distances.

All the PhD defenses