PhD defense Arturo Castellanos: Data Metrics with Statistical Depths and Kernel Methods
Wednesday 25 March, 2026, at 14:00 (Paris time) at Télécom Paris
Télécom Paris, 19 place Marguerite Perey F-91120 Palaiseau [getting there], amphi Estaunié and in videoconferencing
Jury
Bharath Sriperumbudur, Professor, Pennsylvania State University (Reviewer)
Eustasio del Barrio, Professor, Universidad de Valladolid (Reviewer)
Cristina Butucea, Professor, CREST, ENSAE, IP Paris (Examiner)
Alain Célisse, Professor, Université Paris 1 Panthéon-Sorbonne (Examiner)
Abdelaati Daouia, Professor, Toulouse School of Economics (Examiner)
Pavlo Mozharovskyi, Professor, Télécom Paris (LTCI), IP Paris (Thesis supervisor)
Florence d’Alché-Buc, Professor, Télécom Paris (LTCI), IP Paris (Guest)
Anna Korba, Assistant Professor, CREST, ENSAE, IP Paris (Guest)
Abstract
Data is ubiquitous nowadays, with more and more virtualization of the real world. That raises the question: how well can we understand and picture data? “Mathematics and the Picturing of Data” was the title of the seminal work by John Tukey in 1975 that started the field of statistical data depths, and it is suggesting, of course, to rely on mathematical techniques, while keeping the idea of simplicity of a picture.
The challenge is to keep the balance between fidelity to the complexity of the data and simplicity to be understandable by the human mind. In computational statistics, this challenge is often translated as using computational power to analyze a finite number of samples, that should approximate well an underlying distribution, yet to compute relevant characteristics of such distribution. Because the computational complexity is bound to increase with the number of samples, a trade-off appears where the statistical approximation error is hoped to decrease fast enough with the number of samples. In this work, we design tools to understand better data under the constraints that their sample complexity should not grow too high. First, we focus on designing new data depth functions with two kind of extensions of the Tukey depth. One extension is done by changing its geometrical nature, inspired by kernel methods to change from the classical Euclidean inner product. The other is by seeing the Tukey depth as classification risk and changing the space of classifiers as well as the loss function. In both cases, we manage to guarantee parametric rate for this non-parametric methods. In a another part, we also consider understanding distribution by looking at distances to discriminate them. The Wasserstein distance from optimal transport is a well-known distance to quantify the difference of two distributions, however its sample complexity suffers from the curse of dimensionality. On the other hand, we show that the Maximum Mean Discrepancy (MMD), a kernel-based distance enjoying good sample complexity, is sometimes too lenient for discriminating distributions, and we suggest to use a distance in-between those two that has higher discriminating power than MMD at the cost of a slight increase of the sample complexity. All of these methods can be used to prevent contamination of data to ruin our statistical analyses, whether it is by detecting outliers using data depth scores or by enhancing the discrimination of distributions via distances.
We use cookies to enhance your visit to our website. Do you accept cookies?Nous utilisons des cookies pour améliorer votre visite sur notre site web. Acceptez-vous les cookies ?
Notre site web utilise des services tiers pour améliorer votre navigation. Choisissez ici les cookies des services tiers que vous autorisez. Pour plus d'information reportez-vous à notre politique de confidentialité.Our website uses third party services to improve your browsing experience. Choose here the cookies of the third party services that you authorise. For more information please refer to our privacy policy.
Ce service de Google permet d’afficher sur le site des caractères typographiques spécifiques. Google Fonts ne créé pas de cookies. Toutefois, si une police est affichée par le navigateur du visiteur, l’adresse IP du visiteur est enregistrée par Google et utilisée à des fins d’analyse.
Le LinkedIn Insight Tag permet de collecter des données concernant les visites de certaines pages du site web, comme l’URL, le référent, l’adresse IP, les caractéristiques de l’appareil et du navigateur et l’horodatage.
Ce service permet de diffuser des vidéos sur les pages de notre site. Le contenu Youtube intégré sur nos pages se comporte de la même manière que si le visiteur se rendait sur Youtube. Le service peut déposer jusqu’à 4 cookies et collecter des données sur vous, embarquer des outils de suivis tiers, suivre vos interactions avec ces contenus embarqués si vous disposez d’un compte connecté sur leur propre site web.
Le plugin Flow Flow de l’entreprise looks_awesome ne collecte aucune donnée de suivi lors de votre visite. En revanche, les comptes sociaux affichés (Facebook, Twitter, Linkedin, Instagram, etc.) sont susceptibles de le faire.
Le Google Tag Manager permet de collecter des données concernant les visites de certaines pages du site web, comme l’URL, le référent, l’adresse IP, les caractéristiques de l’appareil et du navigateur (agent utilisateur) et l’horodatage.