Data Science Seminar

Data Science Seminar

The LTCI Data Science Seminar is a joint research seminar between the DIG team and the S2A teams. It focuses on machine learning and data science topics.

All talks of the seminar are given in English and take place at Télécom Paris (46 rue Barrault, Paris 13e).

 

June 2019

June 6, 2019

The seminar took place from 2PM to 4PM (room C49), and featured two talks:

Talk 1: Pavlo Mozharovskyi (Télécom Paris): On algorithms for computation of the Tukey depth

You can download the slides of this talk.

Abstract: We consider the representation of objects of the real world in a Euclidean space using their quantitative properties as coordinates, which is referred to as multivariate data. In the presence of several coordinates (or variables) no natural way to compare the objects (or observations) exists. In 1975 John W. Tukey suggested a novel way of ordering multivariate data according to their centrality. To achieve this, he defined what is nowadays known as the Tukey depth (also called halfspace or location depth), which computes how deep is an observation in the data cloud and thus ranks the data with respect to their degree of centrality, and yields a family of trimmed convex regions in a Euclidean space. As they are visual, affine invariant/equivariant and robust, Tukey depth and regions are useful tools in nonparametric multivariate analysis. However, their practical use in applications has been impeded so far by the lack of efficient computational procedures if the number of variables exceeds two.

First, we suggest a theoretical framework for computing the exact value of the halfspace depth of a point w.r.t. a data cloud in the Euclidean space of arbitrary dimension. Based on this framework, a whole class of algorithms can be derived while three variants of this class are studied in more detail. All of these algorithms are capable of dealing with data that are not in general position and even with data that contain ties. As our simulations show, all proposed algorithms prove to be very efficient.

Second, using similar ideas, we construct an algorithm to compute a Tukey trimmed region, that is much faster than the known ones. Also, a strict bound on the number of facets of a Tukey region is derived. We explore both speed and precision of the algorithm in a large simulation
study. Finally, the approach is extended to an algorithm that calculates the innermost Tukey region and its barycenter, the Tukey median.

Bio: Pavlo Mozharovskyi joined Télécom Paris as an Assistant Professor in 2018. After having finished his studies at Kyiv Polytechnic Institute in automation control and informatics, he obtained a PhD degree at the University of Cologne in 2014, where he conducted research in nonparametric and computational statistics and classification. He has been a postdoc at Agrocampus Ouest in Rennes with the Centre Henri Lebesgue for a year working on imputation of missing values, and then joined the CREST laboratory at the National School of Statistics and Information Analysis. His main research interests lie in the areas of statistical data depth function, classification, computational statistics, robust statistics, missing values, and data envelopment analysis.

Talk 2: Silviu Maniu (Université Paris-Sud): An Experimental Study of the Treewidth of Real-World Data

You can download the slides of this talk.

Abstract: Treewidth is a parameter that measures how tree-like a data instance is, and whether it can reasonably be decomposed into a data structure resembling a tree. Many computation tasks are known to be tractable on data having small treewidth, but computing the treewidth of a given instance is intractable. This talk presents the first large-scale experimental study of treewidth and tree decompositions of real-world data, with a focus on graph data. We aim to find out which data, if any, can benefit of the wealth of algorithms for data of small treewidth. For each dataset, we obtain upper and lower bound estimations of their treewidth, and study the properties of their tree decompositions. We show in particular that, even when treewidth is high, using partial tree decompositions can result in data structures that can assist algorithms.

Bio: Silviu Maniu is associate professor of computer science in the LaHDAK team of LRI at Université Paris-Sud in Orsay, France, since September 2015. Before, he was a researcher in Noah’s Ark Lab of Huawei in Hong Kong, and, between 2012 and 2014, a Postdoctoral Fellow in the Department of Computer Science of the University of Hong Kong. He received the Ph.D. degree in computer science from Télécom Paris in 2012, and the Dipl.Ing. degree from Politehnica University in Timisoara, Romania in 2005. His research interests are centered on data mining on the Web, with a focus on uncertain and social data.

 

March 2019

March 14, 2019

The seminar took place from 2PM to 4PM (room C49), and featured two talks:

Talk 1: Thomas Bonald (Télécom Paris): Hierarchical graph clustering: quality metrics and algorithms

You can download the slides of this talk.

Abstract: This talk is about hierarchical clustering for graph-structured data. We present a novel agglomerative algorithm as well as a quality metric based on the sampling distribution of node pairs induced by the graph. A fundamental property of the proposed metric is that it is interpretable in terms of graph reconstruction: it is exactly the information loss induced by the representation of the graph as a tree. It is applicable to any tree and, in particular, to a tree of height 2 corresponding to a usual graph clustering, i.e., a partition of the set of nodes. It can also be used to compress the binary tree returned by any hierarchical clustering algorithm so as to get a compact, meaningful representation of the hierarchical structure of the graph. The results are illustrated on both synthetic and real datasets.

Talk 2: Alessandro Rudi (Inria Sierra): Structured prediction via implicit embeddings

You can download the slides of this talk.

Abstract: In this talk we analyze a regularization approach for structured prediction problems. We characterize a large class of loss functions that allows to naturally embed structured outputs in a linear space. We exploit this fact to design learning algorithms using a surrogate loss approach and regularization techniques. We prove universal consistency and finite sample bounds characterizing the generalization properties of the proposed methods. Experimental results are provided to demonstrate the practical usefulness of the proposed approach.

 

December 2018

December 11, 2018

The seminar took place from 2PM to 3PM (room C48), and featured one talk:

Talk: Shai Ben-David (University of Waterloo): Settling the sample complexity of learning mixtures of Gaussians via sample compression schemes

You can download the slides of this talk.

Abstract: Estimating distributions from observed data is a fundamental
task in statistics that has been studied for more than a century. This task
also arises in applied machine learning. In many scenarios it is common to
assume that the distribution can be modelled using a mixture of Gaussians.

We prove that Θ(kd^2/ε^2 ) samples are necessary and sufficient for
learning a mixture of k Gaussians in R^d , up to error ε in total variation
distance. This improves both the known upper bounds and lower bounds for
this problem. For mixtures of axis-aligned Gaussians, we show that O(kd/ε^2
) samples suffice, matching a known lower bound. The upper bound is based
on a novel technique for distribution learning based on a notion of sample
compression. Any class of distributions that allows such a sample
compression scheme can also be learned with few samples. Moreover, if a
class of distributions has such a compression scheme, then so do the
classes of products and mixtures of those distributions. The core of our
main result is showing that the class of Gaussians in R^d has an efficient
sample compression.

The talk is based on joint work with Hassan Ashtiani, Nicholas J. A.
Harvey, Christopher Liaw, Abbas Mehrabian and Yaniv Plan.

This work has just received a best paper award at NeurIPS 2018.

 

November 2018

November 29, 2018

The seminar took place from 2PM to 4PM (room C48), and featured two talks:

Talk 1: Rodrigo Mello (University of Sao Paulo): Steps, concepts and issues involved in providing learning guarantees in the Deep Learning scenario (most specially Convolutional Neural Networks)

You can download the slides of this talk.

Abstract: The main idea about this one-hour talk is to start with the Generalization bound, from the Statistical Learning Theory, estimate the Shattering coefficient of a single-hyperplane-based algorithm in different input space dimensionalities, so that we end up questioning the complexity of Deep Learning biases and how they could be estimated.

Talk 2: Olivier Sigaud (Sorbonne University): Efficient exploration for learning continuous action policies

You can download the slides of this talk.

Abstract: This talk is based on our ICLM 2018 paper with Cédric Colas and Pierre-Yves Oudeyer. I will first give a tutorial introduction to reinforcement learning (RL) and the DQN and DDPG deep RL algorithms. Then I will highlight a simple exploration issue faced by these systems, I will present Goal Exploration Processes (GEPs) as an efficient exploration method, and I will propose a way of combining GEPs with DDPG which provides a satisfactory solution to the issue. Finally, I will wrap these results into the broader context of Artificial Intelligence from a developmental learning perspective.

 

February 2018

February 22, 2018

The seminar took place from 2PM to 4PM (room C48), and featured two talks:

Talk 1: Odalric-Ambrym Maillard (Inria Lille, Sequel team): A survey of some recent results in structured bandits

The slides for this talk were not shared by the speaker.

Abstract: I will especially focus on the use of multi-armed bandit theory in the context of linearly correlated distributions, optimization of Lipschitz functions, and the problem of sequential ranking. The talk will be based on recent results obtained by Stefan Magureanu and by Audrey Durand during their PhD.

Talk 2: Pierre Senellart (ENS): Reinforcement learning for intensional data management

You can download the slides of this talk.

Abstract: We will review how techniques from reinforcement learning (bandits, Markov decision processes) naturally occur in a wide variety of tasks related to intensional data management, i.e., management of data whose access is associated to a cost. This covers a range of applications, from Web crawling to crowdsourcing, from query optimization to management of virtual machines. In contrast with traditional reinforcement learning applications, data management scenarios often involve a very large but heavily structured state or action space, requiring adaptations of traditional techniques.

November 2017

November 16, 2017

The seminar took place from 2PM to 4PM (room C48), and featured two talks:

Talk 1: Luis Galárraga (Inria Rennes), How to know how much we know

You can download the slides of this talk.

Abstract: Current RDF knowledge bases (KBs) are highly incomplete. Such incompleteness is a serious problem both for data consumers and data producers. Data consumers do not have any guarantee about the completeness of the results of queries run on a KB. This diminishes the practical value of the available data. On the other hand, data producers are blind about the parts of the KB that should be populated. Yet, completeness information management is poorly supported in the Semantic Web: Completeness information for KBs is scarce and no RDF storage engine can nowadays provide completeness guarantees for queries. In this talk we present a vision of a completeness- aware Semantic Web, and explain our ongoing work in this direction, namely on the tasks of automatic generation of completeness annotations, and computation of completeness guarantees for SPARQL queries.

Talk 2: Vincent Audigier (CNAM): Multiple imputation with principal component methods

You can download the slides of this talk.

Abstract: Missing data are common in the domain of statistics. They are a key problem because most statistical methods cannot be applied to incomplete data sets. Multiple imputation is a classical strategy for dealing with this issue. Although many multiple imputation methods have been suggested in the literature, they still have some weaknesses. In particular, it is difficult to impute categorical data, mixed data and data sets with many variables, or a small number of individuals with respect to the number of variables. This is due to overfitting caused by the large number of parameters to be estimated. This work proposes new multiple imputation methods that are based on principal component methods, which were initially used for exploratory analysis and visualisation of continuous, categorical and mixed multidimensional data. The study of principal component methods for imputation, never previously attempted, offers the possibility to deal with many types and sizes of data. This is because the number of estimated parameters is limited due to dimensionality reduction. First, we describe a single imputation method based on factor analysis of mixed data. We study its properties and focus on its ability to handle complex relationships between variables, as well as infrequent categories. Its high prediction quality is highlighted with respect to the state-of-the-art single imputation method based on random forests. Next, a multiple imputation method for categorical data using multiple correspondence analysis (MCA) is proposed. The variability of prediction of missing values is introduced via a non-parametric bootstrap approach. This helps to tackle the combinatorial issues which arise from the large number of categories and variables. We show that multiple imputation using MCA outperforms the best current methods.

October 2017

October 5, 2017

The seminar took place from 2PM to 4PM in Amphi Grenat, and featured two talks:

Talk 1: Giovanna Varni (Télécom Paris): Automated analysis of socio-affective signals in interpersonal interaction

The slides for this talk were not shared by the speaker.

Abstract: The establishment of interpersonal interaction, understood as human-human and human-machine interaction, grounds on the skills to exchange and manage verbal and nonverbal socio-affective signals. Becoming an effective partner in interaction requires, first, the perception and the understanding of these signals and, then, a continuous adaptation of one’s own signals by predicting the future development of the interaction. Endowing systems and artificial agents with the skill to automatically analyze these signals and their dynamics is not a trivial task both offline and on the fly. Still, this is a crucial challenge for improving the effectiveness and naturalness of human-artificial agent interaction in several scenarios e.g. education, manufacturing industries, assistive technology and so on.

In this talk, I report my recent works on Social Signals Processing and Affective Computing with particular reference to laughter, emotional interdependence and personality traits. These works were performed in the framework of European and national French projects.

Talk 2: James Eagan (Télécom Paris): Reinventing the mind’s bicycle

The slides for this talk were not shared by the speaker.

Abstract: Steve Jobs once described the computer as “a bicycle for the mind.” My work focuses on making this bicycle a better, more expressive tool for humans: by making richer, more expressive interactions, by reinventing how we build software so users can adapt it to their own idiosyncratic needs, and by providing richer tools to interact with and understand data. In this talk, I will present an overview of my work in these three areas.

Bio: James Eagan is an Associate Professor (maître de conférences) in the DIVA group at Télécom Paris <https://www.telecom-paris.fr/> with a research focus on Human-Computer Interaction and Information Visualization. He is also co-chair of the MS Big Data program. Before joining Télécom Paris, he was a post-doctoral researcher at LRI (CNRS & Université Paris-Sud). He holds a Ph.D. and M.Sc. in Computer Science from the Georgia Institute of Technology and B.A. in Mathematics/Computer Science from Lawrence University.

September 2017

September 7, 2017

The seminar took place from 2PM to 4PM in Amphi Saphir, and featured two talks:

Talk 1: Albert Bifet (Télécom Paris): Massive Online Analytics for the Internet of Things (IoT)

You can download the slides of this talk.

Abstract: Big Data and the Internet of Things (IoT) have the potential to fundamentally shift the way we interact with our surroundings. The challenge of deriving insights from the Internet of Things (IoT) has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors and devices is bound to become a key area of data mining research as the number of applications requiring such processing increases. Dealing with the evolution over time of such data streams, i.e., with concepts that drift or change completely, is one of the core issues in stream mining. In this talk, I will present an overview of data stream mining, and I will introduce some popular open source tools for data stream mining.

Talk 2: François Roueff (Télécom Paris): Prediction of weakly locally stationary processes by auto-regression

You can download the slides of this talk.

Abstract: We introduce locally stationary time series through the  local approximation of the non-stationary covariance structure by a  stationary one. This allows us to define autoregression coefficients in a  non-stationary context, which, in the particular case of a locally stationary  Time Varying Autoregressive (TVAR) process, coincide with the generating coefficients. We provide and study an estimator of the time varying autoregression coefficients in a general setting. The proposed estimator of  these coefficients enjoys an optimal minimax convergence rate under limited  smoothness conditions. In a second step, using a bias reduction technique, we  derive a minimax-rate estimator for arbitrarily smooth time-evolving  coefficients, which outperforms the previous one for large data sets. In  turn, for TVAR processes, the predictor derived from the estimator  exhibits an optimal minimax prediction rate.

June 2017

June 22, 2017

The seminar takes place from 2PM to 4PM in Amphi Saphir, and consists of the two following talks:

Talk 1: Pascal Bianchi (Télécom Paris): Distributed optimization on graphs using operator splitting methods

You can download the slides of this talk.

Abstract: Consider a network of N agents (computing units) having private objective functions and seeking to find a minimum of the aggregate objective. The aim is to design iterative algorithms where, at a each iteration, an agent updates a local estimate of the minimizer based on the sole knowledge of its private function and the information received from its neighbors. In this talk, i will first provide an overview of standard distributed optimization methods. Then, i will explain how recent and generic results in stochastic optimization can be used in order to design asynchronous and adaptive distributed optimization algorithms.

Talk 2: Maximilien Danisch (UPMC): Towards real-world graph algorithmics

You can download the slides of this talk.

Abstract: Real-world graphs (a.k.a. complex networks) are ubiquitous: the web, Facebook, brain networks, protein interaction networks, etc. Studying these graphs and solving computational problems on them (say maximum clique, partitioning or dense subgraph) has applications in many fields. I will first show that the structure of some real-world graphs is special. In particular, they are not adversarial and some difficult problems (say NP-hard problems) can be solved on some huge real-world graphs (say 2G edges or more). I will then present two works along the lines of “understanding and leveraging the structure of real-world graphs in order to build better algorithms”: (i) Enumerating all k-cliques and (ii) Computing the density-friendly graph decomposition. The latter one has been published in WWW2017.