Data Science Seminar

Le 16 novembre 2017 à Télécom Paris / Barrault. Séminaire entièrement en anglais.

The LTCI Data Science Seminar is a joint research seminar between the DIG team and the S2A teams. It focuses on machine learning and data science topics.

November 16, 2017

The seminar took place from 2PM to 4PM (room C48), and featured two talks:

Talk 1: Luis Galárraga (Inria Rennes), How to know how much we know

You can download the slides of this talk.

Abstract: Current RDF knowledge bases (KBs) are highly incomplete. Such incompleteness is a serious problem both for data consumers and data producers. Data consumers do not have any guarantee about the completeness of the results of queries run on a KB. This diminishes the practical value of the available data. On the other hand, data producers are blind about the parts of the KB that should be populated. Yet, completeness information management is poorly supported in the Semantic Web: Completeness information for KBs is scarce and no RDF storage engine can nowadays provide completeness guarantees for queries. In this talk we present a vision of a completeness- aware Semantic Web, and explain our ongoing work in this direction, namely on the tasks of automatic generation of completeness annotations, and computation of completeness guarantees for SPARQL queries.

Talk 2: Vincent Audigier (CNAM): Multiple imputation with principal component methods

You can download the slides of this talk.

Abstract: Missing data are common in the domain of statistics. They are a key problem because most statistical methods cannot be applied to incomplete data sets. Multiple imputation is a classical strategy for dealing with this issue. Although many multiple imputation methods have been suggested in the literature, they still have some weaknesses. In particular, it is difficult to impute categorical data, mixed data and data sets with many variables, or a small number of individuals with respect to the number of variables. This is due to overfitting caused by the large number of parameters to be estimated. This work proposes new multiple imputation methods that are based on principal component methods, which were initially used for exploratory analysis and visualisation of continuous, categorical and mixed multidimensional data. The study of principal component methods for imputation, never previously attempted, offers the possibility to deal with many types and sizes of data. This is because the number of estimated parameters is limited due to dimensionality reduction. First, we describe a single imputation method based on factor analysis of mixed data. We study its properties and focus on its ability to handle complex relationships between variables, as well as infrequent categories. Its high prediction quality is highlighted with respect to the state-of-the-art single imputation method based on random forests. Next, a multiple imputation method for categorical data using multiple correspondence analysis (MCA) is proposed. The variability of prediction of missing values is introduced via a non-parametric bootstrap approach. This helps to tackle the combinatorial issues which arise from the large number of categories and variables. We show that multiple imputation using MCA outperforms the best current methods.