Algorithmic fairness and de-biasing techniques in image recognition

For facial recognition, the main challenge is fairness: the algorithm should behave in the same way whatever the visible characteristic of the person are (age, sex, skin color, nationality, etc.).

DCNN are state of the art algorithms for face recognition. They produce an embedded representation of the face from an image. When the DCNN encode face image, it projects it into an N dimensional space. The local density in this N dimension space is directly linked to the FAR (false acceptance rate), a key figure for border security. It should be the same (which means a uniform repartition), whatever the visible characteristics of the person are.

Robustness and bias are very challenging given the limitations inherent in the facial recognition training data sets. Face recognition algorithms are trained on public datasets containing millions of images and hundreds of thousands of identities, but these datasets are strongly biased in identities and age distribution. This leads to variations of performance depending on the quality of the image, and possibly also on individual characteristics such as skin color or gender.

We will develop original strategies for identifying and correcting selection bias.



Bias issues in AI have been recently reviewed in Bertail et al. (2019).

Representativeness issues do not vanish simply under the effect of the size of the training set. Selection bias issues are now the subject of much attention in the literature, that is to say situations where the samples at disposal for learning a predictive rule are not distributed as those to which the predictive rule will be applied when deployed.

Data gathered in many repositories available on the Web has not been collected by means of a rigorous experimental design but rather by leaping at the opportunity. 

Depending on the nature of the mechanism causing the sample selection bias and on that of the statistical information available for learning a decision rule, special cases have been considered in the literature, for which dedicated approaches have been developed.

Most of these methods boil down to weighting the training observations using appropriate weights, based on the Inverse Probability Weighting technique.

For instance, these weights are the inverses of the first order inclusion probabilities in the case where data are acquired by means of a survey plan, cf Clémençon et al. (2017).

In the context of regression under random censorship, a weighted version of the empirical risk can also be considered, weights corresponding to the inverses of estimates of the probability of not being censored, see e.g. Ausset et al. (2019).

In general, side information about the cause of the selection bias is crucially used to derive explicit forms for the appropriate weights and to design ways of estimating them from the observations available.

In many situations, the selection bias mechanism is excessively complex to derive fully explicit forms for the appropriate weights that would permit to mimic the target distribution based on the observations available.


In line with the aforementioned approaches, a preliminary framework that allows tackling problems where the biasing mechanism at work is very general, if certain identifiability hypotheses are satisfied, has been developed in Clémençon and Laforgue (2020).

Promising preliminary experiments, based on synthetic and real data, have also provided empirical evidence of the relevance of the approach.

In Clémençon et al. (2020) and in Ausset et al. (2019), several situations (e.g. positive-unlabeled learning, learning under random censorship) where the biasing mechanism depends on a few (possibly functional) parameters that can be estimated, have been exhibited.


The case where the biasing mechanism is only approximately known is of considerable interest in practice and it is one of the main objectives to investigate to which extent the statistical learning guarantees established are preserved.


It is our goal to understand conditions under which plugging estimates of the biasing functions at work do not compromise the accuracy of the methodology proposed in Clémençon and Laforgue (2020).


When a (small) sample of observations drawn from the target/test distribution is available, a natural way to check whether the training data available form an unbiased sample, i.e. the absence of bias in the training sample, is to test homogeneity of the two samples.

The ‘two-sample problem’ arises in a wide variety of applications, ranging from bioinformatics to psychometrics through database attribute matching for instance. However, it still raises challenging questions in a (highly) multidimensional setting, where the notion of rank is far from straightforward.

A recent and original approach initiated in Clémençon et al. (2009) consists in investigating how ranking bipartite methods for multivariate data can be exploited in order to extend the rank-based test approach for testing homogeneity between two samples to a multivariate framework.

Offering an appealing alternative to parametric approaches, plug-in techniques or Maximum Mean Discrepancy methods, the idea promoted is simple and avoids the curse of dimensionality: training data are split into two subsamples (with approximately the same proportion of positive instances) so that an empirical maximizer of the rank-based criterion chosen (e.g. the popular AUC criterion) can be learnt and next used to score the data of the second subsample. Easy to formulate, this promising approach remains to be studied at length.

Another objective of this work is to investigate this approach from a theoretical and experimental perspective both at the same time. If it can be successfully implemented, this approach could also offer an alternative method to debias training samples, by designing a procedure removing the data that cause the rejection of the homogeneity assumption or weighting them in an appropriate fashion.


Machine learning systems that make crucial decisions for humans, including decisions on border control, should guarantee that they do not penalize certain groups of individuals. Fairness as other trustworthiness properties can be imposed during learning by imposing appropriate constraints. Fairness constraints are generally modeled by means of a (qualitative) sensitive variable, indicating membership to a certain group (e.g., ethnicity, gender). The vast majority of the work dedicated to algorithmic fairness in machine learning focuses on binary classification.

In this context, fairness constraints force the classifiers to have the same true positive rate (or false positive rate) across the sensitive groups.


However a fairness constraint can degrade performance of the algorithm in other areas: for example, it may teach the algorithm to pay particular importance to small differences in the shape of eyes of persons from a certain under-represented ethnic group in order to ensure that the quality of the algorithm for that group is equivalent to the quality for other groups of the population. But this constraint may lead to decreased performance of the algorithm for another group. Tradeoffs will thus be necessary. The intensity of the fairness constraints and the resulting tradeoffs will depend on the legal/ethical requirements for face recognition.

DCNNs for face encoding learn how to project images of faces in a high dimension space. Similarities between images would be calculated using cosine distance between templates in the N-dimensional space. In an ideal state, this projection should be even whatever the soft biometrics of the individual (men and women should occupy half of the space each, no bias as a function of nationality, color of the eyes…).

Das et al (2018) propose to mitigate those for soft biometrics estimation. For face encoding, losses like von Mises Fisher loss as in Hasnat et al. (2017) could help balancing these biases if properly estimated.

Previous work was also devoted to algorithmic fairness, but for a different problem: learning scoring functions from binary- labeled data. This statistical learning task, referred to as bipartite ranking, is of considerable importance in applications for which fairness requirements are a major concern (credit scoring in banking, pathology scoring in medicine or recidivism scoring in criminal justice).


Evaluating performance is itself a challenge.

The gold standard, the ROC curve is highly relevant for evaluating fairness of face recognition algorithms but serious computational problems come up with such a functional criterion, most of the literature focusing on the maximization of scalar summaries, e.g. the AUC criterion. A thorough study of fairness in bipartite ranking has been proposed in Vogel et al. (2020), where the goal is to guarantee that sensitive variables (such as skin color) have little impact on the rankings induced by a scoring function. Limitations in using AUC to measure fairness have motivated the design of richer definitions of fairness for scoring functions related to the ROC curves themselves. These definitions have strong implications on fair classification: classifiers obtained by thresholding such fair scoring functions approximately satisfy definitions of classification fairness for a wide range of thresholds.

We will investigate to what extent the accuracy of decision rules learned by machine learning for the related face recognition tasks can be preserved under fairness constraints.

We will also try to extend to similarity ranking, a variant of bipartite ranking covering key applications such as scoring for face recognition see e.g. Vogel et al. (2018), e.g. concepts and methods for fair bipartite ranking.