Explainability by design

The goal is to provide a general learning framework for explainability-by-design. Relying on legal and ethical requirements that decisions affecting human rights be transparent and understandable, we devise novel learning approaches to build a classifier with an explanatory model by leveraging new losses and constraints.

  • Definition of desirable properties, given the regulatory requirements and inspired by the use-cases
  • Design of losses, constraints and evaluation metrics for an interpretable system
  • General framework for joint learning of a predictive model and an explanatory model
  • Instantiation to deep convnets and test of the algorithm
  • Confronting solutions with legal/ethical requirements…

Global vs local explainability

The problem we will address is how to measure the quality of an explanation, from both a legal/ethical and technical perspective.

Interpretability and explainability sometimes have different meanings depending on the research paper. However, we will consider the two words as synonyms, and focus instead on the difference between global and local explainability.

Global explainability is about the explanation of the learning algorithm as a whole, including the training data used, appropriate uses of the algorithms, and warnings regarding weaknesses of the algorithm and inappropriate uses. In a recent paper on explainability, we refer to global interpretability as a “user’s manual” approach (Beaudouin et al., 2020).

Local explainability refers to the ability of the system to tell a user why a particular decision was made. While global and local explainability are two important features of trustworthiness, we mainly focus on local explainability as the first indispensable ingredient to transparency.


We will develop explainability scenarii, addressing the needs of the relevant audience for explanation:

  • citizens affected by the algorithmic decision;
  • system operators, who assume the operational responsibility for acting on the decisions;
  • Oversight bodies in charge of guaranteeing the fairness, reliability and robustness of the system as a whole and responsible for rendering accounts to citizens and courts.

Local explanations will occur ex post, such as when a decision is challenged, or when the system is audited. However, where decisions are validated by humans, the operator of the system may also need near to real-time local explanations to support the validation decision.

The explanations provided to oversight bodies will need to demonstrate compliance with the original certification criteria of the system, as well as with legal and human rights principles, including the ability for individuals to challenge decisions.


Even though the focus of this project is on convolutional neural networks, the principles that should guide local explainability are common to all predictive models.

In particular, an explanation depends strongly to whom they are provided: the citizen would like to understand and check why he/she is subject to a decision, the system operator wants to verify the merits of the decision and the oversight bodies with the help of technical staff may want to be able to dive into the system decision in order to ensure fairness of the final decision if it is discussed. Beyond the basic atoms of explanations, typically some high level features of the input image, we also need to specify the nature of the explanation, which may be symbolic and logical, causal and counterfactual, geometrical and visual.

As predictive models are generally based on correlation extraction, two first kinds of explanation from a predictive tool seem not trivial as shown by the current state of the literature. We will consider here an explanation as a simple link between image features and final decision (Escalante et al. 2018). In image recognition, saliency maps and perturbations-based methods (Selvarajet al. 2017; Samek et al. 2019; Mundhenk et al. 2020) are amenable to informative visualizations.

Additionally, we will work on a list of properties that should satisfy a “good” explanation (Lipton, 2018). While a consensus has not been currently reached in the literature (Lundberg and Lee, 2017; Alvarez-Melis et al. 2018). A good explanation should be faithful/correct, consistent, and invariant by some predefined transformations, adversarial robust, decomposable, and complete. This set of properties will be converted into evaluation metrics to assess the quality of explanations provided by our system and in losses during the training phase.

Design of a general framework…

…for learning a classifier with its explanatory model

The approaches developed so far to explainability decompose into two main families:

  • methods that provide post-hoc explanations (xviiiRibeiro et al. 2016) and,
  • methods that build explainable models by design (Selvarajet al. 2017; Samek et al. 2019; Alvarez-Melis et al. 2018).

The so-called post-hoc interpretable methods present the drawback of not being used as a means to improve the quality of a decision system, but solely as a window to approximate what the system is doing. So far most of the machine learning tools are uniquely performance-driven, the learning algorithms do not encourage in any way interpretability.

Another line of research, attracting a growing interest, explores how to re-define the objectives of learning algorithms to design decision systems that explain their decision (Zhang et al. 2018). This approach opens the door to re-visit the validity domain of a decision system by preventing using the system if the explanation is not satisfactory or by warning the user of a potentially misleading decision. However, this explainability-by-design approach is not without its pitfalls. First works in this direction seem to show that explainability often comes at the price of lower accuracy, a crucial drawback that has to be studied carefully.


In this project, we aim to contribute to address these issues by developing a novel explainability-by-design approach based on two pillars: an axiomatic approach of explainability based on the properties of an explanation we should expect in general and for the use-case, in terms of legislation and in terms of interest for the product and a view of an explanation as a synthetic and high-level representation of the decision function implemented by the system.

Our goal is to re-visit the definition of supervised classification by considering an explainable classification system as a pair of a predictive model and an explanatory model, the two of them being strongly linked.

The predictive model as usual provides a decision while the explanatory model outputs an explanation seen as a representation of the model’s decision process. However, the explanatory model will provide feedback to the predictive model, necessitating modifying the loss function and introducing dedicated constraints corresponding to the desirable properties of explanation, as well as the learning algorithm.

As in the previous work packages, we will put emphasis on the versatility of the developed approaches that should work for convnets as well as other differentiable classifiers.

Assessment of explainability-by-design approaches…

…with respect to legal/ethical needs

Given different scenarii and metrics, we will assess the quality of explanations provided by our explainability-by-design approach and by relevant baselines, including legal requirements, and iteratively improve the approach to seek a model that achieves “explainability by design”.

Traditional approaches to software verification and validation (V&V) are poorly adapted to neural networks (Peterson 1993; Borg et al. 2019; U.S. Food and Drug Administration 2019).


The conditions under which image recognition systems are deployed must guarantee the quality of the system, the absence of discrimination, and preserve citizens’ rights to challenge decisions. In addition, image recognition can create interferences with individuals’ right to protection of personal data, and this interference will need to be analyzed under the GDPR and the proportionality test of the European Charter of Fundamental Rights and the European Convention on Human Rights. Reliability, fairness and local interpretability by design will also contribute an important brick toward the certification of machine learning systems for critical applications.

The challenges of the explainability-by-design

The challenges include the non-determinism of neural network decisions, which makes it hard to demonstrate the absence of unintended functionality, and the adaptive nature of machine-learning algorithms, which makes them a moving target (Borg et al. 2019; U.S. Food and Drug Administration 2019).

Specifying a set of requirements that comprehensively describe the behavior of a neural network is considered the most difficult challenge with regard to traditional V&V and certification approaches (Borg et al. 2019; Bhattacharyya et al. 2015).

The absence of complete requirements poses a problem because one of the objectives of V&V is to compare the behavior of the software to a document that describes precisely and comprehensively the system’s intended behavior (Peterson 1993). For neural networks, there may remain a degree of uncertainty about the output for a given input.

Other barriers include the absence of detailed design documentation and the lack of interpretability of machine learning models, which challenge comprehensibility and trust, which are generally required in certification processes (Borg et al. 2019).

While not overcoming all of these challenges, the solutions developed in this project will permit developers and operators to measure and demonstrate compliance with key quality, legal and ethical parameters.

The ability to measure and demonstrate compliance is currently an important missing element in the approval of machine learning systems for critical applications. The measurement of key quality parameters will serve not only during the approval process, but also throughout the system’s life cycle, as it is regularly reviewed and tested for errors including bias.