Do machines understand what they read?

One of the challenges of natural language processing (NLP) is to get the meaning right. Subtle changes in a sentence can radically change its meaning. Computers may have difficulty detecting irony and sarcasm, distinguishing an affirmation of fact from an opinion. In their recent paper, Fabian Suchanek from Télécom Paris and his coauthor Mechket Emna Mahouachi from ENSTA provide an introduction to the fascinating field of information extraction. Professor Suchanek is directing a research program at Télécom Paris on the subject of extracting complex meaning from text.

Information extraction (IE) is the process of extracting machinereadable, structured information from natural language text. For example, given the sentence “Angelina Jolie stars in the superhero film The Eternals”, an IE system can extract the fact ⟨Angelina Jolie, stars, The Eternals⟩. IE is used in numerous applications we’re all familiar with: search engines, science, or the digital humanities. IE techniques have been used for fact checking, to examine the “Panama Papers”, or to extract semantic information from webpages for news agencies. Most of these IE systems extract triples, i.e., facts that consist of a subject, a predicate, and an object. In our example ⟨Angelina Jolie, stars, The Eternals⟩, the subject is Angelina Jolie, the predicate is stars, and the object is The Eternals. However, much of the information that we care about is not of this form. Consider for example the following sentence (taken from the Wikipedia article about Angelina Jolie): “Jolie applied for adoption as a single parent, because Vietnam’s adoption regulations do not allow unmarried couples to co-adopt”. This sentence does not talk about a simple triple. Instead, it contains a negation, a modifier (“as a single parent”), and a causal relationship. A cursory reading of any Wikipedia article, blog, journalistic piece of text, or even just the present blog suggests that the majority of sentences is not concerned with simple triples, but with more complex information.The question thus arises to what degree current IE systems can deal with such information.

In this survey article  Fabian Suchanek and Mechket Emna Mahouachi  focused on 5 systems that have particular provisions for dealing with more than triples: FRED, K-Parser, ClausIE, MinIE, and OpenIE. The authors have systematically analyzed the ability of these systems to extract complex information such as beliefs, negation, causality, anteriority, n-ary relations, cross-sentence references, and contrast. The authors find that, while some systems can deal with some of these types of sentences, none of them is able to deal with all of them. This means that a large part of the sentences in real world text remains beyond the reach of current IE systems. This, in turn, has ramifications on the ability of AI systems to understand text, to converse with humans, to learn from news, or to identify fake news. In the NoRDF project, Suchanek and his colleagues aim to cover this gap, and to enable IE systems to understand not just simple triples, but the entire richness of the human language.


Find out more about the aim of the NoRDF project in the video below.


By Fabian Suchanek, Télécom Paris, Institut Polytechnique de Paris