Do machines understand what they read?22 October 2020
Information extraction (IE) is the process of extracting machine–readable, structured information from natural language text. For example, given the sentence “Angelina Jolie stars in the superhero film The Eternals”, an IE system can extract the fact ⟨Angelina Jolie, stars, The Eternals⟩. IE is used in numerous applications we’re all familiar with: search engines, science, or the digital humanities. IE techniques have been used for fact checking, to examine the “Panama Papers”, or to extract semantic information from webpages for news agencies. Most of these IE systems extract triples, i.e., facts that consist of a subject, a predicate, and an object. In our example ⟨Angelina Jolie, stars, The Eternals⟩, the subject is Angelina Jolie, the predicate is stars, and the object is The Eternals. However, much of the information that we care about is not of this form. Consider for example the following sentence (taken from the Wikipedia article about Angelina Jolie): “Jolie applied for adoption as a single parent, because Vietnam’s adoption regulations do not allow unmarried couples to co-adopt”. This sentence does not talk about a simple triple. Instead, it contains a negation, a modifier (“as a single parent”), and a causal relationship. A cursory reading of any Wikipedia article, blog, journalistic piece of text, or even just the present blog suggests that the majority of sentences is not concerned with simple triples, but with more complex information.The question thus arises to what degree current IE systems can deal with such information.
In this survey article Fabian Suchanek and Mechket Emna Mahouachi focused on 5 systems that have particular provisions for dealing with more than triples: FRED, K-Parser, ClausIE, MinIE, and OpenIE. The authors have systematically analyzed the ability of these systems to extract complex information such as beliefs, negation, causality, anteriority, n-ary relations, cross-sentence references, and contrast. The authors find that, while some systems can deal with some of these types of sentences, none of them is able to deal with all of them. This means that a large part of the sentences in real world text remains beyond the reach of current IE systems. This, in turn, has ramifications on the ability of AI systems to understand text, to converse with humans, to learn from news, or to identify fake news. In the NoRDF project, Suchanek and his colleagues aim to cover this gap, and to enable IE systems to understand not just simple triples, but the entire richness of the human language.
By Fabian Suchanek, Télécom Paris, Institut Polytechnique de Paris