Our ProjectThe NoRDF Project is a scientific project at Télécom Paris that aims to model and extract complex information from natural language text. More precisely, we want to enrich knowledge bases with events, causation, precedence, stories, negation, and beliefs. We want to extract this type of information at scale from structured and unstructured sources, and we want to allow machines to reason on it. The project brings together research on knowledge representation, on reasoning, and on information extraction, and aims to be useful for applications such as fake news detection, the modeling of controversies, or the analysis of the e-reputation of a company.
The project is funded by the French Research Agency ANR, in collaboration with 4 industrial partners, with a total funding of 1.3m €. It runs from 2020 to 2024.
MotivationIn the past decade, information extraction has made huge progress. We can now extract facts from Web documents at large scale, and knowledge bases (KBs) such as KnowItAll, DBpedia, NELL, BabelNet, WikiData, and our own YAGO contain many millions of entities, and hundred millions of facts about them. And yet, all of these KBs operate on an extremely reduced fraction of knowledge: They essentially focus on binary relations between a subject and an object. For example, a KB can know that
type(autism, developmentalDisorder), or that
vaccinates(MmrVaccine, measles). This knowledge representation model is called RDF. The problem is that RDF can capture barely anything from the Wikipedia article about vaccines. Take for example this text about the supposed link between vaccines and autism:
In February 1998, Andrew Wakefield published a paper in the medical journal The Lancet, which reported on twelve children with developmental disorders. The parents were said to have linked the start of behavioral symptoms to vaccination. The resulting controversy became the biggest science story of 2002. As a result, vaccination rates dropped sharply. In 2011, the BMJ detailed how Wakefield had faked some of the data behind the 1998 Lancet article.From this text, the current mainstream methods for KB construction would extract just “Andrew Wakefield published a paper” — and barely anything else. Of course, we could use non-symbolic methods (such as distributional methods or deep learning approaches) to decide whether Andrew Wakefield's paper is trustworthy or not. But suppose that we want to decide whether there is a causal link between autism and vaccination; why we see a lower vaccination rate; or with which arguments another blog post supports the anti-vaccine movement. For this, we need a more detailed understanding of the text. The machine would have to understand:
- The fact that something was asserted (which does not make it true)
- The fact that something is not true
- The fact that something happened before something else
- The fact that some group of facts forms an event
- The fact that one event is the reason for another event
ApplicationsIf successful, our project should contribute to:
- The analysis of fake news: understand an article about a controversial topic, and allow reasoning on it (who said what when and why, what is the evidence, how is the perception of the claims by others), with the goal to support journalists in the fact checking of a story.
- The implementation of the “right to be forgotten”: analyze requests sent to search engines to help human reviewers in determining whether the requests meet legal criteria.
- The modeling of controversies: detect a controversial topic on the Web (e.g., in blogs, forums, or Twitter posts), extract opinions, and model different standpoints.
- The analysis of the e-reputation of a company (or its competitor): map out cases of controversy or beliefs of valuations, together with their reasons, and their support among journalists, clients, and the general public. This can include the analysis of product reviews to identify fine-grained praise or complaints.
- The flagging of potentially fraudulent activity: detect patterns in textual communication that indicate fraud, claims that are in contradiction with established knowledge, or violations of rules.
- The modeling of processes: Summaries of technical interventions often contain sequences of actions performed, causal relation- ships, and suggestions, which could be extracted and analyzed with our model.
- The development of smarter chatbots: allow dialogues that go beyond single-shot questions, build up a mental model of the user and their beliefs, and reason on them. Several companies develop chatbots, and this field is thus a rich ground for applications.