The NoRDF Project
Fabian Suchanek
Professor at Télécom Paris
Works on: Knowledge Bases,
Reasoning, natural language processing
Past:
• Max Planck Institute for Informatics/Germany
• Microsoft Research/US
• INRIA Saclay/France
2
Fabian Suchanek
A
Language Model
(such as GPT-3, ChatGPT, or GPT-4) is a probability distribution
over sequences of words. It can be used in particular to predict a likely next word
in a sentence, and, by extension, to converse with humans.
Me: My colleagues at work keep avoiding me. They never say a single word to me.
Could I sue them for harassment?
GPT3: That's a possibility, but it might not be the best solution. You could try talking to them first.
What if talking doesn't help?
Then you could consult with a lawyer and see if you have a case.
What would a lawyer say?
They would likely say that you have a case if the avoidance is severe and impacting your work.
...
[example by Goran Frehse on
OpenAI's GPT3
]
3
Language Models
The NoRDF project is concerned with the weaknesses of language models,
and how they can be remedied.
How do we measure how good the output of a language model is?
• We systemized the criteria for the quality of a story from the humanities
• had stories annotated manually by these criteria
4
What are language models actually saying?
Relevance (RE): how well the story matches its prompt
Coherence (CH): how much the story makes sense
Empathy (EM): how well the reader understood the character’s emotions
Surprise (SU): how surprising the end of the story was
...
5
Stories generated by language models, evaluated by humans on criteria from humanities, scale 1-5.
Human performance
Best language model performance
[
Chhun, Colombo, Suchanek, Clavel: A Benchmark of the Evaluation of Story Generation, COLING 2022
]
Models cannot talk as well as humans (yet)
6
Can language models reason?
The lawyer visited the doctor. Did the doctor visit the lawyer?
GTP3: The lawyer did not visit the doctor.
Mary is happy. Mary is rich. If someone is happy and rich, he is not blond. Is Bob blond?
GTP3: Bob is not blond.
The light is on. I eat a pizza. I press the light switch. I press the light switch. I eat a pizza.
Is the light on?
Finetuned BERT: erm... (50% error rate). GPT-3 similar.
We surveyed how language models fare on reasoning tasks. Examples:
[
Helwe, Clavel, Suchanek: “Deep Learning, but Shallow Reasoning”, AKBC 2021
]
Transformer models have systematic limitations
when it comes to logical reasoning!
7
Can language models remember?
Does adding two odd numbers give an odd number?
ChatGPT: Yes. This is because any odd number is of the form 2n+1, and the addition of two odd numbers is
(2n+1)+(2m+1) = 2(n+m) + 2, which is also in the form 2p+1.
Did Elvis Presley die?
GPT3: Yes, Elvis Presley died on August 16, 1977.
Is Elvis Presley alive?
GTP3: There is no definitive answer, but many people believe that Elvis Presley is alive and well.
There have been many sightings of Elvis over the years, but none have been definitively proven.
Language models will forget and invent
information at their own discretion!
They know how to talk, but not what to say!
[Suchanek, Varoquaux: “Language Models and Symbolic Representations”, The Conversation, 2022]
8
We need symbolic knowledge
Elvis Presley died on
August 16, 1977.
Applications in:
- fake news detection
- question answering
- text summarization
- automated reasoning on text
Information Extraction
1977-08-16
diedOnDate
9
Contribution 1: Information Extraction Survey
Elvis Presley died on
August 16, 1977.
1977-08-16
diedOnDate
[Weikum, Dong, Razniewski, Suchanek: Machine Knowledge, Found. & Trends in Databases, 2021]
[Zacchary Sadeddine, Fabian Suchanek: Semantic Parsing — a survey, submitted to TACL 2023]
Information Extraction
Language
Model
We wrote a book and a survey on the state of the art in information extraction.
250 pages
10
Contribution 2: Dealing with vague entities
Early generations
learned many of Elvis'
best songs by heart.
11
Contribution 2: Dealing with vague entities
Early generations
learned many of Elvis'
best songs by heart.
We studied the nature and prevalence of vague and unnamed entities, and proposed
methods to deal with them.
[
Paris, El Aud, Suchanek: The Vagueness of Vagueness in Noun Phrases, AKBC 2021
]
[
Paris, Suchanek: Non-named entities - the silent majority, ESWC poster 2021
]
Scalar vagueness
(scale + threshold)
Quantitative vagueness
(number + threshold)
Subjective vagueness
(no agreed‐upon scale)
6% of noun phrases
14%
9%
(in a sample Wikipedia corpus, including overlaps)
12
Contribution 3: Embedding unknown words
performer
Language
Model
Elvis Presley was
a famous
perfromer
.
type
[
Chen,Varoquaux, Suchanek: Out‐of‐Vocabulary Embeddings, ACL 2022
]
We devised a method to embed out-of-vocabulary words into the language model.
13
Contribution 4: Disambiguating Entity Names
performer
Language
Model
type
King Elvis was a
famous performer.
[
Chen,Varoquaux,Suchanek: A Neural Model for Entity Linking, AAAI 2021
]
[
Chen,Varoquaux,Suchanek: GLADIS: A General and Large Acronym Disambiguation Benchmark, EACL 2023
]
We devised methods to determine which entity of the knowledge base a name refers to.
14
Contribution 5:Textual inference with negation
performer
Language
Model
type
King Elvis was a
famous singer, but
not a scientist.
scientist
type
[
Helwe,Clavel,Suchanek: “PyTorch library for logical reasoning”, EMNLP demo 2022
]
[
Helwe, Coumes, Clavel, Suchanek: “Textual inference with negation”, EMNLP Findings 2022
]
Open source
library on Github:
We devised a method that allows a language model to perform logical inference on text
even in the presence of negation.
15
Contribution 6: YAGO knowledge base
singer
Language
Model
type
King Elvis was
married to Priscilla.
1935
[Tanon, Weikum, Suchanek: “YAGO 4: A reason‐able knowledge base”, ESWC 2020]
Test of Time
Award
We build YAGO, a large knowledge base from several online sources,
with 2 billion facts about 60 million entities.
open code
open data
guitar
Priscilla
USA
16
Contribution 7: Reasoning on nested claims
Language
Model
An ex-employee claims
Elvis Presley faked his
death to join a witness
protection program.
???
???
We devise methods to represent nested claims, and to reason on them.
17
Contribution 7: Reasoning on nested claims
Language
Model
An ex-employee claims
Elvis Presley faked his
death to join a witness
protection program.
1977
died
faked
joined
witness protection
cause
employee
claims
[
Suchanek: The need to move beyond triples, Text2Story 2020
]
[Coumes, Paris, Suchanek: Quantifying over assertions and agents, in preparation]
18
Demo
This example reasons on beliefs.
ChatGPT incorrectly says that the text is logically coherent.
19
Demo
We can give a logical proof...
...that the text is logically incoherent.
This is just to
show one way
of reasoning
on natural
languge.
We don’t
have the same
budget as
OpenAI :-)
20
The NoRDF Project
Language
Model
King Elvis was a
famous singer.
singer
Our team:
Natural language text
Symbolic representation
We are looking for new partners to join!
Our partners:
->start