Named Entity Recognition
and Classification
90
CC-BY
Fabian M. Suchanek
Semantic IE
You
are
here
2
Knowledge representation
Entity Recognition
Entity Disambiguation
singer
Fact Extraction
KB
construction
Entity Typing
singer Elvis
Overview
3
Introduction:
•
Named Entity Recognition and Classification (NERC)
•
NERC Features
Methods:
•
NERC by rules
(no training)
•
NERC by Classification
(with training)
•
NERC by Conditional Random Fields
(with training)
•
NERC by Deep Learning
(with training)
Summary
Def: NE Recognition & Classification
Given a corpus, and given a set of classes,
Named Entity Recognition and
Classification
(NERC)
is the task of (1) finding entity names in the corpus and (2) annotating each name with a class.
4
(NERC is often called simply “Named Entity Recognition”. We use “Named Entity Recognition and Classification” here to distinguish it from bare NER.)
The Enlightenment was a philosophical movement in Europe
between 1650 and 1800, driven by Denis Diderot with his
“Encyclopédie”, David Hume, Montesquieu, Voltaire, John Locke,
Olympe de Gouges, and others.
Wikipedia: Enlightenment
[Theobald von Oer]
classes={
Person
,
Location
,
Event
,
Work
,
Date
}
Def: NE Recognition & Classification
5
Two tasks: (1) finding the entity boundaries and (2) mapping them to a class.
Given a corpus, and given a set of classes,
Named Entity Recognition and
Classification
(NERC)
is the task of (1) finding entity names in the corpus and (2) annotating each name with a class.
Wikipedia: Enlightenment
[Theobald von Oer]
classes={
Person
,
Location
,
Event
,
Work
,
Date
}
The
Enlightenment
was a philosophical movement in
Europe
between
1650
and
1800
, driven by
Denis Diderot
with his
“Encyclopédie”
,
David Hume
,
Montesquieu
,
Voltaire
,
John Locke
,
Olympe de Gouges
, and others.
Classes
6
NERC usually focuses the classes person, location, and organization. But some also extract money,
percent, phone number, job title, artefact, brand, product, proteins, drugs, etc.
[Sobha Lalitha Devi]
>examples
NERC Classes in Spacy.io
7
>examples
[Spacy.io: NER]
PERSON
NORP
FAC
ORG
GPE
LOC
PRODUCT
EVENT
WORK_OF_ART
LAW
LANGUAGE
DATE
TIME
PERCENT
MONEY
QUANTITY
ORDINAL
CARDINAL
People, including fictional.
Nationalities or religious or political groups.
Buildings, airports, highways, bridges, etc.
Companies, agencies, institutions, etc.
Countries, cities, states.
Non-GPE locations, mountain ranges, bodies of water.
Objects, vehicles, foods, etc. (Not services.)
Named hurricanes, battles, wars, sports events, etc.
Titles of books, songs, etc.
Named documents made into laws.
Any named language.
Absolute or relative dates or periods.
Times smaller than a day.
Percentage, including ”%“.
Monetary values, including unit.
Measurements, as of weight or distance.
“first”, “second”, etc.
Numerals that do not fall under another type.
NERC examples
8
As Immanual Kant wrote in 1784, the Enlightenment is
mankind’s emergence from its self‐incurred immaturity.
in XML
...<per>Immanual Kant</per> wrote in <date>1784</date>...
>examples
NERC examples
9
42 Immanuel PER
43 Kant PER
44 wrote OTH
45 in OTH
in TSV by token
TSV file of
• sentence number
• token
• class
>examples
As Immanual Kant wrote in 1784, the Enlightenment is
mankind’s emergence from its self‐incurred immaturity.
NERC examples
10
in TSV by token with “BIO” model
>examples
As Immanual Kant wrote in 1784, the Enlightenment is
mankind’s emergence from its self‐incurred immaturity.
42 Immanuel B-PER
43 Kant I-PER
44 wrote O
45 in O
The “BIO” model tags
each token with
“begin”, “inside”, “outside”
NERC examples
11
Enlightenment
Immanuel Kant
PER
Enlightenment
1784
DATE
Enlightenment
Enlightenment
EVENT
TSV by entity
TSV file of
•
document
•
entity
•
class
>examples
As Immanual Kant wrote in 1784, the Enlightenment is
mankind’s emergence from its self‐incurred immaturity.
NERC examples
Try this
12
>examples
First name helps, does not work without
GPE = countries,
cities, states
Recognizing time expressions is a
research field on its own
Now do it here:
We have determined the crystal structure of a triacylglycerol lipase from Pseudomonas
cepacia (Pet) in the absence of a bound inhibitor using X-ray crystallography. The
structure shows the lipase to contain an alpha/beta-hydrolase fold and a catalytic
triad comprising of residues Ser87, His286 and Asp264. The enzyme shares ...
[Diana Maynard: Named Entity Recognition]
13
Deep Learning can’t do everything
14
Deep learning approaches can be trained to do NERC, but do not have 100% precision,
and cannot be corrected when they are wrong:
=>
it is not sufficient to say “Deep Learning can do it”.
We have to understand the task and the methods.
Wrong.
What do we do now?
The model cannot be
“corrected”!
Training Data
Training Data
for NERC is a corpus that is already annotated with NERC classes, and from which the
NERC method can learn.
15
What a NERC method could learn:
<date> always numbers
<work> often in “...”
<loc> preceded by “in the”
Training data may
or may not be available.
Also in <date>2022</date>, we have seen a rise in profit in the <loc>UK</loc>.
Training Data: Examples
There exist large traing data corpora for English
...but not for all languages and domains!
16
[Li et al: A Survey on Deep Learning for Named Entity Recognition, TKDE 2020]
17
Overview
Introduction:
•
Named Entity Recognition and Classification (NERC)
•
NERC Features
Methods:
•
NERC by rules
(no training)
•
NERC by Classification
(with training)
•
NERC by Conditional Random Fields
(with training)
•
NERC by Deep Learning
(with training)
Summary
Def: Token and window
18
A
token
is a sequence of characters that forms a unit, such as a word, a punctuation symbol,
a number, etc.
A
window
of width δ of a token t in a corpus is the sequence of δ tokens before t , the token t itself,
and δ tokens after t .
Window of width δ=3 around law :
Position 0
Position -1
Position +1
[the; rule; of;
law
; , ; tolerance; towards]
The Enlightenment argued for the rule of law, tolerance towards other creeds,
and freedom of thought and speech.
19
A
token
is a sequence of characters that forms a unit, such as a word, a punctuation symbol,
a number, etc.
A
window
of width δ of a token t in a corpus is the sequence of δ tokens before t , the token t itself,
and δ tokens after t .
Window of width δ=3 around , :
Position 0
Position -1
Position +1
[rule; of; law;
,
; tolerance; towards; other]
The Enlightenment argued for the rule of law, tolerance towards other creeds,
and freedom of thought and speech.
Def: Token and window
20
NERC Feature
An
NERC feature
is a function that takes as input a token and returns as output a real value
(typically 0 or 1).
is stopword
matches [A-Z][a-z]+
is punctuation
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
>features
isPunctuation(“tolerance”)=0
isPunctuation(“;”)=1
[rule; of; law;
,
; tolerance; towards; other]
The Enlightenment argued for the rule of law, tolerance towards other creeds,
and freedom of thought and speech.
Syntactic Features
• capitalized word
• all upper case word
• smallcase word
• mixed letters/numbers
• number
• special symbol
• space
• punctuation
• Regular expression
The Stanford NERC system used string patterns:
Paris
—>
Xxxx
M2 DataAI
—>
X# XxxxXX
+33 1234
—>
+## ####
Kant
UN
world
HG2G
1789
$
.,;:?!
[A-Z][a-z]+
21
[Jenny Rose Finkel]
>features
Dictionary Features
22
• cities
• countries
• titles
• common names
• airport codes
• words that identify a company
• common nouns
• hard‐coded words
Paris
France
Dr.
David
CDG
Inc, Corp, ...
car, president, ...
M2
... if you have a dictionary.
>features
23
Proper Name
Verb
Determiner
Adverb
Adjective
Noun
Preposition
Pronoun
A
Part-of-Speech
(also: POS, POS-tag, word class, lexical class, lexical category) is the
grammatical role that a word takes.
Def: POS
Alice wrote a really great book by herself
>features
POS Tag Features for NERC
24
[Penn Treebank symbols]
DT
IN
JJ
NN
NNP
PRP
RB
SYM
VBZ
...
Determiner
Preposition or subordinating conjunction
Adjective
Noun, singular or mass
Proper noun, singular
Personal pronoun
Adverb
Symbol
Verb, 3rd person singular present
>features
POS Tagging in Python
25
NLTK:
Spacy.io
import nltk
sentence = 'Time flies like an arrow. Fruit flies like a banana.'
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)
-> [('Time', 'NN'), ('flies', 'VBZ'), ...]
import spacy
nlp = spacy.load(‘en_core_web_sm’)
sentence = 'Time flies like an arrow.'
for ent in nlp(sentence):
print(ent, ent.pos_)
>features
Morphological features
• word endings
• word contains an n‐gram
• word contains n‐grams at boundaries
-ish, -ist, ...
Par, ari, ris
#Par, ari, ris#
26
Intuition: quite often, the morphology of the word
gives a hint about its type. Examples:
London Bank
—>
ORG
Cotramoxazole
—>
DRUG
>features
Pattern features
The phrases in which a word occurs can give a hint about its class:
27
married X => person
bought X => product
X ’s capital => country
headquarters of X => organization
>features
Overview
28
Introduction:
•
Named Entity Recognition and Classification (NERC)
•
NERC Features
Methods:
•
NERC by rules
(without training)
•
NERC by Classification
(with training)
•
NERC by Conditional Random Fields
(with training)
•
NERC by Deep Learning
(with training)
•
NERC by Generative Language Models (LLMs)
(without training)
Summary
Summary: NERC
NERC (named entity recognition and classification) finds entity names and annotates them
with predefined classes.
• Rule-based NERC
• NERC by Machine Learning
29
[CapWord] says => pers
+
•
•
•
•
o
+
The <event>Enlightenment</event> was a philosophical movement in <loc>Europe</loc>.
• NERC by Conditional Random Fields
• NERC by Deep Learning
• NERC by Generative Language Models
ooo
Extract entities from the following text: ...
next: tools
Existing NERC tools
30
[Li et al: A Survey on Deep Learning for Named Entity Recognition, TKDE 2020]
academia
industry
Datasets: PapersWithCode.com
Sunita Sarawagi: Information Extraction
Diana Maynard: Named Entity Recognition
31
References
Li et al: A Survey on Deep Learning for NER, TKDE 2020
->Entity-typing