Fabian M. Suchanek
Named Entity Recognition
59
>intro
Semantic IE
You
are
here
2
Source Selection and Preparation
Entity Recognition
Entity Disambiguation
singer
Fact Extraction
Reasoning
Instance Extraction
singer Elvis
Overview
3
• Named Entity Recognition
•
...by dictionary
•
...by regex
Named Entity
A
named entity
is an instance that has a name.
This typically concerns people, organizations, and artifacts.
Hitchhiker’s
guide to
the galaxy
4
Douglas Adams
Télécom Paris
Def: Named Entity Recognition
Named entity recognition
(NER) is the task of finding entity names
in a corpus.
5
(In its basic form, NER does not care to which entity the name belongs. It just finds names.)
In homage to Douglas Adams’ book “Hitchhiker’s Guide to the Galaxy”
Google replies to the question of “life, the universe, and everything”
with the number “42”.
Try it out
NER is difficult
6
Alexander the Great
?
?
Douglas Adam wrote the book The Hitchhiker’s guide to the galaxy in...
Cable and Wireless, a major company...
Microsoft and Dell both agreed...
In the beginning the Universe was created. This has made a lot of
people very angry and been widely regarded as a bad move.
Quotes form Douglas Adams.
Capital Letters Were Always The Best Way Of Dealing With Things
You Didn't Have A Good Answer To.
?
Def: Dictionary
A dictionary (also: gazetteer, lexicon) is a set of entity names.
NER by dictionary finds only names of the dictionary.
It can be used if the entities are known upfront.
• US states
• countries
• Dow Jones companies
• Actors of a
given movie
• ...
7
...lived in Los Angeles, California, while...
US states: {Alabama, Alaska, California, ...}
Books by Douglas Adams: {
Life, the Universe and Everything
Hitchhiker’s Guide to the Galaxy
Mostly Harmless
... }
Naive Dictionary NER is slow
The Hitchhiker’s Guide to the
Galaxy has “Don't Panic” on it,
in large, mostly friendly letters.
8
?
Books by Douglas Adams: {
Life, the Universe and Everything
Hitchhiker’s Guide to the Galaxy
Mostly Harmless
... }
Naive Dictionary NER is slow
The Hitchhiker’s Guide to the
Galaxy has “Don't Panic” on it,
in large, mostly friendly letters.
9
?
O(textLength × dictSize × maxWordLength)
Def: Trie
A trie is a tree, where nodes are labeled with
booleans and edges are labeled with characters.
10
A
D
A
O
M
S
R
A
A trie contains strings
A trie contains a string, if the string denotes a
path from the root to a node marked with “true”.
11
A
D
A
O
M
S
R
A
{ADA, ADAMS, ADORA}
Adding strings (1)
To add a string that is a prefix of an existing string,
switch node to “true”.
12
A
D
A
O
M
S
R
A
{ADA, ADAMS, ADORA}
+ ADAM
Adding strings (1)
{ADA, ADAM,
ADAMS, ADORA}
13
A
D
A
O
M
S
R
A
To add a string that is a prefix of an existing string,
switch node to “true”.
Adding strings (2)
To add another string, make a new branch.
{ADA, ADAM,
ADAMS, ADORA} + ART
14
A
D
A
O
M
S
R
A
Adding strings (2)
{ADA, ADAM,
ADAMS, ADORA, ART}
15
A
D
A
O
M
S
R
A
R
T
To add another string, make a new branch.
>task
Task: Tries
Start with an empty trie.
Add
• bon
• bonbon
• on
16
A
D
A
O
M
S
R
A
Tries can be used for NER
For every character in the doc
• advance as far as possible
in the trie and
• report match whenever
you meet a “true” node
Adams adores Adora.
17
A
D
A
M
S
R
A
O
...
Tries have good runtime
Adams adores Adora.
O(textLength × maxWordLength)
18
A
D
A
M
S
R
A
O
...
>tasks
Task: NER with tries
Do NER with the trie from the last task
on the document
on aime un bon bonbon.
19
>tasks
One more example
20
kofi anan eats an ananas in panama, ana!
a
n
a
n
i
Tricks with Tries
In practice, you can
• force tries to respect
word boundaries
• ignore upcase/lowcase
• preprocess the string
• ignore nested words
• ...
21
kofi anan eats an ananas in panama, ana!
a
n
a
n
i
Dictionary NER
Advantages:
• very efficient,
Disadvantages: dictionaries
• have to be given upfront
• have to be maintened to accommodate new names
• cannot deal with name variants
• cannot deal with infinite or unknown sets of names
(e.g., people’s names)
22
Overview
23
• Named Entity Recognition
•
...by dictionary
•
...by regex
Some names follow patterns
The trilogy consist of 5 books,
written in 1979, 1980, 1982,
1984, and 1992, respectively.
Dr. Frankie and Dr. Benjy
have no doctoral title.
Main street 42
West Country
People
with
titles
24
Years
Addresses
>alphabet
Def: Alphabet, Word
An
alphabet
is a set of symbols.
We will use as alphabet always implicitly the set of all unicode
characters.
25
A={a,b,c,d,e,f,0,1,2,3,4,5,6,7,8,9}
A={0,...9,a...,z,A...Z,!,?,...}
A
word
over an alphabet A is a sequence of symbols from A.
Since we use the alphabet of unicode characters, words are just strings.
hello!, 42, 3.141592, Douglas and Sally
also with spaces!
Def: Language
A
language
over an alphabet S is a set of words over S.
26
Set of strings
we want to describe
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = {ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
27
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = {ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
28
ca*
a*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Our description
(much shorter than the original set!)
Set of strings
we want to describe
Def: Language
A language over an alphabet S is a set of words over S.
29
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
30
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
31
a+ | b+
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Def: Language
A language over an alphabet S is a set of words over S.
32
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
(a | b)+
a+ | b+
Def: Language
A language over an alphabet S is a set of words over S.
33
(0 | 1 | ... | 9) = [0-9]
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, Marvin}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
(a | b)+
a+ | b+
Def: Language
A language over an alphabet S is a set of words over S.
34
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
(a | b)+
a+ | b+
Def: Language
A language over an alphabet S is a set of words over S.
35
[A-Z][a-z]+ [A-Z][a-z]+
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
(a | b)+
a+ | b+
aa* = a+
ca*
a*
Our description
(much shorter than the original set!)
Set of strings
we want to describe
(ab)*
L1 = {Arthur Dent, Ford Prefect, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
Regular expressions
A regular expression (regex) describes a language.
36
The language of a regular expression
[0-9][0-9][0-9][0-9] = [0-9]{4}
(0 | 1 | ... | 9) = [0-9]
(a | b)+
a+ | b+
aa* = a+
ca*
a*
[A-Z][a-z]+ [A-Z][a-z]+
L1 = {Arthur Dent, Ford Prefect, ...}
L2 = {1900, 1901, 1982, 2013, 2017, ...}
L3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
L4 = {a, ab, abb, bbba, aaabbab, ababa, ...}
L5 = {a, b, aa, bb, aaa, bbb, ...}
L6 = {a, aa, aaa, ...}
L7 = { , ab, abab, ababab, ...}
L8 = {c, ca, caa, caaa, ...}
L9 = { , a, aa, aaa, ...}
L(E |F) = L(E) ∪ L(F)
L(x) = {x} for symbols x
Def: Regular expressions
A
regular expression
(regex) over an alphabet S is one of the following:
• the empty string
• an element of S
• a string of the form XY
• a string of the form (X|Y)
• a string of the form (X)*
where X and Y are regexes
A regex is associated
to a language:
37
its language
the regular
expression
(0 is a special regex that cannot appear as part of other regexes. It represents the empty language and does not match any string.)
L(x) = {x} for symbols x
L(E |F) = L(E) ∪ L(F)
->formal-grammars
Real-world example
38
Example from Wipolo via Dhouha Bouamor
Regular expressions cheat sheet
L(a) = {a}
L(ab) = {ab}
L(a | b) = {a, b}
L(a*) = {, a, aa, aaa, ...}
L(a+) := L(aa*)
L([a-z]) := L(a|b|...|z)
L(a{2,4}) := L(aa|aaa|aaaa)
L(a{3}) := L(aaa)
L(a?) := L( | a)
L(.) := {a|b|..|A|...|0|...|!|$|...}
L(\.) := {.}
39
Simple symbol
Concatenation
Disjunction
Kleene star
shorthand for “one or more”
shorthand for a range
shortand for a given number
shortand for a given number
shorthand for “optional”
shorthand for “any symbol”
escape sequence for special symbols
Task: Regexes
L(a) = {a}
L(ab) = {ab}
L(a | b) = {a, b}
L(a*) = {, a, aa, aaa, ...}
L(a+) := L(aa*)
L([a-z]) := L(a|b|...|z)
L(a{2,4}) := L(aa|aaa|aaaa)
L(a{3}) := L(aaa)
L(a?) := L( | a)
L(.) := {a|b|..|A|...|0|...|!|$|...}
L(\.) := {.}
40
Simple symbol
Concatenation
Disjunction
Kleene star
shorthand for “one or more”
shorthand for a range
shortand for a given number
shortand for a given number
shorthand for “optional”
shorthand for “any symbol”
escape sequence for special symbols
Define regexes for
• numbers
• phone numbers
Try it
• HTML tags
• Names of the form “Dr. Blah Blub”
Helpful Web page
41
https://regex101.com/
Named regexes
When using regular expressions in a
program, it is a good idea to name them:
String digits = "[0-9]+";
String separator = "( |-)";
String pattern = digits+separator+digits;
But: Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable for
their apparent disinclination to do so. (Douglas Adams)
42
Regex groups
A regex group is a sequence of the form (...) in a regex.
the answer to ([a-z]+), the (([a-z]+) and ([a-z]+))
43
1st group
Regex groups
44
4th group
2nd group
3rd group
A regex group is a sequence of the form (...) in a regex.
the answer to ([a-z]+), the (([a-z]+) and ([a-z]+))
1st group
Groups are numbered by the order of their opening paranthesis.
Regex groups
He found the answer to life,
the universe, and everything
1st group: life
2nd group: universe and everything
3rd group: universe
4th group: everything
Try it out!
45
FSM>90
the answer to ([a-z]+), the (([a-z]+) and ([a-z]+))
How can we match regexes?
Douglas Adams fictional character names:
[A-Z][a-z]*ch[a-z]*
The Hitchhiker meets the Golgafrinchan civilisation,
but falls in love with a girl named Fenchurch.
46
?
Def: Finite State Machine
A
finite state machine
(FSM) is a directed multi-graph,
where each edge is labeled with a symbol or the empty symbol ε .
One node is labeled “start” (typically by an incoming arrow).
Zero or more nodes are labeled “final” (typically by a double circle).
47
a
b
Start
Final
d
c
See a more formal definition in
Wikipedia
>details
Def: Acceptance
An FSM
accepts
(also: generates) a string, if there is a path from the
start node to a final node whose edge labels are the string.
48
ad
bccd
bb
a
b
Start
Final
d
c
>details
Task: FSM
Find strings generated by the following FSM (Betelgeuse 7 language):
49
Final
l
Start
c
x
Final
j
>details
Notation
• start node
• final node
• empty transition
(can be walked without
accepting a symbol)
• multiple edges
ε
50
a, b
a
b
:=
>task
>trafo
>details
Task: FSM
Draw an FSM that accepts the following strings
Draw an FSM that accepts the following strings
51
br
kbr
brbr
kbrbr
brbrbr
kbrbrbr ...
ling
lingping
long
pong
lingpong
ping
pingpong
>trafo
Transforming a regex to a FSM (1)
1. Simplify the regex to contain only concatenation, alternation, kleene star
52
2. Start with this configuration:
REGEX
3. Handle alternation:
X|Y
X
Y
>details
4. Handle concatenation:
XY
X
Y
5. Handle Kleene star:
ε
(X)*
X
6. Proceed recursively, until
all edges are single symbols
FSM = Regex
For every regex, there is an FSM that accepts exactly the words
of the language of the regex (and vice versa).
53
Examples:
• k?(br)+
• ((l|p)(i|o)ng)*
• f{2,3}
>details
Runtime of regexes
Given a word of length l and given an FSM with n states,
determining whether the FSM accepts the word
• takes O(l) time if no state has several outgoing edges
with the same label
•
else (by making it deterministic on the fly)
54
There is a looooot more to say about FSMs, e.g.:
• making them deterministic
• compressing them
• making them more powerful
• learning FSMs from examples
Here, we only use them for IE
Regexes in programming
Regex
Simplified regex
FSM
Matcher
His favorite numbers
are 42, 4200, and 19.
your
programming
language
55
42(0)+
420(0)*
4
2
0
0
you
Example: Regexes in Python
import re
pattern = re.compile("42(0)+")
match = pattern.search(line)
if match!=None:
print(match.group())
56
4200
His favorite numbers
are 42, 4200, and 19.
Entity Recognition
We have seen 2 methods to do entity recognition:
• Tries (if the set of names is known)
• Regexes (if the names follow a pattern)
Douglas N. Adams had the idea for the “Hitchhiker’s Guide”
while lying drunk in a field near Innsbruck.
57
->evaluation
->named-entity-annotation
->disambiguation
->dipre
Sunita Sarawagi: Information Extraction
References: