Fabian M. Suchanek
Evaluation
53
Semantic IE
You
are
here
2
Source Selection and Preparation
Entity Recognition
Entity Disambiguation
singer
Fact Extraction
Reasoning
Instance Extraction
singer Elvis
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
3
1. [A-Z][a-z]+ Simpson
4
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
1. [A-Z][a-z]+ Simpson
5
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
4 matches (1 wrong)
1. [A-Z][a-z]+ Simpson
2. [A-Z][a-z]+ [A-Z][a-z]+
6
4 matches (1 wrong)
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
1. [A-Z][a-z]+ Simpson
2. [A-Z][a-z]+ [A-Z][a-z]+
7
4 matches (1 wrong)
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
5 matches (2 wrong)
1. [A-Z][a-z]+ Simpson
2. [A-Z][a-z]+ [A-Z][a-z]+
3. Homer Simpson
8
4 matches (1 wrong)
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
5 matches (2 wrong)
1 match
1. [A-Z][a-z]+ Simpson
2. [A-Z][a-z]+ [A-Z][a-z]+
3. Homer Simpson
9
4 matches (1 wrong)
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
5 matches (2 wrong)
Def: Gold Standard
The gold standard (also: ground truth) for an IE task is the set of
desired results of the task on a given corpus.
10
{Homer Simpson, Bart Simpson, Lisa Simpson}
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
Task: Detect Simpson members
Corpus:
Gold Standard:
Def: Precision
The
precision
of an IE algorithm is the ratio of its outputs
that are in the respective gold standard.
11
Output: {Homer, Bart, Groening}
G.Standard: {Homer, Bart, Lisa, Marge}
Def: Precision
12
Output: {Homer, Bart, Groening}
G.Standard: {Homer, Bart, Lisa, Marge}
The
precision
of an IE algorithm is the ratio of its outputs
that are in the respective gold standard.
Def: Precision
13
Output: {Homer, Bart, Groening}
G.Standard: {Homer, Bart, Lisa, Marge}
The
precision
of an IE algorithm is the ratio of its outputs
that are in the respective gold standard.
Def: Precision
14
Output: {Homer, Bart, Groening}
G.Standard: {Homer, Bart, Lisa, Marge}
The
precision
of an IE algorithm is the ratio of its outputs
that are in the respective gold standard.
Def: Precision
15
Output: {Homer, Bart, Groening}
G.Standard: {Homer, Bart, Lisa, Marge}
=> Precision: 2/3 = 66%
The
precision
of an IE algorithm is the ratio of its outputs
that are in the respective gold standard.
Def: Recall
The
recall
(also: sensitivity, true positive rate, hit rate) of an IE algorithm
is the ratio of the gold standard that is output.
16
Output: {Homer, Bart, Groening}
G.Standard: {Homer, Bart, Lisa, Marge}
Def: Recall
17
Output: {Homer, Bart, Groening}
G.Standard: {Homer, Bart, Lisa, Marge}
The
recall
(also: sensitivity, true positive rate, hit rate) of an IE algorithm
is the ratio of the gold standard that is output.
Def: Recall
18
Output: {Homer, Bart, Groening}
G.Standard: {Homer, Bart, Lisa, Marge}
The
recall
(also: sensitivity, true positive rate, hit rate) of an IE algorithm
is the ratio of the gold standard that is output.
Def: Recall
19
Output: {Homer, Bart, Groening}
G.Standard: {Homer, Bart, Lisa, Marge}
The
recall
(also: sensitivity, true positive rate, hit rate) of an IE algorithm
is the ratio of the gold standard that is output.
Def: Recall
20
Output: {Homer, Bart, Groening}
G.Standard: {Homer, Bart, Lisa, Marge}
The
recall
(also: sensitivity, true positive rate, hit rate) of an IE algorithm
is the ratio of the gold standard that is output.
Def: Recall
Output: {Homer, Bart, Groening}
G.Standard: {Homer, Bart, Lisa, Marge}
=> Recall: 2/4 = 50%
21
The
recall
(also: sensitivity, true positive rate, hit rate) of an IE algorithm
is the ratio of the gold standard that is output.
>F1
Precision-Recall-Tradeoff
It is very hard to get both good precision and good recall.
Algorithms usually allowing varying one at the expense of the other
(e.g., by choosing different threshold values). This usually yields:
22
Precision
Recall
Very good results,
but too few of them
All good results, but
many wrong ones, too
What we
want
Def: F1
23
Gold Standard: {Homer, Bart, Lisa, Snowball_4, ..., Snowball_100}
Output: {Homer Simpson}
To trade off precision and recall, we could use the average:
Def: F1
24
Gold Standard: {Homer, Bart, Lisa, Snowball_4, ..., Snowball_100}
Output: {Homer Simpson}
To trade off precision and recall, we could use the average:
The
F1 measure
is the harmonic mean of precision and recall.
Precision: 1/1=100%, Recall: 1/100=1%
Average: (100%+1%)/2=50%
Outputting just
a single result
already gives a
score of 50%!
Precision: 1/1=100%, Recall: 1/100=1%
F1: 2
100%
1%/(100%+1%)=2%
Task: Precision & Recall
What is the algorithm output, the gold standard,
the precision and the recall in the following cases?
1. Nostradamus predicts a trip to the moon for every century
from the 15th to the 20th incl.
2. The weather forecast predicts that the next 3 days will
be sunny. It does not say anything about the 2 days
that follow. In reality, it is sunny during all 5 days.
3.On Elvis Radio™ , 90% of the songs are by Elvis. An algorithm learns
to detect Elvis songs. Out of 100 songs on Elvis Radio, the algorithm
says that 20 are by Elvis (and says nothing about the other 80). Out
of these 20 songs, 15 were by Elvis and 5 were not.
4. How can you improve the algorithm?
25
>ROC
Imbalanced classes
26
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
27
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
Precision: 99/100=99%
Recall: 99/99=100%
If there are very few negatives,
just outputting all elements
gives great scores.
The problem of
imbalanced classes
appears when only very few of the
items of the population are not in the gold standard: An approach
that outputs the entire population has a very high precison and a
perfect recall.
The
negatives
are the elements of the population that are
not in the gold standard.
Def: Problem of imbalanced classes
>confusion
28
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
The
confusion matrix
for the output of an algorithm looks as follows:
Def: Confusion Matrix
Gold standard
Output
Positive
Negative
Positive
Positive
Negative
Items of the
population that
are not output
Items of the population that
are not in the gold standard
True Positives
False Positives
False Negatives
True Negatives
“Negative” because it was
not output, “True” because
that was correct.
(Gold) Positives
(Gold) Negatives
Predicted Positives
Predicted Negatives
29
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
The
confusion matrix
for the output of an algorithm looks as follows:
Def: Confusion Matrix
Gold standard
Output
Positive
Negative
Positive
Positive
Negative
99
0
1
0
1 item was output as positive,
but was negative in the
gold standard
100
0
99
1
Precision = true positives / predicted positives
= 99/100 = 99%
Recall = true positives / gold positives
= 99/99 = 100%
30
Population:
Gold Standard:
Output:
{H, Ho, Hom, ..., o, om, ome, ..., r Sim, r Simps, ...}
{Homer}
{Homer}
A confusion matrix does not always make sense in an information
extraction scenario:
Confusion with confusion matrixes
Gold standard
Positive
Negative
Positive
99
0
39462440205
0
A confusion matrix makes sense only when the population is limited
(e.g., in classification tasks)!
39462440206
Positive
Negative
Output
31
Our problem
The problem is that the algorithm did not catch the negatives,
it has a “low recall” on the negatives.
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
Gold standard
Positive
Negative
Positive
99
0
1
0
Positive
Negative
Output
32
Def: True Negative Rate & FPR
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
Positive
Negative
Positive
99
0
1
0
Positive
Negative
Output
The
true negative rate
(also: TNR, specificity, selectivity) is the ratio of
negatives that are output as negatives (= the recall on the negatives):
TNR = true negatives / gold negatives
= 0 / 1 = 0%
The
False Positive Rate
(also: FPR, fall-out) is 1-TNR.
>ROC
>details
TNR & Precision
33
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
Precision: 99/100=99%
Recall: 99/99=100%
TNR: 0/1=0%
TNR and precision both measure the “correctness” of the output.
Precision:
• measures wrt. the output
• suffers from imbalanced classes
• works if population is infinite
(e.g., set of all extractable entities)
TNR:
• measures wrt. the population
• guards against imbalance
• works if population is limited
(e.g., in classification)
>details
>ROC
Informedness
34
The
informedness
(also: Youden’s J statistic, Youden’s index) combines
TNR and Recall as follows:
informedness = recall + TNR - 1 = recall - FPR
= 100% + 0% - 1 = 0
(It’s kind of the F1 measure in the case of a limited population.)
Population:
Gold Standard:
Output:
Precision: 99/100=99%
Recall: 99/99=100%
TNR: 0/1=0%
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
>details
>ROC
Def: ROC
35
Recall
False Positive Rate (FPR)
No bad results,
but also no good ones
Many good results,
but also many bad ones.
What we
want
The
ROC
(receiver operating characteristic) curve plots recall against
the FPR for different thresholds of the algorithm. It guards against
imbalanced classes, and is applicable when the population is finite.
>details
Def: ROC
36
Recall
False Positive Rate (FPR)
What we
want
The
ROC
(receiver operating characteristic) curve plots recall against
the FPR for different thresholds of the algorithm. It guards against
imbalanced classes, and is applicable when the population is finite.
If an algorithm has no threshold to
tune, we can always simulate a curve...
...by randomy adding items
from the population to the output
...and randomly removing items
from the output
>details
Def: ROC
37
Recall
True Negative Rate (TNR)
What we
want
What an algorithm
would get if it
randomly chooses
the positive items
from the population.
Informedness
The
ROC
(receiver operating characteristic) curve plots recall against
the FPR for different thresholds of the algorithm. It guards against
imbalanced classes, and is applicable when the population is finite.
Def: AUC
38
Recall
False Positive Rate (FPR)
What we
want
The
AUC
(area under curve) is the area under the ROC curve.
It corresponds to the probability that the classifier ranks a random
positive item over a random negative item. (It’s kind of the F1 for a
limited population and a varying threshold.)
Caution with AUC
39
Recall
False Positive Rate (FPR)
What we
want
The following algorithm has a great AUC value — but it is still not great.
How not to design an IE algorithm
40
Task: Find Simpson pets
Corpus:
Algorithm:
Regex "Snowball I*"
How not to design an IE algorithm
41
Task: Find Simpson pets
Corpus:
Algorithm:
Regex "Snowball I*"
{Snowball I, Snowball II}
Output:
How not to design an IE algorithm
42
Task: Find Simpson pets
Corpus:
Algorithm:
Regex: "Snowball (I|V)*"
How not to design an IE algorithm
43
Task: Find Simpson pets
Corpus:
Algorithm:
Regex: "Snowball (I|V)*"
{Snowball I,Snowball II,Snowball IV}
Output:
How not to design an IE algorithm
44
Task: Find Simpson pets
Corpus:
Algorithm:
Regex: "Snowball (I|V)*"
{Snowball I,Snowball II,Snowball IV}
Output:
Is this algorithm good?
How to design an IE algorithm
Take only a sample
of the corpus
Lisa decides to play music on her saxophone for Coltrane,
but the noise frightens him and he commits suicide.
As Gil swerves to avoid hitting Snowball V, his car
hits a tree and bursts into flames. Since the cat is unhurt,
Lisa takes it as a sign of good luck and adopts her. [...]
45
Task: Find Simpson pets
Corpus:
How to design an IE algorithm
46
Task: Find Simpson pets
Corpus:
Consider only
the sample corpus.
How to design an IE algorithm
47
Task: Find Simpson pets
Corpus:
Consider only
the sample corpus.
Manually make
a gold standard
{Coltrane, Snowball I, ...}
Gold Standard:
How to design an IE algorithm
48
Task: Find Simpson pets
Corpus:
{Coltrane, Snowball I, ...}
Gold Standard:
Algorithm
How to design an IE algorithm
49
Task: Find Simpson pets
Corpus:
{Coltrane, Snowball I, ...}
Gold Standard:
Algorithm
Output:
{...}
How to design an IE algorithm
50
Task: Find Simpson pets
Corpus:
{Coltrane, Snowball I, ...}
Gold Standard:
Output:
{...}
Algorithm
Evaluator
How to design an IE algorithm
51
Task: Find Simpson pets
Corpus:
{Coltrane, Snowball I, ...}
Gold Standard:
Output:
{...}
Algorithm
Evaluator
Precision/Recall
How to design an IE algorithm
52
Task: Find Simpson pets
Corpus:
{Coltrane, Snowball I, ...}
Gold Standard:
Output:
{...}
Algorithm
Evaluator
Precision/Recall
improve
->end
Evaluation on a Sample
53
Corpus:
1: ...A...B...
2: ...C....
3: ..D.....E...
4: .....H.....
5: ..I...J...K...
3: ..D.....E...
4: .....H.....
Sample:
{D, E, H}
Gold Standard:
Algorithm:
{1: A, Z
2: C
3: D, E, K
4: L,
5: I, K, X}
Sample:
{
3: D, E, K
4: L
}
Precision: 2/4
Recall: 2/3
A, B, etc. can be entities, but also facts
Evaluation on a Sample
54
Corpus:
1: ...A...B...
2: ...C....
3: ..D.....E...
4: .....H.....
5: ..I...J...K...
3: ..D.....E...
4: .....H.....
Sample:
{D, E, H}
Gold Standard:
Algorithm:
{1: A, Z
2: C
3: D, E, K
4: L,
5: I, K, X}
Sample:
{
3: D, E, K
4: L
}
Precision: 2/4
Recall: 2/3
Document 5 not considered
for computing recall!
->end
Simple case: 1 target per document
55
Corpus:
A: ...A’...
B: ...B’...
C: ...C’...
D: ...D’...
E: ...E’...
C: ...C’...
D: ...D’...
E: ...E’...
Sample:
{C:C’, D:D’, E:E’}
Algorithm:
{A: A’
B: X
C: Z
D: D’
}
Precision: 1/ 2
Recall: 1/3
Gold Standard
on sample:
Sample output:
{
C: Z,
D: D’
}
56
Corpus:
A: ...A’...
B: ...B’...
C: ...C’...
D: ...D’...
E: ...E’...
C: ...C’...
D: ...D’...
E: ...E’...
Sample:
{C:C’, D:D’, E:E’}
Gold Standard
on sample:
Algorithm:
{A: A’
B: X
C: Z
D: D’
E: K
}
Sample output:
{
C: Z,
D: D’,
E: K
}
Precision: 1/3
Recall: 1/3
If the algorithm produces
one output per input, prec=rec.
Simple case: 1 target per document
Semantic IE
You
are
here
57
Source Selection and Preparation
Entity Recognition
Entity Disambiguation
singer
Fact Extraction
Reasoning
Instance Extraction
singer Elvis
->disambiguation
->instance‐extraction
->NERC