CC-BY
Fabian M. Suchanek
Evaluation
53
Def: Information Extraction
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
2
Homer Simpson, Bart Simpson, Lisa Simpson
Information Extraction
(IE) is the task of extracting structured data
(such as entities, facts, or relations) from natural language text.
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
3
1. [A-Z][a-z]+ Simpson
4
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
1. [A-Z][a-z]+ Simpson
5
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
4 matches (1 wrong)
1. [A-Z][a-z]+ Simpson
2. [A-Z][a-z]+ [A-Z][a-z]+
6
4 matches (1 wrong)
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
1. [A-Z][a-z]+ Simpson
2. [A-Z][a-z]+ [A-Z][a-z]+
7
4 matches (1 wrong)
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
5 matches (2 wrong)
1. [A-Z][a-z]+ Simpson
2. [A-Z][a-z]+ [A-Z][a-z]+
3. Homer Simpson
8
4 matches (1 wrong)
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
5 matches (2 wrong)
1 match
1. [A-Z][a-z]+ Simpson
2. [A-Z][a-z]+ [A-Z][a-z]+
3. Homer Simpson
9
4 matches (1 wrong)
Detect members of the Simpsons
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
5 matches (2 wrong)
Def: Gold Standard
The
gold standard
(also: ground truth) for an information extraction task is the set of desired results
of the task on a given corpus.
10
{Homer Simpson, Bart Simpson, Lisa Simpson}
in The Simpsons, Homer Simpson is the
father of Bart Simpson and Lisa Simpson.
The M above his ear is for Matt Groening.
Task: Detect Simpson members
Corpus:
Gold Standard:
Def: Precision
The
precision
of an information extraction algorithm is the ratio of its outputs that are
in the respective gold standard.
11
Gold Standard: {Homer, Bart, Lisa, Marge}
Output: {Homer, Bart, Groening}
Def: Precision
12
Gold Standard: {Homer, Bart, Lisa, Marge}
The
precision
of an information extraction algorithm is the ratio of its outputs that are
in the respective gold standard.
Output: {Homer, Bart, Groening}
Def: Precision
13
Gold Standard: {Homer, Bart, Lisa, Marge}
The
precision
of an information extraction algorithm is the ratio of its outputs that are
in the respective gold standard.
Output: {Homer, Bart, Groening}
Def: Precision
14
Gold Standard: {Homer, Bart, Lisa, Marge}
x
The
precision
of an information extraction algorithm is the ratio of its outputs that are
in the respective gold standard.
Output: {Homer, Bart, Groening}
Def: Precision
15
Gold Standard: {Homer, Bart, Lisa, Marge}
=> Precision: 2/3 = 66%
The
precision
of an information extraction algorithm is the ratio of its outputs that are
in the respective gold standard.
Output: {Homer, Bart, Groening}
x
Def: Recall
The
recall
(also: sensitivity, true positive rate, hit rate) of an information extraction algorithm
is the ratio of the gold standard that is output.
16
Output: {Homer, Bart, Groening}
Gold Standard: {Homer, Bart, Lisa, Marge}
Def: Recall
The
recall
(also: sensitivity, true positive rate, hit rate) of an information extraction algorithm
is the ratio of the gold standard that is output.
17
Gold Standard: {Homer, Bart, Lisa, Marge}
Output: {Homer, Bart, Groening}
Def: Recall
The
recall
(also: sensitivity, true positive rate, hit rate) of an information extraction algorithm
is the ratio of the gold standard that is output.
18
Gold Standard: {Homer, Bart, Lisa, Marge}
Output: {Homer, Bart, Groening}
Def: Recall
The
recall
(also: sensitivity, true positive rate, hit rate) of an information extraction algorithm
is the ratio of the gold standard that is output.
19
Gold Standard: {Homer, Bart, Lisa, Marge}
x
Output: {Homer, Bart, Groening}
Def: Recall
The
recall
(also: sensitivity, true positive rate, hit rate) of an information extraction algorithm
is the ratio of the gold standard that is output.
20
Gold Standard: {Homer, Bart, Lisa, Marge}
x
Output: {Homer, Bart, Groening}
x
Def: Recall
The
recall
(also: sensitivity, true positive rate, hit rate) of an information extraction algorithm
is the ratio of the gold standard that is output.
21
Gold Standard: {Homer, Bart, Lisa, Marge}
=> Recall: 2/4 = 50%
Output: {Homer, Bart, Groening}
x
x
Precision-Recall-Tradeoff
It is very hard to get both good precision and good recall. Algorithms usually allowing varying one
at the expense of the other (e.g., by choosing different threshold values). This usually yields:
22
Precision
Recall
Very good results,
but too few of them
All good results, but
many wrong ones, too
What we want
Def: F1
23
Gold Standard:
{Homer, Bart, Lisa, Snowball_4, ..., Snowball_100}
Output:
{Homer}
To trade off precision and recall, we could use the average:
Def: F1
24
To trade off precision and recall, we could use the average:
The
F1 measure
is the harmonic mean of precision and recall.
Outputting just a single result
already gives a score of 50%!
Precision: 1/1=100%, Recall: 1/100=1%
F1: 2× 100%× 1%/(100%+1%)=2%
Gold Standard:
{Homer, Bart, Lisa, Snowball_4, ..., Snowball_100}
Output:
{Homer}
Precision:
1/1=100%
Recall:
1/100=1%
Average:
(100%+1%)/2=50%
Task: Precision & Recall
What is the algorithm output, the gold standard,
the precision and the recall in the following cases?
1.
Nostradamus predicts a trip to the moon for every century
from the 15th to the 20th incl.
2.
The weather forecast predicts that the next 3 days will be sunny. It does not say anything
about the 2 days
that follow. In reality, it is sunny during all 5 days.
3.
On Elvis Radio™ , 90% of the songs are by Elvis. An algorithm learns to detect Elvis songs.
Out of 100 songs on Elvis Radio, the algorithm says that 20 are by Elvis (and says nothing
about the other 80). Out of these 20 songs, 15 were by Elvis and 5 were not.
4.
How can you improve the algorithm?
25
>ROC
Imbalanced classes
26
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
27
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
Precision:
99/100=99%
Recall:
99/99=100%
If there are very few negatives,
just outputting all elements gives great scores.
The problem of
imbalanced classes
appears when only very few of the items of the population
are not in the gold standard: An approach that outputs the entire population has a very high
precison and a perfect recall.
The
negatives
are the elements of the population that are not in the gold standard.
Def: Problem of imbalanced classes
>confusion
28
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
The
confusion matrix
for the output of an algorithm looks as follows:
Def: Confusion Matrix
Gold standard
Output
Negative
Positive
Positive
Negative
Items of the population
that are not output
Items of the population that
are not in the gold standard
True Positives
False Positives
False Negatives
True Negatives
“Negative” because it was not output,
“True” because that was correct.
(Gold) Positives
(Gold) Negatives
Predicted Positives
Predicted Negatives
∑
∑
29
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
The
confusion matrix
for the output of an algorithm looks as follows:
Def: Confusion Matrix
Gold standard
Output
Positive
Negative
Positive
Positive
Negative
99
0
1
0
1 item was output as positive,
but was negative in the
gold standard
100
0
99
1
Precision = true positives / predicted positives
= 99/100 = 99%
Recall = true positives / gold positives
= 99/99 = 100%
30
Population:
Gold Standard:
Output:
{H, Ho, Hom, ..., o, om, ome, ..., r Sim, r Simps, ...}
{Homer}
{Homer}
A confusion matrix does not always make sense in an information extraction scenario:
Confusion with confusion matrixes
Gold standard
Positive
Negative
Positive
99
0
39462440205
0
A confusion matrix makes sense only when the population is limited
(e.g., in classification tasks)!
39462440206
Positive
Negative
Output
31
Our problem
The problem is that the algorithm did not catch the negatives,
it has a “low recall” on the negatives.
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
Gold standard
Positive
Negative
Positive
99
0
1
0
Positive
Negative
Output
32
Def: True Negative Rate & FPR
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
Positive
Negative
Positive
99
0
1
0
Positive
Negative
Output
The
true negative rate
(also: TNR, specificity, selectivity) is the ratio of
negatives that are output as negatives (= the recall on the negatives):
TNR = true negatives / gold negatives
= 0 / 1 = 0%
The
False Positive Rate
(also: FPR, fall-out) is 1-TNR.
>ROC
>details
TNR & Precision
33
Population:
Gold Standard:
Output:
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
Precision: 99/100=99%
Recall: 99/99=100%
TNR: 0/1=0%
TNR and precision both measure the “correctness” of the output.
Precision:
• measures wrt. the output
• suffers from imbalanced classes
• works if population is infinite
(e.g., set of all extractable entities)
TNR:
• measures wrt. the population
• guards against imbalance
• works if population is limited
(e.g., in classification)
>details
>ROC
Informedness
34
The
informedness
(also: Youden’s J statistic, Youden’s index) combines
TNR and Recall as follows:
informedness = recall + TNR - 1 = recall - FPR
= 100% + 0% - 1 = 0
(It’s kind of the F1 measure in the case of a limited population.)
Population:
Gold Standard:
Output:
Precision: 99/100=99%
Recall: 99/99=100%
TNR: 0/1=0%
{Snowball_1,..., Snowball_99, Snowball_100}
{Snowball_1,..., Snowball_99}
{Snowball_1,..., Snowball_99, Snowball_100}
>details
>ROC
Def: ROC
35
Recall
False Positive Rate (FPR)
No bad results,
but also no good ones
Many good results,
but also many bad ones.
What we
want
>details
The
ROC
(receiver operating characteristic) curve plots recall against the FPR for different thresholds
of the algorithm. It guards against imbalanced classes, and is applicable when the population is
finite.
Def: ROC
36
Recall
False Positive Rate (FPR)
What we
want
If an algorithm has no threshold to
tune, we can always simulate a curve...
...by randomy adding items
from the population to the output
...and randomly removing items
from the output
>details
The
ROC
(receiver operating characteristic) curve plots recall against the FPR for different thresholds
of the algorithm. It guards against imbalanced classes, and is applicable when the population is
finite.
Def: ROC
37
Recall
True Negative Rate (TNR)
What we
want
What an algorithm
would get if it
randomly chooses
the positive items
from the population.
Informedness
The
ROC
(receiver operating characteristic) curve plots recall against the FPR for different thresholds
of the algorithm. It guards against imbalanced classes, and is applicable when the population is
finite.
Def: AUC
38
Recall
False Positive Rate (FPR)
What we
want
The
AUC
(area under curve) is the area under the ROC curve. It corresponds to the probability that
the classifier ranks a random positive item over a random negative item. (It’s kind of the F1 for a
limited population and a varying threshold.)
Caution with AUC
39
Recall
False Positive Rate (FPR)
What we
want
The following algorithm has a great AUC value — but it is still not great.
How not to design an IE algorithm
40
Task: Find Simpson pets
Corpus:
Algorithm:
Regex "Snowball I*"
How not to design an IE algorithm
41
Task: Find Simpson pets
Corpus:
Algorithm:
Regex "Snowball I*"
{Snowball I, Snowball II}
Output:
How not to design an IE algorithm
42
Task: Find Simpson pets
Corpus:
Algorithm:
Regex: "Snowball (I|V)*"
How not to design an IE algorithm
43
Task: Find Simpson pets
Corpus:
Algorithm:
Regex: "Snowball (I|V)*"
{Snowball I,Snowball II,Snowball IV}
Output:
How not to design an IE algorithm
44
Task: Find Simpson pets
Corpus:
Algorithm:
Regex: "Snowball (I|V)*"
{Snowball I,Snowball II,Snowball IV}
Output:
Is this algorithm good?
How to design an IE algorithm
Take only a sample
of the corpus
Lisa decides to play music on her saxophone for Coltrane, but the noise frightens him
and he commits suicide. As Gil swerves to avoid hitting Snowball V, his car hits a tree
and bursts into flames. Since the cat is unhurt, Lisa takes it as a sign of good luck
and adopts her. [...]
45
Task: Find Simpson pets
Corpus:
How to design an IE algorithm
46
Task: Find Simpson pets
Corpus:
Consider only the sample corpus.
How to design an IE algorithm
47
Task: Find Simpson pets
Corpus:
Manually make a gold standard
{Coltrane, Snowball I, ...}
Gold Standard:
Consider only the sample corpus.
How to design an IE algorithm
48
Task: Find Simpson pets
Corpus:
{Coltrane, Snowball I, ...}
Gold Standard:
Algorithm
How to design an IE algorithm
49
Task: Find Simpson pets
Corpus:
{Coltrane, Snowball I, ...}
Gold Standard:
Output:
{...}
Algorithm
How to design an IE algorithm
50
Task: Find Simpson pets
Corpus:
{Coltrane, Snowball I, ...}
Gold Standard:
Output:
{...}
Evaluator
Algorithm
How to design an IE algorithm
51
Task: Find Simpson pets
Corpus:
{Coltrane, Snowball I, ...}
Gold Standard:
Output:
{...}
Evaluator
Precision/Recall
Algorithm
How to design an IE algorithm
52
Task: Find Simpson pets
Corpus:
{Coltrane, Snowball I, ...}
Gold Standard:
Output:
{...}
Evaluator
Precision/Recall
improve
->end
Algorithm
Evaluation on a Sample
53
Corpus:
1: ...A...B...
2: ...C....
3: ..D.....E...
4: .....H.....
5: ..I...J...K...
3: ..D.....E...
4: .....H.....
Sample:
{D, E, H}
Gold Standard:
Algorithm:
{1: A, Z
2: C
3: D, E, K
4: L,
5: I, K, X}
Sample:
{
3: D, E, K
4: L
}
Precision: 2/4
Recall: 2/3
A, B, etc. can be entities, but also facts
Evaluation on a Sample
54
Corpus:
1: ...A...B...
2: ...C....
3: ..D.....E...
4: .....H.....
5: ..I...J...K...
3: ..D.....E...
4: .....H.....
Sample:
{D, E, H}
Gold Standard:
Algorithm:
{1: A, Z
2: C
3: D, E, K
4: L,
5: I, K, X}
Sample:
{
3: D, E, K
4: L
}
Precision: 2/4
Recall: 2/3
Document 5 not considered for computing recall!
->end
Simple case: 1 target per document
55
Corpus:
A: ...A’...
B: ...B’...
C: ...C’...
D: ...D’...
E: ...E’...
C: ...C’...
D: ...D’...
E: ...E’...
Sample:
{C:C’, D:D’, E:E’}
Algorithm:
{A: A’
B: X
C: Z
D: D’
}
Precision: 1/ 2
Recall: 1/3
Gold Standard
on sample:
Sample output:
{
C: Z,
D: D’
}
56
Corpus:
A: ...A’...
B: ...B’...
C: ...C’...
D: ...D’...
E: ...E’...
C: ...C’...
D: ...D’...
E: ...E’...
Sample:
{C:C’, D:D’, E:E’}
Gold Standard
on sample:
Algorithm:
{A: A’
B: X
C: Z
D: D’
E: K
}
Sample output:
{
C: Z,
D: D’,
E: K
}
Precision: 1/3
Recall: 1/3
If the algorithm produces one output per input,
then precicion equals recall.
Simple case: 1 target per document
57
Summary
Every information extraction algorithm has to be
evaluated
to ensure it does the right thing.
•
To evaluate we need a gold standard (usally constructed manually)
•
Precision is the ratio of extracted items that are in the gold standard
•
Recall is the ratio of gold standard items that are extracted
•
Usually, algorithms have either a high recall or a high precision
•
F1 is the harmonic mean of precision and recall