From Big Data to Good Data
Fabian M. Suchanek
Keynote at the WoLe 2012 Workshop
at ISWC 2012 in Boston, Mass., USA
Tim Berners-Lee at TED 2009:
Tim Berners-Lee encourages the audience to chant "Raw Data Now!"
2
Linked Open Data
loves
• Represent data in RDF
3
Linked Open Data
loves
• Represent data in RDF
• Re-use existing URIs
4
loves
loves
Linked Open Data
loves
• Represent data in RDF
• Re-use existing URIs
• Interlink with sameAs
loves
5
sameAs
plays
Linked Open Data
loves
• Represent data in RDF
• Re-use existing URIs
• Interlink with sameAs
• Serve data over accessible URIs
loves
plays
sameAs
6
xml/rdf
Linked Open Data 2007
7
Linked Open Data 2012
8
What's in there?
• Entities (>3b?)
• Predicates (>140m)
• Statements (>32b)
• Links (>500m)
9
[Wikipedia: "Linked Data"]
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
10
What's in there?
• Entities (>3b?)
• Predicates (>140m)
• Statements (>32b)
• Links (>500m)
11
The Pedantic Web Group crawled 13m RDF statements
from the Semantic Web. They contain 1.6m URIs. Out of these:
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
URIs
12
The Pedantic Web Group crawled 13m RDF statements
from the Semantic Web. They contain 1.6m URIs. Out of these:
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
8%
return
error
code
URIs
13
The Pedantic Web Group crawled 13m RDF statements
from the Semantic Web. They contain 1.6m URIs. Out of these:
8%
return
error
code
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
37% return
HTML documents
URIs
14
The Pedantic Web Group crawled 13m RDF statements
from the Semantic Web. They contain 1.6m URIs. Out of these:
8%
return
error
code
37% return
HTML documents
6%
return
mp3
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
49% return
RDF documents
:-)
URIs
15
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
:Elvis :born "1935-01-08"^^xsd:date
...
YYYY-MM-DD
ISO 9006 standard
Literals
16
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
:Elvis :born "1935-01-08"^^xsd:date
...
YYYY-MM-DD
ISO 9006 standard
How life should be:
74% of
xsd:date
literals
Literals
Literals
17
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
:Elvis :born "1935-01-08"^^xsd:date
...
How life should be:
YYYY-MM-DD
ISO 9006 standard
74% of
xsd:date
literals
26% of
xsd:date
literals
How life is:
:Elvis :born "January 8th, 1935"^^xsd:date
:Elvis :died "never"^^xsd:date
Disjoint Classes
18
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
OWL allows defining disjoint classes.
If two classes are disjoint, they can have no instance in common.
foaf:Person
owl:disjointWith
foaf:Document
19
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
Disjoint Classes
owl:notSoDisjointWith
In reality, disjoint classes often overlap.
foaf:Person
foaf:Document
20
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
Disjoint Classes
owl:notSoDisjointWith
In reality, disjoint classes often overlap.
foaf:Person
foaf:Document
21
What's in there?
• Entities (>3b?)
• Predicates (>140m)
• Statements (>32b)
• Links (>500m)
22
Duplicate Predicates
... has more than 1000 predicates
in its crowdcrafted ontology.
hasParent
23
Duplicate Predicates
... has more than 1000 predicates
in its crowdcrafted ontology.
hasParent
hasChild
24
Duplicate Predicates
... has more than 1000 predicates
in its crowdcrafted ontology.
hasParent
hasChild
has no parent!
[Hoffart &al: "YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia", AI Journal 2012]
DBpedia contains inverse relationships.
These are not always synchronized.
25
Hairsplitting Predicates
[Hoffart &al: "YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia", AI Journal 2012]
Only 50% of predicates appear more than 500 times.
shoeNumber
distanceToCardiff
thirdDriverCountry
26
Orphan Predicates: Publish the Rubbish!
Many relationships on the Semantic Web appear only few times.
The predicates in the Billion Triple Challenge 2009 are as follows:
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
27
Orphan Predicates: Publish the Rubbish!
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
17%
occur
only
once
Many relationships on the Semantic Web appear only few times.
The predicates in the Billion Triple Challenge 2009 are as follows:
28
Orphan Predicates: Publish the Rubbish!
17%
occur
only
once
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
36%
occur
< 10
times
Many relationships on the Semantic Web appear only few times.
The predicates in the Billion Triple Challenge 2009 are as follows:
29
Orphan Predicates: Publish the Rubbish!
17%
occur
only
once
36%
occur
< 10
times
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
47%
occur
more
often
Many relationships on the Semantic Web appear only few times.
The predicates in the Billion Triple Challenge 2009 are as follows:
30
Orphan Predicates: Publish the Rubbish!
17%
occur
only
once
36%
occur
< 10
times
47%
occur
more
often
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
600
properties
start with
"file://..."
Many relationships on the Semantic Web appear only few times.
The predicates in the Billion Triple Challenge 2009 are as follows:
31
Orphan Predicates: Publish the Rubish!
17%
occur
only
once
36%
occur
< 10
times
47%
occur
more
often
600
properties
start with
"file://..."
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
In the DERI
sample, 14%
of triples use
undefined
properties
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
Many relationships on the Semantic Web appear only few times.
The predicates in the Billion Triple Challenge 2009 are as follows:
32
What's in there?
• Entities (>3b?)
• Predicates (>140m)
• Statements (>32b)
• Links (>500m)
33
Inverse Functional Relations
OWL allows defining inverse functional relations.
Every object of such a relation can have only one subject.
"barack@obama.org"
"mitt@romney.org"
foaf:mbox
foaf:mbox
Romney
Obama
34
Inverse Functional Relations
"null"
foaf:mbox
foaf:mbox
Obama
Romney
35
Inverse Functional Relations
If several subjects have the same object
in an inverse functional relation, they are identical.
"null"
foaf:mbox
Robama
36
Inverse Functional Relations
If several subjects have the same object
in an inverse functional relation, they are identical.
"null"
foaf:mbox
Robama
1000 cases in
the sample crawl
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
37
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
Inverse Functional Relations
If several subjects have the same object
in an inverse functional relation, they are identical.
"null"
foaf:mbox
Robama
1000 cases in
the sample crawl
=> 1000000
new identity links!
38
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
Inverse Functional Relations
If several subjects have the same object
in an inverse functional relation, they are identical.
"null"
foaf:mbox
Robama
1000 cases in
the sample crawl
=> 1000000
new identity links!
with 10 statements
per entity
=> 10000000
new statements!
39
How can we get even more statements?
The Linked Open Numbers Project aims to add
the natural numbers as entities to the Semantic Web:
n42
evenNumber
n41
n43
n44
n45
...
lon:succ
lon:smallerThan
rdf:type
[Vrandecic et al: "Leveraging Non-Lexical Knowledge for the Linked Open Data Web"]
40
How can we get even more statements?
The Linked Open Numbers Project would multiply
the size of the Semantic Web by a factor of 3.5!
[Vrandecic et al: "Leveraging Non-Lexical Knowledge for the Linked Open Data Web"]
Linked
Open
Numbers
(This paper is not aiming to be serious)
41
How can we get more statements?
Historians at the Versailles Castle in France build
a biographical database for the Semantic Web.
The predicates are:
name
otherName
birthDate
deathDate
countryHist
countryNow
profession
42
How can we get more statements?
name
otherName
birthDate
deathDate
countryHist
countryNow
profession
saisipar
datsai
cor
datcor
web
sources
citation
mc
inst
Historians at the Versailles Castle in France build
a biographical database for the Semantic Web.
The predicates are:
43
How can we get more statements?
name
otherName
birthDate
deathDate
countryHist
countryNow
profession
saisipar
datsai
cor
datcor
web
sources
citation
mc
inst
acces3
doc4
leg4
acces4
liens
legurl
obs
auteur
not
accueil
search
Historians at the Versailles Castle in France build
a biographical database for the Semantic Web.
The predicates are:
44
How can we get more statements?
name
otherName
birthDate
deathDate
countryHist
countryNow
profession
saisipar
datsai
cor
datcor
web
sources
citation
mc
inst
acces3
doc4
leg4
acces4
liens
legurl
obs
auteur
not
accueil
search
theme
doc1
leg1
acces1
doc2
leg2
acces2
doc3
leg3
Historians at the Versailles Castle in France build
a biographical database for the Semantic Web.
The predicates are:
45
How can we get more statements?
name
otherName
birthDate
deathDate
countryHist
countryNow
profession
saisipar
datsai
cor
datcor
web
sources
citation
mc
inst
theme
doc1
leg1
acces1
doc2
leg2
acces2
doc3
leg3
acces3
doc4
leg4
acces4
liens
legurl
obs
auteur
not
accueil
search
70% of predicates
are meta-information
Historians at the Versailles Castle in France build
a biographical database for the Semantic Web.
The predicates are:
46
How does it look on the Web?
Most frequent predicates of the Billion Triple Challenge 2009:
156,448,093
143,293,758
53,869,968
35,811,115
32,895,374
23,266,469
22,326,441
19,565,730
19,157,120
18,257,337
5,443,549
http://dbpedia.org/property/wikilink
rdf:type
rdfs:seeAlso
foaf:knows
foaf:nick
foaf:weblog
dc:title
akt:has-author
sioc:links_to
skos:subject
Nepomuk Ontology
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
47
How does it look on the Web?
Most frequent predicates of the Billion Triple Challenge 2009:
156,448,093
143,293,758
53,869,968
35,811,115
32,895,374
23,266,469
22,326,441
19,565,730
19,157,120
18,257,337
5,443,549
http://dbpedia.org/property/wikilink
rdf:type
rdfs:seeAlso
foaf:knows
foaf:nick
foaf:weblog
dc:title
akt:has-author
sioc:links_to
skos:subject
Nepomuk Ontology
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
48
How does it look on the Web?
Most frequent predicates of the Billion Triple Challenge 2009:
156,448,093
143,293,758
53,869,968
35,811,115
32,895,374
23,266,469
22,326,441
19,565,730
19,157,120
18,257,337
5,443,549
http://dbpedia.org/property/wikilink
rdf:type
rdfs:seeAlso
foaf:knows
foaf:nick
foaf:weblog
dc:title
akt:has-author
sioc:links_to
skos:subject
Nepomuk Ontology
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
49
How does it look on the Web?
Most frequent predicates of the Billion Triple Challenge 2009:
156,448,093
143,293,758
53,869,968
35,811,115
32,895,374
23,266,469
22,326,441
19,565,730
19,157,120
18,257,337
5,443,549
http://dbpedia.org/property/wikilink
rdf:type
rdfs:seeAlso
foaf:knows
foaf:nick
foaf:weblog
dc:title
akt:has-author
sioc:links_to
skos:subject
Nepomuk Ontology
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
50
How does it look on the Web?
Most frequent predicates of the Billion Triple Challenge 2009:
156,448,093
143,293,758
53,869,968
35,811,115
32,895,374
23,266,469
22,326,441
19,565,730
19,157,120
18,257,337
5,443,549
http://dbpedia.org/property/wikilink
rdf:type
rdfs:seeAlso
foaf:knows
foaf:nick
foaf:weblog
dc:title
akt:has-author
sioc:links_to
skos:subject
Nepomuk Ontology
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
51
How does it look on the Web?
Most frequent predicates of the Billion Triple Challenge 2009:
156,448,093
143,293,758
53,869,968
35,811,115
32,895,374
23,266,469
22,326,441
19,565,730
19,157,120
18,257,337
5,443,549
http://dbpedia.org/property/wikilink
rdf:type
rdfs:seeAlso
foaf:knows
foaf:nick
foaf:weblog
dc:title
akt:has-author
sioc:links_to
skos:subject
Nepomuk Ontology
[gromgull: "(still) nothing clever", http://gromgull.net/blog/]
A large portion of
triples on the Web
is meta data
52
Example: YAGO
[http://yago-knowledge.org]
YAGO is a large semantic
knowledge base. YAGO
contains more than 10m
entities and 100m statements.
YAGO blah blah blah. Don't read
this, listen to the talk! blub fasel
blah blah blub laber fasel suelz
blah You are still reading this! blah
blub fasel blub blub blah
http://yago-knowledge.org
53
Example: YAGO
[http://yago-knowledge.org]
YAGO is a large semantic
knowledge base. YAGO
contains more than 10m
entities and 100m statements.
YAGO blah blah blah. Don't read
this, listen to the talk! blub fasel
blah blah blub laber fasel suelz
blah You are still reading this! blah
blub fasel blub blub blah
http://yago-knowledge.org
each with
• 1 source statement
• 1 extraction
technique
statement
100m statements
54
Example: YAGO
[http://yago-knowledge.org]
YAGO is a large semantic
knowledge base. YAGO
contains more than 10m
entities and 100m statements.
YAGO blah blah blah. Don't read
this, listen to the talk! blub fasel
blah blah blub laber fasel suelz
blah You are still reading this! blah
blub fasel blub blub blah
http://yago-knowledge.org
100m statements
each with
• 1 source statement
• 1 extraction
technique
statement
= 300m statements
of which 66%
are meta-statements
(numbers are approximated)
55
marriedTo
Wrong Statements: DBpedia
56
marriedTo
Wrong Statements: DBpedia
57
marriedTo
Wrong Statements: DBpedia
58
Wrong Statements: YAGO
YAGO blah blah blah. Ah, you are
reading this again! blub fasel blah
blah blub laber fasel suelz blah blah
If you're bored, go check out YAGO!
blub fasel blub blub blah
http://yago-knowledge.org
YAGO contains more than 10m
entities and 100m statements.
Manual evaluation confirmed
that 95% of its statements are
correct.
[http://yago-knowledge.org]
59
Wrong Statements: YAGO
http://yago-knowledge.org
YAGO contains more than 10m
entities and 100m statements.
Manual evaluation confirmed
that 95% of its statements are
correct.
[http://yago-knowledge.org]
Main advantage of
YAGO: high quality
data.
YAGO blah blah blah. Ah, you are
reading this again! blub fasel blah
blah blub laber fasel suelz blah blah
If you're bored, go check out YAGO!
blub fasel blub blub blah
60
Wrong Statements: YAGO
http://yago-knowledge.org
YAGO contains more than 10m
entities and 100m statements.
Manual evaluation confirmed
that 95% of its statements are
correct.
[http://yago-knowledge.org]
Main advantage of
YAGO: high quality
data.
But: 5% of 100m
statements is
still 5m wrong
statements
YAGO blah blah blah. Ah, you are
reading this again! blub fasel blah
blah blub laber fasel suelz blah blah
If you're bored, go check out YAGO!
blub fasel blub blub blah
61
Wrong Statements: YAGO
http://yago-knowledge.org
YAGO contains more than 10m
entities and 100m statements.
Manual evaluation confirmed
that 95% of its statements are
correct.
[http://yago-knowledge.org]
Main advantage of
YAGO: high quality
data.
But: 5% of 100m
statements is
still 5m wrong
statements
e.g.: 200 unintended
gay marriages
YAGO blah blah blah. Ah, you are
reading this again! blub fasel blah
blah blub laber fasel suelz blah blah
If you're bored, go check out YAGO!
blub fasel blub blub blah
62
What's in there?
• Entities (>3b?)
• Predicates (>140m)
• Statements (>32b)
• Links (>500m)
63
Linking Ontologies: PARIS
[Suchanek &al: "PARIS - Probablistic Alignment of Relations, Instances, and Schema", VLDB 2012]
singer
rockSinger
type
type
type
knows
marriedTo
The PARIS system automatically aligns classes, instances,
and predicates across RDF ontologies.
64
Linking Ontologies: PARIS
[Suchanek &al: "PARIS - Probablistic Alignment of Relations, Instances, and Schema", VLDB 2012]
singer
rockSinger
type
type
type
knows
marriedTo
The PARIS system automatically aligns classes, instances,
and predicates across RDF ontologies.
subClassOf
84% accuracy
on classes
65
Linking Ontologies: PARIS
[Suchanek &al: "PARIS - Probablistic Alignment of Relations, Instances, and Schema", VLDB 2012]
singer
rockSinger
type
type
type
knows
marriedTo
subClassOf
The PARIS system automatically aligns classes, instances,
and predicates across RDF ontologies.
sameAs
90 % accuracy
on instances
84% accuracy
on classes
66
Linking Ontologies: PARIS
[Suchanek &al: "PARIS - Probablistic Alignment of Relations, Instances, and Schema", VLDB 2012]
sameAs
singer
rockSinger
type
type
type
knows
marriedTo
subClassOf
The PARIS system automatically aligns classes, instances,
and predicates across RDF ontologies.
84% accuracy
on classes
90 % accuracy
on instances
subPropertyOf
90% accuracy
on predicates
67
[Suchanek &al: "PARIS - Probablistic Alignment of Relations, Instances, and Schema", VLDB 2012]
Among the 16% of false class alignments, PARIS yields:
People convicted
of Murder in Utah
SoccerPlayer
subClassOf
... because 12% of people convicted of Murder in Utah
are soccer players in YAGO/DBpedia.
Linking Ontologies: PARIS
68
Identity Crisis on the Web
On the Semantic Web, "sameAs" has developed 5 meanings:
[Halpin & Hayes: "When owl:sameAs isn’t the Same", LDOW, 2010]
Identical to
Same in different context
Same but referentially opaque
Represents
Very similar to
69
Identity Crisis on the Web
On the Semantic Web, "sameAs" has developed 5 meanings:
[Halpin & Hayes: "When owl:sameAs isn’t the Same", LDOW, 2010]
Identical to
Same in different context
Same but referentially opaque
Represents
Very similar to
Thanks to
transitivity,
errors multiply
exponentially
70
Identity Crisis on the Web
[Weikum: "For a few Triples More", keynote at ISWC 2011]
71
Identity Crisis on the Web
[Weikum: "For a few Triples More", keynote at ISWC 2011]
Methylenedioxymethamphetamine
is sameAs (or similarTo)
212 entities
72
Problem
Most people publish Semantic Web data with good intentions,
but the pedantic details of Semantics are only of interest to few.
=> Much like the HTML Web,
the Semantic Web contains a lot of junk
The Semantic Web aims to be the 'clean version' of the HTML Web.
73
Big
Let us be aware that more data is not automatically good data.
Good
Wow, such a
big building!
...leading to the creation of a global data space
containing billions of assertions - the Web of Data.
"
74
Big
Let us not just focus on size, but also on
• the readability,
• the truthfulness
• and the usefulness
of our data!
As researchers:
Good
Our task is not to produce large data, but to produce truthful data
75
Make it good! - Syntax
http://pedantic-web.org/
provide free syntax checkers
and introductions to common mistakes
[Hogan et al: "Weaving the Pedantic Web", LDOW, 2010]
76
Make it good! - Semantics
• use the data you have to check its own consistency
• use systematic quality assessments
• attach quality estimators to your data
[Suchanek &al: "YAGO - A large Ontology from Wikipedia & Wordnet", WWW 2007]
77
Make it good! - Semantics
• use the data you have to check its own consistency
• use systematic quality assessments
• attach quality estimators to your data
• use human feedback
as, e.g., in Sig.ma [Tummarello et al: ""Sig.ma: Live views on the Web of Data", JWS, 2010]
and NELL [Mitchell et al: "Populating the Semantic Web by Macro-Reading Internet Text", ISWC 2009]
[Suchanek &al: "YAGO - A large Ontology from Wikipedia & Wordnet", WWW 2007]
78
Make it good! - Provenance
• attach provenance designators to your data
Discussion of quality in [Bizer et al: "Linked Data - The Story So Far", IJSWIS]
Named Graphs [Carroll et al: "Named Graphs, Provenance and Trust",WWW, 2005]
http://yago-knowledge.org
79
Good
Data
Now!
80
Good
Data
Now!
I would like to thank Gerhard Weikum, Gerard de Melo,
and Christoph Boehm for their ideas and inspirations.
Slides done with PowerLine,
the free SVG slide editor with Latex support.