Watermarking
for Ontologies
Fabian M. Suchanek
David Gross-Amblard
Serge Abiteboul
(INRIA Saclay, France)
1
Outline
• Motivation
• Watermarking Ontologies
• Experiments
• Conclusion
2
RDF Ontologies
Singer
1935
type
born
3
Person
An RDF ontology can be seen as a graph of entities
subclassOf
All shook up
...
Wooden Heart
sang
RDF Ontologies
4
An RDF ontology can be seen as a graph of entities
classes
relations / properties
instances
literals
Singer
sang
type
Person
subclassOf
born
1935
All shook up
...
Wooden Heart
facts
RDF Ontologies
5
Creating an ontology usually involves a major effort
sang
type
born
Singer
All shook up
...
Wooden Heart
1935
public extract() {
...
}
Licenses of Ontologies
6
Ontologies are often available on the Web under a license.
sang
type
born
Singer
All shook up
...
Wooden Heart
1935
ElvisPedia
cc
Theft of Ontologies
7
Ontological data can be re-used without consent of the owner
sang
type
born
Singer
All shook up
...
Wooden Heart
1935
ElvisPedia
EvilPedia
sang
type
1935
Singer
All shook up
...
Wooden Heart
born
cc
Theft of Ontologies
8
Ontological data can be re-used without consent of the owner
sang
type
born
Singer
All shook up
...
Wooden Heart
1935
ElvisPedia
EvilPedia
sang
type
1935
Singer
All shook up
...
Wooden Heart
born
cc
?
Problem of Ownership Proof
9
EvilPedia = ElvisPedia!
EvilPedia = Wikipedia !
Given a suspect ontology, we want to prove that the attacker
took the data from the source ontology. This can be hard.
Problem of Ownership Proof
10
EvilPedia = ElvisPedia!
EvilPedia = Wikipedia !
Given a suspect ontology, we want to prove that the attacker
took the data from the source ontology. This can be hard.
The attacker can claim that he
collected the data from public
sources.
Problem of Ownership Proof
11
ElvisPedia has
a timestamp
2011-11-11T11:11:11
EvilPedia does not,
but I collected the data
already in 1890!
Given a suspect ontology, we want to prove that the attacker
took the data from the source ontology. This can be hard.
Problem of Ownership Proof
12
ElvisPedia has
a timestamp
2011-11-11T11:11:11
EvilPedia does not,
but I collected the data
already in 1890!
(and he may even be right)
Given a suspect ontology, we want to prove that the attacker
took the data from the source ontology. This can be hard.
Even in the presence of a
timestamp, the attacker can
claim that he collected the
data independently.
Major Ontologies
13
Many major ontologies are available under a license that
requires giving credit when the data is re-published:
YAGO
SUMO
CC-BY
CC-BY-ND
CC-BY
GFDL
CC-BY
GPL
CC-BY
Require attribution
when re-used
Major Ontologies
14
Many major ontologies are available under a license that
requires giving credit when the data is re-published:
YAGO
SUMO
CC-BY
CC-BY-ND
CC-BY
GFDL
CC-BY
GPL
CC-BY
Require attribution
when re-used
=
Major Ontologies
15
Many major ontologies are available under a license that
requires giving credit when the data is re-published:
YAGO
SUMO
CC-BY
CC-BY-ND
CC-BY
GFDL
CC-BY
GPL
CC-BY
Require attribution
when re-used
=
We focus here
on RDF ontologies
with many instances
Watermarking by Modification
16
Existing approaches introduce small modifications
into the data.
All shook up
...
Wooden Heart
sang
1935
born
type
Singer
1936
Watermarking by Modification
17
Existing approaches introduce small modifications
into the data.
All shook up
...
Wooden Heart
sang
1935
born
type
Singer
1936
If these appear in the
suspect ontology,
they act as a proof
of ownership.
Watermarking by Modification
18
Existing approaches introduce small modifications
into the data.
All shook up
...
Wooden Heart
sang
1935
born
type
Singer
1936
If these appear in the
suspect ontology,
they act as a proof
of ownership.
Watermarking by Modification
19
Modifications have a number of disadvantages:
All shook up
...
Wooden Heart
sang
1935
born
type
Singer
1936
• Modifications may be
acceptable for numeric data,
but less so for categorical data
Watermarking by Modification
20
Modifications have a number of disadvantages:
All shook up
...
Wooden Heart
sang
1935
born
type
Singer
1936
La ci darem la mano
• Modifications may be
acceptable for numeric data,
but less so for categorical data
Plumber
Watermarking by Modification
21
Modifications have a number of disadvantages:
All shook up
...
Wooden Heart
sang
1935
born
type
Singer
1936
La ci darem la mano
• Modifications may be
acceptable for numeric data,
but less so for categorical data
• Modifications can be spotted
by comparison to another source
Plumber
Watermarking by Modification
22
Modifications have a number of disadvantages:
All shook up
...
Wooden Heart
sang
1935
born
type
Singer
1936
La ci darem la mano
• Modifications may be
acceptable for numeric data,
but less so for categorical data
• Modifications can be spotted
by comparison to another source
• Modifications decrease
the precision of the ontology
Plumber
Subtractive Watermarking
23
We propose to remove a certain number of randomly chosen facts
from the ontology.
Singer
All shook up
...
Wooden Heart
born
1935
type
sang
died
1977
Subtractive Watermarking
24
We propose to remove a certain number of randomly chosen facts
from the ontology.
Singer
All shook up
...
Wooden Heart
born
1935
type
sang
died
1977
• Does not decrease precision
• Is OK under the
Open World Assumption
• Does not disturb RDF semantics
(All RDF rules are positive)
Publication
25
The owner keeps the removed facts secret
and publishes only the other facts.
publish this
keep this secret
Innocent Ontologies
26
An innocent ontology, which contains the same data by chance,
will contain the same portion of secret and published facts:
other data
50% of our
secret data
50% of our
published data
Innocent Ontology
Stolen Ontologies
27
A stolen ontology will contain only published facts:
How likely is it that
this just happened
by chance?
50% of our
published data
0% of our
secret data
Stolen Ontology
other data
The Chance of Chance
28
50% of our
published data
0% of our
secret data
Our secret facts are chosen randomly in the original ontology.
The probability that all the secret facts end up in the
non-intersection just by chance can be bounded by a
test.
The Chance of Chance
29
total # of facts
% of facts copied
prob. of error
significance
# secret facts
The probability that the suspect ontology has
no secret facts by chance is less than
if
Identifying Theft
30
If we delete enough facts,
then the chance that an innocent
ontology contains some of our
secret facts is 99.9999%.
If the ontology contains no
secret facts, it was stolen
with a probability of 99.99999%
What can the Attacker try?
31
The attacker can steal only a very small portion
of the source ontology
What can the Attacker try?
32
The attacker can steal only a very small portion
of the source ontology
If we remove more facts,
we can still guarantee
detection.
33
The attacker can merge the data with other ontologies
to fill up the removed facts.
What can the Attacker try?
Data from other ontologies
fills up secret facts
34
The attacker can merge the data with other ontologies
to fill up the removed facts.
What can the Attacker try?
If we remove more facts,
we can detect theft even
if some facts have been
reconstructed.
35
The attacker can rename entities and relations
What can the Attacker try?
1935
born
birthDate
Elvis
36
The attacker can rename entities and relations
What can the Attacker try?
1935
born
birthDate
Elvis
Our detection algorithm
does not depend
on relation names.
37
The attacker can rename entities and relations
What can the Attacker try?
Our detection algorithm
does not depend
on relation names.
1935
born
birthDate
Elvis
Obama
Renaming entities beyond
recognizability makes
the ontology unusable.
38
The attacker can try to guess which facts were removed
What can the Attacker try?
1935
born
singer
type
?
?
39
The attacker can try to guess which facts were removed
What can the Attacker try?
1935
born
singer
type
?
?
Analyses on real-world
ontologies show that 90%
of instances lack a relation
that others in their class have.
40
The attacker can try to guess which facts were removed
What can the Attacker try?
Analyses on real-world
ontologies show that 90%
of instances lack a relation
that others in their class have.
1935
born
singer
type
?
?
=> nearly every entity
might have had facts removed.
41
How many facts to remove?
YAGO
19m
18m
6k
35m
total # of facts
42
How many facts to remove?
YAGO
19m
18m
6k
35m
total # of facts
24
24
456
if 50% is stolen
424
24
24
456
if 5% is stolen
# facts to remove
456
43
How many facts to remove?
YAGO
24
24
24
24
if 50% is stolen
456
456
424
456
if 5% is stolen
19m
18m
6k
35m
total # of facts
# facts to remove
if 50% is stolen
and 10% removed
facts reconstructed
37
37
37
37
44
How many facts to remove?
YAGO
24
24
24
24
if 50% is stolen
456
456
424
456
if 5% is stolen
37
37
37
37
if 50% is stolen
and 10% removed
facts reconstructed
19m
18m
6k
35m
total # of facts
# facts to remove
ca. 1000
if certain themes are stolen
(albums, people, cities,...)
45
How many facts to remove?
YAGO
What if YAGO steals from DBpedia?
Natural overlap:
• 50% of instances
• 8% of facts
46
How many facts to remove?
YAGO
What if YAGO steals from DBpedia?
Natural overlap:
• 50% of instances
• 8% of facts
stolen theme (e.g., villages)
47
How many facts to remove?
YAGO
What if YAGO steals from DBpedia?
stolen theme (e.g., villages)
Natural overlap:
• 50% of instances
• 8% of facts
By removing around 1500 facts in total (out of 19m),
we can still trace theft of the major themes.
48
Conclusion
We have shown an approach that can
• prove the ownership of an ontology
• maintain the precision of the ontology
• withstand different types of attacks
49
Conclusion
We have shown an approach that can
• prove the ownership of an ontology
• maintain the precision of the ontology
• withstand different types of attacks
But most importantly: Elvis is safe!
50
Conclusion
We have shown an approach that can
• prove the ownership of an ontology
• maintain the precision of the ontology
• withstand different types of attacks
But most importantly: Elvis is safe!
Special thanks to Chris Welty for moving my talk
Particular thanks to Mauro Sozio for the visual model of the statistical test
Slides done with PowerLine, the free graphical SVG slide editor
with Latex support