Wrapper Induction
Fabian M. Suchanek
Semantic IE
You
are
here
2
Source Selection and Preparation
Entity Recognition
Entity Disambiguation
singer
Fact Extraction
Reasoning
Instance Extraction
singer Elvis
Generated Web pages
Web page generation is the process of producing several similar
Web pages from a KB.
3
KB
Example: Generated Web pages
4
10 USD
WilliamSteig
price
author
"Shrek"
by W. Steig
Buy it!
only 10 USD!
IE aims to reconstruct the KB
5
W.Steig
price
writer
"Shrek"
by W. Steig
only 10 USD!
Buy it!
10 USD
Def: Wrapper
A wrapper for a set of pages generated from the same KB is a function
that extracts strings from such a page.
(Technically, it is the inverse function of the function that generated the page.
The strings still have to be disambiguated and put in relation to yield facts.
Different applications have different more specific definitions of the "strings".)
Kushmerick: Wrapper Induction
"Shrek...",
"90 min",
"7.9"
6
Information is always in same place
If we
understand
this...
then we
understand
this.
7
Top 10
11/
11
Def: XPath
XPath
is a formal language for selecting nodes in an XML document.
[
Wikipedia/XPath
]
/ identifies the root node
K/T[i] identifies the i-th child with
tag T of the node identified by K
K/T is K/T[1] if K has one T child
(XPath has many more expressions, allowing the union of node sets, the selection by attributes, etc.)
8
<html>
<body>
<h1>Aloha from Hawaii</h1>
<p>This is a really great movie</p>
<p>Stars:<i>Elvis Presley</i></p>
</body>
</html>
Try it out
/html/body/p[2]/i
>task
Task: XPath
<html>
<body>
<b>Shrek</b>
<ul>
<li>Creator: <b>W. Steig</b>
<li>Duration: <i>84 min</i>
</ul>
Write XPath expressions that identify nodes whose text is "Shrek",
"W. Steig", and "84 min".
9
Def: Wrapper induction
Wrapper induction
is the process of generating a wrapper from
a set of Web pages with strings to be extracted.
Strings to
be extracted
"Shrek", "7.9"
10
/html/body/h1
/html/body/p[2]/i
=
Web page
Wrapper
+
=
+
Wrapper induction
Wrapper Induction requires as input Web pages with strings to
be extracted. These can come, e.g.,
• from a KB
• from manual extraction
• from manual annotation in a GUI
11
hasTitle(ShrekMovie, "Shrek")
+
=
"Shrek"
+
=
Def: Wrapper Application
Wrapper application
is the process of extracting its strings from a
Web page.
"Elvis", "11"
12
/html/body/h1
/html/body/p[2]/i
+
=
Web page
Wrapper
Strings
+
=
11
Def: Wrapper Application
Wrapper application
is the process of extracting its strings from a
Web page.
"Elvis", "11"
13
/html/body/h1
/html/body/p[2]/i
+
=
Web page
Wrapper
Strings
+
=
11
hasActor(e42, ElvisPresley)
hasRating(e42, "11.0")
Facts
Disambiguation, + relation
>end
Detail Pages & List Pages
Wrappers can be learned across several detail pages:
14
Wrappers can also be learned across items in a list:
>end
Data may exhibit structure
Dronkeys:
<ul>
<li>Eclair: female
<li>Bananas: flexible
</ul>
Shrek's kids:
<ul>
<li>Farkle: male
<li>Fergus: male
vi.sualize.us
15
family: tuple (
name: string
children: set (
child: tuple (name: string,
gender: string)))
>end
ROADRUNNER: Learn types
Page 1:
<ul>
<li>Peanut
</ul>
Page 2:
<ul>
<li>Charles
</ul>
Wrapper:
<ul>
<li>[FIELD]
</ul>
16
ROADRUNNER is a system that can learn the Web page structure.
>end
ROADRUNNER: Learn types
Page 1:
<ul>
<li>Peanut
</ul>
Page 2:
<ul>
<li>Charles
<li>Anne
</ul>
Wrapper:
<ul>
(<li>[FIELD])+
</ul>
17
ROADRUNNER
ROADRUNNER is a system that can learn the Web page structure.
>end
Data might not be separated by tags
<html>
<body>
The Dronkeys<br>
<p>
Eclair: female<br>
Bananas: male<br>
Peanut: unknown<br>
<hr>
18
For the WIEN system, a wrapper is of the form
< <p>, <hr>,\n, :, , <br>, ...>
<head, tail, left, right, left, right, ...>
>end
Running WIEN
<html>
<body>
The Dronkeys<br>
<p>
Eclair: female<br>
Bananas: male<br>
Peanut: unknown<br>
<hr>
19
For the WIEN system, a wrapper is of the form
< <p>, <hr>,\n, :, , <br>, ...>
<head, tail, left, right, left, right, ...>
1. Scroll to head
2. Scroll to left
3. Extract until right
4. Repeat until tail
>end
Delimiters may differ
<html>
<body>
The Dronkeys<br>
<p>
Eclair: female<br>
Bananas (male)<br>
Peanut is hybrid<br>
<hr>
20
The STALKER system allows disjunctions in wrappers, such as:
The start of the gender is either “:” or “(”.
Semantic IE
You
are
here
21
Entity Recognition
Reasoning
singer Elvis
Source Selection and Preparation
Entity Disambiguation
Fact Extraction
singer
Is-A Extraction
->dipre
References
Kushmerick: Wrapper Induction
ROADRUNNER
Web data mining class
Muslea: STALKER
22