Fabian M. Suchanek
Course material adapted from
Antoine Amarilli
,
which is based on the slides of the
Web Data Management book
.
Web Search
>introduction
Introduction
Crawling
Indexing
Searching
PageRank
Business
Innovation
Overview
2
• Trump is a chump
• Dump Trump!
• Elvis is alive
A
(web) search engine
is a
software system that, given
a user query, returns
a list of Web documents.
Def: Search engine
3
User query:
Search Engine
Result
Is Trump a chump?
>details
• In the very beginning:
centralized
list of Web servers
(Example:
DMOZ
)
• Later: only
links
(and URL exchange) to find information
• Early efforts:
Wanderer
,
W3Catalog
(1993).
• First search engine with crawling, indexing and searching:
JumpStation
(December 1993).
•
First
search engine:
WebCrawler
(April 1994; brand still active).
•
Lycos
(1994).
• Search engines proved more successful than hand‐curated
portals
.
• Nowadays: 93% of Internet traffic from search engines.
(Forrester Research, 2006)
Search engines
4
>details
•
Yahoo!
(1994): Initially a
directory
.
•
Excite
(1995).
•
AltaVista
(1995): In 1998, 20 machines, 130~GB RAM,
500 GB HDD, 13M queries/day (150 queries/second). (0)
•
Netscape
(1996): Yahoo!, Magellan, Lycos, Infoseek, Excite.
•
Ask Jeeves
(1996): Tentative
natural language
support.
•
Yandex
(1997): Popular in Russia (
52%
market share as of 2018)
•
Google Search
(1997): Invented PageRank.
•
GoTo
(1998): Pay to appear for certain keywords.
•
MSN Search
(1998): Microsoft’s answer to Google
•
Baidu
(2000): Popular in China (
76%
market share in China)
•
DuckDuckGo
(2008): Privacy‐aware search engine
•
Startpage
(2009): Anonymizes Google searches
•
Qwant
(2013): French privacy‐aware answer to Google
Important search engines
5
>details
As of 2018, according to
NetMarketShare.com
Search engine market share
6
Google (72%)
Baidu (13%)
Bing (7%)
Yahoo (4%)
Yandex (1%)
DuckDuckGo (0.23%), Ask, AOL,...
>details
Overview
7
Introduction
Crawling
Indexing
Searching
PageRank
Business
Innovation
Web
page
1
Web
page
2
Web
page
3
Web
page
4
Web
page
1000
Web
page
17
Web
page
42
A Search engine
8
Is Trump a chump?
User query:
Search Engine
Result
...
Web Crawler
Donald John Trump (born June 14, 1946) is
the 45th President of the
United States
.
The United States unites lots of states:
Some of the cooler ones are
California
and
New York
.
A
Web crawler
is a system that follows
hyperlinks, collecting all pages on the way.
9
*click*
*click*
*click*
BFS>12
>details
A crawler does BFS on URLs
10
1. Start with queue of important URLs
BFS = Breadth First Search
http://...
http://...
http://...
A crawler does BFS on URLs
11
2. Download
http://...
http://...
http://...
A crawler does BFS on URLs
3. If page is “good”, add it to corpus
12
?
http://...
http://...
A crawler does BFS on URLs
4. Find URLs in page, enqueue them
13
<a href=XYZ></a>
http://...
http://...
http://...
A crawler does BFS on URLs
5. repeat the process until you covered all pages
• within a certain depth
• in a certain domain
• with certain topics
• .
14
http://...
>details
http://...
http://...
• In an HTML page
• Hyperlinks
<a href="...">
• Media
<img src="...">
,
<audio src="...">
,
<video src="...">
,
<source src="...">
• Frames
<iframe src="...">
• JavaScript
window.open("...")
— undecidable in general
• Page text by
regular expressions
.
• In
other kinds
of files (PDFs...).
• In
sitemaps
provided specifically to crawlers.
• In
server logs
available online (could include
Referer
).
• From
existing URLs
: parent folder.
=>
Also possible: change GET parameters...
Finding new URLs
15
>details
Freshness Problem
• Content on the Web
changes
• Different change rates:
online newspaper main page: every hour or so
published article: virtually no change
•
Continuous crawling
, and identification of
change rates
for adaptive crawling:
If-Last-Modified
HTTP feature (not reliable)
Identification of duplicates in successive request
16
>details
• Prevent
multiple indexing
and penalize
content farms
.
• Prevent duplicate
URLs
by
canonicalization
.
http://example.com:80/foo
=
http://example.com/bar/../foo
=
http://www.example.com/foo
• Detect
duplicate pages
by using a
hash function
.
• Detect
near-duplicates
(dates, etc.) by using a
similarity function
.
(e.g., Broder's
MinHash
from 1997, used in AltaVista).
If you are designing a Web site with multiple legitimate copies
=>
Use
301 redirects
to remove unneeded copies.
=>
Specify the canonical copy using:
<link rel="canonical" href="http://example.com/">
Duplicate pages
17
>details
• Wait a minimal
delay
between requests to the same server.
=>
Depends on the
server
(
wikipedia.org
vs your laptop).
=>
Depends on the
resource
(large files...).
=>
Generally, waiting at least
one second
is preferable.
• Requests to different servers can be
parallelized
.
• Requests should be run
asynchronously
.
• The HTTP connection should remain
open
.
• Requests can be
distributed
across multiple machines.
• Crawlers represent about 20% of Web traffic.
Crawl scheduling
18
>details
Crawler traffic
Traffic on a3nm.net as of September 2013 (out of 36593 requests).
19
>details
•
Robot Exclusion Standard
:
http://example.com/robots.txt
=>
Only at root level (not available for subfolders).
=>
Filtering by
User-agent
.
=>
Disallow
directive to forbid certain pages.
=>
Also:
Allow
,
Crawl-delay
,
Host
,
Sitemap
.
•
HTTP header
:
X-Robots-Tag
(less support):
=>
X-Robots-Tag: noindex
•
Meta tag
:
<meta name="robots" content="noindex">
=>
Also
nofollow
,
nosnipped
,
noarchive
...
•
Links
:
<a href="secret/" rel="nofollow">
•
Engine-specific
interfaces (e.g., Google Webmaster Tools).
=>
No guarantees
!
Robot control (honor-based)
20
>details
• Completely Automated Public Turing test to tell Computers
and Humans Apart (trademarked by CMU, but patented by AltaVista).
• Making a
computer
able to recognize humans.
•
Accessibility
problems for the blind (mitigated by alternatives)
• Can be any AI problem: add two numbers,
listen to a word, recognize an animal in an image, etc.
Robot control with CAPTCHAs
How can we discriminate against
robots
?
21
>details
CAPTCHAs can be used to
• digitize books
Show one word that we know (to validate the user),
and one word that we want to digitize (to digitize the book)
• Show ads
Ask the user to type a slogan
• Do recognition of street numbers in
Google street view images
ReCAPTCHAs
22
>details
• Employ humans to remotely solve CAPTCHAs
(“sweatshops”, hundreds per hour)
• Sometimes there may be no ground truth
• Optical character recognition has improved and can
solve some CAPTCHAs
Breaking CAPTCHAs
23
>details
A
spider trap
(also: crawler trap, robot trap) is a set of web pages that
cause a web crawler to make an infinite number of requests
or cause a poorly constructed crawler to crash.
[Wikipedia/Spider trap]
January 1st
• no meetings
next>
<prev
Spider traps can be
intentional or
unintentional.
Can be used to
trap spiders that do
not follow robots.txt :-)
“Robot Control” by Spider Traps
24
Example:
http://foo.com/bar/foo/bar/foo/bar/foo/bar/.....
>details
We can use an existing Web crawl
ClueWeb
CommonCrawl
Internet Archive
25
1b
25 TB
6b
100TB
2b
80TB
pages
size
Overview
26
Introduction
Crawling
Indexing
Searching
PageRank
Business
Innovation
Web
page
1
Web
page
2
Web
page
3
Web
page
4
Web
page
1000
Web
page
17
Web
page
42
A Search engine
27
Will Trump slump?
User query:
Search Engine
Result
...
How does this work?
Web
page
17
Web
page
42
Trump: page 1, page 17, page 42, page 13, ...
slump: page 5, page 17, page 42, page 300, ...
music: page 2, page 20, page 500, ...
...
A Search engine
28
User query:
Search Engine
Result
Will Trump slump?
The
index of a search engine
(also: inverted index) is a map from
words to Web documents (and potentially more information).
Search Engine Index
29
Search Engine
>details
>whichWords
Trump: page 1, page 17, page 42, page 13, ...
slump: page 5, page 17, page 42, page 300, ...
music: page 2, page 20, page 500, ...
...
<html>
<head>
<title>Words that rhyme with Trump</title>
<meta charset=UTF-8>
<script>
...
</script>
</head>
<body>
<h1>Words that rhyme with Trump</h1>
There are lost of words that rhyme with “Trump”, for example
...
Which words go into the index?
30
>details
>whichWords
• The
content
of the page.
(importance could be
weighted
: headers, emphasized text...)
• The
<title>
• The page
URL
.
•
Meta tags
:
•
<meta name="description" content="My website">
(
used
by Google snippets).
•
<meta name="keywords" content="great,website">
(not used by Google).
•
Engine-specific
:
<meta name="google" ...>
.
• Semantic annotations with
RDFa/JSON-LD
• Other
file formats
.
31
Which words go into the index?
>details
>whichWords
• Use
meta-information
(may not be reliable).
• Look at
characters
(not specific enough).
• Look at the
frequent k-grams
and classify.
=>
Surprisingly reliable.
Language identification
Compare:
th awarting problecon inge pria notobe inguarguarew for andes so col
th anciped the cons ors lessolat reatencle behas nuarocialso by in as
the re laturocalis com ad ble of th the in congto res wortionnetesers a
with:
encer desemps our parcellecon le de grarts is bale diatincyclossibles
inats res ne de le catictiont pectionne cofon les des sour dand communes
mation parchivell pedique de dontanduismes ce sour les pages ent inse
32
>details
doc1: "This Web page explains why Trump is so grumpy."
doc2: "Is Trump a crump or a frump?"
doc3: "Words that rhyme are fine"
We clean up our Web documents so that they consist
of only the parts that we want to index. This yields
the
document collection
:
Document collection
33
>details
>Tokenization
A
token
is a string of one or more characters that is significant as a group.
The process of splitting a string into tokens is called
tokenization
.
[Wikipedia/Token]
Tokenization
34
doc1: This, Web, page, explains, why, Trump, is, so, grumpy, .
doc2: Is, Trump, a, crump, or, a, frump, ?
doc3: Words, that, rhyme, are, fine
>details
>Tokenization
Challenges:
• Absence of
whitespace
to separate words in some languages
•
Entities
such as acronyms, numbers, URLs, emails, units, etc.
•
Word boundaries
(“host name” or “hostname”?)
35
>Tokenization
doc1: This, Web, page, explains, why, Trump, is, so, grumpy, .
doc2: Is, Trump, a, crump, or, a, frump, ?
doc3: Words, that, rhyme, are, fine
>details
A
token
is a string of one or more characters that is significant as a group.
The process of splitting a string into tokens is called
tokenization
.
Tokenization
• We remove
punctuation
• We remove
capitalization
Normalization
36
doc1: this, web, page, explains, why, trump, is, so, grumpy
doc2: is, trump, a, crump, or, a, frump
doc3: words, that, rhyme, are, fine
>details
production, produced, produce, producer
-> produc
The
stem
of a word is the part of the word that is common to all its
inflected variants.
[Paul Kroeger (2005). Analyzing grammar. Cambridge University Press. p. 248.]
The
lemma
of a word is its canonical form, as it appears in a dictionary.
[Wikipedia/Lemma]
go, goes, went
-> go
Stem, Lemma
37
>details
>Stemming
•
Morphological
stemming:
•
Remove
: plural, gender, tense, mood...
•
Irregularities
: “mouse”/“mice”.
•
Problem: Ambiguities
: “stocking”, “rose”, “couvent” (French).
•
Lexical
stemming
•
Derived terms
: “policy”, “politics”, “political”...
=>
The most popular algorithm is Porter Stemmer
•
Semantic
stemming:
•
Merge
synonyms or near-synonyms. (Not always relevant.)
•
Phonetic
stemming:
•
Merge
words that sound similarly to account for
spelling mistakes. Eg., for English,
Soundex
stems
Robert
and
Rupert
to
R163
.
Stemming
is the process of replacing a word by a normalized form
(the stem or the lemma, depending on the application).
Stemming
38
>details
>Stemming
Stemming Example
39
doc1: this, web, page, explains, why, trump, is, so, grumpy
doc2: is, trump, a, crump, or, a, frump
doc3: words, that, rhyme, are, fine
>stopwords
Stemming
is the process of replacing a word by a normalized form
(the stem or the lemma, depending on the application).
Stemming Example
40
doc1: this, web, page, explain, why, trump, is, so, grump
doc2: is, trump, a, crump, or, a, frump
doc3: words, that, rhyme, are, fine
>stopwords
Stemming
is the process of replacing a word by a normalized form
(the stem or the lemma, depending on the application).
Stopword
A
stopword
is a word that is common in a corpus but has little value
for searching.
(The definition of stopwords is application‐dependent. This one is from
Postgres
.)
Is
Trump
a
crump
or a
slump?
This
question makes people jump.
41
Example
Usually all words are stopwords, except
• nouns,
• adjectives
• non‐auxiliary verbs
>details
>stopword details
Stopword Rationale
Imagine we search for
How many cats do the Simpsons have?
List of Simpson cats:
...
Here we do explain
how many teeth the
chicken have.
42
>details
Stopword Rationale
Imagine we search for
How many cats do the Simpsons have?
List of Simpson cats:
...
Here we do explain
how many teeth the
chicken have.
43
Overlap: 5
Overlap: 2
Stopword Rationale
Imagine we search for
cats Simpsons?
List of Simpson cats:
...
Here we do explain
how many teeth the
chicken have.
44
Overlap: 0
Overlap: 2
>details
Stopword removal
We
remove stopwords
from our document collection.
45
doc1: this, web, page, explain, why, trump, is, so, grump
doc2: is, trump, a, crump, or, a, frump
doc3: words, that, rhyme, are, fine
>details
Stopword removal
=>
Storing the index takes
less space
.
=>
Processing is
faster
.
46
doc1: web, page, explain, trump, grump
doc2: trump, crump, frump
doc3: words, rhyme, fine
We
remove stopwords
from our document collection.
Overview
47
>motiv TF-IDF
Introduction
Crawling
Indexing
Searching
PageRank
Business
Innovation
Ideas for Relevance: TF-IDF
48
doc1:
trump
,
slump
, garden, flower, water, sun
doc2:
trump, trump
,
slump
doc3:
slump
,
slump
,
trump
doc4: dog,
slump
, cat,
slump
, cow,
slump
Query: Will Trump slump?
Ideas for Relevance: TF-IDF
49
doc1:
trump
,
slump
, garden, flower, water, sun
doc2:
trump, trump
,
slump
doc3:
slump
,
slump
,
trump
doc4: dog,
slump
, cat,
slump
, cow,
slump
Query: Will Trump slump?
The
term frequency
is proportion of terms in
that are
.
(The more
appears in
, the more
is
relevant
for
.)
The
inverse document frequency
is the logarithm of the inverse
proportion of documents of
where
occurs.
(The rarer
is, the more informative it is.)
Ideas for Relevance: TF-IDF
50
The
term frequency
is proportion of terms in
that are
.
(The more
appears in
, the more
is
relevant
for
.)
doc1:
trump
,
slump
, garden, flower, water, sun
doc2:
trump, trump
,
slump
doc3:
slump
,
slump
,
trump
doc4: dog,
slump
, cat,
slump
, cow,
slump
Query: Will Trump slump?
TF-IDF
is the following measure for the importance of a
document given a query term:
: term
: document
: document collection
: number of occurrences of
in
TF-IDF
51
with
>task
doc1:
trump
,
slump
, garden, flower, water, sun
doc2:
trump, trump
,
slump
doc3:
slump
,
slump
,
trump
doc4: dog,
slump
, cat,
slump
, cow,
slump
Query: Will Trump slump?
Compute TF and IDF values for this document collection:
Task: TF-IDF
52
trump
slump
TF
TF
doc1
doc2
TF-IDF
TF-IDF
IDF
...
doc1:
trump
,
slump
, garden, flower, water, sun
doc2:
trump, trump
,
slump
doc3:
slump
,
slump
,
trump
doc4: dog,
slump
, cat,
slump
, cow,
slump
Compute TF and IDF values for this document collection:
Task: TF-IDF
53
TF
TF
doc1
doc2
TF-IDF
TF-IDF
IDF
...
trump
slump
doc1:
trump
,
slump
, garden, flower, water, sun
doc2:
trump, trump
,
slump
doc3:
slump
,
slump
,
trump
doc4: dog,
slump
, cat,
slump
, cow,
slump
trump
: doc1/0.02, doc2/0.08, ...
slump
: doc1/0, doc2/0, ...
...
54
TF-IDF in the index
The TF-IDF values are commonly stored in the index.
>Fagin & Map Reduce
>Fagin
doc1:
trump
,
slump
, garden, flower, water, sun
doc2:
trump, trump
,
slump
doc3:
slump
,
slump
,
trump
doc4: dog,
slump
, cat,
slump
, cow,
slump
elvis: doc2/0.15, doc1/0.11, ...
pelvis: doc1/0.05, doc2/0.03, ...
...
=>
Finding the top-k documents for a single term query is very easy!
55
Index stores terms by decr. TF-IDF
The documents are sorted by decreasing TF-IDF in each list.
Query: elvis
find top-k documents
>Fagin & Map Reduce
>Fagin
If a query consists of multiple keywords
we need an
aggregation function
, which computes
the score of a document from the tf-idf scores of the keywords:
The aggregation function should be
monotonic
.
Common choices are
sum
,
prod
,
min
,
max
e.g.:
score(doc1,<elvis,pelvis>) = 0.11 + 0.05
56
Multi‐keyword queries
Query: elvis, pelvis
elvis: doc2/0.15, doc1/0.11, ...
pelvis: doc1/0.05, doc2/0.03, ...
...
>Fagin & Map Reduce
>Fagin
Query: Obama drama, k=3
Obama: doc1/0.10, doc2/0.08, doc3/0.05, doc7/0.05,...
drama: doc1/0.08, doc4/0.07, doc3/0.06, doc2/0.05,...
Given: inverted index, integer k, keywords
• Scan the lists for the keywords top-down in parallel
until k documents have been seen in all lists
doc1 OK
this document has been seen
in all lists.
57
Fagin’s Algorithm
Given: inverted index, integer k, keywords
• Scan the lists for the keywords top-down in parallel
until k documents have been seen in all lists
doc1 OK
doc2
doc4
58
Fagin’s Algorithm
Query: Obama drama, k=3
Obama: doc1/0.10, doc2/0.08, doc3/0.05, doc7/0.05,...
drama: doc1/0.08, doc4/0.07, doc3/0.06, doc2/0.05,...
Given: inverted index, integer k, keywords
• Scan the lists for the keywords top-down in parallel
until k documents have been seen in all lists
doc1 OK
doc2
doc4
doc3 OK
59
Fagin’s Algorithm
Query: Obama drama, k=3
Obama: doc1/0.10, doc2/0.08, doc3/0.05, doc7/0.05,...
drama: doc1/0.08, doc4/0.07, doc3/0.06, doc2/0.05,...
Given: inverted index, integer k, keywords
• Scan the lists for the keywords top-down in parallel
until k documents have been seen in all lists
doc1 OK
doc2 OK
doc4
doc3 OK
doc7
60
Fagin’s Algorithm
Query: Obama drama, k=3
Obama: doc1/0.10, doc2/0.08, doc3/0.05, doc7/0.05,...
drama: doc1/0.08, doc4/0.07, doc3/0.06, doc2/0.05,...
Given: inverted index, integer k, keywords
• Scan the lists for the keywords top-down in parallel
until k documents have been seen in all lists
doc1 OK
doc2 OK
doc4
doc3 OK
doc7
k documents have been
seen in all lists => stop
61
Fagin’s Algorithm
Query: Obama drama, k=3
Obama: doc1/0.10, doc2/0.08, doc3/0.05, doc7/0.05,...
drama: doc1/0.08, doc4/0.07, doc3/0.06, doc2/0.05,...
Given: inverted index, integer k, keywords
• Scan the lists for the keywords top-down in parallel
until k documents have been seen in all lists
• For all documents that you have seen in any list so far,
write down its tf-ids for each keyword, and its score
doc1 OK
doc2 OK
doc4
doc3 OK
doc7
62
Fagin’s Algorithm
Query: Obama drama, k=3
Obama: doc1/0.10, doc2/0.08, doc3/0.05, doc7/0.05,...
drama: doc1/0.08, doc4/0.07, doc3/0.06, doc2/0.05,...
Given: inverted index, integer k, keywords
• Scan the lists for the keywords top-down in parallel
until k documents have been seen in all lists
• For all documents that you have seen in any list so far,
write down its tf-ids for each keyword, and its score
doc1 OK
doc2 OK
doc4
doc3 OK
doc7
We saw these values
anyway already
during our scan.
63
Fagin’s Algorithm
Obama
drama
score
0.10
0.08
0.08
0.05
0.07
0.05
0.06
0.05
Query: Obama drama, k=3
Obama: doc1/0.10, doc2/0.08, doc3/0.05, doc7/0.05,...
drama: doc1/0.08, doc4/0.07, doc3/0.06, doc2/0.05,...
Given: inverted index, integer k, keywords
• Scan the lists for the keywords top-down in parallel
until k documents have been seen in all lists
• For all documents that you have seen in any list so far,
write down its tf-ids for each keyword, and its score
doc1 OK
doc2 OK
doc4
doc3 OK
doc7
64
Fagin’s Algorithm
score
0.10
0.08
0.08
0.05
0.07
0.05
0.06
0.05
0.05
0.02
0.18
0.13
0.12
0.11
0.07
Query: Obama drama, k=3
Obama: doc1/0.10, doc2/0.08, doc3/0.05, doc7/0.05,..., doc7/0.05
drama: doc1/0.08, doc4/0.07, doc3/0.06, doc2/0.05,..., doc4/0.02
Obama
drama
Given: inverted index, integer k, keywords
• Scan the lists for the keywords top-down in parallel
until k documents have been seen in all lists
• For all documents that you have seen in any list so far,
write down its tf-ids for each keyword, and its score
• Output the top k documents with highest score
doc1 OK
doc2 OK
doc4
doc3 OK
doc7
65
Fagin’s Algorithm
score
0.10
0.08
0.08
0.05
0.07
0.05
0.06
0.05
0.02
0.18
0.13
0.11
0.07
0.05
0.12
Obama
drama
>Fagin & Map Reduce
>Fagin
Fagin’s algorithm scans the lists only until we saw k
documents in all lists. Why is this sufficient?
See also: http://alumni.cs.ucr.edu/~skulhari/Top-k-Query.pdf
66
Fagin’s Algorithm
Query: Obama drama, k=3
Obama: doc1/0.10, doc2/0.08, doc3/0.05, doc7/0.05,..., doc9/0.04
drama: doc1/0.08, doc4/0.07, doc3/0.06, doc2/0.05,..., doc9/0.04
>Fagin & Map Reduce
>Task
elvis: doc4/0.20, doc2/0.15, doc3/0.01, doc6/0.0, doc5/0.0
pelvis: doc3/0.15, doc4/0.10, doc5/0.05, doc2/0.02
Given: inverted index, integer k, keywords
• Scan the lists for the keywords top-down in parallel
until k documents have been seen in all lists
• For all documents that you have seen in any list so far,
write down its tf-ids for each keyword, and its score
• Output the top k documents with highest score
67
Task: Fagin’s Algorithm
Query: elvis pelvis, k=2
score
Given: inverted index, integer k, keywords
• Scan the lists for the keywords top-down in parallel
until k documents have been seen in all lists
• For all documents that you have seen in any list so far,
write down its tf-ids for each keyword, and its score
• Output the top k documents with highest score
68
Task: Fagin’s Algorithm
=> doc4, doc2
doc4
doc3
doc2
doc5
elvis
pelvis
elvis: doc4/0.20, doc2/0.15, doc3/0.01, doc6/0.0, doc5/0.0
pelvis: doc3/0.15, doc4/0.10, doc5/0.05, doc2/0.02
Query: elvis pelvis, k=2
>Map Reduce
doc1
doc2
docn
69
Computing IDF in parallel
...
>Map Reduce
doc1
doc2
docn
70
Computing IDF in parallel
...
doc1
doc2
docn
(t1, doc1)
(t2, doc1)
...
(t7, doc2)
(t2, doc2)
...
(t9, docn)
(t1, docn)
...
71
Computing IDF in parallel
...
doc1
doc2
docn
(t1, doc1)
(t2, doc1)
...
(t7, doc2)
(t2, doc2)
...
(t9, docn)
(t1, docn)
...
(t1,(doc1,doc7,docn))
72
Computing IDF in parallel
...
(t2, (doc1, doc2))
...
(tm, (...))
group by terms
doc1
doc2
docn
(t1, doc1)
(t2, doc1)
...
(t7, doc2)
(t2, doc2)
...
(t9, docn)
(t1, docn)
...
(t1,(doc1,doc7,docn))
(t1, log(n/3))
73
Computing IDF in parallel
...
group by terms
(t2, (doc1, doc2))
...
(tm, (...))
>Map Reduce
doc1
doc2
docn
(t1, doc1)
(t2, doc1)
...
(t7, doc2)
(t2, doc2)
...
(t9, docn)
(t1, docn)
...
(t1,(doc1,doc7,docn))
(t1, log(n/3))
(tm, log(n/...))
(t2, log(n/2))
74
Computing IDF in parallel
...
group by terms
(t2, (doc1, doc2))
...
(tm, (...))
doc1
doc2
(t1, doc1)
(t2, doc1)
...
(t7, doc2)
(t2, doc2)
...
(t1,(doc1,doc7,docn))
(t1, log(n/3))
(t2, log(n/2))
Map the input
to (key, value)
pairs
Group by key:
(key, (value1, ...))
Reduce the
value list to
an aggregate value
(key, aggregate)
75
The Map Reduce Framework
group by terms
(t2, (doc1, doc2))
>Map Reduce
doc1
doc2
(t1, doc1)
(t2, doc1)
...
(t7, doc2)
(t2, doc2)
...
(t1,(doc1,doc7,docn))
(t1, log(n/3))
(t2, log(n/2))
How to distribute
the tasks to
machines?
How to group
the data by keys?
How to distribute
the work?
76
Challenges in MapReduce
(t2, (doc1, doc2))
>Map Reduce
• MapReduce is a framework for
distributed computing
.
• Published by
Google
in 2004: model and proprietary implementation.
• Apache
Hadoop
: open source implementation.
• Express your task in terms of:
Map
: Operations to run in parallel
Reduce
: Operations to aggregate results of Map steps
• The MapReduce implementation takes care of:
•
Scheduling
the various operations.
•
Dispatching
the jobs to the various workers.
•
Monitoring
the jobs and
restarting
failed jobs.
•
Moving
the computation results around.
Def: MapReduce
77
Overview
78
Introduction
Crawling
Indexing
Searching
PageRank
Business
Innovation
>intro
• Developed by
Larry Page
and
Sergey Brin
, 1996.
(Initially a research project at
Stanford
)
•
Patent
(assigned to Stanford University,
but Google has exclusive license rights).
• Main idea: Use the
links
on the Web
to estimate the
importance
of pages.
=>
Precompute this
independently
from the query.
=>
(Also: distribute it massively on
cheap
machines.)
• Related idea:
hubs and authorities
.
PageRank
is an algorithm that determines the importance
of a Web page based on its links to other Web pages
— independently of the query.
PageRank
79
The
Web graph
is a graph where the nodes are Web pages and there
is an edge from
x
to
y
iff page
x
contains a link to page
y
.
Web graph
80
<a href="http://b.com">
http://a.com/index.html
http://b.com/index.html
<a href="http://a.com">
http://c.com/index.html
<a href="http://c.com">
<a href="http://c.com">
>graph
A
directed graph
is a set of “nodes” (also: Vertices)
and a relation of “edges”
.
Graph
81
a
b
c
>graph
The
adjacency matrix
of a graph (V,E) is a matrix
with
.
a b c
a 0 1 1
b 1 0 1
c 0 0 0
Adjacency matrix
82
a
b
c
Intuitively, a page is more important if it has many incoming
links from important pages.
http://en.wikipedia.org/wiki/File:PageRanks-Example.svg
PageRank Intuition
83
Another way to see it: The
page rank
of a page is the probability
that a random surfer lands on the page.
http://en.wikipedia.org/wiki/File:PageRanks-Example.svg
PageRank Intuition
84
Problem: If an area has only incoming links, the surfer will never
leave this area. We have a
spider trap
.
http://en.wikipedia.org/wiki/File:PageRanks-Example.svg
Spider Traps
85
http://en.wikipedia.org/wiki/File:PageRanks-Example.svg
Spider Traps
86
Problem: If an area has only incoming links, the surfer will never
leave this area. We have a
spider trap
.
We just add a “random jump” from every page to every other page.
http://en.wikipedia.org/wiki/File:PageRanks-Example.svg
The Google Solution to Spider Traps
87
>formulae
->total-2018
PageRank Formula
88
PageRank Formula
89
total # nodes
# outgoing links
PageRank Formula
90
# outgoing links
total # nodes
dampening factor
usually 0.85
Random jump
This is the probability that
a random surver lands on page
,
if she performs a jump to a random
node with probability
at every step.
PageRank Formula
91
# outgoing links
total # nodes
dampening factor
usually 0.85
Problem: This formula is recursive!
Simple algorithm:
• Initialize
• Apply the page rank formula to every node
(always on the previous values for the other nodes)
• Replace all node values with their new values
• Repeat until convergence
This algorithm converges always if
PageRank Algorithm
92
>eigen
>task
Task: PageRank Algorithm
93
Start with
b
a
c
if
(otherwise
)
The page rank vector is the eigenvector of the (modified)
adjacency matrix.
PageRank as Eigenvector
94
Idea: Define adjacency matrix as
(simplified)
Overview
95
Introduction
Crawling
Indexing
Searching
PageRank
Business
Innovation
->total-2018
• Two kinds of results:
Organic
: Automatically generated from the Web.
Paid
: Auctioned to publishers.
• Organic results are the
impartial
output of algorithms
and cannot be
bought
(however, who chooses the algorithms?).
• Paid results are just
advertising
(online advertising:
36 billion dollar
revenue in the US in 2012)
•
Targeting
based on the keywords, geographical location,
the type of device...
•
Cost per click
(CPC): announcers are billed based
on clicks, the search engine tries to optimize revenue.
•
Cost per action
(CPA): announcers are billed
based on conversions (registration, sale, etc.).
• Risks of
abuse
(click farming...).
Organic and paid results
96
->end
Tracing the user
http://google.com/dashboard
>more
(Real example in my family, deduced automatically by Google)
97
The big internet companies trace the user through Web searches,
cookies, ads, and fingerprinting in order to serve ads.
->security
What it means if they know
You get unsolicited advertisements from companies whom you never
told about your life — even before the event happens.
(real examples in my family)
98
>more
What it means if they know
99
The service providers may also know if you are
• planning a divorce
• having a medical problem
• having an uncommon sexual preference
• under‐age and pregnant
This can
•
influence
the ads you see, even if Google explicitly
disallows
it
• make court actions against the service providers difficult (
blackmail
).
Dubious services providers may sell your information to
data brokers
,
feeding feeds background checks for
•
credit scores
• insurance fees
and
insurance claims
• advertisements
• hiring decisions
See
story
of
pregnant daughter
->end
SEO (search engine optimizations) are techniques to
improve the ranking of a page in search engines.
•
White hat
techniques, or recommended practices.
•
Black hat
techniques, discouraged by search engines.
• Details of search engine algorithms are
proprietary
.
=>
Search engines are certainly
not just
as PageRank + TF-IDF!
• Search engines try to protect themselves against
abuse
SEO
100
->end
• Provide
content
!
• Make sure that all pages can be reached following
static links
.
• Ensure that your site contains appropriate
keywords
.
• Use
text
rather than images or Flash.
• Use
alt
text for images (also for accessibility).
• Have a
clean
URL structure (no messy GET parameters).
• Avoid being
hacked
.
• Remove user‐generated
spam
.
• Use
Meta‐tags
that are being read by the search engines
General SEO advice
101
->end
• When changing the URL of a website, put in place a
301 redirect
to transfer PageRank.
• Links with
rel="nofollow"
are not accounted for in PageRank
(use them to link
without endorsing
).
•
Buy
links on high-PageRank websites, or
exchange
links.
•
Insert links
in forum posts, wikis, etc.
(Some bots submit
spammy
content in all forms they find.)
• Create backlinks with
misleading
text:
Google bombing
.
PageRank tweaks (white to grey)
102
->end
• Put keywords in the
URL
:
http://example.com/articles/42-a-long-title
.
•
Keyword‐stuff
in the page.
• Abuse CSS, JavaScript, redirects, etc., to make crawlers
see
different content.
• Use the
User-Agent
, IP ranges, etc.
to
serve
different content to crawlers.
•
Scrape
existing content, optionally
rewriting it (automatically or manually), or translating it.
Indexing tweaks (from grey to black)
103
->end
Overview
104
Introduction
Crawling
Indexing
Searching
PageRank
Business
Innovation
• Pages that have no
links
to them.
• For instance,
result pages
from a search.
• 2001 estimate: the deep Web is hundreds of times larger
than the reachable Web.
•
Web form probing
:
=>
Need to figure out form
constraints
.
=>
Need to come up with
keywords
.
=>
Idea:
feed back
words from the website into the form.
•
Risks?
(Semantics of
POST
.)
Deep Web
105
Bergman, Michael K (August 2001). “The Deep Web: Surfacing Hidden Value”. The Journal of Electronic Publishing, 7 (1)
->end
• Usual search engines have a
centralized
index (and a proprietary back-end).
•
P2P search
: distribute the crawling and
indexing between (untrusted)
peers
.
• Propagate
queries
to the various peers to answer them.
• Also used on
darknets
: RetroShare.
• Current implementations:
Seeks
,
YaCy
.
=>
Doesn’t work very well yet.
P2P search
106
->end
• Run queries through
multiple
search engines.
•
Aggregate
the results.
• Also known as
federated search
.
•
Historically
: Copernic Desktop Search, Ixquick...
•
Harder
nowadays: there aren't many independent search engines...
Meta search engine
107
->end
• If a user
clicks
on a search result, it means that it is
relevant
.
=>
Track
response page clicks.
• Other possible signals (pure speculation):
•
New search
from the same user?
•
Time
spent on a page (Google Analytics?)
• Feedback from
social media
(Like button, Google +1).
User feedback
108
->end
Search results can be ranked
differently
depending on the user:
•
Language
(crucial).
•
Position
(especially on mobile).
•
Past searches
(Google Web History, no account required).
•
Social graph
(searches by friends, feedback from friends).
•
Similar users
(clustering).
Search result personalisation
109
->end
Avoiding Filter bubbles
110
Your service provider decides what you get to see.
It wants to keep you happy.
=> It may show you only what you like.
(think about Russia, Islamism, Trump)
=> “intellectual isolation”, “echo chamber”,
“indoctrination with our own ideas”
=> reduced plurality, reinforced opinions, polarization
“You enter as a vegetarian, you leave as a vegan”
=> “threat to democracy” (Barack Obama)
Wikipedia: Filter bubble
A number of search engines avoid filter bubbles:
• Identify
typos
in queries:
•
Edit distance
approaches (spellchecking).
•
Statistical
approaches: two queries with
small edit distance in quick succession by the same user.
• Query
autocompletion
.
•
Natural language
queries.
•
Voice recognition
: Google Voice Search, Siri...
• Refine a query, use current query as
context
.
• Structured data visualization:
WolframAlpha
.
Query language
111
->end
• There are other things than
text
on the Web.
• Historical approach: index multimedia content using
text clues
: surrounding text, file names,
alt
text, etc.
• Better: index the
content
of an image (or video, or speech).
•
Voxalead
: search in
videos
, uses text to speech.
• Search by
image
: Google Images, Google Goggles.
• Search by
sound
:
• query by
humming
: Musipedia, SoundHound.
• query by
fingerprint
: Shazam.
Try it out!
Multimedia indexing
112
• There are different
search engines
to find Web content
• The search engine
crawls
Web pages and builds an
index
•
PageRank
ranks pages independently of the query
• There is a lot of
business value
both in providing Web search
and in optimizing the ranking
Summary
113
->total-2018