Deep Learning:
Embeddings
CC-BY
Fabian M. Suchanek
Overview
2
->Deep-learning
•
Word2vec
•
Other word embeddings
•
Knowledge Base embeddings
One‐hot encodings
3
Given an ordered set of words
(a
vocabulary
), a
one‐hot encoding
of a word
is a vector
of length n that has
only zeroes and a single 1 at position i .
Ω = {
king,
queen,
woman,
princess,
...}
1
0
0
0
...〉
The word
“woman”:
Watch out: “One hot”
is not to be confused with
“hot one”
when you do a Web search!
Vocabulary:
The word
“king”:
0
0
1
0
...〉
Advantage: We can now feed words to a neural network
Disadvantage: Similar words are not similar in the vector space
Word embeddings
4
A
word embedding
for a vocabulary Ω is a mapping from Ω to
(with m<|Ω| ) — usually so that two vectors are similar iff their words
are similar.
Royalty
Femininity
Masculinity
Age
...
1
0
1
0.6
...
The word
“woman”:
“Dimensions”
The word
“king”:
0.1
1
0
0.6
...
How can we find such an embedding?
0.9
0.8
0
0.2
...
The word
“princess”:
[The morning paper: The amazing power of word embeddings]
>details
Def: Word2vec
Word2vec
is a group of methods to produce word embeddings. One of
them is the
Continuous Bag Of Words
(CBOW) method, in which a FFNN is
trained to predict the middle word of a sliding window over a corpus.
Vocabulary size: v ; Hidden layer size: n<v ; Window size: 5
〈0,0,...1,...〉
Elvis
Presley
plays
guitar
well
〈0,0,1,...〉
〈0,1,.......〉
〈0,..1........〉
〈0,.....1.....〉
0
1
0
0
0
0
...
...
...
...
...
...
one‐hot
encoding
Train to predict one‐
hot of middle word
sliding window
[
]
v×n weight matrix
>details
Same n×v weight
matrix W for all inputs
Word2vec
〈0,0,...1,...〉
Elvis
Presley
plays
guitar
well
〈0,0,1,...〉
〈0,1,.......〉
〈0,..1........〉
〈0,.....1.....〉
...
...
[
]
The first weight matrix yields the word embedding. Rationale: the
activation of the hidden layer allowed the reconstitution of the word.
“Presley”=〈0,0,1,...〉
W=
0.0
1.0
0.4
0.3
embedding(word)=
W × onehot(word)
(= a single column inW )
v
n
6
Same n×v weight
matrix W for all inputs
Word2vec applications
Word2vec word embeddings have some impressive properties
(illustrated here in a putative 2-dimensional PCA projection):
man
woman
uncle
aunt
same
difference
king
kings
queen
queens
man
king
woman
queen
[The morning paper: The amazing power of word embeddings]
king
-
man
+
woman
=
queen
7
Word2vec applications
Word2vec word embeddings have some impressive properties
(illustrated here in a putative 2-dimensional PCA projection):
Paris
France
Berlin
Germany
same
difference
[The morning paper: The amazing power of word embeddings]
Other examples of
learned relationships:
big-bigger
Einstein-scientist
Macron-France
copper-Cu
Berlusconi-Silvio
Microsoft-Windows
Microsoft-Ballmer
Japan-sushi
8
Word2vec applications
Word2vec word embeddings have some impressive properties
(illustrated here in a putative 2-dimensional PCA projection):
Vietnam
[The morning paper: The amazing power of word embeddings]
capital
Hanoi
Vietnam
+
capital
=
Hanoi
Problem: word2vec cannot deal with unknown words.
9
->end
->glove etc.
Overview
10
•
Word2vec
•
Other word embeddings
•
Knowledge Base embeddings
->Deep-learning
->end
->glove etc.
Sub‐word Tokenization
11
The central problem of word2vec is that it cannot deal with words
that were not seen during training.
Sub‐word tokenization
is the process of tokenizing a text not into words,
but into parts of words.
In the ideal case, the resulting tokens mirror the
morphemes
, i.e., the
parts of the word that carry meanings.
A word can then be embedded by embedding its constituents,
and averaging the resulting vectors.
Advantages:
• out‐of‐vocabulary words can be handled
• morphemes are captured
“I tokenize subwords” -> [“I”, “token-”, “-ize”, “sub-”, “-word-”, “-s”]
->end
N‐gram embeddings
12
An
n ‐gram
in a string is a substring of length n .
n ‐grams can be used for sub‐word tokenization and embedding.
Advantages:
• deals with out‐of‐vocabulary words
• can capture morphemes if they are of length n
Disadvantages:
• the vocabulary size is huge
n ‐gram embeddings are used in FastText.
“epistemology” -> {“epi”, “pis”, “ist”, “ste”, ...}
->end
Def: Byte Pair Encoding
13
Byte Pair Encoding
(BPE) is the following algorithm for compressing
a string:
• Find the most frequent character pair
•
Replace it by a new symbol,
memorise this mapping
•
Iterate until the most frequent pair
appears only once
“eat a banana, ana!”
most frequent: “an”
“an” -> Z
“eat a bZZa, Za”
“Za” -> Y
“eat a bZY, Y”
Word Piece Encoding
is a variant of BPE.
->end
Def: BPE for Embeddings
14
Byte Pair Encodings can be used for embeddings as follows:
• run BPE on the training corpus (usually not across word boundaries)
• to embed a new word, sub‐word tokenize it into the BPE symbols
• embed each token, and average the resulting vectors
Advantages:
• can deal with out‐of‐vocabulary words
• good vocabulary size
• naturally handles frequent morphemes
Disadvantages:
• expensive to tokenize
“eat a banana, ana!” -> “eat a bZY, Y”
with “an” -> Z, “Za” -> Y
“ananas” -> [“ana”, “nas”]
->NERC by Deep-learning
>char
->end
Character‐wise embeddings
15
Russia
To treat out‐of‐vocabulary words, and to take into account character
features, one can embed the characters instead of (or in addition to)
the word embeddings: by a CNN
〈0,0,...〉
〈0,0,...〉
r
a
o...o
〈0,1,...〉
〈0,0,...〉
〈0,0,...〉
〈1,0,...〉
o...o
o...o
o...o
o...o
o...o
o...o
o...o
o...o
o...o
o...o
o...o
i
s
s
u
->Architectures
->NERC by Deep-learning
->end
Character‐wise embeddings
16
Russia
〈0,0,...〉
〈0,0,...〉
r
a
o...o
〈0,1,...〉
〈0,0,...〉
〈0,0,...〉
〈1,0,...〉
o...o
o...o
o...o
o...o
o...o
o...o
i
s
s
u
o...o
o...o
concatenation
LSTM nodes
Embeddings
one‐hot enc.
To treat out‐of‐vocabulary words, and to take into account character
features, one can embed the characters instead of (or in addition to)
the word embeddings: by a CNN or a RNN.
->Architectures
->NERC by Deep-learning
->end
Other Embedding Methods
17
Besides word2vec, WordPiece, and BPE, there exist several other
methods to embed words:
-
FastText
: developed by Facebook, very fast, can cope with
out‐of‐vocabulary words
-
GloVe
: developed by Stanford
-
ELMo
: developed by Allen AI, produces context‐sentitive representations
with different vectors for different meanings of the same word
-
approaches that use background knowledge, e.g., from a knowledge
base
->Architectures
->NERC by Deep-learning
->end
18
->Deep-learning
->NERC by Deep-learning
Overview
•
Word2vec
•
Other word embeddings
•
Knowledge Base embeddings
Knowledge base
A
knowledge base
(KB) is a labeled multi‐graph, where the nodes are
entities and the edges are relationships (see
lecture on KBs
).
19
Elvis
Priscilla
Lisa
child
child
Goal: map entities and relations to vectors v(.) , so that we have:
Elvis
Priscilla
spouse
v( Elvis)+v( spouse)=v( Priscilla)
>details
spouse
20
KB Embedding
A
KB embedding
is a function v(·) from entities to vectors, such that
v(s)+v(r) is close to v(o) for a KB fact 〈s,p,o〉 , where “close” can be
• close in terms of vector distance: v(s)+v(r)≈ v(o)
• close in terms of angular distance: cos(v(s)+v(r),v(o))≈ 0
A
TransE network
is a neural network that computes KB embeddings
such that ||v(s)+v(r)-v(o)|| is minimized — as we will now see.
Elvis
Priscilla
spouse
v( Elvis)+v( spouse)=v( Priscilla)
>details
21
TransE: Intuition
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
One‐hot encoding of subject, predicate, object of a fact from the KB
Dimensionality
reduction
||v( Elvis)+v( spouse)-v( Priscilla)||=0
v( Elvis)
v( spouse)
v( Priscilla)
22
TransE: Intuition
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
One‐hot encoding of subject, predicate, object of a fact from the KB
Dimensionality
reduction
||v( Elvis)+v( spouse)-v( Priscilla)||=0
v( Elvis)
v( spouse)
v( Priscilla)
Addition
Train this to be 0
Subtraction
Compute vector
length ||·||
>details
23
TransE: From one‐hot to embeddings
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
One‐hot encoding of subject, predicate, object of a fact from the KB
The first layer has 3 fully connected parts, with
-
the number of entities
-
the number of relations
- v the dimension of the embedding (
)
... and no activation function.
Same weight
matrix
for subject
and object!
v( Elvis)
v( spouse)
v( Priscilla)
v( spouse)
>details
24
TransE: Adding subject and relation
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
The second layer adds the embedding of the subject and of the relation,
by help of a fixed weight matrix and no activation function.
All weights are 1,
no activation function
=> this layer performs addition
1
1
1
1
1
1
v( Elvis) + v( spouse)
>details
25
TransE: Subtracting the object
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
The third layer computes the vector difference in a similar way.
1
1
1
-1
-1
-1
Addition
Subtraction
Now we want to compute the length of the difference vector.
v( Elvis)+v( spouse)-v( Priscilla)
>details
26
TransE: Computing the vector length
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
+
-
1
1
1
Weight matrix
passes on the
values, activation
function squares
1
1
1
Weight matrix
adds the values,
activation function
computes square root
Addition and
subtraction,
as seen before
>details
27
TransE: Training
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
+
-
||·||
Train the network with facts from the KB as input, 0 as output
=> the first layer of the network computes embeddings v(·)
such that v(subj)+v(rel)≈ v(obj) .
(We also need negative facts, see
later
.)
28
TransE: Embeddings
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
+
-
We cannot train the network directly with embeddings as output,
because we do not know the ideal embeddings. Rather, we train
the network on a task with known data, and compute the embeddings
as an intermediate step.
>details
||·||
29
TransE: Embeddings
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
+
-
The network computes the final output based solely on the embeddings
=> the embeddings are meaningful
||v(s)+v(r)-v(o)||≈0
>details
||·||
Negative sampling
We want the network to return 0 for true facts — and not just for the
true facts that are in the KB, but also for those true facts that are not
in the KB. We want the network to return ∕=0 for “false facts”.
=> we need “false facts” for training (because otherwise
the network will just always return 0).
Negative sampling
means producing facts that are sure to be
false in the real world, e.g., by taking a fact 〈s, r, o〉 from the KB
and generating a fact 〈s, r, o'〉 that is not in the KB.
Difficulty:
• the KB is incomplete, i.e., not all facts that are not in the KB are false
• if we use facts that are true in the real world as false facts, the
network will not generalize to facts that are not in the KB
• the above strategy ressembles the PCA, but does not work for subjects
>details
31
Link prediction
Link prediction
means predicting facts that are not in the KB but true
in the real world (or predicting facts that are in a portion of the KB that
the network has not seen during training).
Link prediction can be done by taking 〈s,r,o〉 from the KB, trying out all
facts 〈s,r,o'〉 , and keeping those facts for which the
network computes values close to 0.
Link prediction is similar to fact prediction by rules, just that
• the rules are not explicit
• evidence from several “rules” is automatically combined
• the two research communities are largely disjoint
>details
32
Evaluating Link Prediction
KB embeddings are usually evaluated under the Closed World Assumption:
A predicted link that does not appear in the KB is considered wrong.
Elvis
Priscilla
Lisa
spouse
child
child
Original KB:
Test KB:
Elvis
Priscilla
Lisa
spouse
child
33
Evaluating Link Prediction
KB embeddings are usually evaluated under the Closed World Assumption:
A predicted link that does not appear in the KB is considered wrong.
Elvis
Priscilla
Lisa
child
child
Original KB:
Test KB:
Elvis
Priscilla
Lisa
spouse
child
Predictions:
child
Navarone
child
spouse
34
Evaluating Link Prediction
KB embeddings are usually evaluated under the Closed World Assumption:
A predicted link that does not appear in the KB is considered wrong.
Elvis
Priscilla
Lisa
spouse
child
child
Original KB:
Test KB:
Elvis
Priscilla
Lisa
spouse
child
Predictions:
child
Navarone
child
counted
as correct
counted as incorrect
(even though it is
true in reality)
>details
35
Evaluation Issues: Benchmarks
Some benchmarks for the evaluation of link prediction have problems:
• Data leakage: prediction is trivial from the test data
- Inverse Relations: isChild(x , y ) ⇒ isParent(y , x )
- Duplicate Relations: hasNationality(x , y ) ⇒ hasCitizenship(x , y )
- Cartesian Product: ∀ x, y ∈ dogs: sameSpeciesAs(x , y )
•
Sparsity Issues
Link prediction performs better for frequent entities and relations.
Thus, scores may be misleading if there are many such entities.
36
Evaluation Issues: Protocols
The evaluation protocols for Link Predictions also have shortcomings:
• Link Prediction is evaluated under the closed world assumption.
•
The network might give the same score to positive and negative triples.
•
Older models can outperform new models with fine-tuning.
•
Representations fail to capture fine-grained entity semantics, i.e.,
embeddings are unable to capture the semantics of entities beyond
the top level classes (person, organization, place, etc.).
•
The evaluation protocol typically predicts either subject or object,
not both. In fact, it may give a high rank to many false triples of
the form r(?, ?) .
>advanced
37
Advanced networks for embeddings
TransE does not work well if a relation has several objects for a subject
=> they all end up in the same spot in the vector space.
Idea: Map each embedding v(x) to a new vector
that is specific
to the relation r , and impose
.
This mapping is linear, and can hence be achieved by an additional
layer. Networks that implement such mappings are TransH, TransR, TransD.
Alternatively, we can also minimize cos(v(s)+v(r),v(o)) .
Networks that implement this measure are RESCAL, DistMul, HolE,
ComplEx, ANALOGY.
Deeper models add more “pre‐processing” layers that extract features
from the embeddings.
Summary: Embeddings
38
An embedding is a mapping of objects to vectors, so that
objects that are similar are mapped to vectors that are similar.
->Architectures
man
woman
uncle
aunt
same
difference
->Deep-learning
->Transformers