Deep Learning:
Embeddings
CC-BY
Fabian M. Suchanek
Overview
2
->Deep-learning
•
Word2vec
•
Other word embeddings
•
Knowledge Base embeddings
One‐hot encodings
3
Given an ordered set of words
(a
vocabulary
), a
one‐hot encoding
of a word
is a vector
of length n that has only zeroes and a single 1 at position i .
Ω = {
king,
queen,
woman,
princess,
...}
1
0
0
0
...〉
The word
“woman”:
Watch out: “One hot”
is not to be confused with
“hot one”
when you do a Web search!
Vocabulary:
The word
“king”:
0
0
1
0
...〉
Advantage: We can now feed words to a neural network
Disadvantage: Similar words are not similar in the vector space
Word embeddings
4
A
word embedding
for a vocabulary Ω is a mapping from Ω to
(with m<|Ω| )
— usually so that two vectors are similar iff their words are similar.
Royalty
Femininity
Masculinity
Age
...
1
0
1
0.6
...
The word
“woman”:
“Dimensions”
The word
“king”:
0.1
1
0
0.6
...
How can we find such an embedding?
0.9
0.8
0
0.2
...
The word
“princess”:
[The morning paper: The amazing power of word embeddings]
>details
Def: Word2vec
Word2vec
is a group of methods to produce word embeddings. One of them is the
Continuous Bag Of Words
(CBOW) method, in which a FFNN is trained to predict the middle word
of a sliding window over a corpus.
Elvis
Presley
plays
guitar
well
sliding window
[
]
>details
Def: Word2vec
Vocabulary size: v
Hidden layer size: n<v
Window size: 5
Elvis
Presley
plays
guitar
well
sliding window
[
]
>details
〈0,0,...1,...〉
〈0,0,1,...〉
〈0,1,.......〉
〈0,..1........〉
〈0,.....1.....〉
0
1
0
0
0
0
...
...
...
...
...
...
one‐hot encoding (size v )
Train to predict one‐hot of middle word
(a vector of size v )
v×n weight matrix
Same n×v weight matrix W for all inputs
n
1
...
Word2vec
...
...
The first weight matrix yields the word embedding.
“Presley”=〈0,0,1,...〉
W =
0.0
1.0
0.4
0.3
embedding(word) = W × onehot(word)
v
n
7
〈0,0,...1,...〉
Elvis
Presley
plays
guitar
well
〈0,0,1,...〉
〈0,1,.......〉
〈0,..1........〉
〈0,.....1.....〉
[
]
(= a single column inW )
Same n×v weight matrix W for all inputs
Rationale: the activation of
the hidden layer allowed for
the reconstitution of the word.
Word2vec applications
Word2vec word embeddings have some impressive properties
(illustrated here in a putative 2-dimensional PCA projection):
man
woman
uncle
aunt
same
difference
king
kings
queen
queens
man
king
woman
queen
[The morning paper: The amazing power of word embeddings]
king
-
man
+
woman
=
queen
8
Word2vec applications
Word2vec word embeddings have some impressive properties
(illustrated here in a putative 2-dimensional PCA projection):
Paris
France
Berlin
Germany
same
difference
[The morning paper: The amazing power of word embeddings]
Other examples of learned relationships:
- big-bigger
- Einstein-scientist
- Macron-France
- copper-Cu
- Berlusconi-Silvio
- Microsoft-Windows
- Microsoft-Ballmer
- Japan-sushi
9
Word2vec applications
Word2vec word embeddings have some impressive properties
(illustrated here in a putative 2-dimensional PCA projection):
Vietnam
[The morning paper: The amazing power of word embeddings]
capital
Hanoi
Vietnam
+
capital
=
Hanoi
Problem: word2vec cannot deal with unknown words.
10
->end
->glove etc.
Overview
11
•
Word2vec
•
Other word embeddings
•
Knowledge Base embeddings
->Deep-learning
->end
->glove etc.
Sub‐word Tokenization
12
The central problem of word2vec is that it cannot deal with words that were not seen in training.
Sub‐word tokenization
is the process of tokenizing a text not into words, but into parts of words.
In the ideal case, the resulting tokens mirror the
morphemes
, i.e., the
parts of the word that carry meanings.
A word can then be embedded by embedding its constituents, and averaging the resulting vectors.
Advantages:
• out‐of‐vocabulary words can be handled
• morphemes are captured
“I tokenize subwords” -> [“I”, “token-”, “-ize”, “sub-”, “-word-”, “-s”]
->end
N‐gram embeddings
13
An
n ‐gram
in a string is a substring of length n .
n ‐grams can be used for sub‐word tokenization and embedding.
Advantages:
• deals with out‐of‐vocabulary words
• can capture morphemes if they are of length n
Disadvantages:
• the vocabulary size is huge
n ‐gram embeddings are used in FastText.
“epistemology” -> {“epi”, “pis”, “ist”, “ste”, ...}
->end
Def: Byte Pair Encoding
14
Byte Pair Encoding
(BPE) is the following algorithm for compressing a string:
•
Find the most frequent character pair
•
Replace it by a new symbol,
memorise this mapping
•
Iterate until the most frequent pair
appears only once
“eat a banana, ana!”
most frequent: “an”
“an” -> Z
“eat a bZZa, Za”
“Za” -> Y
“eat a bZY, Y”
Word Piece Encoding
is a variant of BPE.
->end
Def: BPE for Embeddings
15
Byte Pair Encodings can be used for embeddings as follows:
• run BPE on the training corpus (usually not across word boundaries)
• to embed a new word, sub‐word tokenize it into the BPE symbols
• embed each token, and average the resulting vectors
Advantages:
• can deal with out‐of‐vocabulary words
• good vocabulary size
• naturally handles frequent morphemes
Disadvantages:
• expensive to tokenize
“eat a banana, ana!” -> “eat a bZY, Y”
with “an” -> Z, “Za” -> Y
“ananas” -> [“ana”, “nas”]
->NERC by Deep-learning
>char
->end
Character‐wise embeddings
16
embeds
To treat out‐of‐vocabulary words, and to take into account character features, one can embed
the characters instead of (or in addition to) the word embeddings: by a CNN
〈0,0,...〉
〈0,0,...〉
e
s
o...o
〈0,1,...〉
〈0,0,...〉
〈0,0,...〉
〈1,0,...〉
o...o
o...o
o...o
o...o
o...o
o...o
o...o
o...o
o...o
o...o
o...o
d
e
b
m
->Architectures
->NERC by Deep-learning
->end
Character‐wise embeddings
17
〈0,0,...〉
〈0,0,...〉
o...o
〈0,1,...〉
〈0,0,...〉
〈0,0,...〉
〈1,0,...〉
o...o
o...o
o...o
o...o
o...o
o...o
o...o
o...o
concatenation
LSTM nodes
Embeddings
one‐hot encoding
To treat out‐of‐vocabulary words, and to take into account character features, one can embed
the characters instead of (or in addition to) the word embeddings: by a CNN or a RNN.
->Architectures
->NERC by Deep-learning
->end
e
s
d
e
b
m
embeds
Other Embedding Methods
18
Besides word2vec, WordPiece, and BPE, there exist several other
methods to embed words:
-
FastText
: developed by Facebook, very fast, can cope with out‐of‐vocabulary words
-
GloVe
: developed by Stanford
-
ELMo
: developed by Allen AI, produces context‐sentitive representations
with different vectors for different meanings of the same word
-
approaches that use background knowledge, e.g., from a knowledge
base
->Architectures
->NERC by Deep-learning
->end
19
->Deep-learning
->NERC by Deep-learning
Overview
•
Word2vec
•
Other word embeddings
•
Knowledge Base embeddings
Knowledge base
A
knowledge base
(KB) is a labeled multi‐graph, where the nodes are
entities and the edges are relationships (see
lecture on KBs
).
20
Elvis
Priscilla
Lisa
child
child
Goal: map entities and relations to vectors v(·) , so that we have:
Elvis
Priscilla
spouse
v( Elvis)+v( spouse)=v( Priscilla)
>details
spouse
21
KB Embedding
A
KB embedding
is a function v(·) from entities to vectors, such that
v(s)+v(r) is close to v(o) for a KB fact 〈s,p,o〉 , where “close” can be
• close in terms of vector distance: v(s)+v(r)≈ v(o)
• close in terms of angular distance: cos(v(s)+v(r),v(o))≈ 0
A
TransE network
is a neural network that computes KB embeddings
such that ||v(s)+v(r)-v(o)|| is minimized — as we will now see.
Elvis
Priscilla
spouse
v( Elvis)+v( spouse)=v( Priscilla)
>details
22
TransE: Intuition
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
One‐hot encoding of subject, predicate, object of a fact from the KB
Dimensionality reduction
||v( Elvis)+v( spouse)-v( Priscilla)||=0
v( Elvis)
v( spouse)
v( Priscilla)
23
TransE: Intuition
||v( Elvis)+v( spouse)-v( Priscilla)||=0
Addition
Train this to be 0
Subtraction
Compute vector
length ||·||
>details
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
One‐hot encoding of subject, predicate, object of a fact from the KB
Dimensionality reduction
v( Elvis)
v( spouse)
v( Priscilla)
24
TransE: From one‐hot to embeddings
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
The first layer has 3 fully connected parts, with
-
the number of entities
-
the number of relations
- v the dimension of the embedding (
)
... and no activation function.
Same weight matrix
for subject and object!
v( Elvis)
v( spouse)
v( Priscilla)
v( spouse)
>details
One‐hot encoding of subject, predicate, object of a fact from the KB
25
TransE: Adding subject and relation
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
The second layer adds the embedding of the subject and of the relation,
by help of a fixed weight matrix and no activation function.
All weights are 1, no activation function
=> this layer performs addition
1
1
1
1
1
1
v( Elvis) + v( spouse)
>details
26
TransE: Subtracting the object
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
The third layer computes the vector difference in a similar way.
1
1
1
-1
-1
-1
Addition
Subtraction
Now we want to compute the length of the difference vector.
v( Elvis)+v( spouse)-v( Priscilla)
>details
27
TransE: Computing the vector length
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
+
-
1
1
1
Weight matrix
passes on the
values, activation
function squares
1
1
1
Weight matrix adds the values,
activation function computes square root
Addition and subtraction,
as seen before
>details
28
TransE: Training
Elvis=〈0,0,...1,...0〉
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
+
Train the network with facts from the KB as input, 0 as output
=> the first layer of the network computes embeddings v(·)
such that v(subj)+v(rel)≈ v(obj) .
(We also need negative facts, see
later
.)
-
||·||
29
TransE: Embeddings
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
+
-
We cannot train the network directly with embeddings as output, because we do not know
the ideal embeddings. Rather, we train the network on a task with known data,
and compute the embeddings as an intermediate step.
>details
||·||
Elvis=〈0,0,...1,...0〉
30
TransE: Embeddings
spouse=〈0,...,1,...,0〉
Priscilla=〈0,0,...,1,...,0〉
+
-
The network computes the final output based solely on the embeddings
=> the embeddings are meaningful
||v(s)+v(r)-v(o)||≈0
>details
||·||
Elvis=〈0,0,...1,...0〉
Negative sampling
We want the network to return 0 for true facts — and not just for the true facts that are in the KB,
but also for those true facts that are not in the KB. We want the network to return ∕=0 for “false facts”.
=> we need “false facts” for training (because otherwise the network will just always return 0).
Negative sampling
means producing facts that are sure to be
false in the real world, e.g., by taking a fact 〈s, r, o〉 from the KB
and generating a fact 〈s, r, o'〉 that is not in the KB.
Difficulty:
•
the KB is incomplete, i.e., not all facts that are not in the KB are false
•
if we use facts that are true in the real world as false facts,
the network will not generalize to facts that are not in the KB
•
the above strategy ressembles the PCA, but does not work for subjects
>details
32
Link prediction
Link prediction
means predicting facts that are not in the KB but true in the real world
(or predicting facts that are in a portion of the KB that the network has not seen during training).
Link prediction can be done by taking 〈s,r,o〉 from the KB, trying out all facts 〈s,r,o'〉 ,
and keeping those facts for which the network computes values close to 0.
Link prediction is similar to fact prediction by rules, just that
• the rules are not explicit
• evidence from several “rules” is automatically combined
• the two research communities are largely disjoint
>details
33
Evaluating Link Prediction
KB embeddings are usually evaluated under the Closed World Assumption:
A predicted link that does not appear in the KB is considered wrong.
Original KB:
Test KB:
Elvis
Priscilla
Lisa
spouse
child
child
Elvis
Priscilla
Lisa
spouse
child
34
Evaluating Link Prediction
KB embeddings are usually evaluated under the Closed World Assumption:
A predicted link that does not appear in the KB is considered wrong.
Original KB:
Test KB:
Predictions:
Elvis
Priscilla
Lisa
spouse
child
child
Elvis
Priscilla
Lisa
spouse
child
child
Navarone
child
35
Evaluating Link Prediction
KB embeddings are usually evaluated under the Closed World Assumption:
A predicted link that does not appear in the KB is considered wrong.
Elvis
Priscilla
Lisa
spouse
child
child
Original KB:
Test KB:
Elvis
Priscilla
Lisa
spouse
child
Predictions:
child
Navarone
child
counted as correct
counted as incorrect
(even though it is true in reality)
>details
36
Evaluation Issues: Benchmarks
Some benchmarks for the evaluation of link prediction have problems:
•
Data leakage: prediction is trivial from the test data
- Inverse Relations: isChild(x , y ) ⇒ isParent(y , x )
- Duplicate Relations: hasNationality(x , y ) ⇒ hasCitizenship(x , y )
- Cartesian Product: ∀ x, y ∈ dogs: sameSpeciesAs(x , y )
•
Sparsity Issues
Link prediction performs better for frequent entities and relations.
Thus, scores may be misleading if there are many such entities.
37
Evaluation Issues: Protocols
The evaluation protocols for Link Predictions also have shortcomings:
•
Link Prediction is evaluated under the closed world assumption.
•
The network might give the same score to positive and negative triples.
•
Older models can outperform new models with fine-tuning.
•
Representations fail to capture fine-grained entity semantics, i.e.,
embeddings are unable to capture the semantics of entities beyond
the top level classes (person, organization, place, etc.).
•
The evaluation protocol typically predicts either subject or object, not both.
In fact, it may give a high rank to many false triples of
the form r(?, ?) .
>advanced
38
Advanced networks for embeddings
TransE does not work well if a relation has several objects for a subject
=> they all end up in the same spot in the vector space.
Idea: Map each embedding v(x) to a new vector
that is specific to the relation r ,
and impose
.
This mapping is linear, and can hence be achieved by an additional layer.
Networks that implement such mappings are TransH, TransR, TransD.
Alternatively, we can also minimize cos(v(s)+v(r),v(o)) .
Networks that implement this measure are RESCAL, DistMul, HolE, ComplEx, ANALOGY.
Deeper models add more “pre‐processing” layers that extract features from the embeddings.
Summary: Embeddings
39
An embedding is a mapping of objects to vectors, so that
objects that are similar are mapped to vectors that are similar.
->Architectures
man
woman
uncle
aunt
same
difference
->Deep-learning
->Transformers