Deep Learning:
Transformers
CC-BY
Fabian M. Suchanek
Sequence2Sequence Architectures
2
Classical neural network architectures for translating one string to another
one (seq2seq, RNNs, LSTMs, CNNs, ...) try to cram the entire first string into
a single vector, before producing the translation:
Disadvantages:
• forgets long‐term dependencies
• suffers from the vanishing gradient problem
[MLexplained:Paper Dissected: “Attention is All You Need” Explained]
Transformers
3
A
transformer
is a neural network architecture that translates the input
string at once into an output string. The central tool is
attention
, i.e.,
the ability to link distant parts in the input and output string:
[MLexplained:Paper Dissected: “Attention is All You Need” Explained]
->BERT
Self-Attention
4
In its simplest variant,
self‐attention
is a function that takes as input an
embedded word from a sentence, and that produces as output the
sum of the embeddings of the other words, weighted by their similarity
to the input word.
->BERT
attention(
, X ) =
Input word
Input sentence:
a set of vectors
For all words
How similar
is
to
?
Times
Self-Attention
5
In its simplest variant,
self‐attention
is a function that takes as input an
embedded word from a sentence, and that produces as output the
sum of the embeddings of the other words, weighted by their similarity
to the input word, mapped to a value space.
->BERT
attention(
, X ) =
Input word
Input sentence:
a set of vectors
For all words
How similar
is
to
?
Times
Mapped to
value space
Def: Self-Attention
6
In its simplest variant,
self‐attention
is a function that takes as input an
embedded word from a sentence, and that produces as output the
sum of the embeddings of the other words, mapped to a key space,
weighted by their similarity to the input word mapped to a query space,
mapped to a value space.
->BERT
attention
(
, X ) =
Mapped to
query space
Mapped to
key space
Mapped to
value space
Self-Attention
7
In its simplest variant,
self‐attention
is a function that takes as input an
embedded word from a sentence, and that produces as output the
sum of the embeddings of the other words, mapped to a key space,
softmax-weighted by their similarity to the input word mapped to a
query space, mapped to a value space and normalized.
->BERT
attention
(
, X ) =
dimension of the vectors
Try it out
->Tranformer
Self-Attention Layer
8
A
self‐attention layer
of a neural network learns weight matrices
Q , K ,
. Given as input a list of input vectors
,
stacked as a matrix
it computes as output the
attention head
attention
〈0,0,1,0〉 〈0,1,0,0〉
X
Q
V
K
attention
attention head:
one representation for
each input word
Multi‐Headed Attention Layer
9
A
multi‐attention layer
of a neural network consists of a self‐attention
layers of dimension d with n m ‐dimensional inputs, applied in parallel.
The resulting matrices are concatenated to a n×(a·d) matrix,
and multiplied by a (a·d)×m matrix O to output a n×m matrix.
〈0,0,1,0〉 〈0,1,0,0〉
X
a attention heads n×d
result n×m
Example with a=3, n=2, m=4 :
a self‐attention layers
n inputs of dimension m
Encoder in Transformers
10
An
encoder
in a Transformer is a multi‐headed attention layer followed
by a fully connected feed‐forward network for each row of the output.
It takes as input n m ‐dimensional vectors, and produces as output also n
m ‐dimensional vectors.
one FFNN for each row
Example with a=3, n=2, m=4 :
same dimensions as input!
〈0,0,1,0〉 〈0,1,0,0〉
X
a attention heads n×d
n inputs of dimension m
>transformer
a self‐attention layers
Transformer
11
[Jay Alammar: The Illustrated Transformer]
A
transformer
is a neural network that consists of stacked encoders.
The last encoder feeds each decoder in a stack of decoders. The output
is the output of the last decoder. A
decoder
is a neural network similar
to an encoder. The input to a transformer is a vector that combines
embeddings with position codes.
Key advantages:
• parallelizable
=> faster training
• processes the
sentence simul‐
taneously
=> can capture
long‐range
dependencies
Def: BERT
12
BERT
(Bidirectional Encoder Representations from Transformers) is a neural
network that consists of stacked encoders:
input sentence
output sentence
Encoder
Encoder
Encoder
BERT is a huge model, with 110 M parameters in its smallest variant
and 345 M parameters in its largest. BERT was
developed by Google
.
BERT Input
13
The input to BERT is a sequence of tokens, with special tokens for a
sentence start and end. For each token we add the following vectors:
• an embedding of the token (usually: WordPiece)
• an embedding of the position of the token in the sentence
• an embedding of the number of the sentence in which the token is
“Be alert! BERT will hurt!”
[CLS] [be] [alert] [SEP] [bert] [will] [hurt] [SEP]
1 2 3 4 5 6 7 8
1 1 1 1 2 2 2 2
“be” ->
input sentence
input tokens
token embeddings
positions
position embeddings
sentence numbers
sentence number emb.
final vector
Def: BERT Training
14
BERT is trained on large text corpora to predict randomly masked words.
“BERT, lift your [MASK]”
Encoder
Encoder
Encoder
“BERT, lift your ____”
BERT is also trained on predicting, for two input sentences, whether
one follows the other in a text or not.
Def: BERT Fine Tuning
15
The trained BERT model can be used for other tasks by (1) adding layers
and (2) starting the training from the pre‐configured parameters
(a practice called
fine‐tuning
).
“BERT makes dirt”
Encoder
Encoder
Encoder
PERSON OTHER OTHER
Classifier for the output
for each token
Other Transformers
16
[Yang et al]
Summary: Transformers
17
Transformers are a powerful neural network architecture:
•
They process the input not sequentially, but simultaneously
•
BERT is a special transformer network, which is pre‐trained
and can be fine‐tuned to a variety of tasks
•
BERT achieves state‐of‐the‐art performance in a wide variety of tasks
->Deep-learning
Backup
18
Key-Value mapping
19
A
key‐value mapping
is based on
• a key matrix
• a value matrix
Given a query vector
, it computes
Vocabulary: 〈 “man”, “woman”, “child”〉 , d=3
Query vector: a d -dimensional encoding of “woman”,
Key matrix: list of n=2 keys, each of dimension d=3
Value matrix: maps keys to output
0.5〉 “boy”
0.5〉 “girl”
〈0.7
〈0.0
0.0
0.7
K =
〈1 1〉 “child”
〈0 1〉 “female”
〈1 0〉 “male”
is most
similar
to key “girl”
Key-Value mapping for matrix query
20
Given
• a key matrix
• a value matrix
• a list of queries
, stacked as a matrix
the key‐value mapping for all queries is given by
Vocabulary: 〈 “man”, “woman”, “child”〉
Query vectors: “man”, “woman”, stacked as a matrix Q
Key matrix and value matrix: as before
〈1,0,0〉 〈0.7,0〉
is more like key “boy”
〈0,1,0〉 〈0,0.7〉
is more like key “girl”
Q =
〈0.7,0,0.7〉 output for “man” 〈 child,female,male〉
〈0.7,0.7,0〉 output for “woman” 〈 child,female,male〉
Key-Value mapping for matrix query
21
Given
• a key matrix
• a value matrix
• a list of queries
, stacked as a matrix
the key‐value mapping for all queries is given by