Deep Learning:
Special Architectures
CC-BY
Fabian M. Suchanek
Overview
2
->Deep-learning
Special architectures
Convolutional neural networks
Graph Convolutional Networks
Recurrent neural networks
Encoder‐decoders
Def: Filter
A
filter
is a matrix with an odd number of columns and an odd number of rows.
Filtering
a matrix M with a m×n filter F is the process of computing a matrix M' with
-1 0 1
0 0 0 0 1 1 1 1
0 0 1 1 1 1 0 0
0 0 1 1 1 1 0 0
0 0 0 0 1 1 1 1
Filter:
M
M'
0·
-1
+1·
0
+1·
1
(Here, ⊗ is the Hadamard product, which multiplies each matrix entry point-wise with the corresponding other matrix entry, yielding a matrix of the same size.
creates a sub‐matrix, padding with the closest values from M if necessary. The sum runs over all entries of the matrix.)
3
1
Filters
Filters are commonly used in image processing to pre‐process an image:
• detect boundaries (as below)
• blur an image
• increase the contrast
• etc.
They can be generalized to work on matrices of any dimension (1, 2, 3, ...).
Our example filter
detects vertical boundaries
Filters were usually designed manually. Neural Networks can learn them!
4
-1 0 1
0 0 0 0
1 1 1 1
0 0
1 1 1 1
0 0
0 0
1 1 1 1
0 0
0 0 0 0
1 1 1 1
0 0 0
1 1
0 0 0
0
1 1
0 0-1-1 0
0
1 1
0 0-1-1 0
0 0 0
1 1
0 0 0
Neural Networks and Filters
-1
0
1
1
Filter F=〈〈-1,0,1〉〉
Neural Networks can apply a filter to a one-dimensional matrix as
illustrated here, if its first hidden layer has one perceptron per input:
•
each perceptron is connected to 3 neighboring inputs
(modulo the borders)
•
the weights are given by F ,
and are the same for all perceptrons
•
the bias is 0, the activation function is act(x)=x
5
0 0 0 0
1 1 1 1
Neural Networks and Filters
Neural Networks can apply a filter to a one-dimensional matrix as
illustrated here, if its first hidden layer has one perceptron per input:
1
Filter F=〈〈-1,0,1〉〉
0
0
0
1
0
0
0
6
0 0 0 0
1 1 1 1
•
each perceptron is connected to 3 neighboring inputs
(modulo the borders)
•
the weights are given by F ,
and are the same for all perceptrons
•
the bias is 0, the activation function is act(x)=x
Neural Networks and Filters
Neural Networks can apply a filter to a higher dimensional matrix
analogously to the one‐dimensional case.
Filter F=〈〈-1,0,1〉〉
The weights are given by F
The input is a vector as always, but arranged in a matrix
A
filter plane
(also: feature map) has one perceptron
per input, each connected to neighboring inputs.
7
0 0 0 0
1 1 1 1
0 0
1 1 1 1
0 0
0 0 0 0
1 1 1 1
Neural Networks and Filters
Neural Networks can apply a filter to a higher dimensional matrix
analogously to the one‐dimensional case.
Filter F=
Each connected to 9 inputs,
with weights given by F
-1 0 1
-2 0 1
-1 0 1
Intuition: each perceptron is connected
to its “perception field”.
The input is a vector as always,
but arranged in a matrix
8
0 0 0
0
1 1 1 1
0 0 1
1 1 1
0 0
0 0 0
0
1 1 1 1
A
convolutional layer
of a FNN is a layer that consists of several filter
planes. Each plane has a weight vector
, which is used for all of its
perceptrons.
Each filter plane will learn its own weights (= its filter).
More planes will allow more filters to be learned.
Def: Convolutional Layer
9
Filters are great to spot invariants,
i.e., patterns that occur in several places of the input.
0 0 0
0
1 1 1 1
0 0 1
1 1 1
0 0
0 0 0
0
1 1 1 1
Convolutional Layer
A
convolutional layer
of a FNN is a layer that consists of several filter
planes. Each plane has a weight vector
, which is used for all of its perceptrons.
Each filter plane will learn its own weights (= its filter).
More planes will allow more filters to be learned.
Convolutional layers can also take their input from a hidden layer.
10
A convolutional layer is thus a layer with a special architecture and shared weights across
all perceptrons in a filter plane. This reduces the complexity of the network, because there
are less free variables than in an FFNN.
Different filter planes will learn to detect different features of the input.
The number of filter planes per layer is a hyperparameter to be chosen.
Convolutional Layer
11
A
pooling layer
(also: subsampling layer) of width d is a neural network layer that has
one perceptron for each non-overlapping d×d square of perceptrons of the preceding layer.
This perceptron computes the maximum or the average of these d×d perceptrons.
Pooling makes the following layer much smaller than the preceding one.
The functions of the pooling perceptrons are fixed, and are not learned.
Def: Pooling Layer
12
A
convolutional neural network
(CNN) is a FNN with convolutional layers,
usually followed by pooling layers and fully connected layers.
Def: Convolutional Neural Network
[Aphex34]
13
The GoogleNet is a convolutional neural network for image recognition.
CNN Example
14
[Leonardo Araujo dos Santos: Artificial Intelligence]
>RNN
Overview
15
Special architectures
Convolutional neural networks
Graph Convolutional Networks
Recurrent neural networks
Encoder‐decoders
->Deep-learning
Graphs with Recursive Properties
16
In some graphs, the properties of one node depend on the properties
of the neighboring nodes
• a Web page is popular if popular pages link to it (PageRank)
• a disambiguation is correct if its neighboring disambiguations are
• a variable is true if certain variables in certain logical formulae are true
• a user’s interest is influenced by the interests of their friends
How can we compute the properties of a node, if they depend
recursively on the properties of the other nodes?
Basic Idea
17
Every node i in the graph has a feature vector
.
〈2,1〉
〈1,3〉
〈3,0〉
Basic Idea
18
Every node i in the graph has a feature vector
.
We update every vector
by replacing it by the average of its neighboring vectors and itself:
N(i) are the neighbors of node i
〈2,1〉
〈1,3〉
〈3,0〉
Basic Idea
19
〈2,1〉
〈1,3〉
〈3,0〉
〈2.5,0.5〉
〈2,1.5〉
〈2,1.3〉
Every node i in the graph has a feature vector
.
We update every vector
by replacing it by the average of its neighboring vectors and itself:
N(i) are the neighbors of node i
Basic Idea in Matrix Form
20
A
B
C
2,1
F = 3,0
1,3
5,
1
6,
4
4,
3
1,1,0
A = 1,1,1
0,1,1
A F
“sum of neighbors”
The update can be written as
, where
•
F is the n×d matrix of stacked feature vectors (one row per node)
•
A is the n×n adjacency matrix, where the diagonal is set to 1
•
D is the diagonal matrix that contains for each node the number
of incoming links in A .
replaces each entry d of D by
.
〈2,1〉
〈1,3〉
〈3,0〉
〈2.5,0.5〉
〈2,1.5〉
〈2,1.3〉
Basic Idea in Matrix Form
21
½,
0,
0
0,
3,
0
0,
0,
½
-1
5,
1
6,
4
4,
3
2½,
½
2,
1.3 = F '
2,
1½
A F =
The update can be written as
, where
•
F is the n×d matrix of stacked feature vectors (one row per node)
•
A is the n×n adjacency matrix, where the diagonal is set to 1
•
D is the diagonal matrix that contains for each node the number
of incoming links in A .
replaces each entry d of D by
.
A
B
C
〈2,1〉
〈1,3〉
〈3,0〉
〈2.5,0.5〉
〈2,1.5〉
〈2,1.3〉
Basic Idea as CNN
22
The update can be computed by a convolutional neural network of two layers,
where each graph node becomes d feature nodes, and is connected to its neighbors
on the lower level. The values of the upper layer are given by
),
where σ is the activation function.
Lower layer
Upper layer
Graph Convolutional Network
23
Given a graph G=(V, E) , where each node v ∈ V has a feature vector
,
a
Graph Convolutional Network
(GCN) is a neural network with d×|V | nodes on each layer,
whose activations on layer l are given by
,
where
, and
are as before.
〈2,1〉
〈1,3〉
〈3,0〉
〈3.2,0.1〉
〈0.4,1.8〉
〈0.22,0.7〉
...
additional
normalization
Overview
24
->Deep-learning
Special architectures
Convolutional neural networks
Graph Convolutional Networks
Recurrent neural networks
Encoder‐decoders
If the input to a network is a text (= a sequence of words = a sequence
of embedding vectors), we have to compress the input vectors into
a single vector.
The easiest way to do this is to average the input vectors:
This may not be the best way, though...
Dealing with input sequences
25
Input:
Elvis
loves
Priscilla
Embeddings:
〈0.3, 0.2, ...〉
〈0.2, 0.3, ...〉
〈0.7, 0.1, ...〉
〈0.4, 0.2, ...〉 average of the input vectors
A
recurrent neural network
(RNN) is a neural network that works
on a sequence of input vectors
, yielding a sequence of outputs
.
In an RNN, a perceptron can receive as input the activation that another perceptron computed for
the previous
.
Def: Recurrent Neural Networks
Very simple RNN with only one layer
26
Recurrent Neural Networks
W
V
More concisely,
with weight
matrices V, W :
27
A
recurrent neural network
(RNN) is a neural network that works
on a sequence of input vectors
, yielding a sequence of outputs
.
In an RNN, a perceptron can receive as input the activation that another perceptron computed for
the previous
.
RNN example
28
RNNs can be used for Machine Translation:
[Richard Socher: CS224n: Natural Language Processing with Deep Learning]
>LSTM
A
long short term network
(LSTM) is a RNN of the following form:
LSTM Networks
⊗
⊕
⊗
Colah’s blog: Understanding LSTM Networks
⊗
〈 ; 〉
is the input at time t
is the cell state at time t
is the output at time t
〈 ; 〉 is vector concatenation
and
are activation functions, applied to a vector
are learned
(a+b)× b weight matrices
⊕ is point-wise vector addition
⊗ is point-wise vector multiplication
29
LSTM States
Colah’s blog: Understanding LSTM Networks
is the
cell state
: a b -dimensional vector that gets mainly passed on from one step to the next.
It “remembers” information from previous inputs (“long term memory”).
is the
output vector
: a b -dimensional vector that is the current output for the current input
.
⊗
⊕
⊗
⊗
〈 ; 〉
30
LSTM Forget Gate
Colah’s blog: Understanding LSTM Networks
The
forget gate
takes the input
and the previous output
as a concatenated vector
and computes a b -dimensional binary vector
.
This vector is multiplied with
, thus setting some components of
to zero
(“forgetting” parts of the long-term memory).
⊗
⊕
⊗
⊗
〈 ; 〉
31
LSTM Input Gate
Colah’s blog: Understanding LSTM Networks
The
input gate
takes the input
and the previous output
as a concatenated vector
and does two things:
1) with
, it computes an update vector
that will be added to
2) with
, it computes a binary vector
that decides which components of
to keep.
The resulting vector is then added to
.
⊗
⊕
⊗
⊗
〈 ; 〉
32
LSTM Output Gate
Colah’s blog: Understanding LSTM Networks
The
output gate
also does two things:
1) It takes
and
, and computes an output
2) It modifies
by multiplying it with
, yielding the output
⊗
⊕
⊗
⊗
〈 ; 〉
33
>encDec
Overview
34
->Deep-learning
Special architectures
Convolutional neural networks
Graph Convolutional Networks
Recurrent neural networks
Encoder‐decoders
Encoder‐Decoder
An
encoder‐decoder network
is a neural network that encodes a variable‐length sequence
into a finite vector, and then decodes that vector into a variable‐length output sequence.
EOS
...
35
EOS
...
Encoder: e.g., LSTMs
Decoder: can be LSTMs
Special
end‐of‐
sequence
symbol
The finite cell state vector
encodes all the input sequence.
It is passed to the decoder.
Summary: Deep Learning Architectures
Important neural network architectures are
•
convolutional neural networks (CNNs) to apply “filters” on an input
•
graph‐convolutional networks (GCNs) compute node features on a graph
•
recursive neural networks (RNNs) to process sequences
•
among the RNNs, long‐term‐short‐term‐memory (LSTM)
structures capture long-distance relationships
36
->Deep-learning
->Transformers