Deep Learning:
Neural Networks
CC-BY
Fabian M. Suchanek
Overview
2
->Deep-learning
Neural Networks
Training a neural network
The naughty details
Def: Neural Networks
3
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
b=0
b=0
“OR”
“
AND NOT
”
“XOR”
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
b=1
“AND”
(We will later generalize this definition of neural networks.)
>walkexample
Neural Networks
4
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=0
“OR”
“XOR”
0
b=1
“AND”
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
the input
values are
propagated
“forward”
through
the network
>walkexample
“
AND NOT
”
Neural Networks
5
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
b=0
“OR”
“AND”
“XOR”
the input
values are
propagated
“forward”
through
the network
0
1
1
Each perceptron computes the weighted sum of its inputs...
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
>walkexample
“
AND NOT
”
Neural Networks
6
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
b=0
“OR”
“AND”
“XOR”
the input
values are
propagated
“forward”
through
the network
0
1
0
...and applies the bias and the activation function.
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
>walkexample
“
AND NOT
”
Neural Networks
7
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
“OR”
“AND”
“XOR”
0
1
0
1
b=0
the input
values are
propagated
“forward”
through
the network
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
>walkexample
“
AND NOT
”
Neural Networks
8
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
b=0
“OR”
“AND”
0
1
0
1
1
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
“
AND NOT
”
Def: Activation functions
9
• Step function
+ simple to compute
- not differentiable
• Rectified Linear Unit (ReLU)
+ easy to compute
- not differentiable
• Sigmoid function
+ differentiable
- not centered
• Tanh
+ differentiable
+ centered
step(x)=x>0?1:0
relu(x)=max(0,x)
perceptron(
)=
Neural Networks
10
Neural Networks can also
• take any real values as input (not just 0 and 1)
• have different activation functions in different perceptrons
• produce any real value as output (e.g., with act(x)=x )
• have several outputs (i.e., a vector of real numbers)
• have several hidden layers
• have any type of connections between perceptrons (not just in layers)
>details
>details&training
Neural Networks and Features
11
Classical Machine Learning:
90% Feature engineering
10% Learning
Neural Networks:
Output ready for consumption
Neurons automatically find and specialize on features
(“Grandma neuron” fires for grandma)
Raw, unpre‐processed input
>details
>details&training
The Crux with Neural Networks
Neural Networks have the crucial advantage that they can eliminate
much of the feature engineering. However,
•
they require a huge amount of training data
due to the curse of dimensionality
• they require a decision on the architecture
• the optimization is non‐convex, and may find a local optimum
(a large array of tricks are developed to avoid them)
• the deeper the network, the smaller the gradient, the less signal
to change the weights (“vanishing gradient problem”, less with ReLU)
12
>details
>details&training
Def: Feed Forward Neural Networks
13
A
feed forward neural network
(FNN) is a neural network without cycles,
where the output perceptrons are those without successors.
Fully Connected FNNs
14
A
fully connected
FNN (FFNN), is an FNN where the perceptrons are partitioned into layers, so that
each layer is fully connected to the following layer (and nowhere else), and all inputs are
connected to all perceptrons of the first layer. Every FNN can be seen as an FFNN.
A fully connected layer is also called a
dense layer
.
input “layer”
hidden layers
output layer
>details
>details&training
Fully Connected FNNs
15
The weights between Layer k and Layer k+1 in an FFNN can be seen as a matrix W ,
where
is the weight to perceptron i in Layer k+1 from perceptron j in Layer k :
Layer 1 (n=4 )
Layer 2 (m=3 )
1
2
3
1
2
3
4
>details
>details&training
W =
m rows: higher layer (“to where”)
n columns: lower layer (“from where”)
m×n matrix
Fully Connected FNNs
16
Layer 1
Layer 2
1
2
3
1
2
3
4
Assume there are no biases, and all perceptrons use the same activation function act .
Then for activations
of the perceptrons in Layer k , the perceptrons in Layer l compute the vector
>details
>details&training
W =
m rows: higher layer (“to where”)
n columns: lower layer (“from where”)
m×n matrix
and × is the matrix product.
Fully Connected FNNs
17
In an FFNN, each layer i=1,...,n has a weight matrix
. Assuming that there are no biases,
and that all perceptrons have the same activation function act , the FNN computes
Layer 1
Layer 2
1
2
3
1
2
3
4
Layer 0
Layer 3 (=n )
Def: Softmax
18
Softmax
is a function
that takes as input a vector
and computes a vector
with
. The function (1) amplifies large values and (2) makes sure that
.
Layer 1
Layer 2
1
2
3
1
2
3
4
Layer 0
Layer 3 (
)
— Softmax —
Layer 4 (
)
Def: Thresholding
19
Thresholding
a value x ∈ R by a threshold θ ∈ R means returning x if x≥θ and 0 otherwise.
Layer 1
Layer 2
1
2
3
1
2
3
4
Layer 0
Layer 3 (
)
>details
>details&training
— Softmax —
Layer 4 (
)
— Thresholding —
Layer 5 (thresholding with θ=0.8 )
Approximation completeness
20
A FFNN computes a function
. With one hidden layer, enough perceptrons, it
can approximate
any continuous function with a compact (=closed and bounded) domain.
The hidden layer projects the input into a higher dimensional space (like a kernel).
As the dimension increases, every set becomes linearly separable (Cover’s Theorem).
1
1
1
1
The hidden layer may have to be quite large
=> it is often preferable to rather add more layers
>details
>details&training
Def: Auto‐Encoder
21
An
auto‐encoder
is a FFNN with n inputs, one hidden layer with m<n perceptrons, and n outputs.
It is trained to reproduce the input, thus forcing it to reduce the n dimensions to m dimensions.
Training set: {
〈〈0,0,0,0〉, 〈0,0,0,0〉〉,
〈〈0,0,0,1〉,〈0, 0, 0, 1〉〉,
〈〈0,0,1,0〉,〈0, 0, 1, 0〉〉,
〈〈0,0,1,1〉,〈0, 0, 1, 1〉〉,
...
}
Def: Auto‐Encoder
22
An
auto‐encoder
is a FFNN with n inputs, one hidden layer with m<n perceptrons, and n outputs.
It is trained to reproduce the input, thus forcing it to reduce the n dimensions to m dimensions.
Training set: {
〈〈0,0,0,0〉, 〈0,0,0,0〉〉,
〈〈0,0,0,1〉,〈0, 0, 0, 1〉〉,
〈〈0,0,1,0〉,〈0, 0, 1, 0〉〉,
〈〈0,0,1,1〉,〈0, 0, 1, 1〉〉,
...
}
The auto‐encoder is trained to
reproduce the input faithfully.
=>
the first weight matrix
is a dimensionality‐reducer
=>
it performes something
comparable to PCA
>training
Overview
23
>training
->Deep-learning
Neural Networks
Training a neural network
The naughty details
Training a Neural Network
Given a training set
with
and
,
we want to find a FFNN that, given any
, computes
.
1.
decide the
hyperparameters
: the architecture (number of perceptrons and connections),
differentiable activation functions, and the training rate α ∈ R . We assume no biases.
2.
Initalize all weights with small random values
3.
Use backpropagation
24
>details
Backpropagation
Backpropagation is an algorithm for training an FFNN that is based on
•
a weight matrix W where
is the weight from perceptron i to j
•
the network function
, which computes the output of the FFNN for a given input
,
assuming that the weights are given by W .
•
a loss function that, given an actual output
and a desired output
,
computes an error, e.g.,
•
the error function of the network
.
Example E for a fixed
and a fixed
, with
E
25
>details
Minimizing the error
Backpropagation aims to minimize the error function
for a given training example
through gradient descent.
For this, it computes for each weight
the derivative
E
We want to find how much each weight is responsible
for the errors, and adjust these weights.
26
>details
Computing the derivative of E
E
(using the chain rule)
where
•
is the net input of perceptron j
•
is the output of perceptron j
27
>details
Computing the derivative of E
E
(using the chain rule)
We have
for output perceptrons j
For the perceptrons in lower layers,
can be
computed
from the gradients
in the higher layer.
Hence they are “back”-propagated.
28
>details
Backpropagation
29
E
Backpropagation
is an algorithm that, given an FFNN, a training set, and a learning rate α ∈ R ,
adjusts the weights so that the FFNN approximates the training set:
• Until the error is sufficiently small
• For each training example
• Compute the output
of the network
• Compute the gradient of the error function for each weight
• Subtract α times the gradient from each weight
Backpropagation does gradient descent
on the error function
>naughty
Overview
30
Neural Networks
Training a neural network
The naughty details
->Deep-learning
The naughty details
31
Designing and training a neural network is an art that has a huge impact on the performance.
[
Ruffinelli et al @ICLR 2020
]
Overfitting
32
Overfitting
means that a more complex model (more neurons) will model the training data
with higher accuracy, but might not generalize well on the testing data.
Training
Testing
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
+
Network Design
33
•
Which type of architecture?
RNN, CNN, FFNN, ... -> start with a simple FFNN
•
How many layers? -> start with 1 or 2 layers
It is more effective to increase the layers than the neurons per layer
•
How many neurons per layer? -> equally many (around 30),
or few at the top and many at the bottom (“pyramid rule”)
more layers and/or more neurons
≈
more complex network
≈
higher training accuracy
≈
danger of overfitting
Activation Functions
34
The choice of the activation function of the final layer depends on
the task:
•
binary classification:
- with one neuron: sigmoid
- with two neurons: softmax
•
classification with more than two classes: softmax
•
classification with more than one correct class per instance: sigmoid
•
regression: no activation
— Softmax —
— Sigmoid —
20
0
-10
6
0
0
0
1
0
1
0.5
1
20
0
-10
6
Thresholding
35
The threshold θ determines how lightheartedly the final layer of the network chooses a class
rather than just saying “no class”.
high θ
≈
less predictions
≈
higher accuracy
The threshold can also be tuned on the training data
θ
activations of last layer neurons
Batch size
36
Due to memory limitations, the training data is usually split into
batches
.
larger batches ≈ higher train accuracy ≈ overfitting
32 examples is a good maximum batch size for images, but there are
diverse studies on the effect of batch size on training time and accuracy.
neural
network
learn
weights
learn/update weights
batches of training data
one
epoch
Learning Rate
37
The learning rate determines how much the weights are changed with each example.
small learning rate = high‐resolution sampling of the error curve ≈ danger of local optima
large learning rate = coarse sampling of the error curve ≈ danger of missing an optimum
network
error
configuration of weights
large learning rate
small
learning
rate
global optimum
local optimum
Input
38
A network for an NLP task can be given different types of input:
• all words
• all words without stopwords
• the words plus their POS tags
• a grammatical structure
• different pre‐trained embeddings (Word2Vec, GloVe, FastText,...)
Start simple, and think what would help (or not)
you as a human to solve the problem
In practice
39
Designing and training neural networks can be done by libraries such as Tensorflow or PyTorch.
model = Sequential()
model.add(Embedding(len(vocab), embdim, input_length=...))
model.add(GlobalAveragePooling1D())
model.add(Dense(32, activation='relu'))
model.add(Dense(number_of_labels, activation='sigmoid'))
simple feed-forward network
word embedding layer
average all embeddings
one hidden layer
final layer, one neuron per label
Summary: Neural Networks
40
Neural networks are combinations of perceptrons
->Architectures
->Embeddings
->Deep-learning
->Transformers