Deep Learning:
Neural Networks
CC-BY
Fabian M. Suchanek
Overview
2
->Deep-learning
Neural Networks
Training a neural network
The naughty details
Def: Neural Networks
3
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
b=0
b=0
“OR”
“
AND NOT
”
“XOR”
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
b=1
“AND”
(We will later generalize this definition of neural networks.)
>walkexample
Neural Networks
4
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=0
“OR”
“XOR”
0
b=1
“AND”
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
the input
values are
propagated
“forward”
through
the network
>walkexample
“
AND NOT
”
Neural Networks
5
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
b=0
“OR”
“AND”
“XOR”
the input
values are
propagated
“forward”
through
the network
0
1
1
Each perceptron computes the weighted sum of its inputs...
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
>walkexample
“
AND NOT
”
Neural Networks
6
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
b=0
“OR”
“AND”
“XOR”
the input
values are
propagated
“forward”
through
the network
0
1
0
...and applies the bias and the activation function.
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
>walkexample
“
AND NOT
”
Neural Networks
7
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
“OR”
“AND”
“XOR”
0
1
0
1
b=0
the input
values are
propagated
“forward”
through
the network
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
>walkexample
“
AND NOT
”
Neural Networks
8
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
b=0
“OR”
“AND”
0
1
0
1
1
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
“
AND NOT
”
Neural Networks
9
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input. Neural Networks can also
• take any real values as input (not just 0 and 1)
• have different activation functions in different perceptrons
• produce any real value as output (e.g., with act(x)=x )
• have several outputs (i.e., a vector of real numbers)
• have several hidden layers
• have any type of connections between perceptrons (not just in layers)
>details
>details&training
Neural Networks and Features
10
Classical Machine Learning:
90% Feature engineering
10% Learning
Neural Networks:
Output ready for consumption
Neurons automatically find and
specialize on features (“Grandma
neuron” fires for grandma)
Raw, unpre‐processed
input
>details
>details&training
The Crux with Neural Networks
Neural Networks have the crucial advantage that they can eliminate
much of the feature engineering. However,
• they require a huge amount of training data,
due to the curse of dimensionality
• they require a decision on the architecture
• the optimization is non‐convex, and may find a local optimum
(a large array of tricks are developed to avoid them)
• the deeper the network, the smaller the gradient, the less signal
to change the weights (“vanishing gradient problem”, less with ReLU)
11
>details
>details&training
Def: Feed Forward Neural Networks
12
A
feed forward neural network
(FNN) is a neural network without cycles,
where the output perceptrons are those without successors.
Fully Connected FNNs
13
A
fully connected
FNN (FFNN), is an FNN where the perceptrons are
partitioned into layers, so that each layer is fully connected to the
following layer (and nowhere else), and all inputs are connected to all
perceptrons of the first layer. Every FNN can be seen as an FFNN.
A fully connected layer is also called a
dense layer
.
input “layer”
hidden
layers
output
layer
>details
>details&training
Fully Connected FNNs
14
The weights between Layer k and Layer k+1 in an FFNN can be seen as
a matrix W where
is the weight to perceptron i in Layer k+1
from perceptron j in Layer k :
Layer 1 (n=4 )
Layer 2 (m=3 )
1
2
3
1
2
3
4
W =
m rows: higher layer (“to where”)
n columns: lower layer (“from where”)
>details
>details&training
m×n matrix
Fully Connected FNNs
15
Layer 1
Layer 2
1
2
3
1
2
3
4
Assume there are no biases, and all perceptrons use the same activation
function act . Then for activations
of the perceptrons in Layer k ,
the perceptrons in Layer l compute the vector
(where
and × is the matrix product.)
.
>details
>details&training
Fully Connected FNNs
16
In an FFNN, each layer i=1,...,n has a weight matrix
. Assuming that
there are no biases, and that all perceptrons have the same activation
function act , the FNN computes
Layer 1
Layer 2
1
2
3
1
2
3
4
Layer 0
Layer 3 (=n )
Def: Softmax
17
Softmax
is a function
that takes as input a vector
and
computes a vector
with
. The function
(1) amplifies large values and (2) makes sure that
.
Layer 1
Layer 2
1
2
3
1
2
3
4
Layer 0
Layer 3 (
)
— Softmax —
Layer 4 (
)
Def: Thresholding
18
Thresholding
a value x ∈ R by a threshold θ ∈ R means
returning x if x≥θ and 0 otherwise.
Layer 1
Layer 2
1
2
3
1
2
3
4
Layer 0
Layer 3 (
)
>details
>details&training
— Softmax —
Layer 4 (
)
— Thresholding —
Layer 5 (thresholding with θ=0.8 )
Approximation completeness
19
A FFNN computes a function
. With one hidden layer, enough
perceptrons, and full connections, it
can approximate
any continuous
function with a compact (=closed and bounded) domain.
For boolean functions
, this is easy to see: for each positive
input
, create a hidden perceptron with weights
and bias
. The output perceptron has weights 1 and bias 0 .
1
1
1
1
The case for real inputs and boolean output
is similar. The hidden layer projects the
input into a higher dimensional space (like
a kernel). As the dimension increases, every set
becomes linearly separable (Cover’s Theorem).
The hidden layer may have to be quite large
=> it is often preferable to rather add more layers
>details
>details&training
Def: Auto‐Encoder
20
An
auto‐encoder
is a FFNN with n inputs, one hidden layer with m<n
perceptrons, and n outputs. It is trained to reproduce the input, thus
forcing it to reduce the n dimensions to m dimensions.
Training set: {
〈〈0,0,0,0〉, 〈0,0,0,0〉〉,
〈0,0,0,1〉,〈...〉〉,
〈0,0,1,0〉,〈...〉〉,
〈0,0,1,1〉,〈...〉〉,
〈1,1,0,0〉,〈...〉〉,
〈1,1,0,1〉,〈...〉〉,
〈1,1,1,0〉,〈...〉〉,
〈1,1,1,1〉,〈...〉〉
}
Auto‐Encoders
21
An
auto‐encoder
is a FFNN with n inputs, one hidden layer with m<n
perceptrons, and n outputs. It is trained to reproduce the input, thus
forcing it to reduce the n dimensions to m dimensions.
Training set: {
〈〈0,0,0,0〉, 〈0,0,0,0〉〉,
〈0,0,0,1〉,〈...〉〉,
〈0,0,1,0〉,〈...〉〉,
〈0,0,1,1〉,〈...〉〉,
〈1,1,0,0〉,〈...〉〉,
〈1,1,0,1〉,〈...〉〉,
〈1,1,1,0〉,〈...〉〉,
〈1,1,1,1〉,〈...〉〉
}
=> the auto‐encoder
learns that
, and
it can reproduce the
input faithfully.
=> the first weight matrix
is a dimensionality‐reducer
=> it performes something
comparable to PCA
>training
Overview
22
>training
->Deep-learning
Neural Networks
Training a neural network
The naughty details
Training a Neural Network
Given a training set
with
and
, we
want to find a FFNN that, given any
, computes
.
1. decide the
hyperparameters
: the architecture (number of
perceptrons and connections), differentiable activation functions, and
the training rate α ∈ R . We assume no biases.
2. Initalize all weights with small random values
3. Use backpropagation
23
>details
Backpropagation
Backpropagation is an algorithm for training an FFNN that is based on
• a weight matrix W where
is the weight from perceptron i to j
•
the network function
, which computes the output of the FFNN
for a given input
, assuming that the weights are given by W .
•
a loss function that, given an actual output
and a desired output
computes an error, e.g.,
•
the error function of the network
.
Example E for a fixed
and a fixed
, with
E
24
>details
Minimizing the error
Backpropagation aims to minimize the error function
for a given training example
through gradient descent.
For this, it computes for each weight
the derivative
E
We want to find how much each
weight is responsible for the
errors, and adjust these weights.
25
>details
Computing the derivative of E
E
(using the chain rule)
where
•
is the net input of perceptron j
•
is the output of perceptron j
26
>details
Computing the derivative of E
E
(using the chain rule)
We have
for output perceptrons j
For the perceptrons in lower
layers,
can be
computed
from the gradients
in the higher layer. Hence they
are “back”-propagated.
27
>details
Backpropagation
28
E
Backpropagation
is an algorithm that, given an FFNN, a training set,
and a learning rate α ∈ R , adjusts the weights so that the FFNN
approximates the training set:
• Until the error is sufficiently small
• For each training example
• Compute the output
of the network
• Compute the gradient of the error function for each weight
• Subtract α times the gradient from each weight
Backpropagation does
gradient descent on
the error function
>naughty
Overview
29
Neural Networks
Training a neural network
The naughty details
->Deep-learning
The naughty details
30
Designing and training a neural network is an art that has a huge impact
on the performance.
[
Ruffinelli et al @ICLR 2020
]
Overfitting
31
Overfitting
means that a more complex model (more neurons)
will model the training data with higher accuracy, but might not
generalize well on the testing data.
Training
Testing
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
+
Network Design
32
• Which type of architecture?
RNN, CNN, FFNN, ... -> start with a simple FFNN
•
How many layers? -> start with 1 or 2 layers
It is more effective to increase the layers than the neurons per layer
•
How many neurons per layer? -> equally many (around 30),
or few at the top and many at the bottom (“pyramid rule”)
more layers and/or more neurons
≈
more complex network
≈
higher training accuracy
≈
danger of overfitting
Activation Functions
33
The choice of the activation function of the final layer depends on
the task:
• binary classification:
- with one neuron: sigmoid
- with two neurons: softmax
•
classification with more than two classes: softmax
•
classification with more than one correct class per instance: sigmoid
•
regression: no activation
— Softmax —
— Sigmoid —
20
0
-10
6
0
0
0
1
0
1
0.5
1
20
0
-10
6
Thresholding
34
The threshold θ determines how lightheartedly the final layer of the
network chooses a class rather than just saying “no class”.
high θ
≈
less predictions
≈
higher accuracy
The threshold can also be tuned on the training data
θ
activations of last layer neurons
Batch size
35
Due to memory limitations, the training data is usually split into
batches
.
larger batches ≈ higher train accuracy ≈ overfitting
32 examples is a good maximum batch size for images, but there are
diverse studies on the effect of batch size on training time and accuracy.
neural
network
learn
weights
learn/update weights
batches
of training
data
one
epoch
Learning Rate
36
The learning rate determines how much the weights are changed with
each example.
small learning rate = high‐resolution sampling of the error curve
≈ danger of local optima
large learning rate = coarse sampling of the error curve
≈ danger of missing an optimum
network
error
configuration of weights
large learning rate
small
learning
rate
global optimum
local optimum
Input
37
A network for an NLP task can be given different types of input:
• all words
• all words without stopwords
• the words plus their POS tags
• a grammatical structure
• different pre‐trained embeddings (Word2Vec, GloVe, FastText,...)
start simple,
and think what would help (or not)
you as a human to solve the problem
In practice
38
Designing and training neural networks can be done by libraries
such as tensorflow or PyTorch.
model = Sequential()
model.add(Embedding(len(vocab), embdim, input_length=...))
model.add(GlobalAveragePooling1D())
model.add(Dense(32, activation='relu'))
model.add(Dense(number_of_labels, activation='sigmoid'))
simple feed-forward network
word embedding layer
average all embeddings
one hidden layer
final layer, one neuron per label
Summary: Neural Networks
39
Neural networks are combinations of perceptrons
->Architectures
->Embeddings
->Deep-learning
->Transformers