Deep Learning:
Perceptrons
CC-BY
Fabian M. Suchanek
Overview
2
Perceptrons
Training a perceptron
Extending a perceptron
Activation Functions
Structured Perceptrons
->Deep-learning
Perceptrons
A
perceptron
is a function
that is computed as
perceptron(
) =
3
>details
b
A perceptron is a linear classifier
A
perceptron
is a function
that is computed as
perceptron(
) =
It is thus a binary classifier (in a supervised scenario).
x
x
x
x
x
x
x
perceptron(
)=1
perceptron(
)=0
x
4
>details
A perceptron is a linear classifier
x'
x'
x'
x'
x'
x'
x'
perceptron(
)=1
perceptron(
)=0
x'
⇔
⇔
⇔
⇔ perceptron(
)=0 (with a simple binary function)
5
>details
A
perceptron
is a function
that is computed as
perceptron(
) =
Eliminating the bias
x
x
b
6
The bias b can be eliminated by adding a new dimension to all input
vectors (whose value is always 1) and adding the weight -b .
perceptron(
)
〈a;b〉 is vector concatenation.
Eliminating the bias
b
x'
x'
vector
concatenation
7
1-
The bias b can be eliminated by adding a new dimension to all input
vectors (whose value is always 1) and adding the weight -b .
perceptron(
)
Eliminating the bias
b
x'
8
x'
b
1-
The bias b can be eliminated by adding a new dimension to all input
vectors (whose value is always 1) and adding the weight -b .
perceptron(
)
Eliminating the bias
x'
b
b
9
x'
1-
The bias b can be eliminated by adding a new dimension to all input
vectors (whose value is always 1) and adding the weight -b .
perceptron(
)
Overview
10
Perceptrons
Training a perceptron
Extending a perceptron
Activation Functions
Structured Perceptrons
->Deep-learning
Def: Training a perceptron
Eliminate the bias b .
Choose a learning rate
.
While there exists a misclassified example
, add
to
if
is positive,
otherwise subtract
from
.
x
x
x
x
x
x
11
Training a perceptron
Eliminate the bias b .
Choose a learning rate
.
While there exists a misclassified example
, add
to
if
is positive,
otherwise subtract
from
.
x
x
x
x
x
x
12
Training a perceptron
Eliminate the bias b .
Choose a learning rate
.
While there exists a misclassified example
, add
to
if
is positive,
otherwise subtract
from
.
x
x
x
x
x
x
13
Training a perceptron
Eliminate the bias b .
Choose a learning rate
.
While there exists a misclassified example
, add
to
if
is positive,
otherwise subtract
from
.
x
x
x
x
x
x
Intuition: The sum
of two vectors is closer
to each of them.
Theorem: If the examples
can be linearly
separated at all,
they will eventually be.
14
>SVMs&dim
Overview
15
>SVM
Perceptrons
Training a perceptron
Extending a perceptron
Activation Functions
Structured Perceptrons
->Deep-learning
A different view on perceptrons
The perceptron formula perceptron(
)
can be written as
perceptron(
)
where
is the
misclassified example, and
its class.
x
x
x
x
a
b
x ?
16
perceptron(
) = act(
“Which similarity is bigger: to
or to
?”
a
: neg. training example
b
: pos. training example
Perceptrons & SVMs
x
x
x
x
a
b
17
A support vector machine (SVM) works in this way, but chooses the
examples that separate the training data best.
x
The perceptron formula perceptron(
)
can be written as
perceptron(
)
where
is the
misclassified example, and
its class.
Problems with Perceptrons
x
x
x
x
?
The perceptron can learn only linearly separable models.
In particular, it cannot learn XOR.
18
If the training data is not linearly separable, the perceptron will
find an arbitrarily bad solution.
Mapping to another space
x
If we map the examples to another space, e.g.,
...
x
x
x
19
x
x
x
x
below the
plane
below the
plane
If we map the examples to another space, e.g.,
then many datasets become linearly separable — among them XOR.
Mapping to another space
20
>kernels
perceptron
Mapping to another space
If we map the examples to another space,e.g.
then the perceptron function (in alternative form) becomes
21
small trick
Since φ only ever appears in the form
, we do not need
φ at all! What we need is only the “kernel”
Kernels
22
A perceptron with a kernel simulates a mapping of the training data
to a higher‐dimensional space, without incurring the inefficiency
of actually mapping the data.
Cover’s theorem
says: given a set of training data that is not linearly
separable, one can with high probability transform it into a training set
that is linearly separable by projecting it in a higher-dimensional space.
A
kernel
for a function
is a function
such that
.
Support vector machines (SVMs) also use kernels.
Overview
23
Perceptrons
Training a perceptron
Extending a perceptron
Activation Functions
Structured Perceptrons
->Deep-learning
Def: Activation functions
24
• Step function
+ simple to compute
- not differentiable
• Rectified Linear Unit (ReLU)
+ easy to compute
- not differentiable
• Sigmoid function
+ differentiable
- not centered
• Tanh
+ differentiable
+ centered
step(x)=x>0?1:0
relu(x)=max(0,x)
ReLU
25
relu(x)=max(0,x)
It can be made differentiable by choosing a∈ [0,1] and setting the
derivative at 0 to a :
relu'(x) = x=0 ? a : 1
The
Rectified Linear Unit
(ReLU) is the following activation function:
ReLU has shown superior performance in some tasks, and is hence
often the activation function of choice.
>strucPerc
Overview
26
Perceptrons
Training a perceptron
Extending a perceptron
Activation Functions
Structured Perceptrons
->end
->Deep-learning
Reminder: CRFs
27
A
conditional random field
(CRF) with identical potentials is a probability
distribution of the form
... which has the same max as a HMM, if
.
Let us consider feature functions
. Each has a weight
. We define
.
If the weights form a vector
, and the functions form a vector
:
To find
, we maximize:
->prob-conditional-random-fields
->named-entity-recognition-and-classification
->pos-tagging
The Structured Perceptron
28
The
Structured Perceptron
is an algorithm that, like a CRF, is given an input
and feature functions
with weights
, and it computes
Input:
I, run, fast〉
Output space:
Features:
Weights:
N,ADV,N o
o N,V,V
PR,V,ADV o
PR,V,ADV
is closest to
=> it is the output
Features of the Structured Perceptron
29
In practice, each feature function is a sum over the positions in
:
...which means that the structured perceptron maximizes:
... which is exactly what a CRF computes.
Interesting: Since every factor depends only on its predecessor,
this quantity can be maximized by the Viterbi Algorithm, running
in
, where m is the length of