Deep Learning:
Perceptrons
CC-BY
Fabian M. Suchanek
Overview
2
Perceptrons
Training a perceptron
Extending a perceptron
Activation Functions
Structured Perceptrons
->Deep-learning
Perceptrons
A
perceptron
is a function
that is computed as
perceptron(
) =
1 if
0 else
3
>details
?
b
Perceptrons
A
perceptron
is a function
that is computed as
perceptron(
) =
where act:R→ R is the activation function (often act(r)=1 for r>0 and act(r)=0 otherwise).
4
>details
b
A perceptron is a linear classifier
A
perceptron
is a function
that is computed as
perceptron(
) =
It is thus a binary classifier (in a supervised scenario).
x
x
x
x
x
x
x
perceptron(
)=1
perceptron(
)=0
x
5
>details
A perceptron is a linear classifier
x'
x'
x'
x'
x'
x'
x'
perceptron(
)=1
perceptron(
)=0
x'
⇔
⇔
⇔
⇔ perceptron(
)=0 (with a simple binary function)
6
>details
A
perceptron
is a function
that is computed as
perceptron(
) =
Eliminating the bias
x
x
b
7
The bias b can be eliminated by adding a new dimension to all input vectors (whose value
is always 1) and adding the weight -b .
perceptron(
)
〈a;b〉 is vector concatenation.
Now a simple dot product
without subtracting a bias
Eliminating the bias
b
x'
x'
vector
concatenation
8
1-
The bias b can be eliminated by adding a new dimension to all input
vectors (whose value is always 1) and adding the weight -b .
perceptron(
)
Eliminating the bias
b
x'
9
x'
b
1-
The bias b can be eliminated by adding a new dimension to all input
vectors (whose value is always 1) and adding the weight -b .
perceptron(
)
Eliminating the bias
x'
b
b
10
x'
1-
The bias b can be eliminated by adding a new dimension to all input
vectors (whose value is always 1) and adding the weight -b .
perceptron(
)
Overview
11
Perceptrons
Training a perceptron
Extending a perceptron
Activation Functions
Structured Perceptrons
->Deep-learning
Def: Training a perceptron
Eliminate the bias b .
Choose a learning rate
.
While there exists a misclassified example
, add
to
if
is positive,
otherwise subtract
from
.
x
x
x
x
x
x
12
Training a perceptron
Eliminate the bias b .
Choose a learning rate
.
While there exists a misclassified example
, add
to
if
is positive,
otherwise subtract
from
.
x
x
x
x
x
x
13
Training a perceptron
Eliminate the bias b .
Choose a learning rate
.
While there exists a misclassified example
, add
to
if
is positive,
otherwise subtract
from
.
x
x
x
x
x
x
14
Training a perceptron
Eliminate the bias b .
Choose a learning rate
.
While there exists a misclassified example
, add
to
if
is positive,
otherwise subtract
from
.
x
x
x
x
x
x
Intuition: The sum of two vectors is closer
to each of them.
Theorem: If the examples can be linearly
separated at all, they will eventually be.
15
>SVMs&dim
Overview
16
>SVM
Perceptrons
Training a perceptron
Extending a perceptron
Activation Functions
Structured Perceptrons
->Deep-learning
A different view on perceptrons
The perceptron formula
perceptron(
)
can be written as
perceptron(
)
where
is the
example, and
its class.
x
x
x
x
a
b
x ?
17
perceptron(
) = act(
“Which similarity is bigger: to
or to
?”
a
: negative training example
b
: positive training example
A sum of all
training example labels,
weighed by similarity to
Perceptrons & SVMs
x
x
x
x
a
b
18
A support vector machine (SVM) works in this way, but chooses the
examples that separate the training data best.
x
The perceptron formula
perceptron(
)
can be written as
perceptron(
)
where
is the
example, and
its class.
Problems with Perceptrons
x
x
x
x
?
The perceptron can learn only linearly separable models.
In particular, it cannot learn XOR.
19
If the training data is not linearly separable, the perceptron will find an arbitrarily bad solution.
Mapping to another space
x
If we map the examples to another space, e.g.,
...
x
x
x
20
x
x
x
x
below the
plane
below the
plane
If we map the examples to another space, e.g.,
then many datasets become linearly separable — among them XOR.
Mapping to another space
21
>kernels
perceptron
Mapping to another space
If we map the examples to another space,e.g.
then the perceptron function (in alternative form) becomes
22
small trick
Since φ only ever appears in the form
, we do not need φ at all!
What we need is only the “kernel”
Kernels
23
A perceptron with a kernel simulates a mapping of the training data
to a higher‐dimensional space, without incurring the inefficiency
of actually mapping the data.
Cover’s theorem
says: given a set of training data that is not linearly separable,
one can with high probability transform it into a training set that is linearly separable
by projecting it in a higher-dimensional space.
A
kernel
for a function
is a function
such that
.
Support vector machines (SVMs) also use kernels.
Overview
24
Perceptrons
Training a perceptron
Extending a perceptron
Activation Functions
Structured Perceptrons
->Deep-learning
Def: Activation functions
25
• Step function
+ simple to compute
- not differentiable
• Rectified Linear Unit (ReLU)
+ easy to compute
- not differentiable
• Sigmoid function
+ differentiable
- not centered
• Tanh
+ differentiable
+ centered
step(x)=x>0?1:0
relu(x)=max(0,x)
ReLU
26
relu(x)=max(0,x)
It can be made differentiable by choosing a∈ [0,1] and setting the derivative at 0 to a :
relu'(x) = x=0 ? a : 1
The
Rectified Linear Unit
(ReLU) is the following activation function:
ReLU has shown superior performance in some tasks, and is hence often the activation function
of choice.
>strucPerc
Overview
27
Perceptrons
Training a perceptron
Extending a perceptron
Activation Functions
Structured Perceptrons
->end
->Deep-learning
Reminder: CRFs
28
A
conditional random field
(CRF) with identical potentials is a probability
distribution of the form
... which has the same max as a HMM, if
.
Let us consider feature functions
. Each has a weight
. We define
.
If the weights form a vector
, and the functions form a vector
:
To find
, we maximize:
->prob-conditional-random-fields
->named-entity-recognition-and-classification
->pos-tagging
The Structured Perceptron
29
The
Structured Perceptron
is an algorithm that, like a CRF, is given an input
and feature functions
with weights
, and it computes
Input:
I, run, fast〉
Output space:
Features:
Weights:
N,ADV,N o
o N,V,V
PR,V,ADV o
PR,V,ADV
is closest to
=> it is the output
Features of the Structured Perceptron
30
In practice, each feature function is a sum over the positions in
:
...which means that the structured perceptron maximizes:
... which is exactly what a CRF computes.
Interesting: Since every factor depends only on its predecessor,
this quantity can be maximized by the Viterbi Algorithm, running
in
, where m is the length of
, and k is the number of different
values that each component of
can take.
[
Michael Collins: “Lecture 4, COMS E6998-3: The Structured Perceptron”
]
Features of the Structured Perceptron
31
The structured perceptron maximizes:
If we choose
and
,
we can simulate an HMM tagging:
Input
I, run, fast〉 . Output space of possible
:
o N,ADV,N
o N,V,V
PR,V,ADV o
\ N,ADV,N
o N,V,V
o PR,V,ADV
o N,ADV,N
o N,V,V
o PR,V,ADV
i=1
i=2
i=3
PR,V,ADV
is always
closest to
=> it is the output
Features of the Structured Perceptron
32
The structured perceptron maximizes:
The features can be HMM features, but they can also be other features:
• a capitalized word is more likely to be a proper name
• a number can be tagged as a number right away
• etc.
o N,ADV,N
o N,V,V
o PR,V,ADV
i=1
Training the Structured Perceptron
33
The structured perceptron differs from a CRF mainly in its learning
algorithm to find
:
Given
• a number T of iterations
• training examples
• a vector of feature functions
Iterate for t=1...T , for i=1...n
•
// Using Viterbi
• if
, then
o
o
Training the Structured Perceptron
34
The structured perceptron differs from a CRF mainly in its learning
algorithm to find
:
Given
• a number T of iterations
• training examples
• a vector of feature functions
Iterate for t=1...T , for i=1...n
•
// Using Viterbi
• if
, then
o
o
Using the Structured Perceptron
35
Theorem: if a training set is separable with margin δ , the structured
perceptron makes at most
mistakes, where R is a constant such
that
. Thus, the training converges.
Even if the training set is not separable, the structured perceptron
algorithm can find an approximate solution.
Structured Perceptrons have been used for POS Tagging, Parsing, and
Machine Translation.
[
Michael Collins: “Discriminative Training Methods
for Hidden Markov Models”, ACL 2002
]
[
Graham Neubig: “NLP Programming Tutorial 11
- The Structured Perceptron”
]
[
Michael Collins: “Lecture 4, COMS E6998-3:
The Structured Perceptron”
]
o
o
Summary: Perceptron
36
The perceptron is a function from
to R , which uses a weight vector
, a bias b ∈ R ,
and an activation function act: R → R :
act
->Deep-learning