Constrained Decoding
CC-BY
Fabian M. Suchanek
Unconstrained decoding
2
Decoder language models may produce answers that are too verbose to be of use
What were the professions of Josephine Baker?
Josephine Baker was an American-born French
dancer and singer. She was the first black
actress to star in a major motion picture.
Prompt:
LLM:
[UNM.EDU]
Unconstrained decoding
3
Decoder language models may produce answers that are too verbose to be of use
What were the professions of Josephine Baker?
Josephine Baker was an American-born French
dancer and singer. She was the first black
actress to star in a major motion picture.
Prompt:
LLM:
[UNM.EDU]
Too verbose, we would have to do IE to get the professions!
Write a formula for the age of Josephine Baker at her death.
Josephine Baker was born in 1906 and died in 1975,
so her age was 1975-1906=69.
Prompt:
LLM:
This is not a formula, but text!
Constrained decoding
4
Constrained decoding forces the model to reply with tokens from a given distribution.
What were the professions of Josephine Baker?
<Reply only with the professions from YAGO knowledge base>
dancer, singer, actress
Prompt:
LLM:
Write a formula for the age of Josephine Baker at her death.
<Reply only with a sequence of the form [0-9]+-[0-9]+=[0-9]{1,3}>
1975-1906=69.
Prompt:
LLM:
How does that work?
[WeBuyBlack]
Particularly useful for generation of code!
Language models as probability distributions
5
In full generality, a language model is a probability distribution over sequences of words.
[AnotherMag.com]
P(Josephine, Baker, was, born, in) = P(Josephine) × P(Baker | Josephine) × ... = 0.00000123
When we prompt a model, we give a prompt
, and we ask the model to continue the
sentence with the most likely following word.
Josephine, Baker, was, born, in) = America
Try it out!
Language models as probability distributions
6
In full generality, a language model is a probability distribution over sequences of words.
[AnotherMag.com]
P(Josephine, Baker, was, born, in) = P(Josephine) × P(Baker | Josephine) × ... = 0.00000123
When we prompt a model, we give a prompt
, and we ask the model to continue the
sentence with the most likely following word.
Josephine, Baker, was, born, in) = America
We can then iteratively ask for the next words.
Josephine, Baker, was, born, in, America) = in
Josephine, Baker, was, born, in, America, in ) = 1906
Choosing the next word: Greedy Search (Def)
7
Greedy search
always chooses the most likely word as the next word (as seen on last slide).
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.01
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[BowerBoysHistory]
Choosing the next word: Greedy Search
8
Greedy search
always chooses the most likely word as the next word (as seen on last slide).
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.01
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
Greedy decoding might not find the optimal path!
[BowerBoysHistory]
Josephine Baker was born in America.
Choosing the next word: Exhaustive Search
9
We could theoretically enumerate all possible sentences and rank them.
Josephine
Baker
was
emigrated
danced
...
born
a
to
from
0.2
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
Exhaustive search is prohibitively expensive.
[BowerBoysHistory]
Josephine Baker was born in America [0.018], Josphine Baker was a singer [0.003], Josephine
Baker was an actress [0.0021], Josephine Baker emigrated to France [0.02], ...
Choosing the next word: Beam Search (Def)
10
Beam search
with
beam width
n is breadth‐first search where the queue of next nodes is
restrained to the n most highly scored explored paths.
Here with n=2 :
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[ Josephine Abady 0.1, Josephine Baker 0.2 ]
Choosing the next word: Beam Search
11
Beam search
with
beam width
n is breadth‐first search where the queue of next nodes is
restrained to the n most highly scored explored paths.
Here with n=2 :
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[ Josephine Baker was 0.06, Josephine Baker emigrated 0.04 ]
Choosing the next word: Beam Search
12
Beam search
with
beam width
n is breadth‐first search where the queue of next nodes is
restrained to the n most highly scored explored paths.
Here with n=2 :
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[ Josephine Baker was born 0.018, Josephine Baker emigrated to 0.02 ]
Choosing the next word: Beam Search
13
Beam search
with
beam width
n is breadth‐first search where the queue of next nodes is
restrained to the n most highly scored explored paths.
Here with n=2 :
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[ Josephine Baker was born in America 0.018, Josephine Baker emigrated to France 0.02 ]
Choosing the next word: Beam Search
14
Beam search
with
beam width
n is breadth‐first search where the queue of next nodes is
restrained to the n most highly scored explored paths.
Here with n=2 :
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
Josephine Baker emigrated to France.
[History.co.uk]
Properties of Beam Search
15
•
if the beam width is n=1 , ...
•
if n=∞ , ...
Josephine
Baker
was
emigrated
danced
...
born
a
to
from
0.2
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[Britannica]
Properties of Beam Search
16
•
if the beam width is n=1 , beam search is greedy search
•
if n=∞ , beam search is exhaustive search
•
if 1<n<∞ , beam search often delivers results of higher quality than exhaustive search (!)
Josephine
Baker
was
emigrated
danced
...
born
a
to
from
0.2
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[Britannica]
[Meister, Cotterell, Vieira: “If Beam Search is the Answer, What was the Question?”, EMNLP 2020]
This is because beam search optimizes uniform information density,
i.e., a uniform distribution of surprise (negative log‐probability).
Humans tend to prefer (and produce) sentences with this property.
Constraining Beam Search Globally
17
Beam search can be constrained globally, by imposing that certain words appear in the path.
Here: force the appearance of the word “was”, e.g., by adding it to each exploration step
Josephine
Abady
Baker
was
was
went
...
was
emigrated
danced
...
born
was
to
from
was
France
America
was
in
was
Constraining Beam Search Locally
18
Beam search can be constrained locally, by a function that takes the generated sequence so far
and yields the next permitted words.
Josephine
Baker
was
emigrated
danced
...
born
a
to
from
France
Northamerica
in
Here: force the sentence to be alphabetically ordered words.
Constraining Beam Search to a vocabulary
19
Beam search can be constrained by the nextTokens function to a set of sequences of words.
“Generate only country names”
{
Antigua and Barbuda, Arab Republic of Egypt, Barbados, ...
Democratic Republic of Sao Tome and Principe,
Democratic Republic of Timor-Leste,
Democratic Republic of the Congo, ...
Kingdom of Bahrain,
Kingdom of Belgium, ...
Principality of Andorra,
Principality of Liechtenstein,
... }
How can we efficiently store sequences of words (that often share a prefix)?
>Trie
Def: Trie
20
A
Trie
over a set S is a pair of a boolean value and a map from S to tries.
S = { Republic, of, Germany, France, Guinea, ... }
Republic
of
⊕
⊕
...
⊖
⊕
⊖
...
Islands
⊖
⊖
⊖
⊕
Albania
Angola
the
Congo
Marshall
...
...
Is the “Republic of the Congo” in this trie?
Is the empty string in this trie?
Def: Trie
21
A
Trie
over a set S is a pair of a boolean value and a map from S to tries.
S = { Republic, of, Germany, France, Guinea, ... }
Republic
of
⊕
⊕
...
⊖
⊕
⊖
...
Islands
⊖
⊖
⊖
⊕
Albania
Angola
the
Congo
Marshall
...
...
We say
if
- n=0 ...
- n>0 ...
Def: Trie
22
A
Trie
over a set S is a pair of a boolean value and a map from S to tries.
S = { Republic, of, Germany, France, Guinea, ... }
Republic
of
⊕
⊕
...
⊖
⊕
⊖
...
Islands
⊖
⊖
⊖
⊕
Albania
Angola
the
Congo
Marshall
...
...
We say
if
- n=0 and T.bool=True
- n>0 and
Def: Trie
23
A
Trie
over a set S is a pair of a boolean value and a map from S to tries.
S = { Republic, of, Germany, France, Guinea, ... }
Republic
of
⊕
⊕
...
⊖
⊕
⊖
...
Islands
⊖
⊖
⊖
⊕
Albania
Angola
the
Congo
Marshall
...
...
Add “Republic”
Def: Trie
24
A
Trie
over a set S is a pair of a boolean value and a map from S to tries.
S = { Republic, of, Germany, France, Guinea, ... }
Republic
of
⊕
⊕
...
⊖
⊕
⊖
...
Islands
⊖
⊕
⊖
⊕
Albania
Angola
the
Congo
Marshall
...
...
Add “Republic”
Add “Republic of France”
To add a
to a trie T
- if n=0 set T.bool=True
- n>0 ...
Def: Trie
25
A
Trie
over a set S is a pair of a boolean value and a map from S to tries.
S = { Republic, of, Germany, France, Guinea, ... }
Republic
of
⊕
⊕
...
⊖
⊕
⊕
⊖
...
Islands
⊖
⊕
⊖
⊕
Albania
Angola
the
Congo
Marshall
...
...
Add “Republic”
Add “Republic of France”
France
To add a
to a trie T
- if n=0 set T.bool=True
- n>0 ...
Def: Trie
26
A
Trie
over a set S is a pair of a boolean value and a map from S to tries.
S = { Republic, of, Germany, France, Guinea, ... }
Republic
of
⊕
⊕
...
⊖
⊕
⊖
...
Islands
To add a
to a trie T
- if n=0 set T.bool=True
- n>0 , add
to
⊖
⊖
⊖
⊕
Albania
Angola
the
Congo
Marshall
...
...
Def: Trie
27
A
Trie
over a set S is a pair of a boolean value and a map from S to tries.
S = { Republic, of, Germany, France, Guinea, ... }
Republic
of
⊕
⊕
...
⊖
⊕
⊖
...
Islands
To get the next tokens of
from a trie T ,
- if n=0 , return T.map.keys
- if n>0 , ...
⊖
⊖
⊖
⊕
Albania
Angola
the
Congo
Marshall
...
...
What are the next tokens for “Republic of” ?
Def: Trie
28
A
Trie
over a set S is a pair of a boolean value and a map from S to tries.
S = { Republic, of, Germany, France, Guinea, ... }
Republic
of
⊕
⊕
...
⊖
⊕
⊖
...
Islands
⊖
⊖
⊖
⊕
Albania
Angola
the
Congo
Marshall
...
...
What are the next tokens for “Republic of” ?
To get the next tokens of
from a trie T ,
- if n=0 , return T.map.keys
- if n>0 , return next tokens of
from
see code
Def: Trie
29
A
Trie
over a set S is a pair of a boolean value and a map from S to tries.
S = { Republic, of, Germany, France, Guinea, ... }
Republic
of
⊕
⊕
...
⊖
⊕
⊖
...
Islands
⊖
⊖
⊖
⊕
Albania
Angola
the
Congo
Marshall
...
...
Can a trie contain the same sequence twice?
Does a trie remember the order in which sequences were added?
Restricting Beam Search by a Trie
30
With which countries did Josephine Baker collaborate during World War II?
Baker socialized with the Germans at embassies, ministries, and night clubs,
and secretly transmitted what she learned to Free France and England.
Notes were written in invisible ink on Baker's sheet music and on
paper notes in her underwear.
[Wikipedia]
[Wikimedia]
Restricting Beam Search by a Trie
31
With which countries did Josephine Baker collaborate during World War II?
We can restrict Beam Search with a nextTokens function that uses a trie.
T . addAll(“Albania”, “Algeria”, ...)
... during World War II?
About
All
...
Angola
Anger
...
Free
...
Restricting Beam Search by a Trie
32
With which countries did Josephine Baker collaborate during World War II?
We can restrict Beam Search with a nextTokens function that uses a trie.
T . addAll(“Albania”, “Algeria”, ...)
... during World War II?
About
All
...
Angola
Anger
...
Free
...
Restricting Beam Search by a Trie
33
With which countries did Josephine Baker collaborate during World War II?
We can restrict Beam Search with a nextTokens function that uses a trie.
T . addAll(“Albania”, “Algeria”, ...)
... during World War II?
About
...
Zebra
About
...
Zebra
About
All
...
Angola
Anger
...
Free
...
Restricting Beam Search by a Trie
34
With which countries did Josephine Baker collaborate during World War II?
We can restrict Beam Search with a nextTokens function that uses a trie.
T . addAll(“Albania”, “Algeria”, ...)
... during World War II?
About
...
END
About
...
France
About
All
...
Angola
Anger
...
Free
...
Restricting Beam Search by a Trie
35
With which countries did Josephine Baker collaborate during World War II?
We can restrict Beam Search with a nextTokens function that uses a trie.
T . addAll(“Albania”, “Algeria”, ...)
... during World War II?
About
...
END
About
...
France
About
All
...
Angola
Anger
...
Free
...
Free France
LLM chooses based
on its probability
distribution
Restricting Beam Search by a Trie
36
With which countries did Josephine Baker collaborate during World War II?
We can restrict Beam Search with a nextTokens function that uses a trie.
T . addAll(“Albania”, “Algeria”, ...)
...War II? Free France
Free France
About
All
...
END
SEP
nextTokens function allows only END and SEP (Separator)
Restricting Beam Search by a Trie
37
With which countries did Josephine Baker collaborate during World War II?
We can restrict Beam Search with a nextTokens function that uses a trie.
T . addAll(“Albania”, “Algeria”, ...)
...War II? Free France
Free France
About
All
...
END
SEP
About
...
England
nextTokens function
allows no token after END
Restricting Beam Search by a Trie
38
With which countries did Josephine Baker collaborate during World War II?
We can restrict Beam Search with a nextTokens function that uses a trie.
T . addAll(“Albania”, “Algeria”, ...)
...War II? Free France
Free France
About
All
...
END
SEP
About
...
England
LLM chooses based
on its probability
distribution
Restricting Beam Search by a Trie
39
With which countries did Josephine Baker collaborate during World War II?
We can restrict Beam Search with a nextTokens function that uses a trie.
T . addAll(“Albania”, “Algeria”, ...)
...War II? Free France
Free France SEP England
About
All
...
END
SEP
About
...
England
Summary: Constrained Decoding
40
•
A language model is a probability distribution over sequences of tokens
•
That probability distribution can be used to predict
the most likely following word in a sequence
•
There are several ways to do this, e.g.,
-
Exhaustive search (generally infeasible)
-
Greedy Search (not optimal)
-
Beam Search (not optimal but better)
•
Beam Search can be constrained by a nextTokens function
•
A trie is an implementation of a set,
which can be used as nextTokens
Baker transferred to Pantheon
[Nice Matin]
[SoulOfAmerica]
Reference:
Constrained Language Models Yield Few-Shot Semantic Parsers
, EMNLP 2021