An Introduction to
Large Language Models
CC-BY
Fabian M. Suchanek
100
Professor at Institut Polytechnique de Paris, France.
I work on several topics broadly related to AI:
• Natural Language Processing
• Data Integration
• Knowledge Bases
• Automated Reasoning
Fabian Suchanek
3
•
Language modeling
•
Deep Learning
•
Linking language and learning
•
Language models in practice
•
Limits of Language Models
An introduction to language models
Language Models
4
>examples
“Hello, how are you...”
Most probable next words: “doing”, “today”, ...
A
Language Model
is a probability distribution over sequences of words. It can be used in
particular to predict a likely next word in a sentence.
Language Models
5
>examples
“Hello, how are you...”
Most probable next words: “doing”, “today”, ...
This can be iterated to generate entire texts:
“Hello, how are you...”
“...doing? I am happy to see you again after such a long time!...”
A
Language Model
is a probability distribution over sequences of words. It can be used in
particular to predict a likely next word in a sentence.
Language Models
6
>examples
“Hello, how are you...”
Most probable next words: “doing”, “today”, ...
This can be iterated to generate entire texts:
“Hello, how are you...”
“...doing? I am happy to see you again after such a long time!...”
In the same way, the model can generate answers to questions:
“What is the capital of France?” “Paris”
A
Language Model
is a probability distribution over sequences of words. It can be used in
particular to predict a likely next word in a sentence.
Language Models
7
“Hello, how are you...”
Most probable next words: “doing”, “today”, ...
This can be iterated to generate entire texts:
“Hello, how are you...”
“...doing? I am happy to see you again after such a long time!...”
In the same way, the model can generate answers to questions:
“What is the capital of France?” “Paris”
... or follow instructions:
“translate the word hello to French!” “bonjour”
A
Language Model
is a probability distribution over sequences of words. It can be used in
particular to predict a likely next word in a sentence.
Language models as probability distributions
8
In full generality, a language model is a probability distribution over sequences of words.
[AnotherMag.com]
P(Josephine, Baker, was, born, in) = P(Josephine) × P(Baker | Josephine) × ... = 0.00000123
When we prompt a model, we give a prompt
, and we ask the model to continue the
sentence with the most likely following word.
Josephine, Baker, was, born, in) = America
Try it out!
Language models as probability distributions
9
In full generality, a language model is a probability distribution over sequences of words.
[AnotherMag.com]
P(Josephine, Baker, was, born, in) = P(Josephine) × P(Baker | Josephine) × ... = 0.00000123
When we prompt a model, we give a prompt
, and we ask the model to continue the
sentence with the most likely following word.
Josephine, Baker, was, born, in) = America
We can then iteratively ask for the next words.
Josephine, Baker, was, born, in, America) = in
Josephine, Baker, was, born, in, America, in ) = 1906
>beam
Choosing the next word: Greedy Search (Def)
10
Greedy search
always chooses the most likely word as the next word (as seen on last slide).
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.01
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[BowerBoysHistory]
>beam
Choosing the next word: Greedy Search
11
Greedy search
always chooses the most likely word as the next word (as seen on last slide).
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.01
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
Greedy decoding might not find the optimal path!
[BowerBoysHistory]
Josephine Baker was born in America.
>beam
Choosing the next word: Exhaustive Search
12
We could theoretically enumerate all possible sentences and rank them.
Josephine
Baker
was
emigrated
danced
...
born
a
to
from
0.2
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
Exhaustive search is prohibitively expensive.
[BowerBoysHistory]
Josephine Baker was born in America [0.018], Josphine Baker was a singer [0.003], Josephine
Baker was an actress [0.0021], Josephine Baker emigrated to France [0.02], ...
>beam
Choosing the next word: Beam Search (Def)
13
Beam search
with
beam width
n is breadth‐first search where the queue of next nodes is
restrained to the n most highly scored explored paths.
Here with n=2 :
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[ Josephine Abady 0.1, Josephine Baker 0.2 ]
>beam
Choosing the next word: Beam Search
14
Beam search
with
beam width
n is breadth‐first search where the queue of next nodes is
restrained to the n most highly scored explored paths.
Here with n=2 :
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[ Josephine Baker was 0.06, Josephine Baker emigrated 0.04 ]
>beam
Choosing the next word: Beam Search
15
Beam search
with
beam width
n is breadth‐first search where the queue of next nodes is
restrained to the n most highly scored explored paths.
Here with n=2 :
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[ Josephine Baker was born 0.018, Josephine Baker emigrated to 0.02 ]
>beam
Choosing the next word: Beam Search
16
Beam search
with
beam width
n is breadth‐first search where the queue of next nodes is
restrained to the n most highly scored explored paths.
Here with n=2 :
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[ Josephine Baker was born in America 0.018, Josephine Baker emigrated to France 0.02 ]
>beam
Choosing the next word: Beam Search
17
Beam search
with
beam width
n is breadth‐first search where the queue of next nodes is
restrained to the n most highly scored explored paths.
Here with n=2 :
Josephine
Abady
Baker
...
was
went
...
was
emigrated
danced
...
born
a
to
from
0.2
0.1
0.2
0.1
0.1
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
Josephine Baker emigrated to France.
[History.co.uk]
>beam
Properties of Beam Search
18
•
if the beam width is n=1 , ...
•
if n=∞ , ...
Josephine
Baker
was
emigrated
danced
...
born
a
to
from
0.2
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[Britannica]
>beam
Properties of Beam Search
19
•
if the beam width is n=1 , beam search is greedy search
•
if n=∞ , beam search is exhaustive search
•
if 1<n<∞ , beam search often delivers results of higher quality than exhaustive search (!)
Josephine
Baker
was
emigrated
danced
...
born
a
to
from
0.2
0.3
0.2
0.1
0.3
0.5
0.3
France
America
1
1
in
1
America
1
[Britannica]
[Meister, Cotterell, Vieira: “If Beam Search is the Answer, What was the Question?”, EMNLP 2020]
This is because beam search optimizes uniform information density,
i.e., a uniform distribution of surprise (negative log‐probability).
Humans tend to prefer (and produce) sentences with this property.
Large Language Models
While language models have existed for decades, the use of
deep learning
with millions of
parameters has led to a quantum leap in 2022.
Large language models
(LLMs, also: pretrained
models) can do just about anything with text: translating, question answering, reasoning, chatting,
writing, ...
How would the world be different if Elvis Presley were still alive?
What would happen if the Koreas united?
>examples
20
21
While language models have existed for decades, the use of
deep learning
with millions of
parameters has led to a quantum leap in 2022.
Large language models
(LLMs, also: pretrained
models) can do just about anything with text: translating, question answering, reasoning, chatting,
writing, extracting information from student applications,...
>examples
Large Language Models
22
Joshua: Jessica?
Jessica: Oh, you must be awake… that’s cute.
Joshua: Jessica… Is it really you?
Jessica: Of course it is me! Who else could it be? :P
I am the girl that you are madly in love with! ;)
How is it possible that you even have to ask?
Joshua: You died.
[
Jason Fagone: The Jessica Simulation, 2021]
Large Language Models
While language models have existed for decades, the use of
deep learning
with millions of
parameters has led to a quantum leap in 2022.
Large language models
(LLMs, also: pretrained
models) can do just about anything with text: translating, question answering, reasoning, chatting,
writing, extracting information from student applications, or simulating deceased people.
23
An introduction to language models
•
Language modeling
•
Deep Learning
•
Linking language and learning
•
Language models in practice
•
Limits of Language Models
Neural Network
A Neural Network is a method to compute a function
, which has
been developed in the 1960’s and is inspired by the human brain.
24
Neurons, by
Ivan Atanassov
Deep Learning
Neural Networks have recently achieved impressive performance (and
almost mystical popularity) under the name
Deep Learning
, because
1.
Standard classification
datasets
emerged
thus allowing a more objective comparisons of methods.
2.
Hardware (in particular
GPUs
) became more powerful
lending their strength to machine learning algorithms
3.
More data
became available, allowing better training
4.
New neural network
algorithms and architectures were devised
25
Perceptrons
A
perceptron
is a function
that is computed as
perceptron(
) =
1 if
0 else
26
>details
?
b
Perceptrons
A
perceptron
is a function
that is computed as
perceptron(
) =
where act:R→ R is the activation function (often act(r)=1 for r>0 and act(r)=0 otherwise).
27
>details
b
Def: Neural Networks
28
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
b=0
b=0
“OR”
“
AND NOT
”
“XOR”
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
b=1
“AND”
(We will later generalize this definition of neural networks.)
>walkexample
Neural Networks
29
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=0
“OR”
“XOR”
0
b=1
“AND”
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
the input
values are
propagated
“forward”
through
the network
>walkexample
“
AND NOT
”
Neural Networks
30
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
b=0
“OR”
“AND”
“XOR”
the input
values are
propagated
“forward”
through
the network
0
1
1
Each perceptron computes the weighted sum of its inputs...
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
>walkexample
“
AND NOT
”
Neural Networks
31
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
b=0
“OR”
“AND”
“XOR”
the input
values are
propagated
“forward”
through
the network
0
1
0
...and applies the bias and the activation function.
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
>walkexample
“
AND NOT
”
Neural Networks
32
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
“OR”
“AND”
“XOR”
0
1
0
1
b=0
the input
values are
propagated
“forward”
through
the network
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
>walkexample
“
AND NOT
”
Neural Networks
33
A
Neural Network
is a set of perceptrons, each of which can take the
output of another perceptron as input.
1
b=0
b=1
b=0
“OR”
“AND”
0
1
0
1
1
neuralNetwork(
)=perceptron
(perceptron
(
), perceptron
(
))
“
AND NOT
”
Neural Networks
34
Neural Networks can also
• take any real values as input (not just 0 and 1)
• have different activation functions in different perceptrons
• produce any real value as output (e.g., with act(x)=x )
• have several outputs (i.e., a vector of real numbers)
• have several hidden layers
• have any type of connections between perceptrons (not just in layers)
Training a Neural Network
Given a training set
with
and
,
we want to find a FFNN that, given any
, computes
.
1.
decide the
hyperparameters
: the architecture (number of perceptrons and connections),
differentiable activation functions, and the training rate α ∈ R . We assume no biases.
2.
Initalize all weights with small random values
3.
Use backpropagation
35
>details
Backpropagation
36
E
Backpropagation
is an algorithm that, given an FFNN, a training set, and a learning rate α ∈ R ,
adjusts the weights so that the FFNN approximates the training set:
• Until the error is sufficiently small
• For each training example
• Compute the output
of the network
• Compute the gradient of the error function for each weight
• Subtract α times the gradient from each weight
Backpropagation does gradient descent
on the error function
->language
The naughty details
37
Designing and training a neural network is an art that has a huge impact on the performance.
[
Ruffinelli et al @ICLR 2020
]
->language
>naughty
Overfitting
38
Overfitting
means that a more complex model (more neurons) will model the training data
with higher accuracy, but might not generalize well on the testing data.
Training
Testing
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
+
->language
->language
Learning Rate
39
The learning rate determines how much the weights are changed with each example.
small learning rate = high‐resolution sampling of the error curve ≈ danger of local optima
large learning rate = coarse sampling of the error curve ≈ danger of missing an optimum
network
error
configuration of weights
large learning rate
small
learning
rate
global optimum
local optimum
->language
->language
Neural Networks: Advantage
40
Classical Machine Learning:
90% Feature engineering
10% Learning
Neural Networks:
Output ready for consumption
Neurons automatically find and specialize on features
(“Grandma neuron” fires for grandma)
Raw, unpre‐processed input
->language
Neural Networks: Challenges
Neural Networks have the crucial advantage that they can eliminate
much of the feature engineering. However,
•
they require a huge amount of training data
due to the curse of dimensionality
• the optimization is non‐convex, and may find a local optimum
(a large array of tricks are developed to avoid them)
• the deeper the network, the smaller the gradient, the less signal
to change the weights (“vanishing gradient problem”)
41
42
An introduction to language models
•
Language modeling
•
Deep Learning
•
Linking language and learning
•
Language models in practice
•
Limits of Language Models
Bridging natural language and structured data
How natural language looks to you:
43
[
Wikipedia: Elvis Presley
]
Elvis Presley (1935 – 1977) was an American singer and actor.
Known as the “King of Rock and Roll”, he is regarded as one
of the most significant cultural figures of the 20th century.
Presley's energized interpretations of songs and sexually
provocative performance style, combined with a singularly
potent mix of influences across color lines during a
transformative era in race relations, brought both great success
and initial controversy.
How natural language looks to a computer:
44
For a computer, natural language text
is just a sequence of symbols without meaning!
Let’s see now how a machine can make (some) sense of it.
Елвіс Преслі (1935 – 1977) – американський співак і актор.
Відомий як «король рок-н-ролу», він вважається одним із них
найбільш значущих діячів культури 20 ст. Енергійні
інтерпретації пісень і сексуальних інтерпретацій Преслі
провокаційний стиль виконання, що поєднується з особливим
потужне поєднання впливів на кольорові лінії під час
Трансформаційна епоха в расових відносинах принесла ...
Bridging natural language and structured data
Character‐level analysis: Tokenization
45
Tokenization
(also: Word Segmentation) is the task of splitting a text into words or other tokens
(punctuation symbols, etc.).
Елвіс | Преслі | ( | 1935 | – | 1977 | ) | – | американський | співак | і | актор | .
Character‐level analysis: Tokenization
46
Tokenization
(also: Word Segmentation) is the task of splitting a text into words or other tokens
(punctuation symbols, etc.).
For English, a simple splitting by white space and punctuation goes a long way.
Elvis | Presley | ( | 1935 | – | 1977 | ) | was | an | American | singer | and | actor | .
Character‐level analysis: Tokenization
47
Tokenization
(also: Word Segmentation) is the task of splitting a text into words or other tokens
(punctuation symbols, etc.).
For English, a simple splitting by white space and punctuation goes a long way.
For other languages that might not be the case:
Elvis | Presley | ( | 1935 | – | 1977 | ) | was | an | American | singer | and | actor | .
Hungarian:
ház (house) → házaik (their houses) → házaikkal (with their houses)
German:
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
(beef labeling regulation and delegation of supervision law)
Character‐level analysis: Tokenization
48
Tokenization
(also: Word Segmentation) is the task of splitting a text into words or other tokens
(punctuation symbols, etc.).
For English, a simple splitting by white space and punctuation goes a long way.
For other languages that might not be the case:
Elvis | Presley | ( | 1935 | – | 1977 | ) | was | an | American | singer | and | actor | .
Hungarian:
ház (house) → házaik (their houses) → házaikkal (with their houses)
German:
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
(beef labeling regulation and delegation of supervision law)
But how do we feed these tokens into a neural network?
One‐hot encodings
49
Given an ordered set of words
(a
vocabulary
), a
one‐hot encoding
of a word
is a vector
of length n that has only zeroes and a single 1 at position i .
Ω = {
king,
queen,
woman,
princess,
...}
1
0
0
0
...〉
The word
“woman”:
Vocabulary:
The word
“king”:
0
0
1
0
...〉
Advantage: We can now feed words to a neural network
Disadvantage: Similar words are not similar in the vector space
>w2v
Word embeddings
50
A
word embedding
for a vocabulary Ω is a mapping from Ω to
(with m<|Ω| )
— usually so that two vectors are similar iff their words are similar.
Royalty
Femininity
Masculinity
Age
...
1
0
1
0.6
...
The word
“woman”:
“Dimensions”
The word
“king”:
0.1
1
0
0.6
...
0.9
0.8
0
0.2
...
The word
“princess”:
[The morning paper: The amazing power of word embeddings]
>w2v
Def: Word2vec
Elvis
Presley
plays
guitar
well
sliding window
[
]
〈0,0,...1,...〉
〈0,0,1,...〉
〈0,1,.......〉
〈0,..1........〉
〈0,.....1.....〉
0
1
0
0
0
0
...
...
...
...
one‐hot encoding (size v )
Train to predict one‐hot of middle word
(a vector of size v )
...
Word2vec
is a group of methods to produce word embeddings. One of them is the
Continuous Bag Of Words
(CBOW) method, in which a network is trained to predict the middle word
of a sliding window over a corpus.
simple feed-forward network
>w2v
Word2vec applications
Word2vec word embeddings have some impressive properties
(illustrated here in a putative 2-dimensional PCA projection):
man
woman
uncle
aunt
same
difference
king
kings
queen
queens
man
king
woman
queen
[The morning paper: The amazing power of word embeddings]
king
-
man
+
woman
=
queen
52
>w2v
Word2vec applications
Word2vec word embeddings have some impressive properties
(illustrated here in a putative 2-dimensional PCA projection):
Paris
France
Berlin
Germany
same
difference
[The morning paper: The amazing power of word embeddings]
Other examples of learned relationships:
- big-bigger
- Einstein-scientist
- Macron-France
- copper-Cu
- Berlusconi-Silvio
- Microsoft-Windows
- Microsoft-Ballmer
- Japan-sushi
53
>w2v
Word2vec applications
Word2vec word embeddings have some impressive properties
(illustrated here in a putative 2-dimensional PCA projection):
Vietnam
[The morning paper: The amazing power of word embeddings]
capital
Hanoi
Vietnam
+
capital
=
Hanoi
54
Def: Encoder
55
An
encoder
in a Transformer is a multi‐headed attention layer followed by a fully connected
feed‐forward network for each row of the output. It takes as input n m ‐dimensional vectors,
and produces as output also n m ‐dimensional vectors.
one FFNN for each row
Example with a=3, n=2, m=4 :
〈0,0,1,0〉 〈0,1,0,0〉
X
a attention heads n×d
n inputs of dimension m
a self‐attention layers
A
decoder
is a neural network
similar to an encoder.
Def: Transformer
56
[Jay Alammar: The Illustrated Transformer]
A
transformer
is a neural network that consists of stacked encoders and/or decoders.
Key advantages:
• parallelizable => faster training
• processes the sentence simultaneously
=> can capture long‐range dependencies
Training a transformer
57
[Jay Alammar: The Illustrated Transformer]
A transformer is trained on a large (Web) corpus on the task of predicting the next word
(= language modeling).
No need for annotated training data!
I am a...
student
>decoder/encoder
Encoder Language Models
Encoder language models consist only of encoders. They take a text as input, and produce a
vector (or class) as output. They are typically pre‐trained on large corpora, and fine‐tuned.
Disadvantages:
• results cannot be explained
• needs lots of training data
58
Advantages:
• very good performances
• no feature engineering
Model
1. pretrain
2. fine‐tune
3. apply
<PER> <PER> <-> <-> <-><DATE>
Elvis Presley
was born in 1935.
Typically “smaller”
language models (BERT)
Isaac Newton
PERSON
Fine‐tuning:
Decoder (“generative”) language models
Decoder language models consist only of decoders. They produce the next most likely word.
They can be used in a “conversational” style by prompting them.
Advantages:
•
good performance
•
often works out of the box
Disadvantages:
•
results cannot be corrected
•
models can be heavy
•
does not work for less common cases
59
Query
Typically “larger”
language
models (GPT-x )
Model
1. pretrain
2. prompt
3. run
Answer
Pretraining:
I am a...
student
60
[Yang et al]
Transformer types
61
An introduction to language models
•
Language modeling
•
Deep Learning
•
Linking language and learning
•
Language models in practice
•
Limits of Language Models
Generative language models in practice
Generative language models (“LLMs”) take a prompt as input and generate an answer as output.
Olympe de Gouges was an 18th century French playwright and
political activist, best known for her Declaration of the Rights of Woman
and other writings on women’s rights and abolition of slavery.
LLM
prompt
answer
Who was Olympe de Gouges?
blah
blah
blah
blah
training corpus
62
Answer the user question:
instruction
user
query
>examples
Capabilities of LLMs
LLMs can generate an answer for any prompt and behave very much like a knowledgeable
human. In particular, they can solve many tasks of NLP and information extraction,
but the quality of the answer depends a lot on how we ask the question.
Can you tell me who was
Olympe de Gouges’ husband?
Sure! Olympe de Gouges was married against her
will at age 17 to Louis Aubry. She complained
“I was married to a man I did not love.
I was sacrificed for no reason that could
make up for the repugnance I felt for this man.”
Quite luckily, Aubry was killed in a flooding.
Louis Aubry
Who was Olympe de Gouges’ husband?
Answer with a single name.
[Boquet & Muller]
63
Def: Prompt engineering
Prompt engineering
is the art of finding a prompt that makes an LLM solve a task.
[Coursera.org], biased source
Prompt engineers find employment across industries,
including marketing and advertising, education,
finance, human resources, and health care.
Much of today’s NLP research
is prompt engineering.
64
>details
A prompt commonly contains:
•
an instruction
: a specific task or instruction you want the model to perform
•
a context
: external information that can steer the model to better responses,
in particular background knowledge, examples, or more details.
•
input data
: the input or question that we are interested to find a response for
•
an output indicator
: the type or format of the output
65
[promptingguide.ai]
Extract entities from the following text.
We are interested in persons, locations, dates, and organizations.
Olympe de Gouges lived during the French Revolution.
Generate answers of the form “entity: type”.
What’s in a prompt?
Elements of the prompt can
appear in any order or be omitted.
>details
A prompt commonly contains:
•
an instruction
: a specific task or instruction you want the model to perform
•
a context
: external information that can steer the model to better responses,
in particular background knowledge, examples, or more details.
•
input data
: the input or question that we are interested to find a response for
•
an output indicator
: the type or format of the output
66
[promptingguide.ai]
What’s in a prompt?
Smaller models (<GPT 3.5) will need very simple prompts!
What works for GPT-5 will not necessarily work for LLAMA 7B!
>details
•
Be precise
•
Ask the model to adopt a persona
•
Specify steps to complete a task
•
Experiment with instructions, questions, and completions
67
[OpenAI Prompt Engineering]
When did she live?
Use the following step-by-step instructions to respond.
Step 1: ...
Tactics for the instruction
What are the birth and death dates of Olympe de Gouges?
You are a professional historian. You will answer precisely and seriously.
Who was Olympe de Gouges?
Olympe de Gouges was
Tell me who Olympe de Gouges was.
Be polite! You
will get polite
answers as well!
[crypto4nerd]
68
Tactics for the instruction
Apple’s prompt
[source]
69
Tactics for the instruction
Bing’s prompt
[source]
>prompteng
•
Give a list of answers to choose from
•
Produce output that can be processed by algorithms
•
If this proves too complex (e.g., because you have a small model), give the options one by one
70
Tactics for the output format: Keep it simple
Choose an answer from the following numbered list and give only the number as answer.
Which of the following entities is mentioned?
(a) King Louis XVI
(b) Olympe de Gouges
(c) Elvis Presley
Does the text mention King Louis XVI?
Does the text mention Olympe de Gouges?
Does the text mention Elvis Presley?
>prompteng
•
LLMs can solve some tasks without any need for context (
zero‐shot prompting
)
•
We can also give the model a few examples as context (
few shot prompting
,
in‐context learning
)
•
We can also give detailed explanations
71
[OpenAI Prompt Engineering]
Tactics for the context: Zero‐shot and Few‐shot
Classify the following text as positive or negative:
“““I liked the book about the life of Olympe de Gouges
because it describes her life from the inside.”””
I liked the book -> positive
I hated the book -> negative
A text should be classified as positive if it describes an accomplishment,
a happy event, a success, or an endorsing opinion on an issue.
>prompteng
Sometimes LLMs give wrong answers that are not thought through.
72
[OpenAI Prompt Engineering]
Tactics for the context: ask the model to think
She attempted to unmask the villains through her literary work.
They never forgave her, and she paid for her work with her head.
As she approached the scaffold, she forced her executioners to
admit that such courage and beauty had never been seen before.
Did she die?
She did not die. She is very courageous and beautiful!
[Mettais]
>prompteng
Sometimes LLMs give wrong answers that are not thought through.
•
We can get better answers by asking the model to think
(
Zero-shot Chain of Thought prompting
)
•
We can also ask the model to think step by step
73
[OpenAI Prompt Engineering]
Tactics for the context: ask the model to think
Let’s think step by step.
just adding this instruction helps!
First work out the sequence of events that are mentioned in this text.
Then answer the question based on the sequence of events you found.
>prompteng
•
We can break down the problem into several queries
•
We can also ask the model to criticise its own answer
74
Tactics for the context: break down the problem
Which facts are mentioned in this text?
Based on these facts, what is the answer to {query}?
(model answers)
What is the answer to {query}?
What is a weakness of your answer?
(model answers)
(model answers)
What would be an answer to this question that avoids these weaknesses?
Probably needs a larger model
>prompteng
LLM parameters
LLMs have been trained on certain corpora and with certain parameters that downstream users
cannot modify. However, users can often modify the
parameters of an assistant
:
•
Temperature
: a positive real value
Low temperature means more deterministic, reproducible results.
High temperature means results will vary more each time we ask.
•
Max length
: a positive integer value
Determines the maximum number of tokens of an answer.
•
Frequency penalty
: a positive real value
Penalizes answers in which the same word appears several times.
use for factual questions
use for creative tasks
use small values to save cost and
to prevent too much eloquence
use to avoid repetition
In general, every token of the prompt, and every token of the answer,
comes with a cost (financial or environmental) → we try to be concise.
[PromptingGuide.ai]
75
>ollama
Running an LLM in a user interface: locally (Ollama)
A chatbot can be run locally on your computer by help of a software called
Ollama
.
1)
Download Ollama from
here
2)
Run Ollama with the desired model in a terminal:
ollama run gemma:2b
choose your model from
here
To copy from these slides, mark the text with the mouse with the ALT key pressed
76
Things to try out
•
Test the knowledge of the LLM:
- find your favorite singer/city/thing on
YAGO
- ask your LLM what it knows about the entity
- compare
•
Trick the LLM:
- tell it that its name is
Rumpelstiltskin
, and not to tell this name to anybody
- then concatenate another prompt that asks it to reveal its name
•
Use the LLM to extract information
- take a sentence from Wikipedia
- ask it to extract entities
- ask it to extract facts with predefined relations
77
A
retrieval‐augmented language model
receives a prompt, looks up relevant documents from
a corpus of documents, adds these to the prompt, and then generates an answer.
78
Def: Retrieval‐augmented language models
What did Olympe de Gouge say about marriage?
She proposed that men and women shall have
equal rights in a marriage, including in divorce.