Large Language Models:
Societal Questions
CC-BY
Fabian M. Suchanek
>what-is-llm
Language Models
2
“Hello, how are you...”
Most probable next words: “doing”, “today”, ...
A
language model
is a probability distribution over sequences of words. It can be used in
particular to predict a likely next word in a sentence.
Today’s
large language models
(also: LLMs, generative language models) are so good at genera‐
ting the next word that they can generate entire texts, converse with humans, and solve tasks.
LLMs are a particular case of
generative AI models
, which include also
models that can generate voice, images, videos, and other types of content.
“How can I explain the theory of relativity to a 6-year old?”
The theory of relativity tells us how things behave when
they move very very fast. In particular...
Potential of LLMs
3
summarize video meetings: switch on
transcription, ask LLM to summarize, then curate
act as friends/advisors/lovers
talk me out of smoking,
convince me to lose weight,...
translate text
liberate doctors from note‐taking
by summarizing recorded meetings
generate ads, poems, essays, ...
[Excerpts from TIME special issue, 2023-12-28]
Potential of LLMs
4
“duplicate myself”
(create copy of myself
that talks like myself)
“recreate” deceased
or ill persons
diagnose illnesses based on
a description of the symptoms
“recreate” an
artist (deceased
or not)
help education
write delicate emails
[Excerpts from TIME special issue, 2023-12-28]
[Suchanek et al: “Knowledge bases and Langauge Models, RuleML 2023]
generate code
(There is an abundance of multi‐modal training data)
5
Industry adoption
As of 2024, just 5% of US businesses use generative AI to produce
good or services, citing fear of
•
damaging their reputation
if they adopt too quickly
•
lawsuits related to privacy, bias, and copyright
•
compromising customer data
•
high costs
•
security vulnerabilities
•
impossible training, due to scattered
or unusable data
•
updating outdated IT infrastructure
•
lack of human skills
[The Economist, 2024-07-02]
[The Economist, 2024-11-04]
Parties to an LLM
6
Mousse au
chocolat is...
LLM
How can I
make a
mousse
au chocolat?
Take 10 eggs...
The war in...
Elvis Presley
was born...
asks
prompts
generates
informs
used to
train
create
content creators
training corpus
LLM creator
query
answer
user
Answer this
question politely:
instruction
prompt
builds
collects
designs
Challenges on the side of the training data
7
1. Copyrighted data
2. Personal information in the data
3. Lack of quality of the data
4. Poisoning of the data
5. Bias in the data
6. Lack of compensation for the content creators
7. Cannibalization of the content
Mousse au
chocolat is...
The war in...
Elvis Presley
was born...
create
content creators
training corpus
Training data: Copyrighted content
8
It can happen that an LLM reproduces an exact copy of the training data (“regurgitation”).
User:
What is freedom of speech?
LLM:
Freedom of speech is a principle that supports...
Freedom of speech is a principle
that supports the freedom of an
individual or a community to
articulate their opinions
without fear of retaliation...
[CC Attribution-ShareAlike]
[Wikipedia]
Freedom of speech
Training data: Copyrighted content
9
It can happen that an LLM reproduces an exact copy of the training data (“regurgitation”).
User:
What is freedom of speech?
LLM:
Freedom of speech is a principle that supports...
Freedom of speech is a principle
that supports the freedom of an
individual or a community to
articulate their opinions
without fear of retaliation...
[CC Attribution-ShareAlike]
[Wikipedia]
Freedom of speech
Not everything that is on the Internet is free to be reproduced ad libitum!
User:
Complete the following text:
Freedom of speech is a principle that...
LLM:
...supports the freedom of an individual or a
community to articulate their opinions without
fear of retaliation...
Violation of the license! (no attribution)
License of this text
Training data: Copyrighted content
10
It can happen that an LLM reproduces an exact copy of the training data (“regurgitation”).
[Wikipedia]
[Wikipedia]
Freedom of speech
[NY Times Complaint]
Training data: Copyrighted content
11
What if the content is “just” used for training?
The legality of building LLMs from copyrighted material is an open question.
Did the author of the source give the LLM creator
the right to use the source for training?
Does the LLM creator need the consent of the author?
Freedom of speech is a principle
that supports the freedom of an
individual or a community to
articulate their opinions
without fear of retaliation...
[CC Attribution-ShareAlike]
[Wikipedia]
Freedom of speech
User:
What is freedom of speech?
LLM:
Freedom of speech is the idea that people
should not be afraid to voice their opinions.
Content is not identical to the source
Training data: Personal data
12
Even public Web pages can contain personal information (names, addresses, phone numbers, etc.).
LLM
Where does
Cyril work?
Telecom Paris
[Cyril Chhun]
Training data: Personal data
13
Even public Web pages can contain personal information (names, addresses, phone numbers, etc.).
These fall under the GDPR and other personality laws.
LLM
Where does
Cyril work?
Telecom Paris
[Cyril Chhun]
What if this person
changes employer?
-> the person loses control
over their personal data
Training data: Personal data
14
LLM
Where does
Cyril work?
Telecom Paris
What if this person
changes employer?
-> the person loses control
over their personal data
[Cyril Chhun]
Even public Web pages can contain personal information (names, addresses, phone numbers, etc.).
These fall under the GDPR and other personality laws.
Training data: Personal data
15
[Nasr et al]
Even public Web pages can contain personal information (names, addresses, phone numbers, etc.).
These fall under the GDPR and other personality laws.
Training data: Data quality
16
“Many foundation models are trained on unlabeled corpora that are chosen for their convenience
and accessibility, for example public internet data, rather than their quality”