Large Language Models:
Societal Questions
CC-BY
Fabian M. Suchanek
>what-is-llm
Language Models
2
“Hello, how are you...”
Most probable next words: “doing”, “today”, ...
A
language model
is a probability distribution over sequences of words. It can be used in
particular to predict a likely next word in a sentence.
Today’s
large language models
(also: LLMs, generative language models) are so good at genera‐
ting the next word that they can generate entire texts, converse with humans, and solve tasks.
LLMs are a particular case of
generative AI models
, which include also
models that can generate voice, images, videos, and other types of content.
“How can I explain the theory of relativity to a 6-year old?”
The theory of relativity tells us how things behave when
they move very very fast. In particular...
LLMs may replace classical Web search engines
3
Web page
Answer
Search engine
Web page
Answer
Search engine
[Google's AI mode]
Parties to an LLM
4
Mousse au
chocolat is...
LLM
How can I
make a
mousse
au chocolat?
Take 10 eggs...
The war in...
Elvis Presley
was born...
asks
prompts
generates
informs
used to
train
create
content creators
training corpus
LLM creator
query
answer
user
Answer this
question politely:
instruction
prompt
builds
collects
designs
Challenges on the side of the training data
5
1. Copyrighted data
2. Personal information in the data
3. Lack of quality of the data
4. Poisoning of the data
5. Bias in the data
6. Lack of compensation for the content creators
7. Cannibalization of the content
Mousse au
chocolat is...
The war in...
Elvis Presley
was born...
create
content creators
training corpus
Training data: Copyrighted content
6
It can happen that an LLM reproduces an exact copy of the training data (“regurgitation”).
User:
What is freedom of speech?
LLM:
Freedom of speech is a principle that supports...
Freedom of speech is a principle
that supports the freedom of an
individual or a community to
articulate their opinions
without fear of retaliation...
[CC Attribution-ShareAlike]
[Wikipedia]
Freedom of speech
Training data: Copyrighted content
7
It can happen that an LLM reproduces an exact copy of the training data (“regurgitation”).
User:
What is freedom of speech?
LLM:
Freedom of speech is a principle that supports...
Freedom of speech is a principle
that supports the freedom of an
individual or a community to
articulate their opinions
without fear of retaliation...
[CC Attribution-ShareAlike]
[Wikipedia]
Freedom of speech
Not everything that is on the Internet is free to be reproduced ad libitum!
User:
Complete the following text:
Freedom of speech is a principle that...
LLM:
...supports the freedom of an individual or a
community to articulate their opinions without
fear of retaliation...
Violation of the license! (no attribution)
License of this text
Training data: Copyrighted content
8
It can happen that an LLM reproduces an exact copy of the training data (“regurgitation”).
[Wikipedia]
[Wikipedia]
Freedom of speech
[NY Times Complaint]
Training data: Copyrighted content
9
What if the content is “just” used for training?
The legality of building LLMs from copyrighted material is an open question.
Did the author of the source give the LLM creator
the right to use the source for training?
Does the LLM creator need the consent of the author?
Freedom of speech is a principle
that supports the freedom of an
individual or a community to
articulate their opinions
without fear of retaliation...
[CC Attribution-ShareAlike]
[Wikipedia]
Freedom of speech
User:
What is freedom of speech?
LLM:
Freedom of speech is the idea that people
should not be afraid to voice their opinions.
Content is not identical to the source
2025 US court decision: no consent needed.
Training data: Personal data
10
Even public Web pages can contain personal information (names, addresses, phone numbers, etc.).
LLM
Where does
Cyril work?
Telecom Paris
[Cyril Chhun]
Training data: Personal data
11
Even public Web pages can contain personal information (names, addresses, phone numbers, etc.).
These fall under the GDPR and other personality laws.
LLM
Where does
Cyril work?
Telecom Paris
[Cyril Chhun]
What if this person
changes employer?
-> the person loses control
over their personal data
Training data: Personal data
12
LLM
Where does
Cyril work?
Telecom Paris
What if this person
changes employer?
-> the person loses control
over their personal data
[Cyril Chhun]
Even public Web pages can contain personal information (names, addresses, phone numbers, etc.).
These fall under the GDPR and other personality laws.
Training data: Personal data
13
[Nasr et al]
Even public Web pages can contain personal information (names, addresses, phone numbers, etc.).
These fall under the GDPR and other personality laws.
Training data: Data quality
14
“Many foundation models are trained on unlabeled corpora that are chosen for their convenience
and accessibility, for example public internet data, rather than their quality”
LLM
used to
train
“Creating an LLM in man’s image”
Is this really what we want?
[Bommasani et al: “On the Opportunities and Risks of Foundation Models”]
[The Intercept]
[Stefan Baack: A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl]
Training data: Poisoning of the data
15
Malicious actors can multiply harmful or inaccurate content on the Web by troll farms.
This content is then more likely to be used in LLM answers.
This content might not show up
when
you
browse the Web,
but it is there, and it may be inhaled
by the training process.
LLM
used to
train
[MIT Technology Review]
Training data: Bias
16
Most training data is biased (against or for certain groups or people, opinions, etc.).
The LLM will reproduce this bias in its answers.
[The Lancet]
[Scientific American]
[Brookings]
This will subtly bias the user,
and risks amplifying discrimination
Training data: Bias
17
Most training data is biased (against or for certain groups or people, opinions, etc.).
The LLM will reproduce this bias in its answers.
[Scientific American]
Great mousse au chocolat
(to impress your husband):
...
produces biased text
LLM
used to train
produces
?
Training data: Lack of Compensation
18
LLM creators (can) make money with the content that was produced by other people.
Mousse au
chocolat is...
LLM
used to
train
create
builds
Take 10 eggs...
pays $$$
informs
generates
Training data: Lack of Compensation
19
LLM creators (can) make money with the content that was produced by other people.
These other people do not get compensated.
Mousse au
chocolat is...
LLM
used to
train
create
builds
Take 10 eggs...
pays $$$
informs
generates
pays
x
It’s this person who created
the content that OpenAI “sells”!
Training data: Cannibalization
20
When users get all their answers from the LLM, they might stop visiting the source Web sites
Mousse au
chocolat is...
used to
train
builds
Take 10 eggs...
pays $$$
informs
generates
LLM
visits
x
Training data: Cannibalization
21
When users get all their answers from the LLM, they might stop visiting the source Web sites
-> the content creators get no visitors (and no revenue) at all
-> they might stop creating content (“cannibalization of the Web”)
Mousse au
chocolat is...
used to
train
builds
Take 10 eggs...
pays $$$
informs
generates
x
LLM
x
x
x
Training data: Cannibalization
22
When users get all their answers from the LLM, they might stop visiting the source Web sites
-> the content creators get no visitors (and no revenue) at all
-> they might stop creating content (“cannibalization of the Web”)
-> LLMs cannot be trained
-> system collapse
Mousse au
chocolat is...
used to
train
builds
Take 10 eggs...
pays $$$
informs
generates
x
LLM
x
x
x
x
x
x
Google’s AI Overviews (web page summaries above search results)
has already reduced traffic to outside websites by 34%.
The Atlantic, 2025-06-25
Challenges on the side of the LLM creators
23
The war in...
Elvis Presley
was born...
1. Creation of fake personas
2. Creation of biased advisors
3. Creation of fake content
4. Creation of deep fakes
5. Microtargeting
6. Centralization of power
7. Environmental impact
Mousse au
chocolat is...
used to
train
builds
LLM
collects
LLM creators: Fake personas
24
User A:
I think abortions should be legal.
User B:
Do you know that the heart starts beating in the embryo a few weeks after conception?
User A:
This is a chemical process that has nothing to do with a full‐grown human being.
User B:
Life begins at conception!
User A:
So are you telling me that vegetarians can’t eat eggs?
User B:
...
What User A does not know:
User B is a chatbot that was trained and deployed in the thousands
to convince users of the immorality of abortion
LLM creators: Fake personas
25
User A:
I think abortions should be legal.
User B:
Do you know that the heart starts beating in the embryo a few weeks after conception?
User A:
This is a chemical process that has nothing to do with a full‐grown human being.
User B:
Life begins at conception!
User A:
So are you telling me that vegetarians can’t eat eggs?
User B:
...
What User A does not know:
User B is a chatbot that was trained and deployed in the thousands
to convince users of the immorality of abortion.
-> User A is wasting her time
-> User A cannot use that time to convice other users of her position
-> User A might get convinced herself
AI is more persuasive than people in online debates [Nature, 2025]
LLM creators: Biased advisors
26
Chatbots will soon be available on our phones, laptops, and other devices as personal assistants.
[Scherlund]
Chatbot:
How are you doing today?
User:
Great! I want to go running today!
Chatbot:
Fantastic! The weather is sunny today until 4pm!
Go check if your running shoes are still OK!
These chatbots will be able to build up intimate relationships with their users.
These relationships can then be used to nudge users towards products,
services, attitudes, or political orientations.
LLM creators: Fake content
27
LLMs can produce textual content at unprecedented rates, for example to create
- social media posts
- emails
- Web pages
Real or fake?
Real: [CNN, 2024-01-16]
Real: [Trends Mol Med. 2022]
LLM creators: Deep fakes
28
Generative AI can produce a digital replica of a person that can be difficult to distinguish from
the original (“deep fake”). Deep fakes can be used for
- fraud
- fake news
- hoaxes
- bullying
- blackmailing
- smear campaigns
[SCMP]
[Le Monde, 2023-07-10]
[The Guardian, 2020-01-13]
LLMs can provide the textual content or script for deepfakes, e.g., in a conversation.