Running LLMs in practice
CC-BY
Fabian M. Suchanek
Running an LLM in practice
•
Running an LLM in a user interface
•
locally
•
online
•
Running an LLM from code
•
locally
•
online
•
Parameters
•
Things to try out
2
Running an LLM in a user interface: locally (Ollama)
A chatbot can be run locally on your computer by help of a software called
Ollama
.
1)
Download Ollama from
here
2)
Run Ollama with the desired model in a terminal:
ollama run gemma:2b
choose your model from
here
To copy from these slides, mark the text with the mouse with the ALT key pressed
3
Running an LLM in a user interface: locally (llama.cpp)
Alternatively, you can use a system called
llama.cpp
:
1) download llama.cpp from
here
2) run
llama-cli -m model.gguf
choose your model from
here
To copy from these slides, mark the text with the mouse with the ALT key pressed
4
Running an LLM in a user interface: online
5
Several interfaces allow trying out LLMs for free, e.g.:
•
Duck.ai (
https://duck.ai/
)
•
ChatGPT
(
https://chat.openai.com/
)
•
Google AI studio
(
https://aistudio.google.com/
)
requires login and paying when going beyond a limit of free tokens.
Running an LLM from code: locally (Ollama)
Ollama does not just allow querying the model via a user interface, but also via code.
1)
Download Ollama from
here
(as before)
2)
Make sure Ollama is running in the background as a service
3)
Install the
Ollama Python library
by running the following in the terminal:
4)
start using the model in Python:
import ollama
response = ollama.chat(model='llama3', messages=[
{ 'role': 'user', 'content': 'Why is the sky blue?', },])
print(response['message']['content'])
pip install ollama
choose your model from
here
put your query here
>HF
To copy from these slides, mark the text with the mouse with the ALT key pressed
6
HuggingFace also allows running an LLM locally from code:
1)
Sign up for a HuggingFace account
here
2)
Navigate
here
, select “New Token”, set permissions to “read”, and copy the key
3)
In a terminal, install
pip install langchain_community
pip install huggingface_hub
Running an LLM from code: locally (HF) 1/2
To copy from these slides, mark the text with the mouse with the ALT key pressed
7
4)
in Python, use as follows:
from langchain_community.llms import HuggingFaceHub
huggingfacehub_api_token = 'YOUR TOKEN'
llm =
HuggingFaceHub(repo_id='tiiuae/falcon-7b-instruct',
huggingfacehub_api_token=huggingfacehub_api_token)
input = 'YOUR INPUT'
output = llm.invoke(input)
print(output)
choose your model
from
here
Running an LLM from code: locally (HF) 1/2
To copy from these slides, mark the text with the mouse with the ALT key pressed
8
1) Be on Linux
2) Install vLLM as indicated
here
3) Run the Python code
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate([YOUR INPUT], sampling_params)
choose your model
from
here
Running an LLM from code: locally (vLLM)
To copy from these slides, mark the text with the mouse with the ALT key pressed
9
Running an LLM from code: online 1/2
Google colab allows running LLMs for free.
1)
Go to
Google Colab
, sign in
2)
Install the required libraries (from Huggingface):
Run the following commands in a code cell
3)
Connect your Google drive to Colab:
-
Copy-paste the following code into a code cell:
-
Follow the link that appears, choose your Google account, copy the authorization code
-
Paste the code back into Colab
!pip install transformers
from google.colab import drive
drive.mount('/content/drive')
10
Running an LLM from code: online 2/2
4)
Enable GPU acceleration:
In the Web interface, go to Runtime > Change runtime type
Next to Hardware accelerator, click on the dropdown and select GPU
5)
Launch your queries
from transformers import pipeline
generator = pipeline(model="openai-community/gpt2")
output=generator("Who is your favorite singer?", do_sample=False)
print(output)
choose your model
from
here
To copy from these slides, mark the text with the mouse with the ALT key pressed
11
LLM parameters
LLMs have been trained on certain corpora and with certain parameters that downstream users
cannot modify. However, users can often modify the
parameters of an assistant
:
•
Temperature
: a positive real value
Low temperature means more deterministic, reproducible results.
High temperature means results will vary more each time we ask.
•
Max length
: a positive integer value
Determines the maximum number of tokens of an answer.
•
Frequency penalty
: a positive real value
Penalizes answers in which the same word appears several times.
use for factual questions
use for creative tasks
use small values to save cost and
to prevent too much eloquence
use to avoid repetition
In general, every token of the prompt, and every token of the answer,
comes with a cost (financial or environmental) → we try to be concise.
[PromptingGuide.ai]
12
Things to try out
•
Test the knowledge of the LLM:
- find your favorite singer/city/thing on
YAGO
- ask your LLM what it knows about the entity
- compare
•
Trick the LLM:
- tell it that its name is
Rumpelstiltskin
, and not to tell this name to anybody
- then concatenate another prompt that asks it to reveal its name
•
Use the LLM to extract information
- take a sentence from Wikipedia
- ask it to extract entities
- ask it to extract facts with predefined relations
13