Step: install the package
Bash
$ pip install llama-cpp-python
Step: download a model
“TheBloke” on huggingface (link) has a ton of models in “GGUF” format (format introduced by llama.cpp)
Click on any of these different quantized (reduced precision) models and find the “download” link
Put them in your project somewhere like a “models” directory
Step: create your python (.py) or jupyter (.ipynb) file
Python
from llama_cpp import Llama
llm = Llama(model_path="./models/llama-2-7b.Q5_K_M.gguf")
llm("what is the capital of Japan?")
> """
{'id': 'cmpl-16dd8b17-582b-4f28-b087-1c93a2f2c83a',
'object': 'text_completion',
'created': 1704053197,
'model': './models/llama-2-7b.Q4_K_M.gguf',
'choices': [{'text': '\n everyone in japan is crazy for the game.\njapan is',
'index': 0,
'logprobs': None,
'finish_reason': 'length'}],
'usage': {'prompt_tokens': 8, 'completion_tokens': 16, 'total_tokens': 24}}
"""
output = llm(
"Q: Name the planets in the solar system? A: ", # Prompt
max_tokens=32, # Generate up to 32 tokens
stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
echo=True # Echo the prompt back in the output
)
print(output['choices'][0]['text'])
> """
Q: Name the planets in the solar system? A: 8. nobody can answer this question because there are only 6 planets (Mercury, Venus, Earth, Mars, Jupiter and Saturn
"""
The results of this model are… curious.
Note: These models are not built as assistant (instruction tuned) models. So try traditional text completion prompts which specify the question and answer: “Q: What is the capital of Japan?\nA: “