An elaborate guide to hosting LLMs locally
In this guide I will take you through the steps needed to
- Download a model from huggingface
- Convert it to gguf using llama.cpp (optionally quantize it)
- Host the model using Ollama
Being able to host your own models and try different is very valuable both in terms of privacy and task specific performance.
I needed to do this for the open-source model OLMo-7B-Instruct-hf, so this is the one I will use as an example throughout this blog post.
Setup
First create an empty folder where we will store the models and required scripts.
mkdir local_models
cd local_models
To make managing dependencies (a bit) more convenient, I will use a conda environment. This is especially useful for downloading and managing llama.cpp and the python wrapper.
conda create -n llms python=3.11
conda activate llms
Download a model from huggingface
Install and log in to huggingface-cli (instructions)
pip install -U "huggingface_hub[cli]"
huggingface-cli login
Now, to download a specific model you can simply run:
huggingface-cli download allenai/OLMo-7B-Instruct-hf \
--local-dir olmo \
--local-dir-use-symlinks False \
--revision main
Convert the model to gguf using llama.cpp (optionally quantize it)
If we are not lucky enough that the model providers already created a .gguf file, we need to do it manually using llama.cpp.
conda install "llama-cpp-python[server]" -c conda-forge
conda install pytorch -c conda-forge
pip install sentencepiece
This should download and install llama.cpp and the python wrapper.
We need some extra python util tools.
The first is the python library gguf. It is hosted on pypi, but to make it work at the time of writing this post, I needed to install the latest version from github:
pip install git+https://github.com/ggerganov/llama.cpp.git#subdirectory=gguf-py --force-reinstall
llama.cpp also provides a separate script to convert the models (don’t ask me why this isn’t a part of the python library…).
wget -O convert_hf_to_gguf.py https://raw.githubusercontent.com/ggerganov/llama.cpp/master/convert_hf_to_gguf.py
Now, we can convert the model to the gguf format.
python convert_hf_to_gguf.py --help
So I call the script and specify the models stored in the directory I specified when downloading it from huggingface, while also specifying the output file and quantization.
python convert_hf_to_gguf.py olmo \
--outfile OLMo-7B-Instruct-hf.gguf \
--outtype q8_0
Host the model using Ollama
Now comes the tricky part, creating a Modelfile which Ollama can use to implement the model. Ollama should be able to do this automatically, but at least the detection of the template failed in my instance.
To do this manually I loaded the model using the transformers library in python
pip install transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B-Instruct-hf")
messages = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"}
]
inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
print(inputs)
<|endoftext|><|user|>
What is 2+2?
<|assistant|>
4<|endoftext|>
which then translated to the following Modelfile, which is created in the olmo directory
cd olmo
vi Modelfile
The first line specifies the path to the .gguf model, while the following is the prompt template derived from the above output.
Figuring out the conversion to the go template format of Ollama was not that straight forward for me, but with a lot of trial and error and a bit of claude 3.5 I landed on the following which seems to work as intended.
FROM /full/path/to/your/local_models/OLMo-7B-Instruct-hf.gguf
TEMPLATE """{{ if .System }}<|endoftext|><|system|>
{{ .System }}{{ end }}{{ if .Prompt }}<|endoftext|><|user|>
{{ .Prompt }}{{ end }}<|endoftext|><|assistant|>
"""
Now, if you have not already installed Ollama follow the instruction here. And then the local model can be added and run by the following commands:
ollama create olmo
ollama run olmo
I hope all went well, and that you are now a able to download, convert and run models on your locally hosted Ollama instance!