An elaborate guide to hosting LLMs locally

12^th August 2024 - by Tobias Gårdhus

Tags: LLM, Ollama, Huggingface, Hosting

In this guide I will take you through the steps needed to

Download a model from huggingface
Convert it to gguf using llama.cpp (optionally quantize it)
Host the model using Ollama

Being able to host your own models and try different is very valuable both in terms of privacy and task specific performance.

I needed to do this for the open-source model OLMo-7B-Instruct-hf, so this is the one I will use as an example throughout this blog post.

Setup

First create an empty folder where we will store the models and required scripts.

mkdir local_models
cd local_models

To make managing dependencies (a bit) more convenient, I will use a conda environment. This is especially useful for downloading and managing llama.cpp and the python wrapper.

conda create -n llms python=3.11
conda activate llms

Download a model from huggingface

Install and log in to huggingface-cli (instructions)

pip install -U "huggingface_hub[cli]"
huggingface-cli login

Now, to download a specific model you can simply run:

huggingface-cli download allenai/OLMo-7B-Instruct-hf \
  --local-dir olmo \
  --local-dir-use-symlinks False \
  --revision main

Convert the model to gguf using llama.cpp (optionally quantize it)

If we are not lucky enough that the model providers already created a .gguf file, we need to do it manually using llama.cpp.

conda install "llama-cpp-python[server]" -c conda-forge
conda install pytorch -c conda-forge
pip install sentencepiece

This should download and install llama.cpp and the python wrapper.

We need some extra python util tools.

The first is the python library gguf. It is hosted on pypi, but to make it work at the time of writing this post, I needed to install the latest version from github:

pip install git+https://github.com/ggerganov/llama.cpp.git#subdirectory=gguf-py --force-reinstall

llama.cpp also provides a separate script to convert the models (don’t ask me why this isn’t a part of the python library…).

wget -O convert_hf_to_gguf.py https://raw.githubusercontent.com/ggerganov/llama.cpp/master/convert_hf_to_gguf.py

Now, we can convert the model to the gguf format.

python convert_hf_to_gguf.py --help

So I call the script and specify the models stored in the directory I specified when downloading it from huggingface, while also specifying the output file and quantization.

python convert_hf_to_gguf.py olmo \
  --outfile OLMo-7B-Instruct-hf.gguf \
  --outtype q8_0

Host the model using Ollama

Now comes the tricky part, creating a Modelfile which Ollama can use to implement the model. Ollama should be able to do this automatically, but at least the detection of the template failed in my instance.

To do this manually I loaded the model using the transformers library in python

pip install transformers

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B-Instruct-hf")
messages = [
  {"role": "system", "content": "You are a helpful assistant"},
  {"role": "user", "content": "What is 2+2?"},
  {"role": "assistant", "content": "4"}
]
inputs = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
print(inputs)

<|endoftext|><|user|>
What is 2+2?
<|assistant|>
4<|endoftext|>

which then translated to the following Modelfile, which is created in the olmo directory

cd olmo
vi Modelfile

The first line specifies the path to the .gguf model, while the following is the prompt template derived from the above output.

Figuring out the conversion to the go template format of Ollama was not that straight forward for me, but with a lot of trial and error and a bit of claude 3.5 I landed on the following which seems to work as intended.

FROM /full/path/to/your/local_models/OLMo-7B-Instruct-hf.gguf

TEMPLATE """{{ if .System }}<|endoftext|><|system|>
{{ .System }}{{ end }}{{ if .Prompt }}<|endoftext|><|user|>
{{ .Prompt }}{{ end }}<|endoftext|><|assistant|>
"""

Now, if you have not already installed Ollama follow the instruction here. And then the local model can be added and run by the following commands:

ollama create olmo
ollama run olmo

I hope all went well, and that you are now a able to download, convert and run models on your locally hosted Ollama instance!