Use Ollama with any GGUF Model on Hugging Face Hub
🆕 You can now also run private GGUFs from the Hugging Face Hub.
Ollama is an application based on llama.cpp to interact with LLMs directly through your computer. You can use any GGUF quants created by the community (bartowski, MaziyarPanahi and many more) on Hugging Face directly with Ollama, without creating a new Modelfile
. At the time of writing there are 45K public GGUF checkpoints on the Hub, you can run any of them with a single ollama run
command. We also provide customisations like choosing quantization type, system prompt and more to improve your overall experience.
Getting started is as simple as:
- Enable
ollama
under your Local Apps settings. - On a model page, choose
ollama
fromUse this model
dropdown. For example: bartowski/Llama-3.2-1B-Instruct-GGUF.
The snippet would be in format:
ollama run hf.co/{username}/{repository}
Please note that you can use both hf.co
and huggingface.co
as the domain name.
Here are some models you can try:
ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF ollama run hf.co/arcee-ai/SuperNova-Medius-GGUF ollama run hf.co/bartowski/Humanish-LLama3-8B-Instruct-GGUF
Custom Quantization
By default, the Q4_K_M
quantization scheme is used, when it’s present inside the model repo. If not, we default to picking one reasonable quant type present inside the repo.
To select a different scheme, simply:
- From
Files and versions
tab on a model page, open GGUF viewer on a particular GGUF file. - Choose
ollama
fromUse this model
dropdown.
The snippet would be in format (quantization tag added):
ollama run hf.co/{username}/{repository}:{quantization}
For example:
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:IQ3_M
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
# the quantization name is case-insensitive, this will also work
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:iq3_m
# you can also directly use the full filename as a tag
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Llama-3.2-3B-Instruct-IQ3_M.gguf
Custom Chat Template and Parameters
By default, a template will be selected automatically from a list of commonly used templates. It will be selected based on the built-in tokenizer.chat_template
metadata stored inside the GGUF file.
If your GGUF file doesn’t have a built-in template or if you want to customize your chat template, you can create a new file called template
in the repository. The template must be a Go template, not a Jinja template. Here’s an example:
{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>
To know more about the Go template format, please refer to this documentation
You can optionally configure a system prompt by putting it into a new file named system
in the repository.
To change sampling parameters, create a file named params
in the repository. The file must be in JSON format. For the list of all available parameters, please refer to this documentation.
Run Private GGUFs from the Hugging Face Hub
You can run private GGUFs from your personal account or from an associated organisation account in two simple steps:
- Copy your Ollama SSH key, you can do so via:
cat ~/.ollama/id_ed25519.pub | pbcopy
- Add the corresponding key to your Hugging Face account by going to your account settings and clicking on
Add new SSH key
. - That’s it! You can now run private GGUFs from the Hugging Face Hub:
ollama run hf.co/{username}/{repository}
.
References
- https://github.com/ollama/ollama/blob/main/docs/README.md
- https://huggingface.co/docs/hub/en/gguf