This is Mistral AI's Mixtral Instruct v0.1 model, quantized on 02/24/2024. It works well.

How to quantize your own models with Windows and an RTX GPU:

Requirements:

  • git
  • python

Instructions:

The following example starts at the root of D drive and quantizes mistral's Mixtral-9x7B-Instruct-v0.1.

Windows command prompt - folder setup and git clone llama.cpp

Download llama.cpp

Assuming you want CUDA for your NVIDIA RTX GPU(s) use the links below, or grab latest compiled executables from https://github.com/ggerganov/llama.cpp/releases

Latest version as of Feb 24, 2024:

Extract the two .zip files directly into the llama.cpp folder you just git cloned. Overwrite files as prompted.

Download Mixtral

Windows command prompt - Convert the model to fp16:

  • D:\llama.cpp>python convert.py D:\Mixtral --outtype f16 --outfile D:\Mixtral\Mixtral-8x7B-Instruct-v0.1.fp16.bin

Windows command prompt - Quantize the fp16 model to q5_k_m:

  • D:\llama.cpp>quantize.exe D:\Mixtral\Mixtral-8x7B-Instruct-v0.1.fp16.bin D:\Mixtral\Mixtral-8x7B-Instruct-v0.1.q5_k_m.gguf q5_k_m

That's it!

Downloads last month
8
GGUF
Model size
46.7B params
Architecture
llama

5-bit

Inference API
Inference API (serverless) has been turned off for this model.

Model tree for OptimizeLLM/Mixtral-8x7B-Instruct-v0.1.q5_k_m

Quantized
(29)
this model