Granite-3.1-8B-Reasoning-GGUF (Quantized for Efficient Inference)

Model Overview

This is a GGUF quantized version of ruslanmv/granite-3.1-8b-Reasoning, fine-tuned from ibm-granite/granite-3.1-8b-instruct. The GGUF format enables efficient inference on CPUs and GPUs, optimized for various K-bit quantization levels (4-bit, 5-bit, and 8-bit).

  • Developed by: ruslanmv
  • License: Apache 2.0
  • Base Model: ibm-granite/granite-3.1-8b-instruct
  • Fine-tuned for: Logical reasoning, structured problem-solving, long-context tasks
  • Quantized GGUF versions available:
    • 4-bit: Q4_K_M
    • 5-bit: Q5_K_M
    • 8-bit: Q8_0
  • Supported Languages: English
  • Architecture: Granite
  • Model Size: 8.17B params

Why Use the GGUF Quantized Version?

The GGUF format is designed for optimized CPU and GPU inference, making it ideal for:

Lower memory usage for efficient deployment
Faster inference speeds on consumer hardware
Compatibility with leading inference engines like llama.cpp, ctransformers, and KoboldCpp
Improved performance on logical reasoning and analytical tasks


Installation & Usage

Install dependencies for llama.cpp:

pip install llama-cpp-python

Running the Model with llama.cpp:

from llama_cpp import Llama

model_path = "path/to/ruslanmv/granite-3.1-8b-Reasoning-GGUF.Q4_K_M.gguf"

llm = Llama(model_path=model_path)

input_text = "Can you explain the difference between inductive and deductive reasoning?"
output = llm(input_text, max_tokens=400)

print(output["choices"][0]["text"])

Alternatively, using ctransformers:

pip install ctransformers
from ctransformers import AutoModelForCausalLM

model_path = "path/to/ruslanmv/granite-3.1-8b-Reasoning-GGUF.Q4_K_M.gguf"

model = AutoModelForCausalLM.from_pretrained(model_path, model_type="llama", gpu_layers=50)

input_text = "What are the key principles of logical reasoning?"
output = model(input_text, max_new_tokens=400)

print(output)

Intended Use

Granite-3.1-8B-Reasoning-GGUF is designed for efficient inference while maintaining strong reasoning capabilities, making it ideal for:

  • Logical and analytical problem-solving
  • Text-based reasoning tasks
  • Mathematical and symbolic reasoning
  • Advanced instruction-following

This model is particularly beneficial for CPU-based deployments, low-memory environments, and users who need optimized text generation without requiring high-end GPUs.


License & Acknowledgments

This model is released under the Apache 2.0 license. It is fine-tuned from IBM’s Granite 3.1-8B-Instruct model and quantized using GGUF for optimal efficiency. Special thanks to the IBM Granite Team for developing the base model.

For more details, visit the IBM Granite Documentation.


Citation

If you use this model in your research or applications, please cite:

@misc{ruslanmv2025granite,
  title={Fine-Tuning and GGUF Quantization of Granite-3.1-8B for Advanced Reasoning},
  author={Ruslan M.V.},
  year={2025},
  url={https://huggingface.co/ruslanmv/granite-3.1-8b-Reasoning-GGUF}
}
Downloads last month
61
GGUF
Model size
8.17B params
Architecture
granite

4-bit

5-bit

8-bit

Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for ruslanmv/granite-3.1-8b-Reasoning-GGUF

Quantized
(31)
this model