---
license: llama3.1
datasets:
- allenai/c4
language:
- en
metrics:
- accuracy
base_model:
- meta-llama/Llama-3.1-8B-Instruct
---
# Language Model Evaluation Results

## Overview
This document presents the evaluation results of `Llama-3.1-8B-Instruct-gptq-4bit` using the **Language Model Evaluation Harness** on the **ARC-Challenge** benchmark.

---

## 📊 Evaluation Summary

| **Metric**            | **Value**  | **Description**  | **[original](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)**  |
|----------------------|-----------|-----------------|-----------|
| **Accuracy (acc,none)** | `47.1%`  | Raw accuracy - percentage of correct answers. | `53.1%` |
| **Standard Error (acc_stderr,none)** | `1.46%` | Uncertainty in the accuracy estimate. | `1.45%` |
| **Normalized Accuracy (acc_norm,none)** | `49.9%`  | Accuracy after dataset-specific normalization. | `56.8%` |
| **Standard Error (acc_norm_stderr,none)** | `1.46%` | Uncertainty for normalized accuracy. | `1.45%` |

📌 **Interpretation:**
- The model correctly answered **47.1% of the questions**.
- After **normalization**, the accuracy slightly improves to **49.9%**.
- The **standard error (~1.46%)** indicates a small margin of uncertainty.

---

## ⚙️ Model Configuration

- **Model:** `Llama-3.1-8B-Instruct-gptq-4bit`
- **Parameters:** `1.05 billion` (Quantized 4-bit model)
- **Source:** Hugging Face (`hf`)
- **Precision:** `torch.float16`
- **Hardware:** `NVIDIA A100 80GB PCIe`
- **CUDA Version:** `12.4`
- **PyTorch Version:** `2.6.0+cu124`
- **Batch Size:** `1`
- **Evaluation Time:** `365.89 seconds (~6 minutes)`

📌 **Interpretation:**
- The evaluation was performed on a **high-performance GPU (A100 80GB)**.
- The model is **4-bit quantized**, reducing memory usage but possibly affecting accuracy.
- A **single-sample batch size** was used, which might slow evaluation speed.

---

## 📂 Dataset Information

- **Dataset:** `AI2 ARC-Challenge`
- **Task Type:** `Multiple Choice`
- **Number of Samples Evaluated:** `1,172`
- **Few-shot Examples Used:** `0` (Zero-shot setting)

📌 **Interpretation:**
- This benchmark assesses **grade-school-level scientific reasoning**.
- Since **no few-shot examples** were provided, the model was evaluated in a **pure zero-shot setting**.

---

## 📈 Performance Insights

- The `"higher_is_better"` flag confirms that **higher accuracy is preferred**.
- The model's **raw accuracy (47.1%)** is moderate compared to state-of-the-art models (**60–80%** on ARC-Challenge).
- **Quantization Impact:** The **4-bit quantized model** might perform slightly worse than a full-precision version.
- **Zero-shot Limitation:** Performance could improve with **few-shot prompting** (providing examples before testing).

---

📌 Let us know if you need further analysis or model tuning! 🚀

## **Citation**
If you use this model in your research or project, please cite it as follows:

📌 **Dr. Wasif Masood** (2024). *4bit Llama-3.1-8B-Instruct*. Version 1.0.  
Available at: [https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit](https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit)

### **BibTeX:**
```bibtex
@dataset{rwmasood2024,
  author    = {Dr. Wasif Masood and Empirisch Tech GmbH},
  title     = {Llama-3.1-8B 4 bit quantized},
  year      = {2024},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit},
  version   = {1.0},
  license   = {llama3.1},
  institution = {Empirisch Tech GmbH}
}