|
--- |
|
license: llama3.1 |
|
datasets: |
|
- allenai/c4 |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- meta-llama/Llama-3.1-8B-Instruct |
|
--- |
|
# Language Model Evaluation Results |
|
|
|
## Overview |
|
This document presents the evaluation results of `Llama-3.1-8B-Instruct-gptq-4bit` using the **Language Model Evaluation Harness** on the **ARC-Challenge** benchmark. |
|
|
|
--- |
|
|
|
## π Evaluation Summary |
|
|
|
| **Metric** | **Value** | **Description** | **[original](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)** | |
|
|----------------------|-----------|-----------------|-----------| |
|
| **Accuracy (acc,none)** | `47.1%` | Raw accuracy - percentage of correct answers. | `53.1%` | |
|
| **Standard Error (acc_stderr,none)** | `1.46%` | Uncertainty in the accuracy estimate. | `1.45%` | |
|
| **Normalized Accuracy (acc_norm,none)** | `49.9%` | Accuracy after dataset-specific normalization. | `56.8%` | |
|
| **Standard Error (acc_norm_stderr,none)** | `1.46%` | Uncertainty for normalized accuracy. | `1.45%` | |
|
|
|
π **Interpretation:** |
|
- The model correctly answered **47.1% of the questions**. |
|
- After **normalization**, the accuracy slightly improves to **49.9%**. |
|
- The **standard error (~1.46%)** indicates a small margin of uncertainty. |
|
|
|
--- |
|
|
|
## βοΈ Model Configuration |
|
|
|
- **Model:** `Llama-3.1-8B-Instruct-gptq-4bit` |
|
- **Parameters:** `1.05 billion` (Quantized 4-bit model) |
|
- **Source:** Hugging Face (`hf`) |
|
- **Precision:** `torch.float16` |
|
- **Hardware:** `NVIDIA A100 80GB PCIe` |
|
- **CUDA Version:** `12.4` |
|
- **PyTorch Version:** `2.6.0+cu124` |
|
- **Batch Size:** `1` |
|
- **Evaluation Time:** `365.89 seconds (~6 minutes)` |
|
|
|
π **Interpretation:** |
|
- The evaluation was performed on a **high-performance GPU (A100 80GB)**. |
|
- The model is **4-bit quantized**, reducing memory usage but possibly affecting accuracy. |
|
- A **single-sample batch size** was used, which might slow evaluation speed. |
|
|
|
--- |
|
|
|
## π Dataset Information |
|
|
|
- **Dataset:** `AI2 ARC-Challenge` |
|
- **Task Type:** `Multiple Choice` |
|
- **Number of Samples Evaluated:** `1,172` |
|
- **Few-shot Examples Used:** `0` (Zero-shot setting) |
|
|
|
π **Interpretation:** |
|
- This benchmark assesses **grade-school-level scientific reasoning**. |
|
- Since **no few-shot examples** were provided, the model was evaluated in a **pure zero-shot setting**. |
|
|
|
--- |
|
|
|
## π Performance Insights |
|
|
|
- The `"higher_is_better"` flag confirms that **higher accuracy is preferred**. |
|
- The model's **raw accuracy (47.1%)** is moderate compared to state-of-the-art models (**60β80%** on ARC-Challenge). |
|
- **Quantization Impact:** The **4-bit quantized model** might perform slightly worse than a full-precision version. |
|
- **Zero-shot Limitation:** Performance could improve with **few-shot prompting** (providing examples before testing). |
|
|
|
--- |
|
|
|
π Let us know if you need further analysis or model tuning! π |
|
|
|
## **Citation** |
|
If you use this model in your research or project, please cite it as follows: |
|
|
|
π **Dr. Wasif Masood** (2024). *4bit Llama-3.1-8B-Instruct*. Version 1.0. |
|
Available at: [https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit](https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit) |
|
|
|
### **BibTeX:** |
|
```bibtex |
|
@dataset{rwmasood2024, |
|
author = {Dr. Wasif Masood and Empirisch Tech GmbH}, |
|
title = {Llama-3.1-8B 4 bit quantized}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit}, |
|
version = {1.0}, |
|
license = {llama3.1}, |
|
institution = {Empirisch Tech GmbH} |
|
} |
|
|