--- license: llama3.1 datasets: - allenai/c4 language: - en metrics: - accuracy base_model: - meta-llama/Llama-3.1-8B-Instruct --- # Language Model Evaluation Results ## Overview This document presents the evaluation results of `Llama-3.1-8B-Instruct-gptq-4bit` using the **Language Model Evaluation Harness** on the **ARC-Challenge** benchmark. --- ## 📊 Evaluation Summary | **Metric** | **Value** | **Description** | **[original](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)** | |----------------------|-----------|-----------------|-----------| | **Accuracy (acc,none)** | `47.1%` | Raw accuracy - percentage of correct answers. | `53.1%` | | **Standard Error (acc_stderr,none)** | `1.46%` | Uncertainty in the accuracy estimate. | `1.45%` | | **Normalized Accuracy (acc_norm,none)** | `49.9%` | Accuracy after dataset-specific normalization. | `56.8%` | | **Standard Error (acc_norm_stderr,none)** | `1.46%` | Uncertainty for normalized accuracy. | `1.45%` | 📌 **Interpretation:** - The model correctly answered **47.1% of the questions**. - After **normalization**, the accuracy slightly improves to **49.9%**. - The **standard error (~1.46%)** indicates a small margin of uncertainty. --- ## ⚙️ Model Configuration - **Model:** `Llama-3.1-8B-Instruct-gptq-4bit` - **Parameters:** `1.05 billion` (Quantized 4-bit model) - **Source:** Hugging Face (`hf`) - **Precision:** `torch.float16` - **Hardware:** `NVIDIA A100 80GB PCIe` - **CUDA Version:** `12.4` - **PyTorch Version:** `2.6.0+cu124` - **Batch Size:** `1` - **Evaluation Time:** `365.89 seconds (~6 minutes)` 📌 **Interpretation:** - The evaluation was performed on a **high-performance GPU (A100 80GB)**. - The model is **4-bit quantized**, reducing memory usage but possibly affecting accuracy. - A **single-sample batch size** was used, which might slow evaluation speed. --- ## 📂 Dataset Information - **Dataset:** `AI2 ARC-Challenge` - **Task Type:** `Multiple Choice` - **Number of Samples Evaluated:** `1,172` - **Few-shot Examples Used:** `0` (Zero-shot setting) 📌 **Interpretation:** - This benchmark assesses **grade-school-level scientific reasoning**. - Since **no few-shot examples** were provided, the model was evaluated in a **pure zero-shot setting**. --- ## 📈 Performance Insights - The `"higher_is_better"` flag confirms that **higher accuracy is preferred**. - The model's **raw accuracy (47.1%)** is moderate compared to state-of-the-art models (**60–80%** on ARC-Challenge). - **Quantization Impact:** The **4-bit quantized model** might perform slightly worse than a full-precision version. - **Zero-shot Limitation:** Performance could improve with **few-shot prompting** (providing examples before testing). --- 📌 Let us know if you need further analysis or model tuning! 🚀 ## **Citation** If you use this model in your research or project, please cite it as follows: 📌 **Dr. Wasif Masood** (2024). *4bit Llama-3.1-8B-Instruct*. Version 1.0. Available at: [https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit](https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit) ### **BibTeX:** ```bibtex @dataset{rwmasood2024, author = {Dr. Wasif Masood and Empirisch Tech GmbH}, title = {Llama-3.1-8B 4 bit quantized}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit}, version = {1.0}, license = {llama3.1}, institution = {Empirisch Tech GmbH} }