Language Model Evaluation Results

Overview

This document presents the evaluation results of Llama-3.1-8B-Instruct-gptq-4bit using the Language Model Evaluation Harness on the ARC-Challenge benchmark.


πŸ“Š Evaluation Summary

Metric Value Description original
Accuracy (acc,none) 47.1% Raw accuracy - percentage of correct answers. 53.1%
Standard Error (acc_stderr,none) 1.46% Uncertainty in the accuracy estimate. 1.45%
Normalized Accuracy (acc_norm,none) 49.9% Accuracy after dataset-specific normalization. 56.8%
Standard Error (acc_norm_stderr,none) 1.46% Uncertainty for normalized accuracy. 1.45%

πŸ“Œ Interpretation:

  • The model correctly answered 47.1% of the questions.
  • After normalization, the accuracy slightly improves to 49.9%.
  • The standard error (~1.46%) indicates a small margin of uncertainty.

βš™οΈ Model Configuration

  • Model: Llama-3.1-8B-Instruct-gptq-4bit
  • Parameters: 1.05 billion (Quantized 4-bit model)
  • Source: Hugging Face (hf)
  • Precision: torch.float16
  • Hardware: NVIDIA A100 80GB PCIe
  • CUDA Version: 12.4
  • PyTorch Version: 2.6.0+cu124
  • Batch Size: 1
  • Evaluation Time: 365.89 seconds (~6 minutes)

πŸ“Œ Interpretation:

  • The evaluation was performed on a high-performance GPU (A100 80GB).
  • The model is 4-bit quantized, reducing memory usage but possibly affecting accuracy.
  • A single-sample batch size was used, which might slow evaluation speed.

πŸ“‚ Dataset Information

  • Dataset: AI2 ARC-Challenge
  • Task Type: Multiple Choice
  • Number of Samples Evaluated: 1,172
  • Few-shot Examples Used: 0 (Zero-shot setting)

πŸ“Œ Interpretation:

  • This benchmark assesses grade-school-level scientific reasoning.
  • Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.

πŸ“ˆ Performance Insights

  • The "higher_is_better" flag confirms that higher accuracy is preferred.
  • The model's raw accuracy (47.1%) is moderate compared to state-of-the-art models (60–80% on ARC-Challenge).
  • Quantization Impact: The 4-bit quantized model might perform slightly worse than a full-precision version.
  • Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

πŸ“Œ Let us know if you need further analysis or model tuning! πŸš€

Citation

If you use this model in your research or project, please cite it as follows:

πŸ“Œ Dr. Wasif Masood (2024). 4bit Llama-3.1-8B-Instruct. Version 1.0.
Available at: https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit

BibTeX:

@dataset{rwmasood2024,
  author    = {Dr. Wasif Masood and Empirisch Tech GmbH},
  title     = {Llama-3.1-8B 4 bit quantized},
  year      = {2024},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit},
  version   = {1.0},
  license   = {llama3.1},
  institution = {Empirisch Tech GmbH}
}
Downloads last month
10
Safetensors
Model size
1.99B params
Tensor type
I32
Β·
BF16
Β·
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit

Quantized
(354)
this model

Dataset used to train empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit