File size: 2,444 Bytes
1239071
 
21edcfa
 
1239071
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: mit
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
---

# About
This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.

# Model Description
DeepSeek-R1-Distill-Llama-70B is available in two configurations:
- Standard configuration (bf16)
- Optimized configuration with weight and KV cache quantization (w8a8kv8)

# Key Features
- Model Architecture: Based on the Llama architecture with 70B parameters
- Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration
- Quantization Innovation: 
  - Weight quantization
  - KV cache optimization using fp8
- Context Length: Supports up to 131,072 tokens
- Precision Options: 
  - bf16 for standard version
  - w8a8kv8 for optimized version

# Methods
The model employs advanced quantization techniques:
- Weight quantization for model compression
- KV cache optimization using fp8
- Backend optimization with FLASHINFER for enhanced performance

# Model Usage

## Quick Start

For optimal performance with w8a8kv8 configuration:
```python
# Environment setup
export VLLM_ATTENTION_BACKEND=FLASHINFER

# Model configuration
model_config = {
    "max_model_len": 131072,
    "max_gen_tokens": 1024,
    "tensor_parallel_size": 2,
    "kv_cache_dtype": "fp8"
}
```

# Hardware Requirements
- Standard (bf16): 4 GPUs, tensor parallel size = 4
- Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2

# Model Evaluation

## Benchmark Results
1. Throughput Performance:
  - w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16
2. MMLU Benchmark Scores:
  - bf16: 0.5158 (exact match)
  - w8a8kv8: 0.5169 (exact match)
3. Subject-specific Performance:
  - Notable improvements in: 
    - Biology (+1.11%)
    - Economics (+0.83%)
    - Physics (+0.92%)
  - Slight variations in: 
    - History (-1.57%)
    - Law (-1.46%)

# Limitations and Bias
- Requires specific backend optimizations for fp8 KV cache
- Performance may vary depending on hardware configuration
- Subject-specific performance shows slight variations across different domains

# Community
Join our community discussions and get support:
- Discord: [Novita AI Discord Community](https://discord.com/invite/YyPRAzwp7P)