|
--- |
|
license: mit |
|
base_model: |
|
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B |
|
--- |
|
|
|
# About |
|
This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy. |
|
|
|
# Model Description |
|
DeepSeek-R1-Distill-Llama-70B is available in two configurations: |
|
- Standard configuration (bf16) |
|
- Optimized configuration with weight and KV cache quantization (w8a8kv8) |
|
|
|
# Key Features |
|
- Model Architecture: Based on the Llama architecture with 70B parameters |
|
- Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration |
|
- Quantization Innovation: |
|
- Weight quantization |
|
- KV cache optimization using fp8 |
|
- Context Length: Supports up to 131,072 tokens |
|
- Precision Options: |
|
- bf16 for standard version |
|
- w8a8kv8 for optimized version |
|
|
|
# Methods |
|
The model employs advanced quantization techniques: |
|
- Weight quantization for model compression |
|
- KV cache optimization using fp8 |
|
- Backend optimization with FLASHINFER for enhanced performance |
|
|
|
# Model Usage |
|
|
|
## Quick Start |
|
|
|
For optimal performance with w8a8kv8 configuration: |
|
```python |
|
# Environment setup |
|
export VLLM_ATTENTION_BACKEND=FLASHINFER |
|
|
|
# Model configuration |
|
model_config = { |
|
"max_model_len": 131072, |
|
"max_gen_tokens": 1024, |
|
"tensor_parallel_size": 2, |
|
"kv_cache_dtype": "fp8" |
|
} |
|
``` |
|
|
|
# Hardware Requirements |
|
- Standard (bf16): 4 GPUs, tensor parallel size = 4 |
|
- Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2 |
|
|
|
# Model Evaluation |
|
|
|
## Benchmark Results |
|
1. Throughput Performance: |
|
- w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16 |
|
2. MMLU Benchmark Scores: |
|
- bf16: 0.5158 (exact match) |
|
- w8a8kv8: 0.5169 (exact match) |
|
3. Subject-specific Performance: |
|
- Notable improvements in: |
|
- Biology (+1.11%) |
|
- Economics (+0.83%) |
|
- Physics (+0.92%) |
|
- Slight variations in: |
|
- History (-1.57%) |
|
- Law (-1.46%) |
|
|
|
# Limitations and Bias |
|
- Requires specific backend optimizations for fp8 KV cache |
|
- Performance may vary depending on hardware configuration |
|
- Subject-specific performance shows slight variations across different domains |
|
|
|
# Community |
|
Join our community discussions and get support: |
|
- Discord: [Novita AI Discord Community](https://discord.com/invite/YyPRAzwp7P) |