File size: 2,444 Bytes
1239071 21edcfa 1239071 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
license: mit
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
---
# About
This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.
# Model Description
DeepSeek-R1-Distill-Llama-70B is available in two configurations:
- Standard configuration (bf16)
- Optimized configuration with weight and KV cache quantization (w8a8kv8)
# Key Features
- Model Architecture: Based on the Llama architecture with 70B parameters
- Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration
- Quantization Innovation:
- Weight quantization
- KV cache optimization using fp8
- Context Length: Supports up to 131,072 tokens
- Precision Options:
- bf16 for standard version
- w8a8kv8 for optimized version
# Methods
The model employs advanced quantization techniques:
- Weight quantization for model compression
- KV cache optimization using fp8
- Backend optimization with FLASHINFER for enhanced performance
# Model Usage
## Quick Start
For optimal performance with w8a8kv8 configuration:
```python
# Environment setup
export VLLM_ATTENTION_BACKEND=FLASHINFER
# Model configuration
model_config = {
"max_model_len": 131072,
"max_gen_tokens": 1024,
"tensor_parallel_size": 2,
"kv_cache_dtype": "fp8"
}
```
# Hardware Requirements
- Standard (bf16): 4 GPUs, tensor parallel size = 4
- Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2
# Model Evaluation
## Benchmark Results
1. Throughput Performance:
- w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16
2. MMLU Benchmark Scores:
- bf16: 0.5158 (exact match)
- w8a8kv8: 0.5169 (exact match)
3. Subject-specific Performance:
- Notable improvements in:
- Biology (+1.11%)
- Economics (+0.83%)
- Physics (+0.92%)
- Slight variations in:
- History (-1.57%)
- Law (-1.46%)
# Limitations and Bias
- Requires specific backend optimizations for fp8 KV cache
- Performance may vary depending on hardware configuration
- Subject-specific performance shows slight variations across different domains
# Community
Join our community discussions and get support:
- Discord: [Novita AI Discord Community](https://discord.com/invite/YyPRAzwp7P) |