a simple yet effective postSFT method that enhances long CoT reasoning without requiring new long CoT responses
Introduction
- Here, we show the results of open-source reasoning LLMs before and after ThinkPO.
Accuracy
Models | Dataset | SFT | Ours (+ThinkPO) | Improv. (%) |
---|---|---|---|---|
DeepSeek-R1-Distill-Qwen-7B (Deepseek) | MATH500 | 87.4 | 91.2 | 4.3% |
AIME | 56.7 | 43.3 | -23.6% | |
GPQA | 47.0 | 49.5 | 5.3% | |
GSM8K | 87.2 | 87.6 | 0.5% | |
Olympiad | 58.6 | 58.6 | 0.0% | |
Bespoke-Stratos-7B (Bespoke) | MATH500 | 84.0 | 82.8 | -1.4% |
AIME | 20.0 | 23.3 | 16.5% | |
GPQA | 37.9 | 43.4 | 14.5% | |
GSM8K | 92.9 | 93.3 | 0.4% | |
Olympiad | 44.1 | 48.5 | 10.0% |
Average Response Length
Model | Dataset | SFT | Ours (+ThinkPO) | Improv. (%) |
---|---|---|---|---|
DeepSeek-R1-Distill-Qwen-7B (Deepseek) | MATH500 | 2577 | 3021 | 17.2% |
AIME | 11419 | 12875 | 12.8% | |
GPQA | 4895 | 5604 | 14.5% | |
GSM8K | 619 | 668 | 7.9% | |
Olympiad | 7196 | 7383 | 2.6% | |
Bespoke-Stratos-7B (Bespoke) | MATH500 | 5696 | 6404 | 12.4% |
AIME | 19858 | 20079 | 1.1% | |
GPQA | 5968 | 7301 | 22.3% | |
GSM8K | 1404 | 1755 | 25.0% | |
Olympiad | 11140 | 12204 | 9.6% |
- Downloads last month
- 14
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.