a simple yet effective postSFT method that enhances long CoT reasoning without requiring new long CoT responses

Introduction

  • Here, we show the results of open-source reasoning LLMs before and after ThinkPO.

Accuracy

Models Dataset SFT Ours (+ThinkPO) Improv. (%)
DeepSeek-R1-Distill-Qwen-7B (Deepseek) MATH500 87.4 91.2 4.3%
AIME 56.7 43.3 -23.6%
GPQA 47.0 49.5 5.3%
GSM8K 87.2 87.6 0.5%
Olympiad 58.6 58.6 0.0%
Bespoke-Stratos-7B (Bespoke) MATH500 84.0 82.8 -1.4%
AIME 20.0 23.3 16.5%
GPQA 37.9 43.4 14.5%
GSM8K 92.9 93.3 0.4%
Olympiad 44.1 48.5 10.0%

Average Response Length

Model Dataset SFT Ours (+ThinkPO) Improv. (%)
DeepSeek-R1-Distill-Qwen-7B (Deepseek) MATH500 2577 3021 17.2%
AIME 11419 12875 12.8%
GPQA 4895 5604 14.5%
GSM8K 619 668 7.9%
Olympiad 7196 7383 2.6%
Bespoke-Stratos-7B (Bespoke) MATH500 5696 6404 12.4%
AIME 19858 20079 1.1%
GPQA 5968 7301 22.3%
GSM8K 1404 1755 25.0%
Olympiad 11140 12204 9.6%

Downloads last month
14
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.