VanWang
/

DeepSeek-R1-Distill-Qwen-7B-ThinkPO

Model card Files Files and versions Community

a simple yet effective postSFT method that enhances long CoT reasoning without requiring new long CoT responses

Introduction

Here, we show the results of open-source reasoning LLMs before and after ThinkPO.

Accuracy

Models	Dataset	SFT	Ours (+ThinkPO)	Improv. (%)
DeepSeek-R1-Distill-Qwen-7B (Deepseek)	MATH500	87.4	91.2	4.3%
	AIME	56.7	43.3	-23.6%
	GPQA	47.0	49.5	5.3%
	GSM8K	87.2	87.6	0.5%
	Olympiad	58.6	58.6	0.0%
Bespoke-Stratos-7B (Bespoke)	MATH500	84.0	82.8	-1.4%
	AIME	20.0	23.3	16.5%
	GPQA	37.9	43.4	14.5%
	GSM8K	92.9	93.3	0.4%
	Olympiad	44.1	48.5	10.0%

Average Response Length

Model	Dataset	SFT	Ours (+ThinkPO)	Improv. (%)
DeepSeek-R1-Distill-Qwen-7B (Deepseek)	MATH500	2577	3021	17.2%
	AIME	11419	12875	12.8%
	GPQA	4895	5604	14.5%
	GSM8K	619	668	7.9%
	Olympiad	7196	7383	2.6%
Bespoke-Stratos-7B (Bespoke)	MATH500	5696	6404	12.4%
	AIME	19858	20079	1.1%
	GPQA	5968	7301	22.3%
	GSM8K	1404	1755	25.0%
	Olympiad	11140	12204	9.6%

Downloads last month: 14

Safetensors

Model size

7.62B params

Tensor type

BF16

·

Inference Providers NEW

This model is not currently available via any of the supported Inference Providers.

The model cannot be deployed to the HF Inference API: The model has no library tag.