|
# Online-DPO-R1 |
|
* **Blog**: https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175 |
|
* **Authors**: |
|
* **Code**: https://github.com/RLHFlow/Online-DPO-R1 |
|
|
|
## Introduction |
|
We release unofficial checkpoints for PPO, iterative DPO and rejection sampling (RAFT) trained from Qwen2.5-MATH-7B-base with rule-based RL, which are based on the success of Deepseek-R1-Zero and recent replications of PPO approach. |
|
Evaluated on five widely-adopted benchmarks **AIME 2024**, **MATH 500**, **AMC**, **Minerva Math**, **OlympiadBench**, our **iterative DPO** and **RAFT** model achieve |
|
significant enhancement compared to the base model and are comparable to the PPO approach. |
|
Our models are trained by using the prompt set from the MATH training set and Numina Math. |
|
|
|
Moreover, we provide a [detailed recipe](https://github.com/RLHFlow/Online-DPO-R1) to reproduce the model. Enjoy! |
|
|
|
## Model Releases |
|
- [PPO model] (https://huggingface.co/RLHFlow/Qwen2.5-7B-PPO-Zero) |
|
- [Iterative DPO from SFT model] (https://huggingface.co/RLHFlow/Qwen2.5-7B-DPO) |
|
- [Iterative DPO from base model] (https://huggingface.co/RLHFlow/Qwen2.5-7B-DPO-Zero) |
|
- [Iterative DPO with Negative Log-Likelihood (NLL)] (https://huggingface.co/RLHFlow/Qwen2.5-7B-DPO-NLL-Zero) |
|
- [Raft] (https://huggingface.co/RLHFlow/Qwen2.5-7B-RAFT-Zero) |
|
|
|
## Dataset |
|
|
|
|
|
## Training methods |
|
- Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs. |
|
Then, we optimize the policy by minimizing the DPO loss and enter the next iteration. |
|
Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively |
|
|
|
More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)! |
|
|
|
|
|
## Performance |
|
| **Model** | **AIME 2024** | **MATH 500** | **AMC** | **Minerva Math** | **OlympiadBench** | **Average** | |
|
|----------------------------|---------------|--------------|---------|------------------|-------------------|-------------| |
|
| **Ours** | | | | | | | |
|
| RLHFlow/Qwen2.5-7B-PPO-Zero | **43.3 (+26.6)** | 79.4 (+27.0) | **62.5 (+10.0)** | 33.1 (+20.2) | 40.7 (+24.3) | **51.8 (+21.6)** | |
|
| RLHFlow/Qwen2.5-7B-DPO-Zero | 26.7 (+10.0) | 76.8 (+24.4) | **62.5 (+10.0)** | 30.9 (+18.0) | 37.9 (+21.5) | 47.0 (+16.8) | |
|
| RLHFlow/Qwen2.5-7B-DPO | 30.0 (+13.3) | **84.4 (+32.0)** | **62.5 (+10.0)** | **33.5 (+20.6)** | **48.4 (+32.0)** | **51.8 (+21.6)** | |
|
| RLHFlow/Qwen2.5-7B-RAFT-Zero | 20.0 (+3.3) | 77.6 (+25.2) | 55.0 (+2.5) | 30.5 (+17.6) | 38.7 (+22.3) | 44.4 (+14.2) | |
|
| **Baselines** | | | | | | | |
|
| Qwen2.5-Math-7B-Base | 16.7 | 52.4 | 52.5 | 12.9 | 16.4 | 30.2 | |
|
| Qwen2.5-Math-7B-Base + SFT Warm-up | 20.0 | 73.2 | 62.5 | 30.5 | 35.6 | 44.4 | |
|
| Qwen-2.5-Math-7B-Instruct | 13.3 | 79.8 | 50.6 | 34.6 | 40.7 | 43.8 | |
|
| Llama-3.1-70B-Instruct | 16.7 | 64.6 | 30.1 | 35.3 | 31.9 | 35.7 | |
|
| Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 | |
|
| GPT-4o | 9.3 | 76.4 | 45.8 | 36.8 | 43.3 | 43.3 | |
|
|
|
|
|
## Usage |
|
|
|
|
|
|
|
## Citation |
|
|