File size: 3,762 Bytes
0e1aaee 5ed0208 0e1aaee 5ed0208 0e1aaee 5ed0208 0e1aaee 03f3b95 0e1aaee 5ed0208 0e1aaee 5ed0208 0e1aaee 4936a51 5ed0208 0e1aaee 4936a51 7d77b92 4936a51 0e1aaee 4936a51 0e1aaee 5ed0208 0e1aaee 5ed0208 0e1aaee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# Online-DPO-R1
* **Blog**: https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175
* **Authors**:
* **Code**: https://github.com/RLHFlow/Online-DPO-R1
## Introduction
We release unofficial checkpoints for PPO, iterative DPO and rejection sampling (RAFT) trained from Qwen2.5-MATH-7B-base with rule-based RL, which are based on the success of Deepseek-R1-Zero and recent replications of PPO approach.
Evaluated on five widely-adopted benchmarks **AIME 2024**, **MATH 500**, **AMC**, **Minerva Math**, **OlympiadBench**, our **iterative DPO** and **RAFT** model achieve
significant enhancement compared to the base model and are comparable to the PPO approach.
Our models are trained by using the prompt set from the MATH training set and Numina Math.
Moreover, we provide a [detailed recipe](https://github.com/RLHFlow/Online-DPO-R1) to reproduce the model. Enjoy!
## Model Releases
- [PPO model] (https://huggingface.co/RLHFlow/Qwen2.5-7B-PPO-Zero)
- [Iterative DPO from SFT model] (https://huggingface.co/RLHFlow/Qwen2.5-7B-DPO)
- [Iterative DPO from base model] (https://huggingface.co/RLHFlow/Qwen2.5-7B-DPO-Zero)
- [Iterative DPO with Negative Log-Likelihood (NLL)] (https://huggingface.co/RLHFlow/Qwen2.5-7B-DPO-NLL-Zero)
- [Raft] (https://huggingface.co/RLHFlow/Qwen2.5-7B-RAFT-Zero)
## Dataset
## Training methods
- Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!
## Performance
| **Model** | **AIME 2024** | **MATH 500** | **AMC** | **Minerva Math** | **OlympiadBench** | **Average** |
|----------------------------|---------------|--------------|---------|------------------|-------------------|-------------|
| **Ours** | | | | | | |
| RLHFlow/Qwen2.5-7B-PPO-Zero | **43.3 (+26.6)** | 79.4 (+27.0) | **62.5 (+10.0)** | 33.1 (+20.2) | 40.7 (+24.3) | **51.8 (+21.6)** |
| RLHFlow/Qwen2.5-7B-DPO-Zero | 26.7 (+10.0) | 76.8 (+24.4) | **62.5 (+10.0)** | 30.9 (+18.0) | 37.9 (+21.5) | 47.0 (+16.8) |
| RLHFlow/Qwen2.5-7B-DPO | 30.0 (+13.3) | **84.4 (+32.0)** | **62.5 (+10.0)** | **33.5 (+20.6)** | **48.4 (+32.0)** | **51.8 (+21.6)** |
| RLHFlow/Qwen2.5-7B-RAFT-Zero | 20.0 (+3.3) | 77.6 (+25.2) | 55.0 (+2.5) | 30.5 (+17.6) | 38.7 (+22.3) | 44.4 (+14.2) |
| **Baselines** | | | | | | |
| Qwen2.5-Math-7B-Base | 16.7 | 52.4 | 52.5 | 12.9 | 16.4 | 30.2 |
| Qwen2.5-Math-7B-Base + SFT Warm-up | 20.0 | 73.2 | 62.5 | 30.5 | 35.6 | 44.4 |
| Qwen-2.5-Math-7B-Instruct | 13.3 | 79.8 | 50.6 | 34.6 | 40.7 | 43.8 |
| Llama-3.1-70B-Instruct | 16.7 | 64.6 | 30.1 | 35.3 | 31.9 | 35.7 |
| Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
| GPT-4o | 9.3 | 76.4 | 45.8 | 36.8 | 43.3 | 43.3 |
## Usage
## Citation
|