Update README.md
Browse files
README.md
CHANGED
@@ -25,7 +25,6 @@ Moreover, we provide a [detailed recipe](https://github.com/RLHFlow/Online-DPO-R
|
|
25 |
- Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
|
26 |
Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
|
27 |
Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
|
28 |
-
- Before the DPO training, we add SFT Warm-up procedure for the base model, which is fine-tuned from [RLHFlow/qwq_gen_sft_15k](https://huggingface.co/datasets/RLHFlow/qwq_gen_sft_15k).
|
29 |
|
30 |
More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!
|
31 |
|
|
|
25 |
- Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
|
26 |
Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
|
27 |
Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
|
|
|
28 |
|
29 |
More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!
|
30 |
|