RLHFlow
/

Qwen2.5-7B-RAFT-Zero

Model card Files Files and versions Community

Chenlu123 commited on 22 days ago

Commit

1e14683

·

verified ·

1 Parent(s): 7d77b92

Update README.md

Files changed (1) hide show

README.md +0 -1

README.md CHANGED Viewed

@@ -25,7 +25,6 @@ Moreover, we provide a [detailed recipe](https://github.com/RLHFlow/Online-DPO-R
 - Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
   Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
   Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
-- Before the DPO training, we add SFT Warm-up procedure for the base model, which is fine-tuned from [RLHFlow/qwq_gen_sft_15k](https://huggingface.co/datasets/RLHFlow/qwq_gen_sft_15k).
 More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!

 - Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
   Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
   Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
 More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!