DPO-Shift: Shifting the Distribution of Direct Preference Optimization
Abstract
Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce \method to controllably shift the distribution of the chosen probability. Then, we show that \method exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of \method over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.
Community
Our work, DPO-Shift, mitigates the likelihood displacement issue of DPO through a simple approach, yielding a fundamental and controllable trade-off between the chosen probability and reward margin.
The following image provides a brief illsturation for our proposed method. The first row in represents the SFTed model. The second row corresponds to DPO-Shift, where we observe an increased chosen probability compared to DPO (depicted in the last row).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SimulPL: Aligning Human Preferences in Simultaneous Machine Translation (2025)
- REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models (2025)
- Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization (2025)
- Test-time Alignment of Diffusion Models without Reward Over-optimization (2025)
- Aligning LLMs with Domain Invariant Reward Models (2025)
- Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment (2024)
- AlphaPO -- Reward shape matters for LLM alignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 7
Browse 7 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper