Post
1834
Tried my hand at simplifying the derivations of Direct Preference Optimization.
I cover how one can reformulate RLHF into DPO. The idea of implicit reward modeling is chef's kiss.
Blog: https://huggingface.co/blog/ariG23498/rlhf-to-dpo
I cover how one can reformulate RLHF into DPO. The idea of implicit reward modeling is chef's kiss.
Blog: https://huggingface.co/blog/ariG23498/rlhf-to-dpo