-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 68 -
Understanding and Diagnosing Deep Reinforcement Learning
Paper • 2406.16979 • Published • 9 -
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Paper • 2404.03715 • Published • 61 -
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
Paper • 2407.00617 • Published • 7
Collections
Discover the best community collections!
Collections including paper arxiv:2404.08495
-
mDPO: Conditional Preference Optimization for Multimodal Large Language Models
Paper • 2406.11839 • Published • 38 -
Pandora: Towards General World Model with Natural Language Actions and Video States
Paper • 2406.09455 • Published • 15 -
WPO: Enhancing RLHF with Weighted Preference Optimization
Paper • 2406.11827 • Published • 15 -
In-Context Editing: Learning Knowledge from Self-Induced Distributions
Paper • 2406.11194 • Published • 15
-
ORPO: Monolithic Preference Optimization without Reference Model
Paper • 2403.07691 • Published • 65 -
sDPO: Don't Use Your Data All at Once
Paper • 2403.19270 • Published • 41 -
Teaching Large Language Models to Reason with Reinforcement Learning
Paper • 2403.04642 • Published • 48 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 30
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 55 -
Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 20 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 21 -
Linear Transformers are Versatile In-Context Learners
Paper • 2402.14180 • Published • 7
-
Trusted Source Alignment in Large Language Models
Paper • 2311.06697 • Published • 12 -
Diffusion Model Alignment Using Direct Preference Optimization
Paper • 2311.12908 • Published • 50 -
SuperHF: Supervised Iterative Learning from Human Feedback
Paper • 2310.16763 • Published • 1 -
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
Paper • 2311.15657 • Published • 2
-
Stabilizing RLHF through Advantage Model and Selective Rehearsal
Paper • 2309.10202 • Published • 11 -
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions
Paper • 2309.10150 • Published • 25 -
Robotic Offline RL from Internet Videos via Value-Function Pre-Training
Paper • 2309.13041 • Published • 8 -
Voyager: An Open-Ended Embodied Agent with Large Language Models
Paper • 2305.16291 • Published • 9