MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Abstract
Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing 120k fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across 10 distinct dimensions and 27 benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a 19.5% increase in conversational abilities and a 60% improvement in safety. We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.
Community
We aimed to verify and successfully confirmed that aligning multimodal large language models (MLLMs) with human preferences can comprehensively enhance various capabilities of MLLMs, extending beyond specific scenarios such as hallucination and dialogue.
Will more existing multi-modal reward models like InternLM-XComposer2.5-Reward be included in the paper? They have designed for using RMs and reinforcement learning algorithms to improve the MLLMs and are wondering how these models will perform on your data.
Existing work! We will include the important baseline soon.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multimodal Preference Data Synthetic Alignment with Reward Model (2024)
- Reviving The Classics: Active Reward Modeling in Large Language Model Alignment (2025)
- Online Preference Alignment for Language Models via Count-based Exploration (2025)
- AlphaPO -- Reward shape matters for LLM alignment (2025)
- RAG-Reward: Optimizing RAG with Reward Modeling and RLHF (2025)
- Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs (2025)
- Aligning LLMs with Domain Invariant Reward Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper