Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization
Abstract
MADPO, a margin-adaptive method, enhances preference alignment in large language models by providing instance-level adaptive weighting to the DPO loss, improving performance across datasets.
Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of beta-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative beta values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.
Community
TL;DR The standard DPO objective struggles with mixed-quality data, a problem that β-DPO addresses at the batch level; MADPO provides a more granular solution at the instance level, which leads to consistently better and more robust performance in experiment.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Latent Collective Preference Optimization: A General Framework for Robust LLM Alignment (2025)
- What Matters in Data for DPO? (2025)
- Adaptive Margin RLHF via Preference over Preferences (2025)
- A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling (2025)
- UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following (2025)
- Anchored Supervised Fine-Tuning (2025)
- Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper