Your DPO Pairs Are Probably Wrong

Quick refresher on DPO

DPO replaces the reward model and reinforcement learning stages of RLHF with a single classification style loss. Given a prompt x, a preferred response y_w, and a dispreferred response y_l, the training objective is:

L_DPO(θ) = −𝔼[ log σ( β · ( log π_θ(y_w|x) / π_ref(y_w|x) − log π_θ(y_l|x) / π_ref(y_l|x) ) ) ]

The model increases the relative log probability of the preferred response while decreasing that of the dispreferred one, regularized against a reference policy through the KL divergence term controlled by β. What the loss does not encode is any notion of how much better the chosen response is, or whether the rejected response is catastrophically bad versus merely slightly worse. That silence turns out to matter enormously.

The best vs worst trap

The most counterintuitive finding involves what happens when you increase the number of candidate responses per prompt and then select your preference pairs from them. The conventional approach picks the highest reward response as chosen and the lowest as rejected. You would expect that more candidates means sharper signal and therefore better training data. But the opposite happens: performance actually declines as sample size increases under this strategy.

The intuition is that the absolute worst sample among, say, 128 candidates is often a degenerate outlier. Garbled text, refusals, off topic hallucinations. These extreme negatives do not teach the model meaningful distinctions. They are so far from the model's natural distribution that the gradient signal pushes it in unhelpful directions.

To understand this properly, consider modeling the reward distribution of sampled responses as approximately normal with mean μ and standard deviation σ. The reward space can be carved into representative positions, and all pairwise combinations can be tested systematically.

The key result: selecting the rejected response at approximately μ − 2σ consistently outperforms the minimum reward strategy, and this advantage grows as the sample pool increases. The chosen response should still be top quality, but the rejected response benefits from being convincingly bad rather than absurdly broken.

The μ−2σ strategy scales with more samples. The min reward strategy degrades.

Not all pairs deserve to stay

If the quality of individual pairs matters this much, the natural follow up is whether you should even train on all of them. The answer, backed by both theory and experiments, is a clear no.

The core failure mode is called parameter shrinkage. When noisy preference data enters DPO training, meaning pairs where the reward model's ranking is unreliable or outright flipped, it systematically pushes the learned parameters toward zero.

In the Bradley Terry preference model, the probability that y_w is preferred over y_l is:

P(y_w ≻ y_l | x) = σ( r(y_w) − r(y_l) )

When the true reward gap is small, even modest noise in the reward model can flip the preference. These flipped pairs act as contradictory supervision, shrinking the effective magnitude of the learned policy parameters.

The solution is a margin maximization principle for data curation. Define the reward margin for a pair as:

m(y_w, y_l) = r_ext(y_w) − r_ext(y_l)external reward margin

But a single reward model's margins are themselves noisy. The implicit reward signal from the model's own log probability ratios provides an independent estimate:

r_θ(x, y) = β · log( π_θ(y|x) / π_ref(y|x) )implicit reward

These two sources of reward information, external and implicit, turn out to be only weakly correlated. A Bayesian aggregation that combines them into a single preference probability produces a much more robust selection criterion than either alone.

The practical result: using just 10% of the UltraFeedback dataset, this approach achieved 3 to 8 percentage point improvements on AlpacaEval 2 across Llama, Mistral, and Qwen models compared to training on the full dataset.

Binary labels throw away useful signal

Standard DPO treats preference data as binary. One response is better, the other is worse. But most real preference datasets carry richer information. If you used an AI judge or reward model to construct your pairs, you almost certainly have numerical scores, not just ordinal rankings. Discarding that information is costly.

Two specific problems arise from ignoring reward magnitudes. First, when the chosen and rejected responses are close in quality, DPO still maximizes the gap between them. This causes overfitting and unnecessary unlearning, where the model suppresses a rejected response that was actually quite good. Second, when the chosen response is itself low quality, the model is pushed to imitate mediocre behavior, failing to extrapolate toward genuinely strong outputs.

The fix is reward conditioned training. Relabel the dataset so that each response is conditioned on a quality tag derived from its reward score, then construct augmented preference pairs that teach the model to distinguish the full spectrum of quality rather than just a binary split.

Making the gap a first class signal

In standard DPO, you know that response A was preferred over response B, but not by how much. A pair where both responses are excellent gets the same treatment as a pair where one is great and the other is terrible.

If DPO reduces the likelihood of a dispreferred response, that might be appropriate when the response was genuinely bad. But if both responses were high quality and the preference was nearly a coin flip, suppressing the loser is actively harmful. You are unlearning good behavior.

Introducing a rating gap g as an additional signal, the DPO objective can be modified to weight the loss by how much the chosen response is better:

L_RDPO(θ) = L_DPO(θ) + λ · f(g, θ)rating gap weighted objective

Two families of algorithms emerge. The first modifies the RLHF objective to maximize a linear combination of ranking and rating information. The second uses a maximum likelihood principle. These algorithms achieve faster statistical convergence rates than vanilla DPO when the rating gap information is accurate, and degrade gracefully when the gaps are noisy.

When your data is broken

All of the above assumes your preference data is at least approximately correct. In any large scale annotation effort, some fraction of annotations will be wrong. The problem can be modeled as a preference matrix M composed of a true low rank component L and a sparse adversarial perturbation S:

M = L + Slow rank signal + sparse corruption

Matrix completion handles missing entries, and robust principal component analysis separates the low rank signal from the sparse corruption. The theoretical guarantee: provable recovery of a near optimal ranking even when up to O(n) pairwise comparisons per item have been adversarially corrupted.

The preference data construction pipeline: sample, score, clean, select, train.

Additional insights

Several other recent studies reinforce these themes. The quality of the chosen response turns out to be the dominant factor in DPO performance, while the quality of the rejected response has relatively limited impact. Theoretical analysis of the online DPO setting even shows it effectively reduces to supervised fine tuning on the chosen responses. This complements the sweet spot finding: the rejected response's role is mainly to regularize, not to teach.

There is also evidence that ambiguous, small margin pairs should not simply be discarded. They destabilize training when used with preference based losses, but they still contain useful signal when routed to a supervised fine tuning objective instead. A hybrid approach that starts with easy pairs on DPO loss and routes hard pairs to SFT consistently outperformed standard DPO.

Finally, truncated influence functions reveal that data quality is inherently model dependent. A pair that helps one model may actively harm another, suggesting that generic data curation is insufficient.

Do not chase extremes. Rejected responses should be clearly worse, but not degenerate outliers. Aim for the μ−2σ region, not the absolute floor.

Select, do not accumulate. Margin based selection can dramatically outperform training on the full dataset.

Preserve reward magnitudes. If your pipeline produces scores, carry them through. Binary labels discard useful signal.

Clean your data. Even small corruption rates compound. The DPO loss provides no intrinsic robustness against flipped labels.

Quality of chosen matters most. Invest your annotation budget disproportionately in excellent chosen responses.

One size does not fit all. Ideal data for a 3B model is not ideal data for a 7B model.

References

Xiao, Y. et al. Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization. ACL 2025. Link
Deng, X. et al. Less is More: Improving LLM Alignment via Preference Data Selection. 2025. Link
Zhang, S. et al. Reward Augmented Data Enhances Direct Preference Alignment of LLMs. ICML 2025. Link
Viano, L. et al. Direct Preference Optimization with Rating Information: Practical Algorithms and Provable Gains. 2026. Link
Nguyen, S.T. et al. CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models. DaSH 2024. Link
What Matters in Data for DPO? NeurIPS 2025. Link
Small Margin Preferences Still Matter If You Train Them Right. 2026. Link