DPO vs SimPO — Gayatri Malladi

Overview

DPO and SimPO are both offline preference optimization algorithms that replace the reward model and reinforcement learning stages of RLHF with a single loss function. But they make fundamentally different design choices: DPO uses a reference model for KL regularization and operates on raw log probabilities; SimPO drops the reference model entirely, normalizes by sequence length, and adds a target reward margin. These differences produce different failure modes, different strengths on different benchmarks, and different practical tradeoffs around memory, stability, and hyperparameter sensitivity. This post walks through the math behind each, identifies where each one breaks down, and provides a decision framework for choosing between them based on your specific constraints.

The question that actually matters

DPO and SimPO are both offline preference optimization algorithms that skip the explicit reward model used in RLHF. They both take preference pairs and directly fine tune a language model. At a surface level they look interchangeable. But their design choices lead to meaningfully different behavior, and choosing between them requires understanding where each breaks down.

The short version: SimPO is simpler, faster, and often scores higher on chat benchmarks. DPO is more theoretically grounded, more stable under distribution shift, and better at preserving capabilities you do not want to lose. The longer version requires looking at the math.

How DPO works

DPO derives its loss function by reparameterizing the RLHF objective. Instead of training a reward model and then running reinforcement learning, it extracts an implicit reward directly from the policy and reference model. The key equation:

L_DPO(θ) = −𝔼[ log σ( β · ( log π_θ(y_w|x) / π_ref(y_w|x)  −  log π_θ(y_l|x) / π_ref(y_l|x) ) ) ]implicit reward = β · log π_θ(y|x) / π_ref(y|x)

The implicit reward is the log probability ratio between the current policy and the reference model, scaled by β. This ratio acts as a KL divergence constraint: it penalizes the model for drifting too far from the reference. The reference model is typically the SFT checkpoint you started from.

Two things to notice. First, DPO uses raw log probabilities, not length normalized ones. A 200 token response accumulates more total log probability than a 50 token response simply by having more tokens. This creates a systematic length bias. Second, the reference model must be loaded in memory alongside the policy model during training, roughly doubling the GPU memory requirement.

How SimPO works

SimPO modifies DPO in two specific ways. First, it replaces the log ratio reward with the average log probability of the response, eliminating the reference model entirely. Second, it adds a target reward margin γ to the Bradley Terry objective, encouraging a minimum gap between winning and losing responses.

L_SimPO(θ) = −𝔼[ log σ( β/|y_w| · log π_θ(y_w|x) − β/|y_l| · log π_θ(y_l|x) − γ ) ]implicit reward = (1/|y|) · log π_θ(y|x)

The |y| denominator is what makes this length normalized: a long response and a short response are now on equal footing. The γ term acts like a margin in a support vector machine, pushing the model to maintain at least γ worth of separation between winning and losing rewards even after the classification is already correct.

Removing the reference model has practical consequences. Training uses roughly half the GPU memory. There is no need to keep the SFT checkpoint around. And computation per step drops significantly because you skip an entire forward pass through the reference model.

The four key design differences between DPO and SimPO.

The unifying view

Recent theoretical work has shown that SimPO is not as different from DPO as it first appears. Specifically, SimPO is equivalent to DPO with a uniform reference model. If you replace π_ref in the DPO loss with a uniform distribution over the vocabulary, the log ratio collapses to a constant for each token, and the remaining signal is just the log probability of the policy model itself. Add length normalization and a margin term, and you recover the SimPO objective exactly.

This has an important implication. The reference model in DPO is not just a regularizer. It provides adaptive, per example margins: for prompts where the reference model is confident, DPO demands a larger shift; for prompts where the reference model is uncertain, DPO is more permissive. SimPO replaces all of this with a single constant γ. That is simpler, but it throws away information about which examples need more or less correction.

SimPO ≈ DPO with π_ref = Uniform + length normalization + constant margin γα-DPO framework: α = 0 recovers SimPO, α = 1 recovers DPO

Likelihood displacement: DPO's central failure mode

The most studied pathology of DPO is likelihood displacement. During training, the probability of the preferred response often decreases even though the loss is being minimized. The model learns to suppress the dispreferred response faster than it learns to promote the preferred one, and the probability mass displaced from both responses leaks to other, sometimes harmful, completions.

This can be catastrophic. Training a model to prefer "No" over "Never" can sharply increase the probability of "Yes". Training a model to refuse unsafe prompts can actually reduce the refusal rate by displacing probability mass from refusal responses to compliant ones. The phenomenon occurs because DPO's gradient is dominated by the negative term on rejected responses, especially early in training when the model has not yet learned to sharply distinguish the two.

SimPO is not immune to this but exhibits it less severely in practice, partly because length normalization prevents long rejected sequences from dominating the gradient, and partly because the absence of the reference model ratio means the gradient landscape is smoother.

Length bias: why DPO models get verbose

Without explicit length normalization, DPO's implicit reward is the total log probability ratio summed across all tokens. A 300 token response accumulates three times the raw reward of a 100 token response. The model learns that longer responses receive larger gradient updates, and over time it drifts toward verbosity.

The reference model partially counteracts this because both the policy and the reference see the same number of tokens. But empirically, the Spearman correlation between average log likelihood and response length is much stronger for DPO than for SimPO. SimPO's explicit division by |y| makes the implicit reward scale invariant to length, and on benchmarks like AlpacaEval 2 that penalize unnecessary verbosity, SimPO consistently produces shorter and more focused responses.

If your use case rewards conciseness, such as chat assistants, customer support bots, or coding copilots, SimPO has a structural advantage. If your use case genuinely benefits from long form output, such as report generation or creative writing, DPO's slight length preference may actually be desirable.

The math problem

Both algorithms suffer performance drops on mathematical reasoning benchmarks like GSM8K after preference optimization. But SimPO can be worse here, and the reason is illuminating.

In math, changing a single token can flip a correct answer to an incorrect one. The preference signal between "2 + 2 = 4" and "2 + 2 = 5" is carried by exactly one token, but the loss is averaged over the entire sequence. Length normalization dilutes this signal. DPO, by operating on raw log probabilities, gives that critical token a proportionally larger influence on the gradient when the sequence is long.

Additionally, without a reference model, SimPO has no anchor preventing it from forgetting capabilities learned during SFT. The DPO reference model acts as an implicit regularizer that says "do not change too much from what you already knew." When that constraint is absent, the model can drift away from reasoning chains it learned during supervised fine tuning. Adding an SFT loss term to SimPO partially mitigates this but at the cost of degrading chat performance.

Hyperparameter sensitivity

Both algorithms require careful tuning, but the sensitivity profiles are different.

DPO has one main hyperparameter: β, which controls the strength of the KL penalty. Typical values range from 0.1 to 0.5. Too low and the model drifts dangerously from the reference; too high and it barely learns anything. The learning rate is less sensitive because the reference model provides a stabilizing force.

SimPO has three interacting hyperparameters: β, γ, and the learning rate. Because there is no reference model to absorb mistakes, the learning rate becomes the primary stability knob, and it must be kept very low. The SimPO authors recommend grid searching over 3e-7 to 1e-6 and note that values as high as 1e-5 can cause the model to produce incoherent or fully repetitive output. The β value in SimPO is also much larger than in DPO, typically 2.0 to 10.0, because the length normalized reward operates on a different scale than the log ratio reward.

The margin γ adds another dimension. Too small and the model does not learn a meaningful separation between winning and losing responses. Too large and the loss pushes the model into extreme territory, amplifying the risk of reward hacking. The sweet spot depends on the dataset and must be found empirically.

Decision guide based on your constraints and use case.

Practical tradeoffs at a glance

DPO strengths

KL regularization prevents catastrophic drift

Better at preserving reasoning and math

More robust to noisy preference data

Theoretically well understood

Safer for iterative and online training loops

SimPO strengths

No reference model needed: 2x memory savings

Explicit length normalization prevents verbosity

Consistently higher chat benchmark scores

Faster training: skip reference forward pass

Simpler pipeline with fewer moving parts

My recommendation

There is no universal answer, but here is how I think about it.

If you are doing general chat alignment on an instruction tuned model with clean, on policy preference data and your primary metric is something like AlpacaEval or Arena Hard, start with SimPO. It will likely give you better scores with less compute, and the length normalization is genuinely valuable for producing focused responses.

If you are doing safety alignment, domain specific fine tuning where you need to preserve existing capabilities, or working with noisy or off policy data, start with DPO. The reference model acts as a safety net that prevents the kind of catastrophic likelihood displacement that can silently undo your safety training. The alignment tax on reasoning is also lower with DPO in most settings.

If you have the budget, the emerging best practice is to use both sequentially. Run SimPO first as a broad alignment pass to get a generally well behaved model, then follow up with DPO on a curated dataset targeting specific behaviors you want to reinforce or correct. This layered approach lets you benefit from SimPO's efficiency while retaining DPO's precision for the final polish.

SimPO for efficiency and chat quality. If GPU memory is tight and your preference data is clean, SimPO gives you more alignment per FLOP.

DPO for safety and stability. The reference model is not overhead. It is a guardrail. When the cost of catastrophic drift is high, pay for the reference model.

Learning rate is the most critical hyperparameter for both. For SimPO, keep it in the 3e-7 to 1e-6 range. For DPO, you have slightly more headroom but rarely go above 5e-6.

One epoch, then evaluate. Overfitting is the dominant failure mode for both algorithms. Train less than you think you need.

Neither solves math. If reasoning is critical, add an SFT regularization term or use a two stage approach with Step DPO.

References

Rafailov, R. et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. Link
Meng, Y., Xia, M., and Chen, D. SimPO: Simple Preference Optimization with a Reference Free Reward. NeurIPS 2024. Link
Razin, N. et al. Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization. ICLR 2025. Link
Huang, H., Zhan, W., Xie, T., and Lee, J. α-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs. 2025. Link
Yan, J. et al. Reveal the Mystery of DPO: the 3D Properties. ICLR 2025. Link
Zhou, H. et al. RainbowPO. ICLR 2025. Link
Failure Modes of Maximum Entropy RLHF. 2025. Link