pith. sign in

arxiv: 2605.23398 · v1 · pith:JGW25NC6new · submitted 2026-05-22 · 💻 cs.IR

TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization

Pith reviewed 2026-05-25 03:34 UTC · model grok-4.3

classification 💻 cs.IR
keywords iterative direct preference optimizationmodel mergingtrajectory-aware fusionLLM alignmenterror accumulationreference modelpreference optimization
0
0 comments X

The pith

TPMM-DPO merges the sequence of policy models along their optimization trajectory with learned weights to form a stable reference that reduces error accumulation in iterative DPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Iterative DPO repeatedly uses the prior policy as reference, so noise in preferences and errors in that reference compound across rounds and produce late-stage over-optimization plus performance drops. TPMM-DPO reframes the entire sequence of policies as an optimization trajectory and merges them adaptively with learned fusion weights to build a smoother reference model. This change yields more stable training and higher win rates plus reward scores on both in-domain and out-of-domain data. A reader would care because the method keeps the simple DPO training loop intact while addressing the main practical failure mode of repeated iteration. Ablations confirm that learnable weights outperform fixed averaging at preventing the observed degradation.

Core claim

The central claim is that by treating the sequence of policy models generated during iterative DPO as an optimization trajectory and adaptively integrating them using learned fusion weights, TPMM-DPO constructs a smoother and more robust reference model. In contrast to conventional iterative DPO, which relies solely on a single previous model, TPMM-DPO effectively mitigates error accumulation induced by noisy preferences and improves training stability. Experimental results show that standard iterative DPO often suffers from performance degradation in the middle and later stages of training, whereas TPMM-DPO consistently improves generation quality and achieves higher win rates and reward 1s

What carries the argument

Trajectory-aware preference-guided model merging (TPMM), which adaptively integrates the sequence of policy models using learned fusion weights to produce the reference model for the next DPO round.

If this is right

  • Standard iterative DPO experiences performance degradation in the middle and later stages due to accumulating reference-model errors.
  • TPMM-DPO produces higher win rates and reward scores than standard iterative DPO on both in-domain and out-of-domain evaluations.
  • Learnable-weight fusion alleviates late-stage degradation caused by noisy preferences more effectively than simple averaging of prior models.
  • The merged reference improves overall training stability while keeping the core DPO objective unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trajectory-merging idea could be tested on other iterative alignment procedures that also reuse prior models as references.
  • If the learned weights prove robust, the method offers an alternative to extensive preference-data cleaning by leveraging model history instead.
  • Longer iteration chains might show whether the stability benefit scales or eventually saturates as the trajectory length grows.
  • The approach suggests that any optimization process whose iterates carry cumulative information could benefit from explicit trajectory fusion rather than single-step references.

Load-bearing premise

The sequence of policy models forms a useful optimization trajectory whose learned fusion weights can be estimated without introducing new overfitting or bias that offsets the stability gain.

What would settle it

Running the same iterative DPO loop with identical noisy preference data and observing that TPMM-DPO produces equal or greater performance fluctuations and lower win rates than the single-previous-model baseline in later iterations.

Figures

Figures reproduced from arXiv: 2605.23398 by Lingling Fu, Yongfu Xu.

Figure 1
Figure 1. Figure 1: Overview of the proposed TPMM-DPO training framework. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of the chosen and rejected rewards after three rounds of iterative training for iterative DPO and [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Length-controlled win-rate comparison across iterative training rounds. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Direct Preference Optimization (DPO) has been widely adopted for large language model alignment due to its simple training procedure and lack of an explicit reward model. However, in iterative DPO, when the policy model from the previous iteration is repeatedly used as the reference model for subsequent rounds, noise in preference data and errors in the reference model accumulate over time. This accumulation can lead to late-stage over-optimization, performance fluctuations, and degraded generalization. To address these issues, we propose TPMM-DPO, a trajectory-aware preference-guided model merging method. The method treats the sequence of policy models generated during iterative DPO as an optimization trajectory and adaptively integrates them using learned fusion weights, thereby constructing a smoother and more robust reference model. In contrast to conventional iterative DPO, which relies solely on a single previous model, TPMM-DPO effectively mitigates error accumulation induced by noisy preferences and improves training stability. Experimental results show that standard iterative DPO often suffers from performance degradation in the middle and later stages of training, whereas TPMM-DPO consistently improves generation quality and achieves higher win rates and reward scores on both in-domain and out-of-domain evaluations. Further ablation studies and robustness analyses demonstrate that, compared with simple averaging, learnable-weight fusion more effectively alleviates late-stage performance degradation caused by noisy preferences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes TPMM-DPO, which treats the sequence of policy models generated during iterative DPO as an optimization trajectory and merges them via learned fusion weights to produce a smoother reference model. The central claim is that this approach mitigates error accumulation from noisy preferences, improves training stability, and yields higher win rates and reward scores on both in-domain and out-of-domain evaluations compared to standard iterative DPO that uses only the previous policy as reference.

Significance. If the learned fusion weights produce a reference model whose effective error is strictly lower than any single prior policy without simply re-fitting the same noisy signal, the method would offer a practical way to stabilize iterative alignment without an explicit reward model. The trajectory-aware perspective is a distinct contribution relative to simple averaging or last-iteration baselines, and the reported robustness analyses could strengthen iterative DPO pipelines if the weight-learning procedure is shown to be non-circular.

major comments (2)
  1. [Method] Method section (description of preference-guided weight learning): the manuscript does not specify the loss function, regularization, or validation procedure used to optimize the fusion weights. Without this, it remains possible that the weights are fitted directly to the same noisy preference pairs that drive the policy updates, rendering the claimed stability gain a re-parameterization rather than an independent mitigation of error accumulation.
  2. [Experiments] Experiments (ablation and robustness analyses): the comparison of learnable-weight fusion versus simple averaging reports improved late-stage performance, yet no values of the learned weights, their variance across noise levels, or held-out validation metrics are provided. This leaves the central claim that the trajectory merge isolates error accumulation from new bias unverified.
minor comments (1)
  1. [Abstract] Abstract: performance claims are stated without any numerical deltas, error bars, or statistical tests, which weakens the reader's ability to gauge effect size from the summary alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional clarity is needed. We address the major comments point by point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Method] Method section (description of preference-guided weight learning): the manuscript does not specify the loss function, regularization, or validation procedure used to optimize the fusion weights. Without this, it remains possible that the weights are fitted directly to the same noisy preference pairs that drive the policy updates, rendering the claimed stability gain a re-parameterization rather than an independent mitigation of error accumulation.

    Authors: We agree that the current manuscript lacks these specifications. The revised version will explicitly describe the loss function used to optimize the fusion weights, the regularization terms applied, and the validation procedure. This addition will clarify the separation between weight learning and the policy update process. revision: yes

  2. Referee: [Experiments] Experiments (ablation and robustness analyses): the comparison of learnable-weight fusion versus simple averaging reports improved late-stage performance, yet no values of the learned weights, their variance across noise levels, or held-out validation metrics are provided. This leaves the central claim that the trajectory merge isolates error accumulation from new bias unverified.

    Authors: The observation is accurate; the manuscript does not report these specific values or metrics. In the revision, we will add the learned weight values, their variance across noise levels, and held-out validation metrics to provide direct support for the claim that trajectory-aware merging mitigates error accumulation beyond simple averaging. revision: yes

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a learnable fusion mechanism that can be trained from the policy trajectory; this introduces at least one set of free parameters (the fusion weights) whose estimation procedure is not detailed in the abstract.

free parameters (1)
  • fusion weights
    Learned parameters that determine how much each prior policy contributes to the merged reference model; their values are fitted during the TPMM-DPO procedure.
axioms (1)
  • domain assumption Preference data contains noise that accumulates across iterative DPO rounds when the previous policy is used as reference.
    Stated in the first paragraph of the abstract as the motivating problem.

pith-pipeline@v0.9.0 · 5766 in / 1244 out tokens · 32798 ms · 2026-05-25T03:34:06.502135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 6 internal anchors

  1. [1]

    Model fusion via optimal transport,

    S. P . Singh and M. Jaggi, “Model fusion via optimal transport,” Advances in Neural Information Processing Systems, vol. 33, pp. 22 045–22 055, 2020

  2. [2]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  3. [3]

    Direct preference optimization: Y our language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Y our language model is secretly a reward model,” Advances in neural information processing systems , vol. 36, pp. 53 728–53 741, 2023

  4. [4]

    Merge, ensemble, and cooperate! a survey on collabora- tive strategies in the era of large language models,

    J. Lu, Z. Pang, M. Xiao, Y . Zhu, R. Xia, and J. Zhang, “Merge, ensemble, and cooperate! a survey on collabora- tive strategies in the era of large language models,” arXiv preprint arXiv:2407.06089, 2024

  5. [5]

    Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models,

    S. Khaki, J. Li, L. Ma, L. Y ang, and P . Ramachandra, “Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models,” in Findings of the Association for Computational Linguistics: NAACL 2024 , 2024, pp. 1665–1680

  6. [6]

    Scaling laws for reward model overoptimization in direct alignment algorithms,

    R. Rafailov, Y . Chittepu, R. Park, H. S. Sikchi, J. Hejna, B. Knox, C. Finn, and S. Niekum, “Scaling laws for reward model overoptimization in direct alignment algorithms,” Advances in Neural Information Processing Systems, vol. 37, pp. 126 207–126 242, 2024

  7. [7]

    Mitigating reward over-optimization in direct alignment algorithms with importance sampling,

    P . M. Nguyen, N.-H. Nguyen, D. H. Nguyen, A. Liu, A. Mai, B. T. Nguyen, D. Sonntag, and K. D. Doan, “Mitigating reward over-optimization in direct alignment algorithms with importance sampling,” arXiv preprint arXiv:2506.08681, 2025

  8. [8]

    Mitigating reward over-optimization in direct alignment algorithms with importance sampling,

    N. M. Phuc, N.-H. Nguyen, D. M. H. Nguyen, A. Liu, A. Mai, B. T. Nguyen, D. Sonntag, and K. D. Doan, “Mitigating reward over-optimization in direct alignment algorithms with importance sampling,” in The Thirty- ninth Annual Conference on Neural Information Processing Systems

  9. [9]

    Provably mitigating overoptimiza- tion in rlhf: Y our sft loss is implicitly an adversarial regularizer,

    Z. Liu, M. Lu, S. Zhang, B. Liu, H. Guo, Y . Y ang, J. Blanchet, and Z. Wang, “Provably mitigating overoptimiza- tion in rlhf: Y our sft loss is implicitly an adversarial regularizer,” Advances in Neural Information Processing Systems, vol. 37, pp. 138 663–138 697, 2024

  10. [10]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, and T. Henighan, “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022

  11. [11]

    β-dpo: Direct preference optimization with dynamic β,

    J. Wu, Y . Xie, Z. Y ang, J. Wu, J. Gao, B. Ding, X. Wang, and X. He, “ β-dpo: Direct preference optimization with dynamic β,” Advances in Neural Information Processing Systems , vol. 37, pp. 129 944–129 966, 2024

  12. [12]

    Provably robust dpo: Aligning language models with noisy feed- back,

    S. R. Chowdhury, A. Kini, and N. Natarajan, “Provably robust dpo: Aligning language models with noisy feed- back,” arXiv preprint arXiv:2403.00409, 2024

  13. [13]

    Learn your reference model for real good alignment,

    A. Gorbatovski, B. Shaposhnikov, A. Malakhov, N. Surnachev, Y . Aksenov, I. Maksimov, N. Balagansky, and D. Gavrilov, “Learn your reference model for real good alignment,” arXiv preprint arXiv:2404.09656, 2024

  14. [14]

    Mixed preference optimization: Reinforcement learning with data selection and better reference model,

    Q. Gou and C.-T. Nguyen, “Mixed preference optimization: Reinforcement learning with data selection and better reference model,” arXiv preprint arXiv:2403.19443, 2024

  15. [15]

    Spread preference annotation: Direct preference judgment for efficient llm alignment,

    D. Kim, K. Lee, J. Shin, and J. Kim, “Spread preference annotation: Direct preference judgment for efficient llm alignment,” arXiv preprint arXiv:2406.04412, 2024

  16. [16]

    Ultrafeedback: Boosting language models with high-quality feedback,

    G. Cui, L. Y uan, N. Ding, G. Y ao, W. Zhu, Y . Ni, G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting language models with high-quality feedback,” 2023

  17. [17]

    Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation,

    S. Tu, J. Lin, X. Tian, Q. Zhang, L. Li, Y . Fu, N. Xu, W. He, X. Lan, D. Jiang et al., “Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation,” arXiv preprint arXiv:2503.12854, 2025. 10

  18. [18]

    Direct language model alignment from online ai feedback,

    S. Guo, B. Zhang, T. Liu, T. Liu, M. Khalman, F. Llinares, A. Rame, T. Mesnard, Y . Zhao, B. Piot et al., “Direct language model alignment from online ai feedback,” arXiv preprint arXiv:2402.04792, 2024

  19. [19]

    Iterative reasoning preference optimization,

    R. Y . Pang, W. Y uan, H. He, K. Cho, S. Sukhbaatar, and J. Weston, “Iterative reasoning preference optimization,” Advances in Neural Information Processing Systems , vol. 37, pp. 116 617–116 637, 2024

  20. [20]

    Aipo: Improving training objective for iterative preference optimization,

    Y . Shen, X. Wang, Y . Niu, Y . Zhou, L. Tang, L. Zhang, F. Chen, and L. Wen, “Aipo: Improving training objective for iterative preference optimization,” arXiv preprint arXiv:2409.08845, 2024

  21. [21]

    sdpo: Dont use your data all at once,

    D. Kim, Y . Kim, W. Song, H. Kim, Y . Kim, S. Kim, and C. Park, “sdpo: Dont use your data all at once,” in Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , 2025, pp. 366–373

  22. [22]

    Have you merged my model? on the robustness of large language model ip protection methods against model merging,

    T. Cong, D. Ran, Z. Liu, X. He, J. Liu, Y . Gong, Q. Li, A. Wang, and X. Wang, “Have you merged my model? on the robustness of large language model ip protection methods against model merging,” in Proceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis , 2023, pp. 69–76

  23. [23]

    Model merging for knowledge editing,

    Z. Fu, X. Wu, G. Li, Y . Zhang, Y . Zheng, T. Ming, Y . Wang, W. Wang, and X. Zhao, “Model merging for knowledge editing,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 6: Industry Track), 2025, pp. 433–443

  24. [24]

    Aimmerging: Adap- tive iterative model merging using training trajectories for language model continual learning,

    Y . Feng, J. Li, X. Dong, P . Xu, X. Zhou, Y . Zhang, Z. Lu, Y . Wang, A. Zhao, X. Chuet al., “Aimmerging: Adap- tive iterative model merging using training trajectories for language model continual learning,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , 2025, pp. 13 431–13 448

  25. [25]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. V aughanet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

  26. [26]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

  27. [27]

    Improving alignment of dialogue agents via targeted human judgements

    A. Glaese, N. McAleese, M. Tr˛ ebacz, J. Aslanides, V . Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P . Thacker et al. , “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022

  28. [28]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

  29. [29]

    s1: Simple test-time scaling,

    N. Muennighoff, Z. Y ang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P . Liang, E. Candès, and T. B. Hashimoto, “s1: Simple test-time scaling,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 20 286–20 332

  30. [30]

    LIMO: Less is More for Reasoning

    Y . Y e, Z. Huang, Y . Xiao, E. Chern, S. Xia, and P . Liu, “Limo: Less is more for reasoning,” arXiv preprint arXiv:2502.03387, 2025. 11