TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization
Pith reviewed 2026-05-25 03:34 UTC · model grok-4.3
The pith
TPMM-DPO merges the sequence of policy models along their optimization trajectory with learned weights to form a stable reference that reduces error accumulation in iterative DPO.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that by treating the sequence of policy models generated during iterative DPO as an optimization trajectory and adaptively integrating them using learned fusion weights, TPMM-DPO constructs a smoother and more robust reference model. In contrast to conventional iterative DPO, which relies solely on a single previous model, TPMM-DPO effectively mitigates error accumulation induced by noisy preferences and improves training stability. Experimental results show that standard iterative DPO often suffers from performance degradation in the middle and later stages of training, whereas TPMM-DPO consistently improves generation quality and achieves higher win rates and reward 1s
What carries the argument
Trajectory-aware preference-guided model merging (TPMM), which adaptively integrates the sequence of policy models using learned fusion weights to produce the reference model for the next DPO round.
If this is right
- Standard iterative DPO experiences performance degradation in the middle and later stages due to accumulating reference-model errors.
- TPMM-DPO produces higher win rates and reward scores than standard iterative DPO on both in-domain and out-of-domain evaluations.
- Learnable-weight fusion alleviates late-stage degradation caused by noisy preferences more effectively than simple averaging of prior models.
- The merged reference improves overall training stability while keeping the core DPO objective unchanged.
Where Pith is reading between the lines
- The same trajectory-merging idea could be tested on other iterative alignment procedures that also reuse prior models as references.
- If the learned weights prove robust, the method offers an alternative to extensive preference-data cleaning by leveraging model history instead.
- Longer iteration chains might show whether the stability benefit scales or eventually saturates as the trajectory length grows.
- The approach suggests that any optimization process whose iterates carry cumulative information could benefit from explicit trajectory fusion rather than single-step references.
Load-bearing premise
The sequence of policy models forms a useful optimization trajectory whose learned fusion weights can be estimated without introducing new overfitting or bias that offsets the stability gain.
What would settle it
Running the same iterative DPO loop with identical noisy preference data and observing that TPMM-DPO produces equal or greater performance fluctuations and lower win rates than the single-previous-model baseline in later iterations.
Figures
read the original abstract
Direct Preference Optimization (DPO) has been widely adopted for large language model alignment due to its simple training procedure and lack of an explicit reward model. However, in iterative DPO, when the policy model from the previous iteration is repeatedly used as the reference model for subsequent rounds, noise in preference data and errors in the reference model accumulate over time. This accumulation can lead to late-stage over-optimization, performance fluctuations, and degraded generalization. To address these issues, we propose TPMM-DPO, a trajectory-aware preference-guided model merging method. The method treats the sequence of policy models generated during iterative DPO as an optimization trajectory and adaptively integrates them using learned fusion weights, thereby constructing a smoother and more robust reference model. In contrast to conventional iterative DPO, which relies solely on a single previous model, TPMM-DPO effectively mitigates error accumulation induced by noisy preferences and improves training stability. Experimental results show that standard iterative DPO often suffers from performance degradation in the middle and later stages of training, whereas TPMM-DPO consistently improves generation quality and achieves higher win rates and reward scores on both in-domain and out-of-domain evaluations. Further ablation studies and robustness analyses demonstrate that, compared with simple averaging, learnable-weight fusion more effectively alleviates late-stage performance degradation caused by noisy preferences.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TPMM-DPO, which treats the sequence of policy models generated during iterative DPO as an optimization trajectory and merges them via learned fusion weights to produce a smoother reference model. The central claim is that this approach mitigates error accumulation from noisy preferences, improves training stability, and yields higher win rates and reward scores on both in-domain and out-of-domain evaluations compared to standard iterative DPO that uses only the previous policy as reference.
Significance. If the learned fusion weights produce a reference model whose effective error is strictly lower than any single prior policy without simply re-fitting the same noisy signal, the method would offer a practical way to stabilize iterative alignment without an explicit reward model. The trajectory-aware perspective is a distinct contribution relative to simple averaging or last-iteration baselines, and the reported robustness analyses could strengthen iterative DPO pipelines if the weight-learning procedure is shown to be non-circular.
major comments (2)
- [Method] Method section (description of preference-guided weight learning): the manuscript does not specify the loss function, regularization, or validation procedure used to optimize the fusion weights. Without this, it remains possible that the weights are fitted directly to the same noisy preference pairs that drive the policy updates, rendering the claimed stability gain a re-parameterization rather than an independent mitigation of error accumulation.
- [Experiments] Experiments (ablation and robustness analyses): the comparison of learnable-weight fusion versus simple averaging reports improved late-stage performance, yet no values of the learned weights, their variance across noise levels, or held-out validation metrics are provided. This leaves the central claim that the trajectory merge isolates error accumulation from new bias unverified.
minor comments (1)
- [Abstract] Abstract: performance claims are stated without any numerical deltas, error bars, or statistical tests, which weakens the reader's ability to gauge effect size from the summary alone.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional clarity is needed. We address the major comments point by point below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Method] Method section (description of preference-guided weight learning): the manuscript does not specify the loss function, regularization, or validation procedure used to optimize the fusion weights. Without this, it remains possible that the weights are fitted directly to the same noisy preference pairs that drive the policy updates, rendering the claimed stability gain a re-parameterization rather than an independent mitigation of error accumulation.
Authors: We agree that the current manuscript lacks these specifications. The revised version will explicitly describe the loss function used to optimize the fusion weights, the regularization terms applied, and the validation procedure. This addition will clarify the separation between weight learning and the policy update process. revision: yes
-
Referee: [Experiments] Experiments (ablation and robustness analyses): the comparison of learnable-weight fusion versus simple averaging reports improved late-stage performance, yet no values of the learned weights, their variance across noise levels, or held-out validation metrics are provided. This leaves the central claim that the trajectory merge isolates error accumulation from new bias unverified.
Authors: The observation is accurate; the manuscript does not report these specific values or metrics. In the revision, we will add the learned weight values, their variance across noise levels, and held-out validation metrics to provide direct support for the claim that trajectory-aware merging mitigates error accumulation beyond simple averaging. revision: yes
Axiom & Free-Parameter Ledger
free parameters (1)
- fusion weights
axioms (1)
- domain assumption Preference data contains noise that accumulates across iterative DPO rounds when the previous policy is used as reference.
Reference graph
Works this paper leans on
-
[1]
Model fusion via optimal transport,
S. P . Singh and M. Jaggi, “Model fusion via optimal transport,” Advances in Neural Information Processing Systems, vol. 33, pp. 22 045–22 055, 2020
work page 2020
-
[2]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022
work page 2022
-
[3]
Direct preference optimization: Y our language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Y our language model is secretly a reward model,” Advances in neural information processing systems , vol. 36, pp. 53 728–53 741, 2023
work page 2023
-
[4]
J. Lu, Z. Pang, M. Xiao, Y . Zhu, R. Xia, and J. Zhang, “Merge, ensemble, and cooperate! a survey on collabora- tive strategies in the era of large language models,” arXiv preprint arXiv:2407.06089, 2024
-
[5]
S. Khaki, J. Li, L. Ma, L. Y ang, and P . Ramachandra, “Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models,” in Findings of the Association for Computational Linguistics: NAACL 2024 , 2024, pp. 1665–1680
work page 2024
-
[6]
Scaling laws for reward model overoptimization in direct alignment algorithms,
R. Rafailov, Y . Chittepu, R. Park, H. S. Sikchi, J. Hejna, B. Knox, C. Finn, and S. Niekum, “Scaling laws for reward model overoptimization in direct alignment algorithms,” Advances in Neural Information Processing Systems, vol. 37, pp. 126 207–126 242, 2024
work page 2024
-
[7]
Mitigating reward over-optimization in direct alignment algorithms with importance sampling,
P . M. Nguyen, N.-H. Nguyen, D. H. Nguyen, A. Liu, A. Mai, B. T. Nguyen, D. Sonntag, and K. D. Doan, “Mitigating reward over-optimization in direct alignment algorithms with importance sampling,” arXiv preprint arXiv:2506.08681, 2025
-
[8]
Mitigating reward over-optimization in direct alignment algorithms with importance sampling,
N. M. Phuc, N.-H. Nguyen, D. M. H. Nguyen, A. Liu, A. Mai, B. T. Nguyen, D. Sonntag, and K. D. Doan, “Mitigating reward over-optimization in direct alignment algorithms with importance sampling,” in The Thirty- ninth Annual Conference on Neural Information Processing Systems
-
[9]
Z. Liu, M. Lu, S. Zhang, B. Liu, H. Guo, Y . Y ang, J. Blanchet, and Z. Wang, “Provably mitigating overoptimiza- tion in rlhf: Y our sft loss is implicitly an adversarial regularizer,” Advances in Neural Information Processing Systems, vol. 37, pp. 138 663–138 697, 2024
work page 2024
-
[10]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, and T. Henighan, “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
β-dpo: Direct preference optimization with dynamic β,
J. Wu, Y . Xie, Z. Y ang, J. Wu, J. Gao, B. Ding, X. Wang, and X. He, “ β-dpo: Direct preference optimization with dynamic β,” Advances in Neural Information Processing Systems , vol. 37, pp. 129 944–129 966, 2024
work page 2024
-
[12]
Provably robust dpo: Aligning language models with noisy feed- back,
S. R. Chowdhury, A. Kini, and N. Natarajan, “Provably robust dpo: Aligning language models with noisy feed- back,” arXiv preprint arXiv:2403.00409, 2024
-
[13]
Learn your reference model for real good alignment,
A. Gorbatovski, B. Shaposhnikov, A. Malakhov, N. Surnachev, Y . Aksenov, I. Maksimov, N. Balagansky, and D. Gavrilov, “Learn your reference model for real good alignment,” arXiv preprint arXiv:2404.09656, 2024
-
[14]
Q. Gou and C.-T. Nguyen, “Mixed preference optimization: Reinforcement learning with data selection and better reference model,” arXiv preprint arXiv:2403.19443, 2024
-
[15]
Spread preference annotation: Direct preference judgment for efficient llm alignment,
D. Kim, K. Lee, J. Shin, and J. Kim, “Spread preference annotation: Direct preference judgment for efficient llm alignment,” arXiv preprint arXiv:2406.04412, 2024
-
[16]
Ultrafeedback: Boosting language models with high-quality feedback,
G. Cui, L. Y uan, N. Ding, G. Y ao, W. Zhu, Y . Ni, G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting language models with high-quality feedback,” 2023
work page 2023
-
[17]
Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation,
S. Tu, J. Lin, X. Tian, Q. Zhang, L. Li, Y . Fu, N. Xu, W. He, X. Lan, D. Jiang et al., “Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation,” arXiv preprint arXiv:2503.12854, 2025. 10
-
[18]
Direct language model alignment from online ai feedback,
S. Guo, B. Zhang, T. Liu, T. Liu, M. Khalman, F. Llinares, A. Rame, T. Mesnard, Y . Zhao, B. Piot et al., “Direct language model alignment from online ai feedback,” arXiv preprint arXiv:2402.04792, 2024
-
[19]
Iterative reasoning preference optimization,
R. Y . Pang, W. Y uan, H. He, K. Cho, S. Sukhbaatar, and J. Weston, “Iterative reasoning preference optimization,” Advances in Neural Information Processing Systems , vol. 37, pp. 116 617–116 637, 2024
work page 2024
-
[20]
Aipo: Improving training objective for iterative preference optimization,
Y . Shen, X. Wang, Y . Niu, Y . Zhou, L. Tang, L. Zhang, F. Chen, and L. Wen, “Aipo: Improving training objective for iterative preference optimization,” arXiv preprint arXiv:2409.08845, 2024
-
[21]
sdpo: Dont use your data all at once,
D. Kim, Y . Kim, W. Song, H. Kim, Y . Kim, S. Kim, and C. Park, “sdpo: Dont use your data all at once,” in Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , 2025, pp. 366–373
work page 2025
-
[22]
T. Cong, D. Ran, Z. Liu, X. He, J. Liu, Y . Gong, Q. Li, A. Wang, and X. Wang, “Have you merged my model? on the robustness of large language model ip protection methods against model merging,” in Proceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis , 2023, pp. 69–76
work page 2023
-
[23]
Model merging for knowledge editing,
Z. Fu, X. Wu, G. Li, Y . Zhang, Y . Zheng, T. Ming, Y . Wang, W. Wang, and X. Zhao, “Model merging for knowledge editing,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 6: Industry Track), 2025, pp. 433–443
work page 2025
-
[24]
Y . Feng, J. Li, X. Dong, P . Xu, X. Zhou, Y . Zhang, Z. Lu, Y . Wang, A. Zhao, X. Chuet al., “Aimmerging: Adap- tive iterative model merging using training trajectories for language model continual learning,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , 2025, pp. 13 431–13 448
work page 2025
-
[25]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. V aughanet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Improving alignment of dialogue agents via targeted human judgements
A. Glaese, N. McAleese, M. Tr˛ ebacz, J. Aslanides, V . Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P . Thacker et al. , “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
N. Muennighoff, Z. Y ang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P . Liang, E. Candès, and T. B. Hashimoto, “s1: Simple test-time scaling,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 20 286–20 332
work page 2025
-
[30]
LIMO: Less is More for Reasoning
Y . Y e, Z. Huang, Y . Xiao, E. Chern, S. Xia, and P . Liu, “Limo: Less is more for reasoning,” arXiv preprint arXiv:2502.03387, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.