TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization

Lingling Fu; Yongfu Xu

arxiv: 2605.23398 · v1 · pith:JGW25NC6new · submitted 2026-05-22 · 💻 cs.IR

TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization

Lingling Fu , Yongfu Xu This is my paper

Pith reviewed 2026-05-25 03:34 UTC · model grok-4.3

classification 💻 cs.IR

keywords iterative direct preference optimizationmodel mergingtrajectory-aware fusionLLM alignmenterror accumulationreference modelpreference optimization

0 comments

The pith

TPMM-DPO merges the sequence of policy models along their optimization trajectory with learned weights to form a stable reference that reduces error accumulation in iterative DPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Iterative DPO repeatedly uses the prior policy as reference, so noise in preferences and errors in that reference compound across rounds and produce late-stage over-optimization plus performance drops. TPMM-DPO reframes the entire sequence of policies as an optimization trajectory and merges them adaptively with learned fusion weights to build a smoother reference model. This change yields more stable training and higher win rates plus reward scores on both in-domain and out-of-domain data. A reader would care because the method keeps the simple DPO training loop intact while addressing the main practical failure mode of repeated iteration. Ablations confirm that learnable weights outperform fixed averaging at preventing the observed degradation.

Core claim

The central claim is that by treating the sequence of policy models generated during iterative DPO as an optimization trajectory and adaptively integrating them using learned fusion weights, TPMM-DPO constructs a smoother and more robust reference model. In contrast to conventional iterative DPO, which relies solely on a single previous model, TPMM-DPO effectively mitigates error accumulation induced by noisy preferences and improves training stability. Experimental results show that standard iterative DPO often suffers from performance degradation in the middle and later stages of training, whereas TPMM-DPO consistently improves generation quality and achieves higher win rates and reward 1s

What carries the argument

Trajectory-aware preference-guided model merging (TPMM), which adaptively integrates the sequence of policy models using learned fusion weights to produce the reference model for the next DPO round.

If this is right

Standard iterative DPO experiences performance degradation in the middle and later stages due to accumulating reference-model errors.
TPMM-DPO produces higher win rates and reward scores than standard iterative DPO on both in-domain and out-of-domain evaluations.
Learnable-weight fusion alleviates late-stage degradation caused by noisy preferences more effectively than simple averaging of prior models.
The merged reference improves overall training stability while keeping the core DPO objective unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-merging idea could be tested on other iterative alignment procedures that also reuse prior models as references.
If the learned weights prove robust, the method offers an alternative to extensive preference-data cleaning by leveraging model history instead.
Longer iteration chains might show whether the stability benefit scales or eventually saturates as the trajectory length grows.
The approach suggests that any optimization process whose iterates carry cumulative information could benefit from explicit trajectory fusion rather than single-step references.

Load-bearing premise

The sequence of policy models forms a useful optimization trajectory whose learned fusion weights can be estimated without introducing new overfitting or bias that offsets the stability gain.

What would settle it

Running the same iterative DPO loop with identical noisy preference data and observing that TPMM-DPO produces equal or greater performance fluctuations and lower win rates than the single-previous-model baseline in later iterations.

Figures

Figures reproduced from arXiv: 2605.23398 by Lingling Fu, Yongfu Xu.

**Figure 2.** Figure 2: Comparison of the chosen and rejected rewards after three rounds of iterative training for iterative DPO and [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Length-controlled win-rate comparison across iterative training rounds. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Direct Preference Optimization (DPO) has been widely adopted for large language model alignment due to its simple training procedure and lack of an explicit reward model. However, in iterative DPO, when the policy model from the previous iteration is repeatedly used as the reference model for subsequent rounds, noise in preference data and errors in the reference model accumulate over time. This accumulation can lead to late-stage over-optimization, performance fluctuations, and degraded generalization. To address these issues, we propose TPMM-DPO, a trajectory-aware preference-guided model merging method. The method treats the sequence of policy models generated during iterative DPO as an optimization trajectory and adaptively integrates them using learned fusion weights, thereby constructing a smoother and more robust reference model. In contrast to conventional iterative DPO, which relies solely on a single previous model, TPMM-DPO effectively mitigates error accumulation induced by noisy preferences and improves training stability. Experimental results show that standard iterative DPO often suffers from performance degradation in the middle and later stages of training, whereas TPMM-DPO consistently improves generation quality and achieves higher win rates and reward scores on both in-domain and out-of-domain evaluations. Further ablation studies and robustness analyses demonstrate that, compared with simple averaging, learnable-weight fusion more effectively alleviates late-stage performance degradation caused by noisy preferences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TPMM-DPO merges prior DPO policies along the training trajectory with learned fusion weights to build a reference model, but the abstract supplies no equations or procedure for learning those weights.

read the letter

The main claim is that iterative DPO accumulates errors when the latest policy serves as reference, and that merging the full sequence of policies with learned weights produces a smoother reference that reduces late-stage degradation. The abstract positions this as new relative to prior iterative DPO work and reports higher win rates plus better out-of-domain scores than both standard iterative DPO and simple averaging. That diagnosis of the accumulation problem is accurate and the merging response is a direct attempt to address it. The reported gains on in-domain and out-of-domain metrics give some indication that the approach can help in practice for groups already running multi-round DPO. The comparison to averaging is also useful as a baseline. The central limitation is that the abstract never describes the loss or regularization used to learn the fusion weights, nor whether those weights are fit on held-out data or the same preference pairs that contain the noise. Without that, it is impossible to rule out that the extra parameters simply fit training artifacts rather than isolate error accumulation. The ablation and robustness sections are mentioned but not quantified here, so the strength of evidence remains unclear from the given text. This work is aimed at labs doing iterative preference optimization on LLMs. A reader already facing stability issues in that pipeline could extract a practical idea if the full paper supplies the missing weight-training details. The paper engages honestly with the literature on iterative DPO and stays internally consistent, so it deserves a serious referee to examine the method section and results. I would send it to review rather than desk reject, with the expectation that the authors clarify the weight optimization procedure and provide the ablation numbers.

Referee Report

2 major / 1 minor

Summary. The paper proposes TPMM-DPO, which treats the sequence of policy models generated during iterative DPO as an optimization trajectory and merges them via learned fusion weights to produce a smoother reference model. The central claim is that this approach mitigates error accumulation from noisy preferences, improves training stability, and yields higher win rates and reward scores on both in-domain and out-of-domain evaluations compared to standard iterative DPO that uses only the previous policy as reference.

Significance. If the learned fusion weights produce a reference model whose effective error is strictly lower than any single prior policy without simply re-fitting the same noisy signal, the method would offer a practical way to stabilize iterative alignment without an explicit reward model. The trajectory-aware perspective is a distinct contribution relative to simple averaging or last-iteration baselines, and the reported robustness analyses could strengthen iterative DPO pipelines if the weight-learning procedure is shown to be non-circular.

major comments (2)

[Method] Method section (description of preference-guided weight learning): the manuscript does not specify the loss function, regularization, or validation procedure used to optimize the fusion weights. Without this, it remains possible that the weights are fitted directly to the same noisy preference pairs that drive the policy updates, rendering the claimed stability gain a re-parameterization rather than an independent mitigation of error accumulation.
[Experiments] Experiments (ablation and robustness analyses): the comparison of learnable-weight fusion versus simple averaging reports improved late-stage performance, yet no values of the learned weights, their variance across noise levels, or held-out validation metrics are provided. This leaves the central claim that the trajectory merge isolates error accumulation from new bias unverified.

minor comments (1)

[Abstract] Abstract: performance claims are stated without any numerical deltas, error bars, or statistical tests, which weakens the reader's ability to gauge effect size from the summary alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional clarity is needed. We address the major comments point by point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Method] Method section (description of preference-guided weight learning): the manuscript does not specify the loss function, regularization, or validation procedure used to optimize the fusion weights. Without this, it remains possible that the weights are fitted directly to the same noisy preference pairs that drive the policy updates, rendering the claimed stability gain a re-parameterization rather than an independent mitigation of error accumulation.

Authors: We agree that the current manuscript lacks these specifications. The revised version will explicitly describe the loss function used to optimize the fusion weights, the regularization terms applied, and the validation procedure. This addition will clarify the separation between weight learning and the policy update process. revision: yes
Referee: [Experiments] Experiments (ablation and robustness analyses): the comparison of learnable-weight fusion versus simple averaging reports improved late-stage performance, yet no values of the learned weights, their variance across noise levels, or held-out validation metrics are provided. This leaves the central claim that the trajectory merge isolates error accumulation from new bias unverified.

Authors: The observation is accurate; the manuscript does not report these specific values or metrics. In the revision, we will add the learned weight values, their variance across noise levels, and held-out validation metrics to provide direct support for the claim that trajectory-aware merging mitigates error accumulation beyond simple averaging. revision: yes

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a learnable fusion mechanism that can be trained from the policy trajectory; this introduces at least one set of free parameters (the fusion weights) whose estimation procedure is not detailed in the abstract.

free parameters (1)

fusion weights
Learned parameters that determine how much each prior policy contributes to the merged reference model; their values are fitted during the TPMM-DPO procedure.

axioms (1)

domain assumption Preference data contains noise that accumulates across iterative DPO rounds when the previous policy is used as reference.
Stated in the first paragraph of the abstract as the motivating problem.

pith-pipeline@v0.9.0 · 5766 in / 1244 out tokens · 32798 ms · 2026-05-25T03:34:06.502135+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 6 internal anchors

[1]

Model fusion via optimal transport,

S. P . Singh and M. Jaggi, “Model fusion via optimal transport,” Advances in Neural Information Processing Systems, vol. 33, pp. 22 045–22 055, 2020

work page 2020
[2]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

work page 2022
[3]

Direct preference optimization: Y our language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Y our language model is secretly a reward model,” Advances in neural information processing systems , vol. 36, pp. 53 728–53 741, 2023

work page 2023
[4]

Merge, ensemble, and cooperate! a survey on collabora- tive strategies in the era of large language models,

J. Lu, Z. Pang, M. Xiao, Y . Zhu, R. Xia, and J. Zhang, “Merge, ensemble, and cooperate! a survey on collabora- tive strategies in the era of large language models,” arXiv preprint arXiv:2407.06089, 2024

work page arXiv 2024
[5]

Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models,

S. Khaki, J. Li, L. Ma, L. Y ang, and P . Ramachandra, “Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models,” in Findings of the Association for Computational Linguistics: NAACL 2024 , 2024, pp. 1665–1680

work page 2024
[6]

Scaling laws for reward model overoptimization in direct alignment algorithms,

R. Rafailov, Y . Chittepu, R. Park, H. S. Sikchi, J. Hejna, B. Knox, C. Finn, and S. Niekum, “Scaling laws for reward model overoptimization in direct alignment algorithms,” Advances in Neural Information Processing Systems, vol. 37, pp. 126 207–126 242, 2024

work page 2024
[7]

Mitigating reward over-optimization in direct alignment algorithms with importance sampling,

P . M. Nguyen, N.-H. Nguyen, D. H. Nguyen, A. Liu, A. Mai, B. T. Nguyen, D. Sonntag, and K. D. Doan, “Mitigating reward over-optimization in direct alignment algorithms with importance sampling,” arXiv preprint arXiv:2506.08681, 2025

work page arXiv 2025
[8]

Mitigating reward over-optimization in direct alignment algorithms with importance sampling,

N. M. Phuc, N.-H. Nguyen, D. M. H. Nguyen, A. Liu, A. Mai, B. T. Nguyen, D. Sonntag, and K. D. Doan, “Mitigating reward over-optimization in direct alignment algorithms with importance sampling,” in The Thirty- ninth Annual Conference on Neural Information Processing Systems

work page
[9]

Provably mitigating overoptimiza- tion in rlhf: Y our sft loss is implicitly an adversarial regularizer,

Z. Liu, M. Lu, S. Zhang, B. Liu, H. Guo, Y . Y ang, J. Blanchet, and Z. Wang, “Provably mitigating overoptimiza- tion in rlhf: Y our sft loss is implicitly an adversarial regularizer,” Advances in Neural Information Processing Systems, vol. 37, pp. 138 663–138 697, 2024

work page 2024
[10]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, and T. Henighan, “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

β-dpo: Direct preference optimization with dynamic β,

J. Wu, Y . Xie, Z. Y ang, J. Wu, J. Gao, B. Ding, X. Wang, and X. He, “ β-dpo: Direct preference optimization with dynamic β,” Advances in Neural Information Processing Systems , vol. 37, pp. 129 944–129 966, 2024

work page 2024
[12]

Provably robust dpo: Aligning language models with noisy feed- back,

S. R. Chowdhury, A. Kini, and N. Natarajan, “Provably robust dpo: Aligning language models with noisy feed- back,” arXiv preprint arXiv:2403.00409, 2024

work page arXiv 2024
[13]

Learn your reference model for real good alignment,

A. Gorbatovski, B. Shaposhnikov, A. Malakhov, N. Surnachev, Y . Aksenov, I. Maksimov, N. Balagansky, and D. Gavrilov, “Learn your reference model for real good alignment,” arXiv preprint arXiv:2404.09656, 2024

work page arXiv 2024
[14]

Mixed preference optimization: Reinforcement learning with data selection and better reference model,

Q. Gou and C.-T. Nguyen, “Mixed preference optimization: Reinforcement learning with data selection and better reference model,” arXiv preprint arXiv:2403.19443, 2024

work page arXiv 2024
[15]

Spread preference annotation: Direct preference judgment for efﬁcient llm alignment,

D. Kim, K. Lee, J. Shin, and J. Kim, “Spread preference annotation: Direct preference judgment for efﬁcient llm alignment,” arXiv preprint arXiv:2406.04412, 2024

work page arXiv 2024
[16]

Ultrafeedback: Boosting language models with high-quality feedback,

G. Cui, L. Y uan, N. Ding, G. Y ao, W. Zhu, Y . Ni, G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting language models with high-quality feedback,” 2023

work page 2023
[17]

Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation,

S. Tu, J. Lin, X. Tian, Q. Zhang, L. Li, Y . Fu, N. Xu, W. He, X. Lan, D. Jiang et al., “Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation,” arXiv preprint arXiv:2503.12854, 2025. 10

work page arXiv 2025
[18]

Direct language model alignment from online ai feedback,

S. Guo, B. Zhang, T. Liu, T. Liu, M. Khalman, F. Llinares, A. Rame, T. Mesnard, Y . Zhao, B. Piot et al., “Direct language model alignment from online ai feedback,” arXiv preprint arXiv:2402.04792, 2024

work page arXiv 2024
[19]

Iterative reasoning preference optimization,

R. Y . Pang, W. Y uan, H. He, K. Cho, S. Sukhbaatar, and J. Weston, “Iterative reasoning preference optimization,” Advances in Neural Information Processing Systems , vol. 37, pp. 116 617–116 637, 2024

work page 2024
[20]

Aipo: Improving training objective for iterative preference optimization,

Y . Shen, X. Wang, Y . Niu, Y . Zhou, L. Tang, L. Zhang, F. Chen, and L. Wen, “Aipo: Improving training objective for iterative preference optimization,” arXiv preprint arXiv:2409.08845, 2024

work page arXiv 2024
[21]

sdpo: Dont use your data all at once,

D. Kim, Y . Kim, W. Song, H. Kim, Y . Kim, S. Kim, and C. Park, “sdpo: Dont use your data all at once,” in Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , 2025, pp. 366–373

work page 2025
[22]

Have you merged my model? on the robustness of large language model ip protection methods against model merging,

T. Cong, D. Ran, Z. Liu, X. He, J. Liu, Y . Gong, Q. Li, A. Wang, and X. Wang, “Have you merged my model? on the robustness of large language model ip protection methods against model merging,” in Proceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis , 2023, pp. 69–76

work page 2023
[23]

Model merging for knowledge editing,

Z. Fu, X. Wu, G. Li, Y . Zhang, Y . Zheng, T. Ming, Y . Wang, W. Wang, and X. Zhao, “Model merging for knowledge editing,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 6: Industry Track), 2025, pp. 433–443

work page 2025
[24]

Aimmerging: Adap- tive iterative model merging using training trajectories for language model continual learning,

Y . Feng, J. Li, X. Dong, P . Xu, X. Zhou, Y . Zhang, Z. Lu, Y . Wang, A. Zhao, X. Chuet al., “Aimmerging: Adap- tive iterative model merging using training trajectories for language model continual learning,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , 2025, pp. 13 431–13 448

work page 2025
[25]

The Llama 3 Herd of Models

A. Grattaﬁori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. V aughanet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Improving alignment of dialogue agents via targeted human judgements

A. Glaese, N. McAleese, M. Tr˛ ebacz, J. Aslanides, V . Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P . Thacker et al. , “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

s1: Simple test-time scaling,

N. Muennighoff, Z. Y ang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P . Liang, E. Candès, and T. B. Hashimoto, “s1: Simple test-time scaling,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 20 286–20 332

work page 2025
[30]

LIMO: Less is More for Reasoning

Y . Y e, Z. Huang, Y . Xiao, E. Chern, S. Xia, and P . Liu, “Limo: Less is more for reasoning,” arXiv preprint arXiv:2502.03387, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Model fusion via optimal transport,

S. P . Singh and M. Jaggi, “Model fusion via optimal transport,” Advances in Neural Information Processing Systems, vol. 33, pp. 22 045–22 055, 2020

work page 2020

[2] [2]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

work page 2022

[3] [3]

Direct preference optimization: Y our language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Y our language model is secretly a reward model,” Advances in neural information processing systems , vol. 36, pp. 53 728–53 741, 2023

work page 2023

[4] [4]

Merge, ensemble, and cooperate! a survey on collabora- tive strategies in the era of large language models,

J. Lu, Z. Pang, M. Xiao, Y . Zhu, R. Xia, and J. Zhang, “Merge, ensemble, and cooperate! a survey on collabora- tive strategies in the era of large language models,” arXiv preprint arXiv:2407.06089, 2024

work page arXiv 2024

[5] [5]

Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models,

S. Khaki, J. Li, L. Ma, L. Y ang, and P . Ramachandra, “Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models,” in Findings of the Association for Computational Linguistics: NAACL 2024 , 2024, pp. 1665–1680

work page 2024

[6] [6]

Scaling laws for reward model overoptimization in direct alignment algorithms,

R. Rafailov, Y . Chittepu, R. Park, H. S. Sikchi, J. Hejna, B. Knox, C. Finn, and S. Niekum, “Scaling laws for reward model overoptimization in direct alignment algorithms,” Advances in Neural Information Processing Systems, vol. 37, pp. 126 207–126 242, 2024

work page 2024

[7] [7]

Mitigating reward over-optimization in direct alignment algorithms with importance sampling,

P . M. Nguyen, N.-H. Nguyen, D. H. Nguyen, A. Liu, A. Mai, B. T. Nguyen, D. Sonntag, and K. D. Doan, “Mitigating reward over-optimization in direct alignment algorithms with importance sampling,” arXiv preprint arXiv:2506.08681, 2025

work page arXiv 2025

[8] [8]

Mitigating reward over-optimization in direct alignment algorithms with importance sampling,

N. M. Phuc, N.-H. Nguyen, D. M. H. Nguyen, A. Liu, A. Mai, B. T. Nguyen, D. Sonntag, and K. D. Doan, “Mitigating reward over-optimization in direct alignment algorithms with importance sampling,” in The Thirty- ninth Annual Conference on Neural Information Processing Systems

work page

[9] [9]

Provably mitigating overoptimiza- tion in rlhf: Y our sft loss is implicitly an adversarial regularizer,

Z. Liu, M. Lu, S. Zhang, B. Liu, H. Guo, Y . Y ang, J. Blanchet, and Z. Wang, “Provably mitigating overoptimiza- tion in rlhf: Y our sft loss is implicitly an adversarial regularizer,” Advances in Neural Information Processing Systems, vol. 37, pp. 138 663–138 697, 2024

work page 2024

[10] [10]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, and T. Henighan, “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

β-dpo: Direct preference optimization with dynamic β,

J. Wu, Y . Xie, Z. Y ang, J. Wu, J. Gao, B. Ding, X. Wang, and X. He, “ β-dpo: Direct preference optimization with dynamic β,” Advances in Neural Information Processing Systems , vol. 37, pp. 129 944–129 966, 2024

work page 2024

[12] [12]

Provably robust dpo: Aligning language models with noisy feed- back,

S. R. Chowdhury, A. Kini, and N. Natarajan, “Provably robust dpo: Aligning language models with noisy feed- back,” arXiv preprint arXiv:2403.00409, 2024

work page arXiv 2024

[13] [13]

Learn your reference model for real good alignment,

A. Gorbatovski, B. Shaposhnikov, A. Malakhov, N. Surnachev, Y . Aksenov, I. Maksimov, N. Balagansky, and D. Gavrilov, “Learn your reference model for real good alignment,” arXiv preprint arXiv:2404.09656, 2024

work page arXiv 2024

[14] [14]

Mixed preference optimization: Reinforcement learning with data selection and better reference model,

Q. Gou and C.-T. Nguyen, “Mixed preference optimization: Reinforcement learning with data selection and better reference model,” arXiv preprint arXiv:2403.19443, 2024

work page arXiv 2024

[15] [15]

Spread preference annotation: Direct preference judgment for efﬁcient llm alignment,

D. Kim, K. Lee, J. Shin, and J. Kim, “Spread preference annotation: Direct preference judgment for efﬁcient llm alignment,” arXiv preprint arXiv:2406.04412, 2024

work page arXiv 2024

[16] [16]

Ultrafeedback: Boosting language models with high-quality feedback,

G. Cui, L. Y uan, N. Ding, G. Y ao, W. Zhu, Y . Ni, G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting language models with high-quality feedback,” 2023

work page 2023

[17] [17]

Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation,

S. Tu, J. Lin, X. Tian, Q. Zhang, L. Li, Y . Fu, N. Xu, W. He, X. Lan, D. Jiang et al., “Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation,” arXiv preprint arXiv:2503.12854, 2025. 10

work page arXiv 2025

[18] [18]

Direct language model alignment from online ai feedback,

S. Guo, B. Zhang, T. Liu, T. Liu, M. Khalman, F. Llinares, A. Rame, T. Mesnard, Y . Zhao, B. Piot et al., “Direct language model alignment from online ai feedback,” arXiv preprint arXiv:2402.04792, 2024

work page arXiv 2024

[19] [19]

Iterative reasoning preference optimization,

R. Y . Pang, W. Y uan, H. He, K. Cho, S. Sukhbaatar, and J. Weston, “Iterative reasoning preference optimization,” Advances in Neural Information Processing Systems , vol. 37, pp. 116 617–116 637, 2024

work page 2024

[20] [20]

Aipo: Improving training objective for iterative preference optimization,

Y . Shen, X. Wang, Y . Niu, Y . Zhou, L. Tang, L. Zhang, F. Chen, and L. Wen, “Aipo: Improving training objective for iterative preference optimization,” arXiv preprint arXiv:2409.08845, 2024

work page arXiv 2024

[21] [21]

sdpo: Dont use your data all at once,

D. Kim, Y . Kim, W. Song, H. Kim, Y . Kim, S. Kim, and C. Park, “sdpo: Dont use your data all at once,” in Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , 2025, pp. 366–373

work page 2025

[22] [22]

Have you merged my model? on the robustness of large language model ip protection methods against model merging,

T. Cong, D. Ran, Z. Liu, X. He, J. Liu, Y . Gong, Q. Li, A. Wang, and X. Wang, “Have you merged my model? on the robustness of large language model ip protection methods against model merging,” in Proceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis , 2023, pp. 69–76

work page 2023

[23] [23]

Model merging for knowledge editing,

Z. Fu, X. Wu, G. Li, Y . Zhang, Y . Zheng, T. Ming, Y . Wang, W. Wang, and X. Zhao, “Model merging for knowledge editing,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 6: Industry Track), 2025, pp. 433–443

work page 2025

[24] [24]

Aimmerging: Adap- tive iterative model merging using training trajectories for language model continual learning,

Y . Feng, J. Li, X. Dong, P . Xu, X. Zhou, Y . Zhang, Z. Lu, Y . Wang, A. Zhao, X. Chuet al., “Aimmerging: Adap- tive iterative model merging using training trajectories for language model continual learning,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , 2025, pp. 13 431–13 448

work page 2025

[25] [25]

The Llama 3 Herd of Models

A. Grattaﬁori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. V aughanet al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Improving alignment of dialogue agents via targeted human judgements

A. Glaese, N. McAleese, M. Tr˛ ebacz, J. Aslanides, V . Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P . Thacker et al. , “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

s1: Simple test-time scaling,

N. Muennighoff, Z. Y ang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P . Liang, E. Candès, and T. B. Hashimoto, “s1: Simple test-time scaling,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 20 286–20 332

work page 2025

[30] [30]

LIMO: Less is More for Reasoning

Y . Y e, Z. Huang, Y . Xiao, E. Chern, S. Xia, and P . Liu, “Limo: Less is more for reasoning,” arXiv preprint arXiv:2502.03387, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025