s1: Simple test-time scaling,

· 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.IR · 2026-05-22 · unverdicted · novelty 5.0

TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.

Showing 1 of 1 citing paper.

TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization cs.IR · 2026-05-22 · unverdicted · none · ref 29
TPMM-DPO applies trajectory-aware learned-weight merging of prior policy models to stabilize iterative DPO against preference noise accumulation.