Implicit reward as the bridge: A unified view of sft and dpo connections

Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, Xipeng Qiu · 2025 · arXiv 2507.00018

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

cs.LG · 2025-12-12 · unverdicted · novelty 5.0

Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.

Sample-efficient LLM Optimization with Reset Replay

cs.LG · 2025-08-08 · unverdicted · novelty 5.0

LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.

citing papers explorer

Showing 2 of 2 citing papers.

Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning cs.LG · 2025-12-12 · unverdicted · none · ref 42
Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
Sample-efficient LLM Optimization with Reset Replay cs.LG · 2025-08-08 · unverdicted · none · ref 16
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.

Implicit reward as the bridge: A unified view of sft and dpo connections

fields

years

verdicts

representative citing papers

citing papers explorer