Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning

Guhan Chen; Hejin Wang; Jiansheng Wei; Jing Li; Jun Rao; Min Zhang; Xiaojun Meng; Xuebo Liu; Zixiong Yu

arxiv: 2602.05370 · v3 · pith:IJFJSDQOnew · submitted 2026-02-05 · 💻 cs.CL

Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning

Jun Rao , Zixiong Yu , Xuebo Liu , Guhan Chen , Jing Li , Hejin Wang , Jiansheng Wei , Xiaojun Meng

show 1 more author

Min Zhang

This is my paper

classification 💻 cs.CL

keywords correctiveexplorationpacereasoningalignmentdistributionincreasingiterative

0 comments

read the original abstract

Iterative Direct Preference Optimization (DPO) has emerged as a widely used paradigm for aligning Large Language Models on reasoning tasks. Existing approaches typically rely on Best-of-N sampling ($N\geq8$) to mine positive trajectories from the distribution tail. In this work, we show that in mathematical reasoning, increasing $N$ yields diminishing returns while increasing verifier-induced false-positive risk and the distribution shift required for policy updates. To address this, we introduce PACE (Proximal Alignment via Corrective Exploration), a generation-based corrective framework that replaces exhaustive mining with low-budget exploration ($2\leq N\leq3$). Rather than searching for increasingly rare positive samples, PACE synthesizes high-fidelity preference pairs from failed explorations through corrective hindsight refinement and verification-guided filtering. Empirically, PACE matches or exceeds the performance of DPO-R1 ($N=16$) while using about $1/5$ of the compute, and remains robust under 20\% label corruption, where high-$N$ baselines exhibit substantially higher noise exploitation.

This paper has not been read by Pith yet.

Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning

discussion (0)