pith. sign in

arxiv: 2602.05370 · v3 · pith:IJFJSDQOnew · submitted 2026-02-05 · 💻 cs.CL

Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning

classification 💻 cs.CL
keywords correctiveexplorationpacereasoningalignmentdistributionincreasingiterative
0
0 comments X
read the original abstract

Iterative Direct Preference Optimization (DPO) has emerged as a widely used paradigm for aligning Large Language Models on reasoning tasks. Existing approaches typically rely on Best-of-N sampling ($N\geq8$) to mine positive trajectories from the distribution tail. In this work, we show that in mathematical reasoning, increasing $N$ yields diminishing returns while increasing verifier-induced false-positive risk and the distribution shift required for policy updates. To address this, we introduce PACE (Proximal Alignment via Corrective Exploration), a generation-based corrective framework that replaces exhaustive mining with low-budget exploration ($2\leq N\leq3$). Rather than searching for increasingly rare positive samples, PACE synthesizes high-fidelity preference pairs from failed explorations through corrective hindsight refinement and verification-guided filtering. Empirically, PACE matches or exceeds the performance of DPO-R1 ($N=16$) while using about $1/5$ of the compute, and remains robust under 20\% label corruption, where high-$N$ baselines exhibit substantially higher noise exploitation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.