pith. sign in

arxiv: 2512.23075 · v5 · pith:5KSMKQEQnew · submitted 2025-12-28 · 💻 cs.LG · cs.AI· cs.IT· math.IT· stat.ML

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

classification 💻 cs.LG cs.AIcs.ITmath.ITstat.ML
keywords boundboundsmathrmregiontrustlong-horizonpolicydivergence
0
0 comments X
read the original abstract

Policy gradient methods for Large Language Models optimize a policy $\pi_\theta$ via a surrogate objective computed from samples of a rollout policy $\pi_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch ($\pi_{\text{roll}} \neq \pi_\theta$) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound ($O(T^{3/2})$), a Mixed bound ($O(T)$), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position importance-ratio decomposition. Taking the minimum over all bounds yields the tightest known guarantee across all divergence regimes. Crucially, all bounds depend on the maximum token-level divergence $D_{\mathrm{KL}}^{\mathrm{tok,max}}$ (or $D_{\mathrm{TV}}^{\mathrm{tok,max}}$), a sequence-level quantity that cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences violating the trust region, enabling the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

    cs.LG 2026-06 unverdicted novelty 6.0

    CPPO replaces uniform token-level trust regions in PPO-style RLVR with position-weighted thresholds and cumulative prefix budgets to better align with autoregressive generation.

  2. $\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control

    cs.LG 2026-05 unverdicted novelty 5.0

    f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.

  3. Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

    cs.LG 2026-06 unverdicted novelty 4.0

    Introduces Discrepancy-Constrained MDP (DCMDP) with Lagrangian relaxation to optimize LLM RL under train-inference discrepancy constraints, claiming performance gains on 8B and 30B models.