pith. sign in

arxiv: 2606.29526 · v1 · pith:UB7O6OIUnew · submitted 2026-06-28 · 💻 cs.LG

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

Pith reviewed 2026-06-30 07:20 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM reinforcement learningtraining-inference mismatchpolicy optimizationmonotonic improvementMIPIMIPUoff-policynessreasoning performance
0
0 comments X

The pith

An effective update to the training policy does not ensure improvement of the inference policy used in deployment for LLM RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models run separate engines for training and inference, producing inconsistent probabilities for identical trajectories even with matched parameters. This built-in mismatch creates a persistent off-policyness that prior stabilization methods do not address at the objective level. The paper shows that an update improving the training policy can still leave the inference policy unchanged or worse. It therefore replaces the usual training objective with Monotonic Inference Policy Improvement (MIPI) and supplies MIPU, a two-step procedure that generates candidate updates and accepts only those whose inference-side gap proxy indicates monotonic gain. Experiments at two model scales under high mismatch report higher average reasoning performance and greater training stability.

Core claim

Because training and inference engines assign different probabilities to the same sequences, an update that raises the training policy objective need not raise the inference policy objective. The paper therefore defines Monotonic Inference Policy Improvement (MIPI) as the correct target and introduces Monotonic Inference Policy Update (MIPU), which constructs sampler-referenced candidate updates and selectively accepts only those whose inference-side gap proxy signals monotonic improvement on the inference policy.

What carries the argument

Monotonic Inference Policy Improvement (MIPI) objective together with the MIPU two-step framework that filters candidate updates by an inference-side gap proxy.

If this is right

  • MIPU produces higher average reasoning performance than prior mismatch-handling methods under high training-inference mismatch.
  • Selective acceptance of synchronized candidates using the inference-side gap proxy increases training stability and reduces collapse risk.
  • The approach works across two different model scales without requiring changes to the underlying sampler or reward model.
  • MIPI replaces the training-policy objective as the quantity to be maximized at every step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the proxy proves reliable across wider mismatch regimes, training loops could drop direct inference measurements entirely.
  • The same selective-acceptance pattern might apply to other settings where training and deployment engines diverge, such as quantized or distilled models.
  • A tighter theoretical bound on the proxy-to-inference correlation would strengthen the guarantee that accepted updates are monotonic on the deployed policy.

Load-bearing premise

The inference-side gap proxy reliably identifies updates that produce monotonic improvement on the actual inference policy without requiring direct measurement or additional assumptions about the mismatch distribution.

What would settle it

Direct measurement on the inference engine showing that an update accepted by the MIPU proxy produces no improvement or produces degradation in inference policy performance.

Figures

Figures reproduced from arXiv: 2606.29526 by Bo Zheng, Hongyao Tang, Jianye Hao, Jing Liang, Jinyi Liu, Ju Huang, Weixun Wang, Wenbo Su, Xiaoyang Li, Yancheng He, Yan Zheng, Yi Ma.

Figure 1
Figure 1. Figure 1: Monotonic Inference Policy Update (MIPU) resolves the Objective Misalignment issue of LLM RL. Canonical LLM RL accepts synchronized updates by a training-side objective, which does not necessarily imply improvement of the inference policy. Here, π and µ denote the training policy and inference policy respectively, c is a tolerance parameter accounting for proxy noise. To address this mismatch, we propose a… view at source ↗
Figure 2
Figure 2. Figure 2: Performance of different meth￾ods under FP8-quantized rollout. Compared methods show unstable training dynamics and may suffer from sharp performance drops, while MIPU maintains a stable score trajectory [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training curves for ablation studies under FP8-quantized rollout. We show the training score, the inference-training K3-KL, Tbpost (i.e., inference gap) and the rollback rate computed over a 100-step moving window. Step 1 improves the candidate update direction, while Step 2 introduces inference-gap-aware acceptance to filter unreliable synchronized candidates. The full method obtains stronger performance … view at source ↗
Figure 4
Figure 4. Figure 4: (a) Inference-training K3-KL and Tbpost (i.e., inference gap) under FP8-quantized rollout. Qwen3-1.7B exhibits larger mismatch and a more volatile Tbpost than Qwen3-4B. (b) Comparison between inference-gap-aware Step 2 acceptance and a random rollback control. Random rollback rejects more updates, applying fewer effective policy changes, but still collapses. Post-update gap as an inference-side signal. We … view at source ↗
Figure 5
Figure 5. Figure 5: Step 1 implementation analysis under the Qwen3-4B FP8-quantized rollout. Comparison of PPO-IS, Vanilla-IS, and TIS in terms of performance, gradient norm, inference￾training K3-KL, and clip ratio. Although this ratio directly references the sampler, it mixes the pre-update mismatch between πk and µk with the current update from πk to πθ. As a result, the ratio can already deviate substantially from 1 even … view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity to the acceptance tolerance c under the Qwen3-4B FP8-quantized rollout in terms of inference-training K3-KL, training score, and Tbpost [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

Reinforcement learning (RL) has gained growing attention in large language model (LLM) post-training, yet RL training remains fragile and can suffer from instability or collapse. One vital cause is training-inference mismatch: LLM adopts separate inference and training engines for generation efficiency and training precision, which in practice exhibits inconsistent probabilities for the same trajectories on training and inference sides, even with synchronized model parameters. This naturally induces a special type of off-policyness ever existing and poisoning the training. Prior works have made various efforts in addressing the off-policyness to stabilize the training policies under the mismatch. In this paper, we point out the objective misalignment neglected by existing works that an effective update to the policy in the training engine not necessarily ensures the improvement of the inference policy, i.e., the one used in deployment. To this end, we propose a new policy optimization objective for LLM RL, named Monotonic Inference Policy Improvement (MIPI). Following this principle, we introduce Monotonic Inference Policy Update (MIPU), a two-step LLM RL framework that constructs sampler-referenced candidate updates and selectively accepts synchronized candidates using an inference-side gap proxy. Experiments conducted on two model scales under high mismatch show that MIPU improves average reasoning performance and training stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies a training-inference mismatch in LLM RL arising from separate engines, which creates persistent off-policyness. It argues that optimizing the training policy does not guarantee monotonic improvement in the deployed inference policy, and proposes the Monotonic Inference Policy Improvement (MIPI) objective. It introduces the Monotonic Inference Policy Update (MIPU) framework, which generates sampler-referenced candidate updates and accepts them via an inference-side gap proxy. Experiments on two model scales under high mismatch report gains in average reasoning performance and training stability.

Significance. If MIPU's proxy reliably selects only updates that improve the true inference policy, the work would address a practical source of instability in LLM post-training that prior off-policy corrections have overlooked. The experimental results on multiple scales provide initial evidence of benefit, but the absence of a bound or stated assumption linking the proxy to monotonic inference improvement limits the strength of the central claim.

major comments (2)
  1. [§4] §4 (MIPU framework): The inference-side gap proxy is described as the mechanism that ensures only synchronized candidates producing monotonic inference-policy improvement are accepted, yet no derivation, bound, or explicit assumption on the mismatch distribution is supplied to establish that the proxy selects such updates. Without this link, an update passing the proxy could still degrade inference performance.
  2. [§5] §5 (Experiments): The reported gains in reasoning performance and stability are shown under high mismatch, but the evaluation does not include a direct measurement of inference-policy value before/after accepted updates or an ablation isolating the proxy's contribution from the candidate-generation step.
minor comments (2)
  1. Notation for the training and inference policies should be introduced with explicit symbols (e.g., π_train vs. π_inf) at first use rather than relying on descriptive phrases.
  2. The abstract states that MIPU 'constructs sampler-referenced candidate updates'; the corresponding section should clarify whether the sampler is the training engine itself or an external reference distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger theoretical grounding of the proxy and more targeted experimental validation. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (MIPU framework): The inference-side gap proxy is described as the mechanism that ensures only synchronized candidates producing monotonic inference-policy improvement are accepted, yet no derivation, bound, or explicit assumption on the mismatch distribution is supplied to establish that the proxy selects such updates. Without this link, an update passing the proxy could still degrade inference performance.

    Authors: We agree that the current manuscript lacks an explicit derivation or bound connecting the inference-side gap proxy to guaranteed monotonic improvement of the inference policy. The proxy is motivated directly by the MIPI objective (difference in expected value under the inference policy) and uses the observed log-probability gap between engines on sampler trajectories as a practical surrogate. We will revise §4 to state the assumption of bounded mismatch (i.e., the total variation distance between training and inference distributions is upper-bounded by a known constant) and provide a short proof sketch showing that, under this assumption, acceptance by the proxy implies non-negative change in inference policy value. This is a partial revision because a fully general bound without assumptions on the mismatch remains an open question. revision: partial

  2. Referee: [§5] §5 (Experiments): The reported gains in reasoning performance and stability are shown under high mismatch, but the evaluation does not include a direct measurement of inference-policy value before/after accepted updates or an ablation isolating the proxy's contribution from the candidate-generation step.

    Authors: We concur that direct before/after inference-policy value measurements and a proxy ablation would strengthen the empirical claims. In the revised version we will add (i) inference-engine rollouts to compute policy value on a held-out set before and after each accepted update and (ii) an ablation that disables the proxy (i.e., always accepts sampler-referenced candidates) while keeping the candidate-generation step fixed. Results will be reported for both model scales under the same high-mismatch regime. revision: yes

Circularity Check

0 steps flagged

No circularity; MIPI/MIPU presented as new objective without self-referential reduction or fitted predictions

full rationale

The abstract and provided text introduce the training-inference mismatch as motivation and propose MIPI as a new objective plus MIPU as a two-step framework using an inference-side gap proxy. No equations, parameter-fitting procedures, or derivation steps are shown that would allow any claim to reduce to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is described. The central claim of objective misalignment is stated directly rather than derived from prior self-work. Because the paper's derivation chain contains no inspectable steps that match the enumerated circularity patterns, the score is 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; the central claim rests on the existence of persistent training-inference probability mismatch and on the validity of the inference-side gap proxy as a selector.

axioms (1)
  • domain assumption Training and inference engines produce inconsistent probabilities for identical trajectories even with synchronized parameters.
    Stated in abstract as the source of off-policyness that poisons training.

pith-pipeline@v0.9.1-grok · 5795 in / 1178 out tokens · 25249 ms · 2026-06-30T07:20:41.272492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 16 canonical work pages · 11 internal anchors

  1. [1]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    URL https://arxiv.org/abs/2508.06471. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638,

  2. [2]

    DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

    URL https://arxiv.org/abs/2504.11456. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset.NeurIPS,

  3. [3]

    OpenAI o1 System Card

    URL https://openreview.net/forum?id=NFM8F5cV0V. Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  4. [4]

    Kimi K2.5: Visual Agentic Intelligence

    URL https://arxiv.org/abs/ 2602.02276. Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

  5. [5]

    Accessed: 2026-04-30

    URL https: //artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions. Accessed: 2026-04-30. MAA. American invitational mathematics examination (aime), February

  6. [6]

    Accessed: 2026-04-30

    URL https: //artofproblemsolving.com/wiki/index.php/2024_AIME_I. Accessed: 2026-04-30. Qi, P., Liu, Z., Zhou, X., Pang, T., Du, C., Lee, W. S., and Lin, M. Defeating the training- inference mismatch via fp16.arXiv preprint arXiv:2510.26788,

  7. [7]

    Qwen3 Technical Report

    URLhttps://arxiv.org/abs/2505.09388. Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimiza- tion. InICML, volume 37, pp. 1889–1897,

  8. [8]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347,

  9. [9]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

  10. [10]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    URL https://arxiv.org/abs/1909.08053. Sutton, R. S. and Barto, A. G.Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press,

  11. [11]

    Deep Reinforcement Learning and the Deadly Triad

    van Hasselt, H., Doron, Y ., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. Deep reinforce- ment learning and the deadly triad.arXiv preprint, arXiv:1812.02648,

  12. [12]

    Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122,

    Wang, W., Xiong, S., Chen, G., Gao, W., Guo, S., He, Y ., Huang, J., Liu, J., Li, Z., Li, X., et al. Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122,

  13. [13]

    Wu, Z., Tang, H., Ma, Y ., Liu, J., Zheng, Y ., and Hao, J

    URLhttps://arxiv.org/abs/2512.24873. Wu, Z., Tang, H., Ma, Y ., Liu, J., Zheng, Y ., and Hao, J. The rank and gradient lost in non- stationarity: Sample weight decay for mitigating plasticity loss in reinforcement learning. In ICLR,

  14. [14]

    URL https://fengyao.notion.site/ off-policy-rl. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y ., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Dai, W., Song, Y ., Wei, X., Zhou, H., Liu, J., Ma, W., Zhang, Y ., Yan, L., Q...

  15. [15]

    URL https://arxiv.org/abs/2602.01826. Zhao, Y ., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., Desmaison, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao, Y ., Mathews, A., and Li, S. Pytorch fsdp: Experiences on scaling fully sharded data parallel.Proc. VLDB Endow., 16(12):3848–3860, August

  16. [16]

    doi: 10.14778/3611540.3611569

    ISSN 2150-8097. doi: 10.14778/3611540.3611569. URLhttps://doi.org/10.14778/3611540.3611569. Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., Zhou, J., and Lin, J. Group sequence policy optimization,

  17. [17]

    Group Sequence Policy Optimization

    URL https://arxiv. org/abs/2507.18071. Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., and Sheng, Y . SGLang: Efficient execution of structured language model programs. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  18. [18]

    A direct implementation is PPO-IS, which clips the total trainer-to-sampler ratio: PPO-IS:ω PPO-IS = clip πθ µk ,1−ϵ,1 +ϵ

    For readability, we omit token or trajectory arguments in this section, and all ratios are evaluated on sampled rollouts. A direct implementation is PPO-IS, which clips the total trainer-to-sampler ratio: PPO-IS:ω PPO-IS = clip πθ µk ,1−ϵ,1 +ϵ . 18 0 200 400 600 800 Training Step 0.00 0.25 0.50 0.75 1.00pass@1 Reward 0 200 400 600 800 Training Step 0.000 ...