Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Introduces Discrepancy-Constrained MDP (DCMDP) with Lagrangian relaxation to optimize LLM RL under train-inference discrepancy constraints, claiming performance gains on 8B and 30B models.
citing papers explorer
-
The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning
Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.
-
Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy
Introduces Discrepancy-Constrained MDP (DCMDP) with Lagrangian relaxation to optimize LLM RL under train-inference discrepancy constraints, claiming performance gains on 8B and 30B models.