Stabilizing Policy Optimization via Logits Convexity

Hongzhan Chen; Shiping Gao; Tao Yang; Ting Yao; Xiaojun Quan; Yuhua Zhu

arxiv: 2603.00963 · v2 · pith:Z54I7K4Fnew · submitted 2026-03-01 · 💻 cs.LG · cs.CL

Stabilizing Policy Optimization via Logits Convexity

Hongzhan Chen , Tao Yang , Yuhua Zhu , Shiping Gao , Xiaojun Quan , Ting Yao This is my paper

classification 💻 cs.LG cs.CL

keywords optimizationpolicyconvexitylogitsstabilizingframeworkgradientmodel

0 comments

read the original abstract

While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization
cs.CL 2026-04 unverdicted novelty 6.0

IPVRM learns prefix values to produce reliable step rewards from sequence outcomes using TD learning, enabling distribution-level RL that improves reasoning when paired with calibrated rewards.