pith. sign in

arxiv: 2511.04256 · v2 · submitted 2025-11-06 · 💻 cs.CL

SSPO: Subsentence-level Policy Optimization

Pith reviewed 2026-05-18 01:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords subsentence policy optimizationRLVRLLM reasoningpolicy optimizationreinforcement learningimportance samplingadaptive clippingmath reasoning
0
0 comments X

The pith

SSPO computes importance ratios at the subsentence level to stabilize policy updates in reinforcement learning for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Subsentence-level Policy Optimization (SSPO) to address stability problems in Reinforcement Learning from Verifiable Rewards for improving LLM reasoning. GRPO uses token-level ratios that overemphasize outliers and risk collapse, while GSPO uses full-response ratios that often produce near-zero clipping and retain noisy tokens. SSPO computes ratios at the subsentence level to split the difference and adds entropy-based adaptive clipping that loosens bounds for high-entropy tokens and tightens them for low-entropy ones. Experiments show SSPO reaching an average score of 46.72 on five math datasets with the Qwen2.5-1.5B model, beating GRPO and GSPO. Readers care because this granularity choice could make post-training more reliable without extra hyperparameter searches.

Core claim

SSPO alleviates training collapse and excessive variance while avoiding indiscriminate response retention by computing importance ratios at the subsentence level, striking a balance between GRPO and GSPO, and incorporates subsentence-level entropy into PPO-CLIP to adaptively adjust clipping bounds for improved exploration and stability.

What carries the argument

Subsentence-level importance ratio with entropy-driven adaptive clipping bounds inside the PPO-CLIP update rule.

If this is right

  • Policy updates become more stable by preventing both outlier dominance and full-response retention.
  • Average scores on math reasoning benchmarks rise, with state-of-the-art results on four of five datasets for the 1.5B model.
  • Adaptive entropy clipping reduces the need for manual bound tuning across different training runs.
  • The same gains appear on the 7B model, where SSPO records the highest average across five baseline methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The subsentence unit might transfer to non-reasoning tasks if boundaries are chosen to match natural pause points in the output.
  • Entropy signals could be replaced or augmented by other uncertainty measures without changing the overall architecture.
  • Combining SSPO with length-based or format-based rewards might further reduce variance in long-form generation.
  • Smaller models trained this way could close more of the gap to larger models on verifiable-reward benchmarks.

Load-bearing premise

Subsentence boundaries can be defined consistently and the entropy signal reliably indicates when to loosen or tighten clipping without introducing new selection biases or requiring extensive hyperparameter tuning per dataset.

What would settle it

A dataset where subsentence segmentation is ambiguous and performance falls below GSPO or GRPO while clipping fractions remain near zero would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.04256 by Jing Xiao, Kun Yang, Ning Cheng, Shaojun Wang, Yanmeng Wang, Zhigen Li, Zikang chen.

Figure 1
Figure 1. Figure 1: The training entropy of SSPO, SSPO (w/o entropy clip), GSPO and GRPO [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
read the original abstract

As a key component of large language model (LLM) post-training, Reinforcement Learning from Verifiable Rewards (RLVR) has substantially improved reasoning performance. However, existing RLVR algorithms exhibit distinct stability issues: GRPO (Group Relative Policy Optimization) often suffers from unstable policy updates, while GSPO (Group Sequence Policy Optimization) can retain high-variance tokens. In GRPO, the importance ratio is computed at the token level, which overemphasizes individual tokens and makes learning sensitive to outliers, potentially causing training collapse. GSPO instead computes a response-level importance ratio, mitigating variance and reducing the accumulation of token-level noise present in GRPO. Nevertheless, our experiments show that GSPO frequently yields a near-zero clipping fraction: extreme token-level ratios can be diluted by other tokens in the same response, causing the entire response to be retained and resulting in unstable updates. We propose SSPO, which computes importance ratios at the subsentence level, striking a balance between GRPO and GSPO. SSPO alleviates training collapse and excessive variance while avoiding the failure mode in which the clipping mechanism indiscriminately retains entire responses. Moreover, we incorporate subsentence-level entropy into PPO-CLIP to adaptively adjust the clipping bounds: we encourage exploration for high-entropy tokens while tightening the clipping range for low-entropy tokens. Empirically, SSPO achieves an average score of 46.72 across five datasets on Qwen2.5-1.5B-Math model, outperforming GRPO (43.01) and GSPO (44.42), and attains state-of-the-art results on four datasets. On Qwen2.5-7B-Math model, SSPO also achieves the highest averaged scores over five baseline methods. These results demonstrate SSPO's effectiveness in RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SSPO, a variant of policy optimization for RLVR in LLMs. It computes importance sampling ratios at the subsentence level (rather than token or full-response) and modulates the PPO clipping bounds using subsentence entropy to encourage exploration on high-entropy subsentences while tightening on low-entropy ones. The central empirical claim is that this yields an average score of 46.72 across five math datasets on Qwen2.5-1.5B-Math (vs. GRPO 43.01 and GSPO 44.42), attaining SOTA on four of them, with similar gains on the 7B variant.

Significance. If the reported gains prove robust to boundary definitions and hyperparameter choices, SSPO would offer a practical middle ground between the instability of token-level ratios and the over-retention problem of response-level ratios, plus a lightweight adaptive clipping mechanism. The work supplies concrete numbers on standard math reasoning benchmarks and demonstrates applicability to both 1.5B and 7B models, which is useful for the RLVR literature even if further controls are required.

major comments (3)
  1. [§4] §4 (Experiments): No ablation is reported that replaces the subsentence segmentation rule with fixed-length chunks or sentence-level splits while holding the entropy-based clipping and all other hyperparameters fixed. Without this control, the 3.7-point average improvement over GSPO cannot be confidently attributed to the subsentence importance ratio rather than to the particular boundary heuristic or per-dataset tuning.
  2. [Table 1 / §4.2] Table 1 / §4.2: The headline scores (46.72, 43.01, 44.42) are presented without error bars, standard deviations across seeds, or statistical significance tests. Given that RLVR runs are known to be sensitive to random seeds and learning-rate schedules, the absence of these statistics leaves the central claim of consistent outperformance only partially supported.
  3. [§3.2] §3.2 (Adaptive Clipping): The entropy signal is used to widen or tighten the clip range, yet the paper does not specify how subsentence entropy is normalized or thresholded, nor whether the same entropy-to-clip mapping is used across all five datasets. This detail is load-bearing for the claim that the method avoids both collapse and indiscriminate retention.
minor comments (2)
  1. [§3.1] The precise rule used to delineate subsentences (punctuation, length, or model-based) should be stated explicitly in §3.1 with pseudocode or a short example.
  2. [Figure 2] Figure 2 (clipping fraction curves) would benefit from a legend that distinguishes the three methods and from reporting the number of runs averaged.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): No ablation is reported that replaces the subsentence segmentation rule with fixed-length chunks or sentence-level splits while holding the entropy-based clipping and all other hyperparameters fixed. Without this control, the 3.7-point average improvement over GSPO cannot be confidently attributed to the subsentence importance ratio rather than to the particular boundary heuristic or per-dataset tuning.

    Authors: We agree that a controlled ablation isolating the segmentation strategy would strengthen attribution of the gains. Our subsentence boundaries are chosen to respect linguistic units (punctuation-delimited semantic segments) rather than arbitrary splits, which we argue better balances token-level noise and response-level dilution. In the revised manuscript we will add an ablation that replaces the subsentence rule with both fixed-length chunks and sentence-level splits while freezing the entropy-based clipping and all other hyperparameters, reporting the resulting performance on the same five datasets. revision: yes

  2. Referee: [Table 1 / §4.2] Table 1 / §4.2: The headline scores (46.72, 43.01, 44.42) are presented without error bars, standard deviations across seeds, or statistical significance tests. Given that RLVR runs are known to be sensitive to random seeds and learning-rate schedules, the absence of these statistics leaves the central claim of consistent outperformance only partially supported.

    Authors: The referee is correct that variability statistics are important for RLVR experiments. Our original runs used a single seed per configuration owing to compute limits. We have now executed three independent seeds for the primary comparisons on both model sizes and will update Table 1 to report mean scores together with standard deviations. We will also add a brief discussion of statistical significance testing in §4.2. revision: yes

  3. Referee: [§3.2] §3.2 (Adaptive Clipping): The entropy signal is used to widen or tighten the clip range, yet the paper does not specify how subsentence entropy is normalized or thresholded, nor whether the same entropy-to-clip mapping is used across all five datasets. This detail is load-bearing for the claim that the method avoids both collapse and indiscriminate retention.

    Authors: We thank the referee for highlighting the missing implementation details. Subsentence entropy is the mean token-level entropy within each subsentence, normalized to [0, 1] by division by log(|V|). The clipping interval is then scaled linearly according to normalized entropy e: lower bound = 1 − 0.2(1 − e), upper bound = 1 + 0.2e, with a fixed threshold of 0.4 used to decide whether to widen or tighten. This exact mapping is applied uniformly to all five datasets. We have inserted the precise formulas, normalization procedure, and threshold into §3.2 together with pseudocode in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; SSPO is an independent algorithmic proposal validated empirically

full rationale

The paper defines SSPO by explicitly describing its two core modifications—subsentence-level importance ratios and entropy-adaptive clipping bounds—then reports empirical scores on five math datasets against GRPO and GSPO baselines. No equation or claim reduces by construction to a fitted parameter, self-citation, or renamed input; the method is presented as a direct response to observed failure modes (token-level outliers in GRPO, near-zero clipping in GSPO) and is tested through new runs rather than derived from the same quantities used to measure success. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard policy gradient assumptions and the existence of a verifiable reward signal; no new entities are postulated and no free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption Policy gradient methods can be stabilized by adjusting the granularity at which importance ratios are computed.
    Central premise of moving from token or response level to subsentence level.

pith-pipeline@v0.9.0 · 5874 in / 1211 out tokens · 37283 ms · 2026-05-18T01:16:59.426800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR

    cs.LG 2026-05 unverdicted novelty 6.0

    Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.

  2. Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

    cs.CL 2026-04 unverdicted novelty 6.0

    Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...

  3. Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

    cs.LG 2026-04 unverdicted novelty 5.0

    Intra-group objectives in sparse-reward RL must maintain token gradient exchangeability to enable cancellation on weak-credit tokens and stabilize training.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 3 Pith papers · 10 internal anchors

  1. [1]

    Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., et al. (2025). Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585

  2. [2]

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948

  3. [3]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. (2024). Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008

  4. [4]

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021). Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874

  5. [5]

    Anil, C., Schlag, I., Gutman-Solo, T., et al. (2022). Solving quantitative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857

  6. [6]

    Li, A., Liu, B., Hu, B., Li, B., Zeng, B., Ye, B., Tang, C., Tian, C., Huang, C., Zhang, C., et al. (2025). Every activation boosted: Scaling general reasoner to 1 trillion open language foundation.arXiv preprint arXiv:2510.22115

  7. [7]

    Shen, Z., et al. (2024). Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9

  8. [8]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. (2025). Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783

  9. [9]

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347

  10. [10]

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. (2024). Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

  11. [11]

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. (2024). Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256

  12. [12]

    Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., et al. (2024). Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122

  13. [13]

    Yang, S., Dou, C., Guo, P., Lu, K., Ju, Q., Deng, F., and Xin, R. (2025). Dcpo: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333

  14. [14]

    Zhao, Y ., Liu, Y ., Liu, J., Chen, J., Wu, X., Hao, Y ., Lv, T., Huang, S., Cui, L., Ye, Q., et al. (2025). Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673

  15. [15]

    Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. (2025). Group sequence policy optimization.arXiv preprint arXiv:2507.18071. 8