pith. sign in

arxiv: 2606.09871 · v1 · pith:HRZGXEDUnew · submitted 2026-06-02 · 💻 cs.CV · cs.AI· cs.LG

SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

Pith reviewed 2026-06-28 10:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords segment decompositionGRPOvision-language generationpolicy optimizationlong-form responsescredit assignmentmultimodal modelsreinforcement learning
0
0 comments X

The pith

SD-GRPO replaces GRPO's single scalar advantage with a vector of per-segment advantages from z-normalized verifiable rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Group Relative Policy Optimization applies one advantage value to an entire output, which limits its fit to long vision-language responses that contain multiple grounded parts. The paper introduces Segment-Decomposed GRPO, which splits outputs into segments, computes independent verifiable rewards for each, and z-normalizes those rewards across rollouts to produce a vector of advantages instead of one scalar. On tasks where segments are independent, this change produces consistent gains that grow with the number of segments. When segments share context, combining the per-segment vector with a holistic reward further improves results over either approach alone.

Core claim

SD-GRPO z-normalizes verifiable per-segment rewards across the rollout group to yield a vector of per-segment advantages in place of a single scalar advantage, and this change consistently outperforms the GRPO baseline on long-form VL generation tasks, with blending of holistic and per-segment signals required when segments share context.

What carries the argument

The per-segment advantage vector obtained by z-normalizing verifiable rewards across the rollout group, which replaces the scalar advantage used in standard GRPO.

If this is right

  • Gains from SD-GRPO increase with the number of segments on tasks where segments are semantically independent.
  • Rollout-level rewards produce cross-segment credit misattribution that grows with output length.
  • Blending holistic and per-segment rewards improves results when segments share context across the image.
  • SD-GRPO can be added to any existing GRPO framework with only small code changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automatic methods to discover segments could extend the approach to tasks that lack explicit divisions.
  • The same per-segment normalization may reduce credit misattribution in other long-sequence reinforcement learning settings.
  • Reward models for vision-language agents might benefit from explicit decomposition when outputs contain multiple verifiable parts.

Load-bearing premise

Long-form vision-language outputs possess a natural segmentation into parts for which independent verifiable rewards can be computed without substantial new labeling or modeling error.

What would settle it

An experiment on a long-form VL benchmark where per-segment rewards cannot be assigned independently would show no performance gain for SD-GRPO over GRPO if the central claim is false.

Figures

Figures reproduced from arXiv: 2606.09871 by Hannah Yun, Hyunwoong Kim, Jonggwon Park, Junhyun Park, Seongeun Lee.

Figure 1
Figure 1. Figure 1: Illustration of SD-GRPO. (a) Vision-language generation from semantically rich images [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reasoning trajectories from our SD-GRPO-trained model on multi-chart long-form VQA. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the composite scientific figure captioning experiment. Segment rewards [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
read the original abstract

Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images. To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar. We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts. Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length. On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Segment-Decomposed GRPO (SD-GRPO), which extends standard GRPO by z-normalizing verifiable per-segment rewards across a rollout group to produce a vector of per-segment advantages instead of a single scalar advantage. It evaluates the approach on three long-form VL tasks of increasing semantic entanglement: a controlled multi-panel captioning task from DOCCI (independent segments), a multi-chart VQA task from MultiChartQA (credit misattribution analyzed theoretically and empirically), and scientific figure captioning on MMSci (entangled subfigures requiring blending of holistic and per-segment rewards). The central claim is that SD-GRPO consistently outperforms GRPO baselines, with larger gains at higher segment counts on independent cases, and that the method integrates into existing GRPO frameworks such as Dr. GRPO with minimal overhead.

Significance. If the empirical gains and theoretical analysis of credit misattribution hold under scrutiny, the work offers a targeted improvement to credit assignment for long-form VL generation by exploiting natural output segmentation, a signal discarded by scalar-advantage methods. The low-overhead integration claim and the controlled-to-real-world progression of experiments are strengths that could make the technique broadly useful if the per-segment reward assumption generalizes.

major comments (2)
  1. [Abstract / MMSci evaluation] Abstract and MMSci evaluation: the statement that 'per-segment normalization alone is insufficient when segments are semantically entangled' and that blending with holistic rewards is required directly qualifies the central claim; if computing the per-segment rewards in such cases necessarily incorporates cross-segment context or additional modeling error, the resulting advantage vector no longer isolates the intended credit assignment, undermining consistent outperformance across the three settings of increasing entanglement.
  2. [MultiChartQA evaluation] MultiChartQA section: the theoretical claim that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length is load-bearing for motivating the vector advantage; without the explicit derivation or equations showing how z-normalization removes this scaling (as opposed to merely reweighting it), the empirical outperformance cannot be attributed to the proposed mechanism rather than other implementation choices.
minor comments (2)
  1. [Abstract / Methods] The abstract refers to 'verifiable per-segment rewards' and 'semantically independent' segments without defining the reward functions or segmentation procedure; these should be specified with pseudocode or equations in the methods section to allow reproduction.
  2. [DOCCI evaluation] No mention of statistical significance, number of runs, or variance for the 'consistently outperforms' and 'larger gains at higher segment counts' claims on DOCCI; tables or figures reporting these quantities are needed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and note planned revisions.

read point-by-point responses
  1. Referee: [Abstract / MMSci evaluation] Abstract and MMSci evaluation: the statement that 'per-segment normalization alone is insufficient when segments are semantically entangled' and that blending with holistic rewards is required directly qualifies the central claim; if computing the per-segment rewards in such cases necessarily incorporates cross-segment context or additional modeling error, the resulting advantage vector no longer isolates the intended credit assignment, undermining consistent outperformance across the three settings of increasing entanglement.

    Authors: We disagree that the statement undermines the central claim. The manuscript structures its evaluation explicitly around increasing entanglement levels, showing SD-GRPO gains on independent segments and further gains from blending on entangled ones; this progression demonstrates the method's flexibility rather than a limitation. Per-segment rewards are derived from independent verifiable signals without cross-segment context in the reward computation, and blending is applied afterward as an optional adjustment. We will revise the abstract and MMSci discussion to explicitly state that the core per-segment advantage vector remains isolated even under blending. revision: partial

  2. Referee: [MultiChartQA evaluation] MultiChartQA section: the theoretical claim that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length is load-bearing for motivating the vector advantage; without the explicit derivation or equations showing how z-normalization removes this scaling (as opposed to merely reweighting it), the empirical outperformance cannot be attributed to the proposed mechanism rather than other implementation choices.

    Authors: We agree that the theoretical section would benefit from greater explicitness. The current analysis derives the scaling of misattribution under scalar advantages and shows how per-segment z-normalization removes cross-segment terms, but we will expand the MultiChartQA section with the complete set of equations for the advantage computation and scaling behavior in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces SD-GRPO as an empirical algorithmic modification to GRPO: it defines per-segment advantages directly via z-normalization of verifiable rewards within rollout groups. No equations derive a result that reduces to the inputs by construction, no parameters are fitted on a subset and then renamed as predictions, and no self-citation chain or uniqueness theorem is invoked to force the method. Performance claims rest on direct comparisons against GRPO baselines across three external datasets (DOCCI, MultiChartQA, MMSci), with the segment-independence premise stated as an operating assumption rather than a derived necessity. The approach is therefore self-contained as a heuristic extension without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that natural, verifiable segment boundaries exist in long-form VL outputs and that cross-segment credit misattribution is the dominant failure mode of scalar GRPO.

axioms (1)
  • domain assumption Long-form VL outputs admit segmentation into parts with independent verifiable rewards
    Invoked when the paper distinguishes 'semantically independent' vs 'entangled' segments and claims per-segment rewards are available.

pith-pipeline@v0.9.1-grok · 5855 in / 1295 out tokens · 25132 ms · 2026-06-28T10:29:41.678320+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 16 canonical work pages · 11 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005

  3. [3]

    SciBERT: A pretrained language model for scientific text

    Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3615–3620, 2019

  4. [4]

    R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V, 2025

    Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V, 2025

  5. [5]

    Self-guided process reward optimization with redefined step-wise advantage for process reinforcement learning.arXiv preprint arXiv:2507.01551, 2025

    Wu Fei, Hao Kong, Shuxian Liang, Yang Lin, Yibo Yang, Jing Tang, Lei Chen, and Xiansheng Hua. Self-guided process reward optimization with redefined step-wise advantage for process reinforcement learning.arXiv preprint arXiv:2507.01551, 2025

  6. [6]

    Group-in-group policy optimization for LLM agent training.Advances in Neural Information Processing Systems, 2025

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training.Advances in Neural Information Processing Systems, 2025

  7. [7]

    Segmental advantage estimation: Enhancing ppo for long-context llm training

    Xue Gong, Qi Yi, Ziyuan Nan, Guanhua Huang, Kejiao Li, Yuhao Jiang, Ruibin Xiong, Zenan Xu, Jiaming Guo, Shaohui Peng, et al. Segmental advantage estimation: Enhancing ppo for long-context llm training. arXiv preprint arXiv:2601.07320, 2026

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  9. [9]

    Segment policy optimization: Effective segment- level credit assignment in RL for large language models.Advances in Neural Information Processing Systems, 2025

    Yiran Guo, Lijie Xu, Jie Liu, Ye Dan, and Shuang Qiu. Segment policy optimization: Effective segment- level credit assignment in RL for large language models.Advances in Neural Information Processing Systems, 2025

  10. [10]

    mTORC2 signalling regulates M2 macrophage differentiation in response to helminth infection and adaptive thermogenesis.Nature communications, 8(1):14208, 2017

    RW Hallowell, SL Collins, JM Craig, Y Zhang, M Oh, PB Illei, Y Chan-Li, CL Vigeland, W Mitzner, AL Scott, et al. mTORC2 signalling regulates M2 macrophage differentiation in response to helminth infection and adaptive thermogenesis.Nature communications, 8(1):14208, 2017

  11. [11]

    SciCap: Generating captions for scientific figures

    Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. SciCap: Generating captions for scientific figures. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3258–3264, 2021

  12. [12]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaoshen Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.ArXiv, abs/2503.06749, 2025

  13. [13]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  14. [14]

    Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

  15. [15]

    Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding.Advances in Neural Information Processing Systems, 37:3229– 3242, 2024

    Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding.Advances in Neural Information Processing Systems, 37:3229– 3242, 2024

  16. [16]

    Hybrid retrieval-generation reinforced agent for medical image report generation.Advances in neural information processing systems, 31, 2018

    Yuan Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Hybrid retrieval-generation reinforced agent for medical image report generation.Advances in neural information processing systems, 31, 2018

  17. [17]

    Mmsci: A dataset for graduate-level multi-discipline multimodal scientific understanding.arXiv preprint arXiv:2407.04903, 2024

    Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, et al. Mmsci: A dataset for graduate-level multi-discipline multimodal scientific understanding.arXiv preprint arXiv:2407.04903, 2024

  18. [18]

    Photolatently modulable hydrogels using unilamellar titania nanosheets as photocatalytic crosslinkers.Nature Communications, 4 (1):2029, 2013

    Mingjie Liu, Yasuhiro Ishida, Yasuo Ebina, Takayoshi Sasaki, and Takuzo Aida. Photolatently modulable hydrogels using unilamellar titania nanosheets as photocatalytic crosslinkers.Nature Communications, 4 (1):2029, 2013. 10

  19. [19]

    Improved image captioning via policy gradient optimization of spider

    Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Improved image captioning via policy gradient optimization of spider. InProceedings of the IEEE international conference on computer vision, pages 873–881, 2017

  20. [20]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  21. [21]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  22. [22]

    DOCCI: Descriptions of Connected and Contrasting Images

    Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, et al. DOCCI: Descriptions of Connected and Contrasting Images. InEuropean Conference on Computer Vision, pages 291–309, 2024

  23. [23]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  24. [24]

    Blockwise advantage estimation for multi-objective rl with verifiable rewards.arXiv preprint arXiv:2602.10231, 2026

    Kirill Pavlenko, Alexander Golubev, Simon Karasik, and Boris Yangel. Blockwise advantage estimation for multi-objective rl with verifiable rewards.arXiv preprint arXiv:2602.10231, 2026

  25. [25]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Yi Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

  26. [26]

    Self-critical sequence training for image captioning

    Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024, 2017

  27. [27]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  28. [28]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  29. [29]

    A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

    Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero- Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26700– 26709, 2024

  30. [30]

    Cider: Consensus-based image description evaluation

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

  31. [31]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-Math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

  32. [32]

    Bartscore: Evaluating generated text as text generation

    Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in neural information processing systems, 34:27263–27277, 2021

  33. [33]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

  34. [34]

    Describe each panel

    Zifeng Zhu, Mengzhao Jia, Zhihan Zhang, Lang Li, and Meng Jiang. MultiChartQA: Benchmarking vision-language models on multi-chart problems. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11341–11359, 2025. 11 A Implemen...

  35. [35]

    Accuracy – does each subfigure caption correctly describe what the reference describes (apparatus, conditions, observed effect, statistical claims)? Penalize hallucinated or contradicted claims

  36. [36]

    Completeness – does the candidate cover the salient information the reference covers for each subfigure?

  37. [37]

    accuracy

    Fluency – is the candidate’s language scientific, concise, and 18 grammatical? Anchor points (use0.5steps between): -5.0near-publication-quality; matches the reference faithfully on this axis -4.0substantively correct, minor omissions or stylistic differences -3.0partially correct; salient details missing or slightly off -2.0multiple errors or substantial...

  38. [38]

    Accuracy – which candidate more correctly describes each subfigure as the reference does (apparatus, conditions, observed effect, statistical claims)? Penalize hallucinated or contradicted claims

  39. [39]

    Completeness – which candidate covers more of the salient information present in the reference for each subfigure?

  40. [40]

    A", "B",

    Fluency – which candidate’s language is more scientific, concise, and grammatical? Also give an overall preference: which candidate, all axes considered, would you prefer as the published caption? Return ONLY a JSON object with these exact keys (each value one of "A", "B", "tie"): {"accuracy": "A|B|tie", "completeness": "A|B|tie", "fluency": "A|B|tie", "o...