SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

Hannah Yun; Hyunwoong Kim; Jonggwon Park; Junhyun Park; Seongeun Lee

arxiv: 2606.09871 · v1 · pith:HRZGXEDUnew · submitted 2026-06-02 · 💻 cs.CV · cs.AI· cs.LG

SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

Hyunwoong Kim , Seongeun Lee , Hannah Yun , Junhyun Park , Jonggwon Park This is my paper

Pith reviewed 2026-06-28 10:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords segment decompositionGRPOvision-language generationpolicy optimizationlong-form responsescredit assignmentmultimodal modelsreinforcement learning

0 comments

The pith

SD-GRPO replaces GRPO's single scalar advantage with a vector of per-segment advantages from z-normalized verifiable rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Group Relative Policy Optimization applies one advantage value to an entire output, which limits its fit to long vision-language responses that contain multiple grounded parts. The paper introduces Segment-Decomposed GRPO, which splits outputs into segments, computes independent verifiable rewards for each, and z-normalizes those rewards across rollouts to produce a vector of advantages instead of one scalar. On tasks where segments are independent, this change produces consistent gains that grow with the number of segments. When segments share context, combining the per-segment vector with a holistic reward further improves results over either approach alone.

Core claim

SD-GRPO z-normalizes verifiable per-segment rewards across the rollout group to yield a vector of per-segment advantages in place of a single scalar advantage, and this change consistently outperforms the GRPO baseline on long-form VL generation tasks, with blending of holistic and per-segment signals required when segments share context.

What carries the argument

The per-segment advantage vector obtained by z-normalizing verifiable rewards across the rollout group, which replaces the scalar advantage used in standard GRPO.

If this is right

Gains from SD-GRPO increase with the number of segments on tasks where segments are semantically independent.
Rollout-level rewards produce cross-segment credit misattribution that grows with output length.
Blending holistic and per-segment rewards improves results when segments share context across the image.
SD-GRPO can be added to any existing GRPO framework with only small code changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automatic methods to discover segments could extend the approach to tasks that lack explicit divisions.
The same per-segment normalization may reduce credit misattribution in other long-sequence reinforcement learning settings.
Reward models for vision-language agents might benefit from explicit decomposition when outputs contain multiple verifiable parts.

Load-bearing premise

Long-form vision-language outputs possess a natural segmentation into parts for which independent verifiable rewards can be computed without substantial new labeling or modeling error.

What would settle it

An experiment on a long-form VL benchmark where per-segment rewards cannot be assigned independently would show no performance gain for SD-GRPO over GRPO if the central claim is false.

Figures

Figures reproduced from arXiv: 2606.09871 by Hannah Yun, Hyunwoong Kim, Jonggwon Park, Junhyun Park, Seongeun Lee.

**Figure 2.** Figure 2: Reasoning trajectories from our SD-GRPO-trained model on multi-chart long-form VQA. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the composite scientific figure captioning experiment. Segment rewards [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images. To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar. We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts. Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length. On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SD-GRPO replaces GRPO's scalar advantage with per-segment z-normalized vectors to fix credit assignment in long VL outputs, with clearest gains on independent segments but needing blends when context is shared.

read the letter

The main thing here is that SD-GRPO takes the GRPO setup and decomposes the advantage into a vector by z-normalizing per-segment rewards within each rollout group. This targets the coarse credit assignment that hurts long-form VL tasks.

The paper tests this across three settings ordered by entanglement. On the DOCCI multi-panel task with independent segments, SD-GRPO beats the baseline and the gap widens with more segments. On MultiChartQA they lay out both a theoretical argument and empirical results showing how rollout-level rewards misattribute credit as length grows. On MMSci, where subfigure captions share context, they report that pure per-segment normalization is not enough and blending with holistic rewards improves results further. They also note the change plugs into Dr. GRPO with little extra work.

This is a direct, practical modification rather than a wholesale new algorithm. The cross-setting comparison is useful because it shows where the idea helps and where it needs adjustment.

The soft spot is the premise that segments admit independent, verifiable rewards without adding substantial new error or cross-segment leakage. The MMSci case already shows that pure per-segment normalization falls short when segments are entangled, so the vector advantage may not stay clean in those regimes. Without the full details on how segments are identified and how the rewards are actually computed and verified, it is hard to judge how much of the reported lift comes from the decomposition versus other implementation choices.

This is for groups already running GRPO-style RL on multimodal models and looking for incremental fixes on long outputs. It has enough structure and comparative results to deserve peer review, even if the experiments will need closer checking on segment construction and reward sources.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Segment-Decomposed GRPO (SD-GRPO), which extends standard GRPO by z-normalizing verifiable per-segment rewards across a rollout group to produce a vector of per-segment advantages instead of a single scalar advantage. It evaluates the approach on three long-form VL tasks of increasing semantic entanglement: a controlled multi-panel captioning task from DOCCI (independent segments), a multi-chart VQA task from MultiChartQA (credit misattribution analyzed theoretically and empirically), and scientific figure captioning on MMSci (entangled subfigures requiring blending of holistic and per-segment rewards). The central claim is that SD-GRPO consistently outperforms GRPO baselines, with larger gains at higher segment counts on independent cases, and that the method integrates into existing GRPO frameworks such as Dr. GRPO with minimal overhead.

Significance. If the empirical gains and theoretical analysis of credit misattribution hold under scrutiny, the work offers a targeted improvement to credit assignment for long-form VL generation by exploiting natural output segmentation, a signal discarded by scalar-advantage methods. The low-overhead integration claim and the controlled-to-real-world progression of experiments are strengths that could make the technique broadly useful if the per-segment reward assumption generalizes.

major comments (2)

[Abstract / MMSci evaluation] Abstract and MMSci evaluation: the statement that 'per-segment normalization alone is insufficient when segments are semantically entangled' and that blending with holistic rewards is required directly qualifies the central claim; if computing the per-segment rewards in such cases necessarily incorporates cross-segment context or additional modeling error, the resulting advantage vector no longer isolates the intended credit assignment, undermining consistent outperformance across the three settings of increasing entanglement.
[MultiChartQA evaluation] MultiChartQA section: the theoretical claim that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length is load-bearing for motivating the vector advantage; without the explicit derivation or equations showing how z-normalization removes this scaling (as opposed to merely reweighting it), the empirical outperformance cannot be attributed to the proposed mechanism rather than other implementation choices.

minor comments (2)

[Abstract / Methods] The abstract refers to 'verifiable per-segment rewards' and 'semantically independent' segments without defining the reward functions or segmentation procedure; these should be specified with pseudocode or equations in the methods section to allow reproduction.
[DOCCI evaluation] No mention of statistical significance, number of runs, or variance for the 'consistently outperforms' and 'larger gains at higher segment counts' claims on DOCCI; tables or figures reporting these quantities are needed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and note planned revisions.

read point-by-point responses

Referee: [Abstract / MMSci evaluation] Abstract and MMSci evaluation: the statement that 'per-segment normalization alone is insufficient when segments are semantically entangled' and that blending with holistic rewards is required directly qualifies the central claim; if computing the per-segment rewards in such cases necessarily incorporates cross-segment context or additional modeling error, the resulting advantage vector no longer isolates the intended credit assignment, undermining consistent outperformance across the three settings of increasing entanglement.

Authors: We disagree that the statement undermines the central claim. The manuscript structures its evaluation explicitly around increasing entanglement levels, showing SD-GRPO gains on independent segments and further gains from blending on entangled ones; this progression demonstrates the method's flexibility rather than a limitation. Per-segment rewards are derived from independent verifiable signals without cross-segment context in the reward computation, and blending is applied afterward as an optional adjustment. We will revise the abstract and MMSci discussion to explicitly state that the core per-segment advantage vector remains isolated even under blending. revision: partial
Referee: [MultiChartQA evaluation] MultiChartQA section: the theoretical claim that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length is load-bearing for motivating the vector advantage; without the explicit derivation or equations showing how z-normalization removes this scaling (as opposed to merely reweighting it), the empirical outperformance cannot be attributed to the proposed mechanism rather than other implementation choices.

Authors: We agree that the theoretical section would benefit from greater explicitness. The current analysis derives the scaling of misattribution under scalar advantages and shows how per-segment z-normalization removes cross-segment terms, but we will expand the MultiChartQA section with the complete set of equations for the advantage computation and scaling behavior in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces SD-GRPO as an empirical algorithmic modification to GRPO: it defines per-segment advantages directly via z-normalization of verifiable rewards within rollout groups. No equations derive a result that reduces to the inputs by construction, no parameters are fitted on a subset and then renamed as predictions, and no self-citation chain or uniqueness theorem is invoked to force the method. Performance claims rest on direct comparisons against GRPO baselines across three external datasets (DOCCI, MultiChartQA, MMSci), with the segment-independence premise stated as an operating assumption rather than a derived necessity. The approach is therefore self-contained as a heuristic extension without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that natural, verifiable segment boundaries exist in long-form VL outputs and that cross-segment credit misattribution is the dominant failure mode of scalar GRPO.

axioms (1)

domain assumption Long-form VL outputs admit segmentation into parts with independent verifiable rewards
Invoked when the paper distinguishes 'semantically independent' vs 'entangled' segments and claims per-segment rewards are available.

pith-pipeline@v0.9.1-grok · 5855 in / 1295 out tokens · 25132 ms · 2026-06-28T10:29:41.678320+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 16 canonical work pages · 11 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005

2005
[3]

SciBERT: A pretrained language model for scientific text

Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3615–3620, 2019

2019
[4]

R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V, 2025

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V, 2025

2025
[5]

Self-guided process reward optimization with redefined step-wise advantage for process reinforcement learning.arXiv preprint arXiv:2507.01551, 2025

Wu Fei, Hao Kong, Shuxian Liang, Yang Lin, Yibo Yang, Jing Tang, Lei Chen, and Xiansheng Hua. Self-guided process reward optimization with redefined step-wise advantage for process reinforcement learning.arXiv preprint arXiv:2507.01551, 2025

work page arXiv 2025
[6]

Group-in-group policy optimization for LLM agent training.Advances in Neural Information Processing Systems, 2025

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training.Advances in Neural Information Processing Systems, 2025

2025
[7]

Segmental advantage estimation: Enhancing ppo for long-context llm training

Xue Gong, Qi Yi, Ziyuan Nan, Guanhua Huang, Kejiao Li, Yuhao Jiang, Ruibin Xiong, Zenan Xu, Jiaming Guo, Shaohui Peng, et al. Segmental advantage estimation: Enhancing ppo for long-context llm training. arXiv preprint arXiv:2601.07320, 2026

work page arXiv 2026
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Segment policy optimization: Effective segment- level credit assignment in RL for large language models.Advances in Neural Information Processing Systems, 2025

Yiran Guo, Lijie Xu, Jie Liu, Ye Dan, and Shuang Qiu. Segment policy optimization: Effective segment- level credit assignment in RL for large language models.Advances in Neural Information Processing Systems, 2025

2025
[10]

mTORC2 signalling regulates M2 macrophage differentiation in response to helminth infection and adaptive thermogenesis.Nature communications, 8(1):14208, 2017

RW Hallowell, SL Collins, JM Craig, Y Zhang, M Oh, PB Illei, Y Chan-Li, CL Vigeland, W Mitzner, AL Scott, et al. mTORC2 signalling regulates M2 macrophage differentiation in response to helminth infection and adaptive thermogenesis.Nature communications, 8(1):14208, 2017

2017
[11]

SciCap: Generating captions for scientific figures

Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. SciCap: Generating captions for scientific figures. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3258–3264, 2021

2021
[12]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaoshen Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.ArXiv, abs/2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

2019
[15]

Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding.Advances in Neural Information Processing Systems, 37:3229– 3242, 2024

Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding.Advances in Neural Information Processing Systems, 37:3229– 3242, 2024

2024
[16]

Hybrid retrieval-generation reinforced agent for medical image report generation.Advances in neural information processing systems, 31, 2018

Yuan Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Hybrid retrieval-generation reinforced agent for medical image report generation.Advances in neural information processing systems, 31, 2018

2018
[17]

Mmsci: A dataset for graduate-level multi-discipline multimodal scientific understanding.arXiv preprint arXiv:2407.04903, 2024

Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, et al. Mmsci: A dataset for graduate-level multi-discipline multimodal scientific understanding.arXiv preprint arXiv:2407.04903, 2024

work page arXiv 2024
[18]

Photolatently modulable hydrogels using unilamellar titania nanosheets as photocatalytic crosslinkers.Nature Communications, 4 (1):2029, 2013

Mingjie Liu, Yasuhiro Ishida, Yasuo Ebina, Takayoshi Sasaki, and Takuzo Aida. Photolatently modulable hydrogels using unilamellar titania nanosheets as photocatalytic crosslinkers.Nature Communications, 4 (1):2029, 2013. 10

2029
[19]

Improved image captioning via policy gradient optimization of spider

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Improved image captioning via policy gradient optimization of spider. InProceedings of the IEEE international conference on computer vision, pages 873–881, 2017

2017
[20]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

DOCCI: Descriptions of Connected and Contrasting Images

Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, et al. DOCCI: Descriptions of Connected and Contrasting Images. InEuropean Conference on Computer Vision, pages 291–309, 2024

2024
[23]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[24]

Blockwise advantage estimation for multi-objective rl with verifiable rewards.arXiv preprint arXiv:2602.10231, 2026

Kirill Pavlenko, Alexander Golubev, Simon Karasik, and Boris Yangel. Blockwise advantage estimation for multi-objective rl with verifiable rewards.arXiv preprint arXiv:2602.10231, 2026

work page arXiv 2026
[25]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yi Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Self-critical sequence training for image captioning

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024, 2017

2017
[27]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero- Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26700– 26709, 2024

2024
[30]

Cider: Consensus-based image description evaluation

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

2015
[31]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-Math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Bartscore: Evaluating generated text as text generation

Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in neural information processing systems, 34:27263–27277, 2021

2021
[33]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[34]

Describe each panel

Zifeng Zhu, Mengzhao Jia, Zhihan Zhang, Lang Li, and Meng Jiang. MultiChartQA: Benchmarking vision-language models on multi-chart problems. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11341–11359, 2025. 11 A Implemen...

2025
[35]

Accuracy – does each subfigure caption correctly describe what the reference describes (apparatus, conditions, observed effect, statistical claims)? Penalize hallucinated or contradicted claims
[36]

Completeness – does the candidate cover the salient information the reference covers for each subfigure?
[37]

accuracy

Fluency – is the candidate’s language scientific, concise, and 18 grammatical? Anchor points (use0.5steps between): -5.0near-publication-quality; matches the reference faithfully on this axis -4.0substantively correct, minor omissions or stylistic differences -3.0partially correct; salient details missing or slightly off -2.0multiple errors or substantial...
[38]

Accuracy – which candidate more correctly describes each subfigure as the reference does (apparatus, conditions, observed effect, statistical claims)? Penalize hallucinated or contradicted claims
[39]

Completeness – which candidate covers more of the salient information present in the reference for each subfigure?
[40]

A", "B",

Fluency – which candidate’s language is more scientific, concise, and grammatical? Also give an overall preference: which candidate, all axes considered, would you prefer as the published caption? Return ONLY a JSON object with these exact keys (each value one of "A", "B", "tie"): {"accuracy": "A|B|tie", "completeness": "A|B|tie", "fluency": "A|B|tie", "o...

work page arXiv

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005

2005

[3] [3]

SciBERT: A pretrained language model for scientific text

Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3615–3620, 2019

2019

[4] [4]

R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V, 2025

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V, 2025

2025

[5] [5]

Self-guided process reward optimization with redefined step-wise advantage for process reinforcement learning.arXiv preprint arXiv:2507.01551, 2025

Wu Fei, Hao Kong, Shuxian Liang, Yang Lin, Yibo Yang, Jing Tang, Lei Chen, and Xiansheng Hua. Self-guided process reward optimization with redefined step-wise advantage for process reinforcement learning.arXiv preprint arXiv:2507.01551, 2025

work page arXiv 2025

[6] [6]

Group-in-group policy optimization for LLM agent training.Advances in Neural Information Processing Systems, 2025

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training.Advances in Neural Information Processing Systems, 2025

2025

[7] [7]

Segmental advantage estimation: Enhancing ppo for long-context llm training

Xue Gong, Qi Yi, Ziyuan Nan, Guanhua Huang, Kejiao Li, Yuhao Jiang, Ruibin Xiong, Zenan Xu, Jiaming Guo, Shaohui Peng, et al. Segmental advantage estimation: Enhancing ppo for long-context llm training. arXiv preprint arXiv:2601.07320, 2026

work page arXiv 2026

[8] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Segment policy optimization: Effective segment- level credit assignment in RL for large language models.Advances in Neural Information Processing Systems, 2025

Yiran Guo, Lijie Xu, Jie Liu, Ye Dan, and Shuang Qiu. Segment policy optimization: Effective segment- level credit assignment in RL for large language models.Advances in Neural Information Processing Systems, 2025

2025

[10] [10]

mTORC2 signalling regulates M2 macrophage differentiation in response to helminth infection and adaptive thermogenesis.Nature communications, 8(1):14208, 2017

RW Hallowell, SL Collins, JM Craig, Y Zhang, M Oh, PB Illei, Y Chan-Li, CL Vigeland, W Mitzner, AL Scott, et al. mTORC2 signalling regulates M2 macrophage differentiation in response to helminth infection and adaptive thermogenesis.Nature communications, 8(1):14208, 2017

2017

[11] [11]

SciCap: Generating captions for scientific figures

Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. SciCap: Generating captions for scientific figures. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3258–3264, 2021

2021

[12] [12]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaoshen Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.ArXiv, abs/2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

2019

[15] [15]

Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding.Advances in Neural Information Processing Systems, 37:3229– 3242, 2024

Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding.Advances in Neural Information Processing Systems, 37:3229– 3242, 2024

2024

[16] [16]

Hybrid retrieval-generation reinforced agent for medical image report generation.Advances in neural information processing systems, 31, 2018

Yuan Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Hybrid retrieval-generation reinforced agent for medical image report generation.Advances in neural information processing systems, 31, 2018

2018

[17] [17]

Mmsci: A dataset for graduate-level multi-discipline multimodal scientific understanding.arXiv preprint arXiv:2407.04903, 2024

Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, et al. Mmsci: A dataset for graduate-level multi-discipline multimodal scientific understanding.arXiv preprint arXiv:2407.04903, 2024

work page arXiv 2024

[18] [18]

Photolatently modulable hydrogels using unilamellar titania nanosheets as photocatalytic crosslinkers.Nature Communications, 4 (1):2029, 2013

Mingjie Liu, Yasuhiro Ishida, Yasuo Ebina, Takayoshi Sasaki, and Takuzo Aida. Photolatently modulable hydrogels using unilamellar titania nanosheets as photocatalytic crosslinkers.Nature Communications, 4 (1):2029, 2013. 10

2029

[19] [19]

Improved image captioning via policy gradient optimization of spider

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Improved image captioning via policy gradient optimization of spider. InProceedings of the IEEE international conference on computer vision, pages 873–881, 2017

2017

[20] [20]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

DOCCI: Descriptions of Connected and Contrasting Images

Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, et al. DOCCI: Descriptions of Connected and Contrasting Images. InEuropean Conference on Computer Vision, pages 291–309, 2024

2024

[23] [23]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002

[24] [24]

Blockwise advantage estimation for multi-objective rl with verifiable rewards.arXiv preprint arXiv:2602.10231, 2026

Kirill Pavlenko, Alexander Golubev, Simon Karasik, and Boris Yangel. Blockwise advantage estimation for multi-objective rl with verifiable rewards.arXiv preprint arXiv:2602.10231, 2026

work page arXiv 2026

[25] [25]

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Yi Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Self-critical sequence training for image captioning

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024, 2017

2017

[27] [27]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions

Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero- Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26700– 26709, 2024

2024

[30] [30]

Cider: Consensus-based image description evaluation

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015

2015

[31] [31]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-Math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Bartscore: Evaluating generated text as text generation

Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in neural information processing systems, 34:27263–27277, 2021

2021

[33] [33]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[34] [34]

Describe each panel

Zifeng Zhu, Mengzhao Jia, Zhihan Zhang, Lang Li, and Meng Jiang. MultiChartQA: Benchmarking vision-language models on multi-chart problems. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11341–11359, 2025. 11 A Implemen...

2025

[35] [35]

Accuracy – does each subfigure caption correctly describe what the reference describes (apparatus, conditions, observed effect, statistical claims)? Penalize hallucinated or contradicted claims

[36] [36]

Completeness – does the candidate cover the salient information the reference covers for each subfigure?

[37] [37]

accuracy

Fluency – is the candidate’s language scientific, concise, and 18 grammatical? Anchor points (use0.5steps between): -5.0near-publication-quality; matches the reference faithfully on this axis -4.0substantively correct, minor omissions or stylistic differences -3.0partially correct; salient details missing or slightly off -2.0multiple errors or substantial...

[38] [38]

Accuracy – which candidate more correctly describes each subfigure as the reference does (apparatus, conditions, observed effect, statistical claims)? Penalize hallucinated or contradicted claims

[39] [39]

Completeness – which candidate covers more of the salient information present in the reference for each subfigure?

[40] [40]

A", "B",

Fluency – which candidate’s language is more scientific, concise, and grammatical? Also give an overall preference: which candidate, all axes considered, would you prefer as the published caption? Return ONLY a JSON object with these exact keys (each value one of "A", "B", "tie"): {"accuracy": "A|B|tie", "completeness": "A|B|tie", "fluency": "A|B|tie", "o...

work page arXiv