ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Haochen Wang; Hongyang Tang; Jiacheng Chen; Rongxin Guo; Shaoxiang Chen; Tianle Li; Xuyang Shen; Yan Ma; Yu Cheng; Yucong Zhou

arxiv: 2605.20278 · v1 · pith:RLZYC2GBnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI· cs.CV

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Tianle Li , Xuyang Shen , Yan Ma , Rongxin Guo , Shaoxiang Chen , Jiacheng Chen , Haochen Wang , Hongyang Tang

show 2 more authors

Yucong Zhou Yu Cheng

This is my paper

Pith reviewed 2026-05-21 08:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords image captioningreinforcement learninghallucinationclaim verificationfine-grained rewardsvision-language modelsmultimodal evaluationfactuality

0 comments

The pith

ClaimDiff-RL uses verified differences between individual visual claims as the reward unit in reinforcement learning for image captions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-form image captions are hard to train because standard rewards judge whole sequences at once and hide whether errors come from invented details or omitted facts. The paper introduces a method that compares an actor caption to a reference, breaks the differences into atomic claims, and has a multimodal judge check each one against the actual image while labeling its error type and severity. This produces separate, tunable signals for hallucination and missing content instead of a single compressed score. Experiments on diagnostic sets and standard benchmarks show the approach reaches operating points that reduce hallucinations without increasing omissions as much as holistic rewards do. The same training preserves overall model ability and improves results on specific tasks like counting objects and describing spatial relations.

Core claim

ClaimDiff-RL replaces sequence-level scalar rewards with reference-conditioned atomic claim differences. A multimodal judge enumerates visually grounded differences between the generated caption and a reference caption, verifies each difference directly against the image, assigns open-vocabulary error types and severity levels, and supplies per-difference statistics that are composed into the reinforcement learning reward. This decomposition makes hallucinated claims and omitted salient facts separately measurable and adjustable, exposing the faithfulness-coverage tradeoff that holistic rewards obscure and enabling more balanced captioning models.

What carries the argument

The multimodal judge that enumerates visually grounded differences between actor and reference captions, verifies each against the image, and assigns open-vocabulary error types plus severity levels to generate per-difference statistics for reward composition.

If this is right

Holistic scalar rewards reduce hallucination only by increasing missing facts, while claim-difference rewards allow training to reach better-balanced points on both dimensions.
On the 160-image human-labeled diagnostic benchmark the method improves the measured hallucination-missing-fact balance compared with scalar-reward baselines.
General captioning and VQA performance on public benchmarks remains intact rather than degrading.
The resulting models surpass Gemini-3-Pro-Preview on fine-grained dimensions including object counting, spatial relations, and scene recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same claim-difference machinery could be applied to other generation domains where local factual errors matter, such as video description or medical report writing.
If the judge remains reliable, training logs could expose which specific error types the model is still making, guiding targeted data collection.
Captioning models trained this way may develop stronger internal verification habits that transfer to new images without references.
Future systems might embed similar claim verification steps at inference time to self-correct before final output.

Load-bearing premise

The multimodal judge can reliably enumerate, verify against the image, and correctly type and score the differences without introducing its own systematic errors or biases.

What would settle it

A side-by-side human evaluation on held-out caption pairs showing that the judge's error-type assignments and verification decisions disagree with humans at rates high enough to reverse the reported balance improvements on the 160-image diagnostic benchmark.

Figures

Figures reproduced from arXiv: 2605.20278 by Haochen Wang, Hongyang Tang, Jiacheng Chen, Rongxin Guo, Shaoxiang Chen, Tianle Li, Xuyang Shen, Yan Ma, Yu Cheng, Yucong Zhou.

**Figure 1.** Figure 1: Overview of CLAIMDIFF-RL. Unlike direct scalar judging, CLAIMDIFF-RL verifies actor–reference visual differences against the image and composes typed side-specific errors into scalar rewards, making the hallucination–coverage tradeoff explicit. good dense caption should therefore be both faithful and informative: it should avoid unsupported visual claims while still covering salient image content [30, 35].… view at source ↗

**Figure 2.** Figure 2: Overview of CLAIMDIFF-RL. Actor–reference differences are verified against the image to produce side-specific typed errors, which are composed into relative or actor-only scalar rewards for group-normalized RL optimization. Each difference di contains a visual aspect, the actor-side claim, the reference-side claim, an imagegrounded judgment, and side-specific error descriptions: di = [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 3.** Figure 3: Hallucination and missing-fact trends across RL training steps. Step [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics of reward, response length, and reference-side weighted errors. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClaimDiff-RL breaks caption rewards into per-claim differences to expose the hallucination-missing-fact tradeoff, but the whole approach rests on an uncalibrated multimodal judge.

read the letter

The main takeaway is that ClaimDiff-RL uses differences in individual visual claims between a generated caption and a reference as the reward signal for RL. This gives finer control over hallucinations versus missing details compared to whole-caption rewards. The paper shows how standard scalar rewards often fix one problem by worsening the other, and their claim-based approach finds more balanced results. They report gains on a small diagnostic set and some public benchmarks, including better performance on counting and spatial relations than a strong baseline like Gemini. What stands out is the explicit separation of error types in the reward composition. That framing is distinct from prior work on preference optimization or scalar factuality scores. The weak point is the multimodal judge that generates these claims and labels. The description does not include any human validation of the judge's accuracy or consistency on the diagnostic images. Without that, it's hard to know if the improvements come from better RL or from the judge's particular biases in what it flags as differences or how it scores severity. The abstract also lacks details on how the reward weights are set or statistical tests. This work is for people building RL systems for vision-language generation who care about diagnosable rewards. It is worth sending to peer review because the granularity issue is real and the proposed unit is a concrete step forward, though the judge reliability needs more evidence to make the claims stick.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ClaimDiff-RL, a reinforcement learning framework for long-form image captioning that addresses reward granularity by using reference-conditioned atomic claim differences as the reward unit. A multimodal judge enumerates visually grounded differences between an actor caption and a reference caption, verifies each against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This enables separate measurement and tuning of hallucination and missing-fact errors. Experiments on a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA tasks show improved hallucination-missing-fact balance, preserved general capability, and outperformance of Gemini-3-Pro-Preview on fine-grained dimensions such as object counting, spatial relations, and scene recognition.

Significance. If the multimodal judge proves reliable, the framework could meaningfully advance fine-grained RL for captioning by replacing holistic scalar or preference-based rewards with typed, verifiable claim-level signals. The explicit exposure and balancing of the faithfulness-coverage tradeoff, along with targeted gains on specific visual capabilities, would represent a useful contribution to multimodal generation. The approach's diagnosability is a strength, but its significance depends on demonstrating that improvements stem from the method rather than judge artifacts.

major comments (2)

[Abstract] Abstract: The reported improvements on the 160-image diagnostic benchmark and public sets provide no quantitative details on judge accuracy, reward composition weights, or statistical significance. Without these, it is difficult to determine whether the claimed gains in hallucination-missing-fact balance are robust or attributable to the proposed RL objective.
[Method overview (multimodal judge component)] Method overview (multimodal judge component): The reward signal is constructed directly from the judge's per-difference outputs (enumeration, image verification, error typing, severity). No calibration against human annotations on the diagnostic set, inter-annotator agreement metrics, or ablation studies on judge model choice are described. If the judge systematically over- or under-counts certain relations or objects, the RL process would amplify those biases rather than optimize true visual claims.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one concrete numerical result (e.g., delta in hallucination rate or missing-fact rate) to support the balance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below, providing clarifications based on the manuscript and committing to revisions that strengthen the presentation of our results and method without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The reported improvements on the 160-image diagnostic benchmark and public sets provide no quantitative details on judge accuracy, reward composition weights, or statistical significance. Without these, it is difficult to determine whether the claimed gains in hallucination-missing-fact balance are robust or attributable to the proposed RL objective.

Authors: We agree that the abstract would benefit from these quantitative details to better contextualize the improvements. The 160-image diagnostic benchmark is human-labeled precisely to support such evaluation, and the main text reports reward composition and performance metrics. In the revision we will expand the abstract to include judge accuracy figures (e.g., agreement with human labels), the exact reward weights used, and statistical significance results (e.g., paired t-test p-values) for the hallucination-missing-fact balance improvements. revision: yes
Referee: [Method overview (multimodal judge component)] Method overview (multimodal judge component): The reward signal is constructed directly from the judge's per-difference outputs (enumeration, image verification, error typing, severity). No calibration against human annotations on the diagnostic set, inter-annotator agreement metrics, or ablation studies on judge model choice are described. If the judge systematically over- or under-counts certain relations or objects, the RL process would amplify those biases rather than optimize true visual claims.

Authors: The diagnostic benchmark was collected with human annotations specifically to enable calibration of the judge outputs. We will add a new subsection that reports calibration results against these human labels, inter-annotator agreement statistics, and an ablation on judge model choice (comparing at least two VLMs). We will also expand the discussion to address potential systematic biases, noting that every difference is image-verified and that the typed, per-claim nature of the reward allows post-hoc inspection. These additions will directly respond to the concern about bias amplification. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces ClaimDiff-RL as a methodological framework that decomposes caption rewards into per-claim differences produced by a multimodal judge. This is presented as an engineering choice for granularity rather than a mathematical derivation or prediction that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, or self-citation chains that bear the central load are visible in the abstract or described text. The reported improvements rest on empirical comparisons against external benchmarks and a human-labeled diagnostic set, keeping the approach self-contained against outside evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on the existence and reliability of a multimodal judge that can perform open-vocabulary claim verification; no free parameters or invented entities are explicitly quantified in the abstract.

pith-pipeline@v0.9.0 · 5829 in / 1050 out tokens · 36146 ms · 2026-05-21T08:32:46.131828+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CLAIMDIFF-RL improves the hallucination–missing-fact balance... on object counting, spatial relations, and scene recognition

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 10 internal anchors

[1]

SPICE: Semantic Propositional Image Caption Evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation.ArXiv, abs/1607.08822, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InIEEvaluation@ACL, 2005

work page 2005
[4]

David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John F. Canny. Clair: Evaluating image captions with large language models.ArXiv, abs/2310.12971, 2023

work page arXiv 2023
[5]

Simplevqa: Multimodal factuality evaluation for multimodal large language models.2025 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4637–4646, 2025

Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Wei ye Zhou, Yun Lu, Tongliang Li, Wenhao Huang, and Zhoujun Li. Simplevqa: Multimodal factuality evaluation for multimodal large language models.2025 IEEE/CVF International Conference on...

work page 2025
[6]

Benchmarking and improving detail image caption.ArXiv, abs/2405.19092, 2024

Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, and Haoyuan Guo. Benchmarking and improving detail image caption.ArXiv, abs/2405.19092, 2024

work page arXiv 2024
[7]

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxing Huang, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.ArXiv, abs/2404.12390, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani S. Marathe, Stephen O. Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, ...

work page arXiv 2023
[10]

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models.2024 IEEE/CVF Conference on Computer Vision and Pattern Recog...

work page 2024
[11]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.ArXiv, abs/2104.08718, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen

Max W.F. Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation.ArXiv, abs/2312.14867, 2023

work page arXiv 2023
[13]

Prometheus- vision: Vision-language model as a judge for fine-grained evaluation

Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, and Minjoon Seo. Prometheus- vision: Vision-language model as a judge for fine-grained evaluation. InAnnual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[14]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination in large vision-language models. InConference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[15]

Describe anything: Detailed localized image and video captioning.ArXiv, abs/2504.16072, 2025

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, and Yin Cui. Describe anything: Detailed localized image and video captioning.ArXiv, abs/2504.16072, 2025

work page arXiv 2025
[16]

Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness, 2025

Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Pandeng Li, Boqiang Zhang, Nianzu Yang, Yinglu Li, Zuan Gao, Yun Zheng, and Hongtao Xie. Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness, 2025

work page 2025
[17]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with h...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAnnual Meeting of the Association for Computational Linguistics, 2002

work page 2002
[19]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. ArXiv, abs/2305.18290, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self- critical sequence training for image captioning, 2017

work page 2017
[21]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InConference on Empirical Methods in Natural Language Processing, 2018

work page 2018
[22]

Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

work page 2022
[23]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun-Mei Song, Mingchuan Zhang, Y . K. Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.ArXiv, abs/2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

work page 2026
[25]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented rlhf.ArXiv, abs/2309.14525, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Ba...

work page 2025
[27]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2014

work page 2015
[28]

Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.ArXiv, abs/2510.18876, 2025

Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, and Zhaoxiang Zhang. Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.ArXiv, abs/2510.18876, 2025

work page arXiv 2025
[29]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.ArXiv, abs/2408.15556, 2024

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.ArXiv, abs/2408.15556, 2024

work page arXiv 2024
[30]

Vicrit: A verifiable reinforcement learning proxy task for visual perception in vlms, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, and Lijuan Wang. Vicrit: A verifiable reinforcement learning proxy task for visual perception in vlms, 2025

work page 2025
[31]

Realworldqa

xAI. Realworldqa. https://huggingface.co/datasets/xai-org/RealworldQA, 2024. Hugging Face dataset

work page 2024
[32]

Caprl: Stimulating dense image caption capabilities via reinforcement learning, 2025

Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, and Dahua Lin. Caprl: Stimulating dense image caption capabilities via reinforcement learning, 2025

work page 2025
[33]

CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, and Chenfeng Xu. Captionqa: Is your caption as useful as the image itself?ArXiv, abs/2511.21025, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025

Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025

work page 2025
[35]

Please describe this 18 image in detail

Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian Hüger, Thorsten Bagdonat, Hanno Gottschalk, and Leo Schwinn. Focus: Internal mllm representations for efficient fine-grained visual question answering, 2025. 17 A Limitations Dependence on strong multimodal judges.CLAIMDIFF-RL relies on a strong multimodal judge to identify actor–reference difference...

work page 2025
[40]

Important rules: - A correct detail in the actor caption should not be penalized merely because it is absent from the reference

Clarity and specificity without unnecessary ambiguity or repetition. Important rules: - A correct detail in the actor caption should not be penalized merely because it is absent from the reference. - A detail in the reference should not be rewarded unless it is supported by the image. - Penalize hallucination more than omission. - Penalize strategic hedgi...

work page
[41]

Visual factual correctness

work page
[42]

Coverage of salient image content

work page
[43]

Correct attributes, counts, spatial relations, OCR/text, and identities

work page
[44]

Avoidance of hallucinated objects, attributes, or relations

work page
[45]

Important rules: - Penalize hallucination more than omission

Clarity and specificity without unnecessary ambiguity or repetition. Important rules: - Penalize hallucination more than omission. - Penalize strategic hedging when the image evidence is clear. - Do not reward length by itself. - Do not reward flowery style by itself. Actor caption: {actor_caption} Return exactly this format: SCORE: <integer from 0 to 10>...

work page
[46]

A contradiction is a candidate claim that conflicts with the reference

Difference detection.The judge compares R and C and emits each detected difference as one of three types: contradiction, extra_info, or missing_fact. A contradiction is a candidate claim that conflicts with the reference. An extra_info item is a candidate claim not mentioned in the reference. A missing_fact is a fact in the reference that is absent from t...

work page
[47]

missing" and is_hallucination=false Strict mapping rules for is_hallucination: - type=

Image verification.For each contradiction or extra_info item, the judge verifies the candidate-side claim against the image and assigns one of three verification labels: verified, false, or ambiguous. The label verified means the image supports the candidate claim, false means the image contradicts it, and ambiguous means the image is insufficiently infor...

work page
[48]

Only judge visually grounded content

work page
[49]

Do not penalize missing minor background details

work page
[50]

Do not hallucinate facts in the checklist

work page
[51]

For hallucination, judge only claims made by the caption

work page
[52]

For missing facts, judge only important image facts

work page
[53]

If visual evidence is ambiguous, mark UNCERTAIN

work page
[54]

claims": [ {

Keep all claims atomic. Model Caption: {pred_caption} Return your answer in valid JSON format with this structure: { "claims": [ { "claim": "<atomic claim from caption>", "aspect": "<object | attribute | count | spatial | action | text_ocr | identity | scene | style | other>", "judgment": "<SUPPORTED | HALLUCINATION | UNCERTAIN>", "evidence": "<brief visu...

work page arXiv 2030

[1] [1]

SPICE: Semantic Propositional Image Caption Evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation.ArXiv, abs/1607.08822, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InIEEvaluation@ACL, 2005

work page 2005

[4] [4]

David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John F. Canny. Clair: Evaluating image captions with large language models.ArXiv, abs/2310.12971, 2023

work page arXiv 2023

[5] [5]

Simplevqa: Multimodal factuality evaluation for multimodal large language models.2025 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4637–4646, 2025

Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Wei ye Zhou, Yun Lu, Tongliang Li, Wenhao Huang, and Zhoujun Li. Simplevqa: Multimodal factuality evaluation for multimodal large language models.2025 IEEE/CVF International Conference on...

work page 2025

[6] [6]

Benchmarking and improving detail image caption.ArXiv, abs/2405.19092, 2024

Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, and Haoyuan Guo. Benchmarking and improving detail image caption.ArXiv, abs/2405.19092, 2024

work page arXiv 2024

[7] [7]

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxing Huang, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vi...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.ArXiv, abs/2404.12390, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani S. Marathe, Stephen O. Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, ...

work page arXiv 2023

[10] [10]

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models.2024 IEEE/CVF Conference on Computer Vision and Pattern Recog...

work page 2024

[11] [11]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.ArXiv, abs/2104.08718, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen

Max W.F. Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation.ArXiv, abs/2312.14867, 2023

work page arXiv 2023

[13] [13]

Prometheus- vision: Vision-language model as a judge for fine-grained evaluation

Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, and Minjoon Seo. Prometheus- vision: Vision-language model as a judge for fine-grained evaluation. InAnnual Meeting of the Association for Computational Linguistics, 2024

work page 2024

[14] [14]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination in large vision-language models. InConference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[15] [15]

Describe anything: Detailed localized image and video captioning.ArXiv, abs/2504.16072, 2025

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, and Yin Cui. Describe anything: Detailed localized image and video captioning.ArXiv, abs/2504.16072, 2025

work page arXiv 2025

[16] [16]

Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness, 2025

Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Pandeng Li, Boqiang Zhang, Nianzu Yang, Yinglu Li, Zuan Gao, Yun Zheng, and Hongtao Xie. Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness, 2025

work page 2025

[17] [17]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with h...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAnnual Meeting of the Association for Computational Linguistics, 2002

work page 2002

[19] [19]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. ArXiv, abs/2305.18290, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self- critical sequence training for image captioning, 2017

work page 2017

[21] [21]

Object hallucination in image captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InConference on Empirical Methods in Natural Language Processing, 2018

work page 2018

[22] [22]

Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

work page 2022

[23] [23]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun-Mei Song, Mingchuan Zhang, Y . K. Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.ArXiv, abs/2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

work page 2026

[25] [25]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented rlhf.ArXiv, abs/2309.14525, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Ba...

work page 2025

[27] [27]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2014

work page 2015

[28] [28]

Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.ArXiv, abs/2510.18876, 2025

Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, and Zhaoxiang Zhang. Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.ArXiv, abs/2510.18876, 2025

work page arXiv 2025

[29] [29]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.ArXiv, abs/2408.15556, 2024

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.ArXiv, abs/2408.15556, 2024

work page arXiv 2024

[30] [30]

Vicrit: A verifiable reinforcement learning proxy task for visual perception in vlms, 2025

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, and Lijuan Wang. Vicrit: A verifiable reinforcement learning proxy task for visual perception in vlms, 2025

work page 2025

[31] [31]

Realworldqa

xAI. Realworldqa. https://huggingface.co/datasets/xai-org/RealworldQA, 2024. Hugging Face dataset

work page 2024

[32] [32]

Caprl: Stimulating dense image caption capabilities via reinforcement learning, 2025

Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, and Dahua Lin. Caprl: Stimulating dense image caption capabilities via reinforcement learning, 2025

work page 2025

[33] [33]

CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, and Chenfeng Xu. Captionqa: Is your caption as useful as the image itself?ArXiv, abs/2511.21025, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025

Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025

work page 2025

[35] [35]

Please describe this 18 image in detail

Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian Hüger, Thorsten Bagdonat, Hanno Gottschalk, and Leo Schwinn. Focus: Internal mllm representations for efficient fine-grained visual question answering, 2025. 17 A Limitations Dependence on strong multimodal judges.CLAIMDIFF-RL relies on a strong multimodal judge to identify actor–reference difference...

work page 2025

[36] [40]

Important rules: - A correct detail in the actor caption should not be penalized merely because it is absent from the reference

Clarity and specificity without unnecessary ambiguity or repetition. Important rules: - A correct detail in the actor caption should not be penalized merely because it is absent from the reference. - A detail in the reference should not be rewarded unless it is supported by the image. - Penalize hallucination more than omission. - Penalize strategic hedgi...

work page

[37] [41]

Visual factual correctness

work page

[38] [42]

Coverage of salient image content

work page

[39] [43]

Correct attributes, counts, spatial relations, OCR/text, and identities

work page

[40] [44]

Avoidance of hallucinated objects, attributes, or relations

work page

[41] [45]

Important rules: - Penalize hallucination more than omission

Clarity and specificity without unnecessary ambiguity or repetition. Important rules: - Penalize hallucination more than omission. - Penalize strategic hedging when the image evidence is clear. - Do not reward length by itself. - Do not reward flowery style by itself. Actor caption: {actor_caption} Return exactly this format: SCORE: <integer from 0 to 10>...

work page

[42] [46]

A contradiction is a candidate claim that conflicts with the reference

Difference detection.The judge compares R and C and emits each detected difference as one of three types: contradiction, extra_info, or missing_fact. A contradiction is a candidate claim that conflicts with the reference. An extra_info item is a candidate claim not mentioned in the reference. A missing_fact is a fact in the reference that is absent from t...

work page

[43] [47]

missing" and is_hallucination=false Strict mapping rules for is_hallucination: - type=

Image verification.For each contradiction or extra_info item, the judge verifies the candidate-side claim against the image and assigns one of three verification labels: verified, false, or ambiguous. The label verified means the image supports the candidate claim, false means the image contradicts it, and ambiguous means the image is insufficiently infor...

work page

[44] [48]

Only judge visually grounded content

work page

[45] [49]

Do not penalize missing minor background details

work page

[46] [50]

Do not hallucinate facts in the checklist

work page

[47] [51]

For hallucination, judge only claims made by the caption

work page

[48] [52]

For missing facts, judge only important image facts

work page

[49] [53]

If visual evidence is ambiguous, mark UNCERTAIN

work page

[50] [54]

claims": [ {

Keep all claims atomic. Model Caption: {pred_caption} Return your answer in valid JSON format with this structure: { "claims": [ { "claim": "<atomic claim from caption>", "aspect": "<object | attribute | count | spatial | action | text_ocr | identity | scene | style | other>", "judgment": "<SUPPORTED | HALLUCINATION | UNCERTAIN>", "evidence": "<brief visu...

work page arXiv 2030