ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison
Pith reviewed 2026-05-21 08:32 UTC · model grok-4.3
The pith
ClaimDiff-RL uses verified differences between individual visual claims as the reward unit in reinforcement learning for image captions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ClaimDiff-RL replaces sequence-level scalar rewards with reference-conditioned atomic claim differences. A multimodal judge enumerates visually grounded differences between the generated caption and a reference caption, verifies each difference directly against the image, assigns open-vocabulary error types and severity levels, and supplies per-difference statistics that are composed into the reinforcement learning reward. This decomposition makes hallucinated claims and omitted salient facts separately measurable and adjustable, exposing the faithfulness-coverage tradeoff that holistic rewards obscure and enabling more balanced captioning models.
What carries the argument
The multimodal judge that enumerates visually grounded differences between actor and reference captions, verifies each against the image, and assigns open-vocabulary error types plus severity levels to generate per-difference statistics for reward composition.
If this is right
- Holistic scalar rewards reduce hallucination only by increasing missing facts, while claim-difference rewards allow training to reach better-balanced points on both dimensions.
- On the 160-image human-labeled diagnostic benchmark the method improves the measured hallucination-missing-fact balance compared with scalar-reward baselines.
- General captioning and VQA performance on public benchmarks remains intact rather than degrading.
- The resulting models surpass Gemini-3-Pro-Preview on fine-grained dimensions including object counting, spatial relations, and scene recognition.
Where Pith is reading between the lines
- The same claim-difference machinery could be applied to other generation domains where local factual errors matter, such as video description or medical report writing.
- If the judge remains reliable, training logs could expose which specific error types the model is still making, guiding targeted data collection.
- Captioning models trained this way may develop stronger internal verification habits that transfer to new images without references.
- Future systems might embed similar claim verification steps at inference time to self-correct before final output.
Load-bearing premise
The multimodal judge can reliably enumerate, verify against the image, and correctly type and score the differences without introducing its own systematic errors or biases.
What would settle it
A side-by-side human evaluation on held-out caption pairs showing that the judge's error-type assignments and verification decisions disagree with humans at rates high enough to reverse the reported balance improvements on the 160-image diagnostic benchmark.
Figures
read the original abstract
Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ClaimDiff-RL, a reinforcement learning framework for long-form image captioning that addresses reward granularity by using reference-conditioned atomic claim differences as the reward unit. A multimodal judge enumerates visually grounded differences between an actor caption and a reference caption, verifies each against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This enables separate measurement and tuning of hallucination and missing-fact errors. Experiments on a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA tasks show improved hallucination-missing-fact balance, preserved general capability, and outperformance of Gemini-3-Pro-Preview on fine-grained dimensions such as object counting, spatial relations, and scene recognition.
Significance. If the multimodal judge proves reliable, the framework could meaningfully advance fine-grained RL for captioning by replacing holistic scalar or preference-based rewards with typed, verifiable claim-level signals. The explicit exposure and balancing of the faithfulness-coverage tradeoff, along with targeted gains on specific visual capabilities, would represent a useful contribution to multimodal generation. The approach's diagnosability is a strength, but its significance depends on demonstrating that improvements stem from the method rather than judge artifacts.
major comments (2)
- [Abstract] Abstract: The reported improvements on the 160-image diagnostic benchmark and public sets provide no quantitative details on judge accuracy, reward composition weights, or statistical significance. Without these, it is difficult to determine whether the claimed gains in hallucination-missing-fact balance are robust or attributable to the proposed RL objective.
- [Method overview (multimodal judge component)] Method overview (multimodal judge component): The reward signal is constructed directly from the judge's per-difference outputs (enumeration, image verification, error typing, severity). No calibration against human annotations on the diagnostic set, inter-annotator agreement metrics, or ablation studies on judge model choice are described. If the judge systematically over- or under-counts certain relations or objects, the RL process would amplify those biases rather than optimize true visual claims.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one concrete numerical result (e.g., delta in hallucination rate or missing-fact rate) to support the balance claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below, providing clarifications based on the manuscript and committing to revisions that strengthen the presentation of our results and method without altering the core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported improvements on the 160-image diagnostic benchmark and public sets provide no quantitative details on judge accuracy, reward composition weights, or statistical significance. Without these, it is difficult to determine whether the claimed gains in hallucination-missing-fact balance are robust or attributable to the proposed RL objective.
Authors: We agree that the abstract would benefit from these quantitative details to better contextualize the improvements. The 160-image diagnostic benchmark is human-labeled precisely to support such evaluation, and the main text reports reward composition and performance metrics. In the revision we will expand the abstract to include judge accuracy figures (e.g., agreement with human labels), the exact reward weights used, and statistical significance results (e.g., paired t-test p-values) for the hallucination-missing-fact balance improvements. revision: yes
-
Referee: [Method overview (multimodal judge component)] Method overview (multimodal judge component): The reward signal is constructed directly from the judge's per-difference outputs (enumeration, image verification, error typing, severity). No calibration against human annotations on the diagnostic set, inter-annotator agreement metrics, or ablation studies on judge model choice are described. If the judge systematically over- or under-counts certain relations or objects, the RL process would amplify those biases rather than optimize true visual claims.
Authors: The diagnostic benchmark was collected with human annotations specifically to enable calibration of the judge outputs. We will add a new subsection that reports calibration results against these human labels, inter-annotator agreement statistics, and an ablation on judge model choice (comparing at least two VLMs). We will also expand the discussion to address potential systematic biases, noting that every difference is image-verified and that the typed, per-claim nature of the reward allows post-hoc inspection. These additions will directly respond to the concern about bias amplification. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper introduces ClaimDiff-RL as a methodological framework that decomposes caption rewards into per-claim differences produced by a multimodal judge. This is presented as an engineering choice for granularity rather than a mathematical derivation or prediction that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, or self-citation chains that bear the central load are visible in the abstract or described text. The reported improvements rest on empirical comparisons against external benchmarks and a human-labeled diagnostic set, keeping the approach self-contained against outside evaluation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CLAIMDIFF-RL improves the hallucination–missing-fact balance... on object counting, spatial relations, and scene recognition
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
SPICE: Semantic Propositional Image Caption Evaluation
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation.ArXiv, abs/1607.08822, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InIEEvaluation@ACL, 2005
work page 2005
- [4]
-
[5]
Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Wei ye Zhou, Yun Lu, Tongliang Li, Wenhao Huang, and Zhoujun Li. Simplevqa: Multimodal factuality evaluation for multimodal large language models.2025 IEEE/CVF International Conference on...
work page 2025
-
[6]
Benchmarking and improving detail image caption.ArXiv, abs/2405.19092, 2024
Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, and Haoyuan Guo. Benchmarking and improving detail image caption.ArXiv, abs/2405.19092, 2024
-
[7]
Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxing Huang, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.ArXiv, abs/2404.12390, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani S. Marathe, Stephen O. Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, ...
-
[10]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion- bench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models.2024 IEEE/CVF Conference on Computer Vision and Pattern Recog...
work page 2024
-
[11]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.ArXiv, abs/2104.08718, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen
Max W.F. Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation.ArXiv, abs/2312.14867, 2023
-
[13]
Prometheus- vision: Vision-language model as a judge for fine-grained evaluation
Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, and Minjoon Seo. Prometheus- vision: Vision-language model as a judge for fine-grained evaluation. InAnnual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[14]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination in large vision-language models. InConference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[15]
Describe anything: Detailed localized image and video captioning.ArXiv, abs/2504.16072, 2025
Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, and Yin Cui. Describe anything: Detailed localized image and video captioning.ArXiv, abs/2504.16072, 2025
-
[16]
Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Pandeng Li, Boqiang Zhang, Nianzu Yang, Yinglu Li, Zuan Gao, Yun Zheng, and Hongtao Xie. Capability: A comprehensive visual caption benchmark for evaluating both correctness and thoroughness, 2025
work page 2025
-
[17]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with h...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InAnnual Meeting of the Association for Computational Linguistics, 2002
work page 2002
-
[19]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. ArXiv, abs/2305.18290, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self- critical sequence training for image captioning, 2017
work page 2017
-
[21]
Object hallucination in image captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InConference on Empirical Methods in Natural Language Processing, 2018
work page 2018
-
[22]
Laion-5b: An open large-scale dataset for training next generation image-text models, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022
work page 2022
-
[23]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun-Mei Song, Mingchuan Zhang, Y . K. Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.ArXiv, abs/2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...
work page 2026
-
[25]
Aligning Large Multimodal Models with Factually Augmented RLHF
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented rlhf.ArXiv, abs/2309.14525, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Ba...
work page 2025
-
[27]
Lawrence Zitnick, and Devi Parikh
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation.2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2014
work page 2015
-
[28]
Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, and Zhaoxiang Zhang. Grasp any region: Towards precise, contextual pixel understanding for multimodal llms.ArXiv, abs/2510.18876, 2025
-
[29]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models.ArXiv, abs/2408.15556, 2024
-
[30]
Vicrit: A verifiable reinforcement learning proxy task for visual perception in vlms, 2025
Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, and Lijuan Wang. Vicrit: A verifiable reinforcement learning proxy task for visual perception in vlms, 2025
work page 2025
-
[31]
xAI. Realworldqa. https://huggingface.co/datasets/xai-org/RealworldQA, 2024. Hugging Face dataset
work page 2024
-
[32]
Caprl: Stimulating dense image caption capabilities via reinforcement learning, 2025
Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, and Dahua Lin. Caprl: Stimulating dense image caption capabilities via reinforcement learning, 2025
work page 2025
-
[33]
CaptionQA: Is Your Caption as Useful as the Image Itself?
Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, and Chenfeng Xu. Captionqa: Is your caption as useful as the image itself?ArXiv, abs/2511.21025, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025
Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025
work page 2025
-
[35]
Please describe this 18 image in detail
Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian Hüger, Thorsten Bagdonat, Hanno Gottschalk, and Leo Schwinn. Focus: Internal mllm representations for efficient fine-grained visual question answering, 2025. 17 A Limitations Dependence on strong multimodal judges.CLAIMDIFF-RL relies on a strong multimodal judge to identify actor–reference difference...
work page 2025
-
[40]
Clarity and specificity without unnecessary ambiguity or repetition. Important rules: - A correct detail in the actor caption should not be penalized merely because it is absent from the reference. - A detail in the reference should not be rewarded unless it is supported by the image. - Penalize hallucination more than omission. - Penalize strategic hedgi...
-
[41]
Visual factual correctness
-
[42]
Coverage of salient image content
-
[43]
Correct attributes, counts, spatial relations, OCR/text, and identities
-
[44]
Avoidance of hallucinated objects, attributes, or relations
-
[45]
Important rules: - Penalize hallucination more than omission
Clarity and specificity without unnecessary ambiguity or repetition. Important rules: - Penalize hallucination more than omission. - Penalize strategic hedging when the image evidence is clear. - Do not reward length by itself. - Do not reward flowery style by itself. Actor caption: {actor_caption} Return exactly this format: SCORE: <integer from 0 to 10>...
-
[46]
A contradiction is a candidate claim that conflicts with the reference
Difference detection.The judge compares R and C and emits each detected difference as one of three types: contradiction, extra_info, or missing_fact. A contradiction is a candidate claim that conflicts with the reference. An extra_info item is a candidate claim not mentioned in the reference. A missing_fact is a fact in the reference that is absent from t...
-
[47]
missing" and is_hallucination=false Strict mapping rules for is_hallucination: - type=
Image verification.For each contradiction or extra_info item, the judge verifies the candidate-side claim against the image and assigns one of three verification labels: verified, false, or ambiguous. The label verified means the image supports the candidate claim, false means the image contradicts it, and ambiguous means the image is insufficiently infor...
-
[48]
Only judge visually grounded content
-
[49]
Do not penalize missing minor background details
-
[50]
Do not hallucinate facts in the checklist
-
[51]
For hallucination, judge only claims made by the caption
-
[52]
For missing facts, judge only important image facts
-
[53]
If visual evidence is ambiguous, mark UNCERTAIN
-
[54]
Keep all claims atomic. Model Caption: {pred_caption} Return your answer in valid JSON format with this structure: { "claims": [ { "claim": "<atomic claim from caption>", "aspect": "<object | attribute | count | spatial | action | text_ocr | identity | scene | style | other>", "judgment": "<SUPPORTED | HALLUCINATION | UNCERTAIN>", "evidence": "<brief visu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.