ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Haochen Wang; Hongyang Tang; Jiacheng Chen; Rongxin Guo; Shaoxiang Chen; Tianle Li; Xuyang Shen; Yan Ma; Yu Cheng; Yucong Zhou

arxiv: 2605.20278 · v2 · pith:RLZYC2GBnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI· cs.CV

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Tianle Li , Xuyang Shen , Yan Ma , Rongxin Guo , Shaoxiang Chen , Jiacheng Chen , Haochen Wang , Hongyang Tang

show 2 more authors

Yucong Zhou Yu Cheng

This is my paper

Pith reviewed 2026-05-21 08:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords image captioningreinforcement learninghallucinationclaim verificationfine-grained rewardsvision-language modelsmultimodal evaluationfactuality

0 comments

The pith

ClaimDiff-RL uses verified differences between individual visual claims as the reward unit in reinforcement learning for image captions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-form image captions are hard to train because standard rewards judge whole sequences at once and hide whether errors come from invented details or omitted facts. The paper introduces a method that compares an actor caption to a reference, breaks the differences into atomic claims, and has a multimodal judge check each one against the actual image while labeling its error type and severity. This produces separate, tunable signals for hallucination and missing content instead of a single compressed score. Experiments on diagnostic sets and standard benchmarks show the approach reaches operating points that reduce hallucinations without increasing omissions as much as holistic rewards do. The same training preserves overall model ability and improves results on specific tasks like counting objects and describing spatial relations.

Core claim

ClaimDiff-RL replaces sequence-level scalar rewards with reference-conditioned atomic claim differences. A multimodal judge enumerates visually grounded differences between the generated caption and a reference caption, verifies each difference directly against the image, assigns open-vocabulary error types and severity levels, and supplies per-difference statistics that are composed into the reinforcement learning reward. This decomposition makes hallucinated claims and omitted salient facts separately measurable and adjustable, exposing the faithfulness-coverage tradeoff that holistic rewards obscure and enabling more balanced captioning models.

What carries the argument

The multimodal judge that enumerates visually grounded differences between actor and reference captions, verifies each against the image, and assigns open-vocabulary error types plus severity levels to generate per-difference statistics for reward composition.

If this is right

Holistic scalar rewards reduce hallucination only by increasing missing facts, while claim-difference rewards allow training to reach better-balanced points on both dimensions.
On the 160-image human-labeled diagnostic benchmark the method improves the measured hallucination-missing-fact balance compared with scalar-reward baselines.
General captioning and VQA performance on public benchmarks remains intact rather than degrading.
The resulting models surpass Gemini-3-Pro-Preview on fine-grained dimensions including object counting, spatial relations, and scene recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same claim-difference machinery could be applied to other generation domains where local factual errors matter, such as video description or medical report writing.
If the judge remains reliable, training logs could expose which specific error types the model is still making, guiding targeted data collection.
Captioning models trained this way may develop stronger internal verification habits that transfer to new images without references.
Future systems might embed similar claim verification steps at inference time to self-correct before final output.

Load-bearing premise

The multimodal judge can reliably enumerate, verify against the image, and correctly type and score the differences without introducing its own systematic errors or biases.

What would settle it

A side-by-side human evaluation on held-out caption pairs showing that the judge's error-type assignments and verification decisions disagree with humans at rates high enough to reverse the reported balance improvements on the 160-image diagnostic benchmark.

Figures

Figures reproduced from arXiv: 2605.20278 by Haochen Wang, Hongyang Tang, Jiacheng Chen, Rongxin Guo, Shaoxiang Chen, Tianle Li, Xuyang Shen, Yan Ma, Yu Cheng, Yucong Zhou.

**Figure 1.** Figure 1: Overview of CLAIMDIFF-RL. Unlike direct scalar judging, CLAIMDIFF-RL verifies actor–reference visual differences against the image and composes typed side-specific errors into scalar rewards, making the hallucination–coverage tradeoff explicit. good dense caption should therefore be both faithful and informative: it should avoid unsupported visual claims while still covering salient image content [30, 35].… view at source ↗

**Figure 2.** Figure 2: Overview of CLAIMDIFF-RL. Actor–reference differences are verified against the image to produce side-specific typed errors, which are composed into relative or actor-only scalar rewards for group-normalized RL optimization. Each difference di contains a visual aspect, the actor-side claim, the reference-side claim, an imagegrounded judgment, and side-specific error descriptions: di = [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 3.** Figure 3: Hallucination and missing-fact trends across RL training steps. Step [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics of reward, response length, and reference-side weighted errors. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClaimDiff-RL breaks caption rewards into per-claim differences to expose the hallucination-missing-fact tradeoff, but the whole approach rests on an uncalibrated multimodal judge.

read the letter

The main takeaway is that ClaimDiff-RL uses differences in individual visual claims between a generated caption and a reference as the reward signal for RL. This gives finer control over hallucinations versus missing details compared to whole-caption rewards. The paper shows how standard scalar rewards often fix one problem by worsening the other, and their claim-based approach finds more balanced results. They report gains on a small diagnostic set and some public benchmarks, including better performance on counting and spatial relations than a strong baseline like Gemini. What stands out is the explicit separation of error types in the reward composition. That framing is distinct from prior work on preference optimization or scalar factuality scores. The weak point is the multimodal judge that generates these claims and labels. The description does not include any human validation of the judge's accuracy or consistency on the diagnostic images. Without that, it's hard to know if the improvements come from better RL or from the judge's particular biases in what it flags as differences or how it scores severity. The abstract also lacks details on how the reward weights are set or statistical tests. This work is for people building RL systems for vision-language generation who care about diagnosable rewards. It is worth sending to peer review because the granularity issue is real and the proposed unit is a concrete step forward, though the judge reliability needs more evidence to make the claims stick.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ClaimDiff-RL, a reinforcement learning framework for long-form image captioning that addresses reward granularity by using reference-conditioned atomic claim differences as the reward unit. A multimodal judge enumerates visually grounded differences between an actor caption and a reference caption, verifies each against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This enables separate measurement and tuning of hallucination and missing-fact errors. Experiments on a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA tasks show improved hallucination-missing-fact balance, preserved general capability, and outperformance of Gemini-3-Pro-Preview on fine-grained dimensions such as object counting, spatial relations, and scene recognition.

Significance. If the multimodal judge proves reliable, the framework could meaningfully advance fine-grained RL for captioning by replacing holistic scalar or preference-based rewards with typed, verifiable claim-level signals. The explicit exposure and balancing of the faithfulness-coverage tradeoff, along with targeted gains on specific visual capabilities, would represent a useful contribution to multimodal generation. The approach's diagnosability is a strength, but its significance depends on demonstrating that improvements stem from the method rather than judge artifacts.

major comments (2)

[Abstract] Abstract: The reported improvements on the 160-image diagnostic benchmark and public sets provide no quantitative details on judge accuracy, reward composition weights, or statistical significance. Without these, it is difficult to determine whether the claimed gains in hallucination-missing-fact balance are robust or attributable to the proposed RL objective.
[Method overview (multimodal judge component)] Method overview (multimodal judge component): The reward signal is constructed directly from the judge's per-difference outputs (enumeration, image verification, error typing, severity). No calibration against human annotations on the diagnostic set, inter-annotator agreement metrics, or ablation studies on judge model choice are described. If the judge systematically over- or under-counts certain relations or objects, the RL process would amplify those biases rather than optimize true visual claims.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one concrete numerical result (e.g., delta in hallucination rate or missing-fact rate) to support the balance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below, providing clarifications based on the manuscript and committing to revisions that strengthen the presentation of our results and method without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The reported improvements on the 160-image diagnostic benchmark and public sets provide no quantitative details on judge accuracy, reward composition weights, or statistical significance. Without these, it is difficult to determine whether the claimed gains in hallucination-missing-fact balance are robust or attributable to the proposed RL objective.

Authors: We agree that the abstract would benefit from these quantitative details to better contextualize the improvements. The 160-image diagnostic benchmark is human-labeled precisely to support such evaluation, and the main text reports reward composition and performance metrics. In the revision we will expand the abstract to include judge accuracy figures (e.g., agreement with human labels), the exact reward weights used, and statistical significance results (e.g., paired t-test p-values) for the hallucination-missing-fact balance improvements. revision: yes
Referee: [Method overview (multimodal judge component)] Method overview (multimodal judge component): The reward signal is constructed directly from the judge's per-difference outputs (enumeration, image verification, error typing, severity). No calibration against human annotations on the diagnostic set, inter-annotator agreement metrics, or ablation studies on judge model choice are described. If the judge systematically over- or under-counts certain relations or objects, the RL process would amplify those biases rather than optimize true visual claims.

Authors: The diagnostic benchmark was collected with human annotations specifically to enable calibration of the judge outputs. We will add a new subsection that reports calibration results against these human labels, inter-annotator agreement statistics, and an ablation on judge model choice (comparing at least two VLMs). We will also expand the discussion to address potential systematic biases, noting that every difference is image-verified and that the typed, per-claim nature of the reward allows post-hoc inspection. These additions will directly respond to the concern about bias amplification. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces ClaimDiff-RL as a methodological framework that decomposes caption rewards into per-claim differences produced by a multimodal judge. This is presented as an engineering choice for granularity rather than a mathematical derivation or prediction that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, or self-citation chains that bear the central load are visible in the abstract or described text. The reported improvements rest on empirical comparisons against external benchmarks and a human-labeled diagnostic set, keeping the approach self-contained against outside evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on the existence and reliability of a multimodal judge that can perform open-vocabulary claim verification; no free parameters or invented entities are explicitly quantified in the abstract.

pith-pipeline@v0.9.0 · 5829 in / 1050 out tokens · 36146 ms · 2026-05-21T08:32:46.131828+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CLAIMDIFF-RL improves the hallucination–missing-fact balance... on object counting, spatial relations, and scene recognition

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.