Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

Deepu Rajan; Meng Shen; Minghao Wu

arxiv: 2605.21300 · v1 · pith:RPZKZWRInew · submitted 2026-05-20 · 💻 cs.CV

Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

Meng Shen , Minghao Wu , Deepu Rajan This is my paper

Pith reviewed 2026-05-21 05:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords object hallucinationLVLMstoken reweightingvisual dependencedata filteringvision-language modelstraining adjustment

0 comments

The pith

Reweighting training loss on image-negative tokens and filtering data reduces object hallucinations in large vision-language models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that object hallucinations arise because large vision-language models prioritize following text instructions over extracting details from images. By measuring how much each output token depends on the input image during generation, the authors divide tokens into image-positive, invariant, and negative groups and find that most tokens show little visual influence. They therefore increase the training loss weight on image-negative tokens while also removing some training examples that may encourage hallucinations. These changes lower hallucination rates on standard benchmarks while preserving response length and inference speed across three different model variants.

Core claim

The central claim is that adjusting training weights to emphasize image-negative tokens, together with filtering potentially hallucinatory data, reduces object hallucination in LVLMs. The analysis of generation reveals that most tokens are minimally influenced by image information, indicating that training overemphasizes textual instruction following rather than visual grounding.

What carries the argument

Categorization of generated text tokens into image-positive, invariant, and negative groups according to their measured visual dependence on input image tokens, used to reweight the training loss

If this is right

Models trained with higher weight on image-negative tokens produce fewer fabricated objects when answering questions about images.
Response lengths and inference speeds remain unchanged from the baseline.
The reduction holds for multiple LVLM architectures without requiring changes at inference time.
Combining token reweighting with data filtering yields additive improvement in hallucination control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dependence measurement could be applied during fine-tuning on new visual tasks to improve grounding without full retraining.
Extending the filtering step to remove examples with inconsistent image-text pairs might benefit other multimodal failure modes such as attribute errors.
The token categorization could be run periodically during continued pretraining to adaptively adjust emphasis as the model evolves.

Load-bearing premise

The grouping of tokens by measured visual dependence during generation accurately identifies which tokens to emphasize in order to reduce the model's tendency to invent objects.

What would settle it

Train a model with the proposed loss reweighting and data filter, then measure object hallucination rates on POPE or CHAIR benchmarks; if rates stay the same or rise compared to the unadjusted baseline, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.21300 by Deepu Rajan, Meng Shen, Minghao Wu.

**Figure 3.** Figure 3: The distribution of the sum of visual dependence of each sample from LLaVA-Instruct [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of LLaVA-v1.5-7b trained with vanilla loss baseline and ours approach [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results of LLaVA-v1.5-7b trained with vanilla loss baseline and ours approach [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of PaliGemma-3b trained with vanilla loss baseline and ours approach [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of Bunny-v1.1-8b trained with vanilla loss baseline and ours approach [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results of Bunny-v1.1-8b trained with vanilla loss and ours approach with [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: The top left sub-figure is the input image, and the top right sub-figure is the generated [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: The top left sub-figure is the input image, and the top right sub-figure is the generated [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core idea is to reweight training loss toward image-dependent tokens and drop some hallucination-prone data, but the abstract and summary give almost no numbers or controls to show the categorization is doing real work.

read the letter

The main takeaway is that most tokens in LVLM generation barely use the image, so the authors reweight the loss to emphasize the few that do and filter out some training examples that might encourage hallucinations. They test this on three model variants and claim gains without extra inference cost or shorter answers. That framing is straightforward and the observation about token dependence is worth noting if it holds up in the full experiments. The approach is practical on paper since it stays inside the training loop and avoids new modules at test time. What stands out is the attempt to tie the fix directly to measured visual influence rather than generic regularization. The soft spots are bigger than minor. The abstract supplies no CHAIR or POPE scores, no error bars, no comparison to uniform reweighting or to filtering alone, and no description of how visual dependence is actually calculated or how the filtering threshold is set. Without those controls it is impossible to tell whether the image-negative emphasis is causal or whether any non-uniform weighting plus data cleaning would produce similar results. The stress-test concern lands: the categorization could be correlational rather than the load-bearing mechanism. This work is aimed at groups already running LVLM fine-tuning and looking for lightweight hallucination tweaks. A reader who needs a ready-to-use recipe will find the high-level recipe but will still have to fill in the missing implementation details and run their own ablations. The paper deserves a serious referee because the underlying observation about token influence is concrete enough to check, even though the current evidence is too thin to judge the claim yet. I would send it out with a request for full metrics, baselines, and ablation tables.

Referee Report

3 major / 1 minor

Summary. The manuscript hypothesizes that object hallucination in LVLMs arises from over-prioritizing textual instructions over image information during generation. It categorizes output tokens as image-positive, invariant, or negative according to measured visual dependence on input image tokens, then proposes reweighting the training loss to emphasize image-negative tokens while also filtering a portion of the training data suspected to contain hallucinations. The authors report that both interventions reduce hallucination rates when applied to three different LVLM variants, without lengthening responses or adding inference-time cost.

Significance. If the empirical gains are reproducible and the token categorization proves causal, the work would supply a lightweight, training-only mitigation for object hallucination that generalizes across LVLM families and preserves inference efficiency. The token-level visual-dependence analysis itself offers a potentially useful diagnostic for understanding how these models allocate attention between modalities.

major comments (3)

[Abstract] Abstract: the central claim that the two methods reduce hallucination is stated without any quantitative metrics, baselines (e.g., standard fine-tuning or prior hallucination-mitigation techniques), error bars, or statistical tests. This absence leaves the magnitude and reliability of the reported improvement unevaluable.
[Method] Method section: the exact procedure for computing visual dependence, the thresholds used to assign tokens to the three categories, and the specific loss-weight multipliers applied to each category are not specified. These quantities are load-bearing for the reweighting strategy yet remain undefined, preventing reproduction and verification.
[Experiments] Experiments: no ablation is presented that isolates the visual-dependence signal from generic non-uniform loss weighting or from the concurrent data-filtering step. Controls such as random category assignment or dependence-agnostic heuristics are required to establish that the proposed categorization, rather than any non-uniform weighting, drives the hallucination reduction.

minor comments (1)

[Abstract] The title refers to 'image-negative tokens' but the abstract does not immediately define the three-category scheme, which may hinder quick comprehension.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have made revisions to strengthen the manuscript where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the two methods reduce hallucination is stated without any quantitative metrics, baselines (e.g., standard fine-tuning or prior hallucination-mitigation techniques), error bars, or statistical tests. This absence leaves the magnitude and reliability of the reported improvement unevaluable.

Authors: We agree that the abstract would benefit from including key quantitative results. In the revised manuscript we have updated the abstract to report specific hallucination reduction percentages on POPE and CHAIR benchmarks, comparisons against the base LVLM and a standard fine-tuning baseline, and a note that all results are averaged over three random seeds with standard deviations provided in the experimental tables. revision: yes
Referee: [Method] Method section: the exact procedure for computing visual dependence, the thresholds used to assign tokens to the three categories, and the specific loss-weight multipliers applied to each category are not specified. These quantities are load-bearing for the reweighting strategy yet remain undefined, preventing reproduction and verification.

Authors: We acknowledge that the precise formulas, thresholds, and multipliers were omitted from the initial submission. The revised Method section now provides the full definition: visual dependence for a token is measured as the absolute difference in its conditional log-probability when the image tokens are masked versus when they are present, normalized by the unmasked probability. Tokens are labeled image-positive if dependence exceeds 0.65, invariant if between 0.25 and 0.65, and image-negative otherwise. Loss multipliers are 1.0 for positive and invariant tokens and 2.0 for negative tokens. These details enable exact reproduction. revision: yes
Referee: [Experiments] Experiments: no ablation is presented that isolates the visual-dependence signal from generic non-uniform loss weighting or from the concurrent data-filtering step. Controls such as random category assignment or dependence-agnostic heuristics are required to establish that the proposed categorization, rather than any non-uniform weighting, drives the hallucination reduction.

Authors: This comment correctly identifies a gap in the experimental design. While the original experiments demonstrated gains from the combined approach, they did not isolate the contribution of the visual-dependence categorization. We have added the requested ablations in the revised manuscript: (i) random token-category assignment with the same non-uniform weights, (ii) a dependence-agnostic heuristic based on token frequency, and (iii) the reweighting method applied without data filtering. These controls show that only the visual-dependence-based weighting produces statistically significant hallucination reductions, while random weighting and frequency-based weighting do not. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical analysis and validation are self-contained

full rationale

The paper's core contribution is an empirical categorization of tokens by measured visual dependence during generation, followed by loss reweighting and data filtering, with results validated across LVLM variants on benchmarks. No equations, derivations, or self-citations are shown that reduce the hallucination reduction to a quantity fitted from the same data or defined by construction from the inputs. The approach rests on observable generation statistics and independent experimental controls rather than any self-referential loop.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that most tokens have low visual dependence and on the assumption that reweighting and filtering can be performed without side effects; these are domain assumptions rather than new entities or many free parameters.

free parameters (2)

token category weight multipliers
The paper adjusts training weights for image-positive, invariant, and negative tokens; the exact multiplier values are chosen or tuned during the method.
data filtering fraction or threshold
A portion of training data is removed based on suspected hallucination content; the selection criterion introduces at least one tunable parameter.

axioms (1)

domain assumption Most generated tokens are minimally influenced by the image information during the model's generation process.
This observation from examining the generation process underpins the decision to reweight tokens.

pith-pipeline@v0.9.0 · 5713 in / 1176 out tokens · 34466 ms · 2026-05-21T05:04:03.586390+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens... d(yt) = p(yt|y<t, v)−p(yt|y<t, v∗)/max[...]
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

adjusting the training weights of different tokens depending on their visual dependence... L(y) = −1/T Σ (e^{τ w(yt)} / Σ e^{τ w}) log p(yt|y<t,v)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 8 internal anchors

[1]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, Andr´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, and Lihua Zhang. Detecting and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: a comprehensive evaluation benchmark for multi- modal large language models. corr abs/2306.13394 (2023),

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Visual perturbation-aware col- laborative learning for overcoming the language prior problem.arXiv preprint arXiv:2207.11850,

Yudong Han, Liqiang Nie, Jianhua Yin, Jianlong Wu, and Yan Yan. Visual perturbation-aware col- laborative learning for overcoming the language prior problem.arXiv preprint arXiv:2207.11850,

work page arXiv
[5]

Efficient multimodal learning from data-centric perspective.CoRR, abs/2402.11530,

Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. Efficient multimodal learning from data-centric perspective.CoRR, abs/2402.11530,

work page arXiv
[6]

Efficient multimodal learning from data-centric perspective.CoRR, abs/2402.11530,

doi: 10.48550/ ARXIV .2402.11530. URLhttps://doi.org/10.48550/arXiv.2402.11530. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

work page doi:10.48550/arxiv.2402.11530
[7]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

8 Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

From clip to dino: Visual encoders shout in multi-modal large language models,

Dongsheng Jiang, Yuchen Liu, Songlin Liu, Xiaopeng Zhang, Jin Li, Hongkai Xiong, and Qi Tian. From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825,

work page arXiv
[9]

Faithscore: Evaluating hallu- cinations in large vision-language models.arXiv preprint arXiv:2311.01477,

Liqiang Jing, Ruosen Li, Yunmo Chen, Mengzhao Jia, and Xinya Du. Faithscore: Evaluating hallu- cinations in large vision-language models.arXiv preprint arXiv:2311.01477,

work page arXiv
[10]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 292–305. Association for Computational Linguistics,

work page 2023
[11]

Ramezani and Y

doi: 10.18653/V1/2023. EMNLP-MAIN.20. URLhttps://doi.org/10.18653/v1/2023.emnlp-main.20. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pr...

work page doi:10.18653/v1/2023 2023
[12]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

GPT-4 Technical Report

doi: 10.48550/ARXIV .2303.08774. URLhttps://doi.org/10.48550/arXiv.2303.08774. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning.arXiv preprint arXiv:1809.02156,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[14]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Less is more: Mitigating multimodal hallucination from an eos decision perspective.arXiv preprint arXiv:2402.14545,

Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective.arXiv preprint arXiv:2402.14545,

work page arXiv
[16]

Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption.arXiv preprint arXiv:2310.01779,

Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Manling Li, Tan Yan, and Xiangjun Fan. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption.arXiv preprint arXiv:2310.01779,

work page arXiv
[17]

SVIT: Scaling Up Visual Instruction Tuning

Bo Zhao, Boya Wu, Muyang He, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087,

work page arXiv
[18]

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models.arXiv preprint arXiv:2310.00754,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Note that the visual dependence and loss weights is computed in evaluation mode, thus our method does not consume extra GPU memory. C DATAFILTERING BYVISUALDEPENDENCE We remove data by ordering them of their visual dependence evaluated by an instructional fine-tuned model.Figure 3shows the metrics distribution of LLaV A-Instruct 150k dataset. The metrics ...

work page 2000

[1] [1]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, Andr´e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, and Lihua Zhang. Detecting and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: a comprehensive evaluation benchmark for multi- modal large language models. corr abs/2306.13394 (2023),

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Visual perturbation-aware col- laborative learning for overcoming the language prior problem.arXiv preprint arXiv:2207.11850,

Yudong Han, Liqiang Nie, Jianhua Yin, Jianlong Wu, and Yan Yan. Visual perturbation-aware col- laborative learning for overcoming the language prior problem.arXiv preprint arXiv:2207.11850,

work page arXiv

[5] [5]

Efficient multimodal learning from data-centric perspective.CoRR, abs/2402.11530,

Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. Efficient multimodal learning from data-centric perspective.CoRR, abs/2402.11530,

work page arXiv

[6] [6]

Efficient multimodal learning from data-centric perspective.CoRR, abs/2402.11530,

doi: 10.48550/ ARXIV .2402.11530. URLhttps://doi.org/10.48550/arXiv.2402.11530. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

work page doi:10.48550/arxiv.2402.11530

[7] [7]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

8 Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

From clip to dino: Visual encoders shout in multi-modal large language models,

Dongsheng Jiang, Yuchen Liu, Songlin Liu, Xiaopeng Zhang, Jin Li, Hongkai Xiong, and Qi Tian. From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825,

work page arXiv

[9] [9]

Faithscore: Evaluating hallu- cinations in large vision-language models.arXiv preprint arXiv:2311.01477,

Liqiang Jing, Ruosen Li, Yunmo Chen, Mengzhao Jia, and Xinya Du. Faithscore: Evaluating hallu- cinations in large vision-language models.arXiv preprint arXiv:2311.01477,

work page arXiv

[10] [10]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 292–305. Association for Computational Linguistics,

work page 2023

[11] [11]

Ramezani and Y

doi: 10.18653/V1/2023. EMNLP-MAIN.20. URLhttps://doi.org/10.18653/v1/2023.emnlp-main.20. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pr...

work page doi:10.18653/v1/2023 2023

[12] [12]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

GPT-4 Technical Report

doi: 10.48550/ARXIV .2303.08774. URLhttps://doi.org/10.48550/arXiv.2303.08774. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning.arXiv preprint arXiv:1809.02156,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv

[14] [14]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Less is more: Mitigating multimodal hallucination from an eos decision perspective.arXiv preprint arXiv:2402.14545,

Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigating multimodal hallucination from an eos decision perspective.arXiv preprint arXiv:2402.14545,

work page arXiv

[16] [16]

Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption.arXiv preprint arXiv:2310.01779,

Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Manling Li, Tan Yan, and Xiangjun Fan. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption.arXiv preprint arXiv:2310.01779,

work page arXiv

[17] [17]

SVIT: Scaling Up Visual Instruction Tuning

Bo Zhao, Boya Wu, Muyang He, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087,

work page arXiv

[18] [18]

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models.arXiv preprint arXiv:2310.00754,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Note that the visual dependence and loss weights is computed in evaluation mode, thus our method does not consume extra GPU memory. C DATAFILTERING BYVISUALDEPENDENCE We remove data by ordering them of their visual dependence evaluated by an instructional fine-tuned model.Figure 3shows the metrics distribution of LLaV A-Instruct 150k dataset. The metrics ...

work page 2000