arxiv: 2604.18803 · v3 · submitted 2026-04-20 · 💻 cs.CV · cs.AI

Recognition: unknown

LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

Zhiyuan Jiang , Weihao Hong , Xinlei Guan , Tejaswi Dhandu , Miles Q. Li , Meng Xu , Kuan Huang , Umamaheswara Rao Tida

show 3 more authors

Bingyu Shen Daehan Kwak Boyang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelshallucinationprompt tonebenchmarknegative ground truthLLM judgeVLM evaluationfabrication

0 comments

The pith

Vision-language models fabricate details about absent objects more under moderate directive prompts than under neutral or maximal pressure, with rate and intensity measures diverging across models and tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds Ghost-100, a benchmark of 800 images where the queried element is guaranteed absent by construction across text, time, and object tasks, then tests each image with five prompts that increase only in linguistic force. It tracks both the fraction of responses that shift to unsupported claims and the judged strength of those claims. Results across nine models show that how often models fabricate often separates from how confidently they do so, that reading tasks and absence tasks react differently to the same tone changes, and that several models reach highest fabrication rates at middle tone levels rather than the strongest. Readers should care because real deployments involve varied user language that can push models past reliable visual grounding in ways neutral tests do not catch.

Core claim

We introduce Ghost-100, a procedurally generated set of 800 images across eight categories in three task families, each image built so the target is absent, illegible, or indeterminate. Every image is paired with five prompts that hold the visual content and task fixed while escalating only directive tone. A rule-based H-Rate counts the share of responses that move from grounded refusal to positive commitment, while a GPT-4o-mini H-Score rates the confidence and specificity of any fabrication on a 1-5 scale. Evaluation of nine open-weight VLMs reveals that H-Rate and H-Score diverge across model families, reading-style and presence-detection subsets respond differently to prompt pressure, a

What carries the argument

The 5-Level Prompt Intensity Framework, which fixes the image and task identity while varying only the directive force of the prompt to isolate tone as the independent variable, together with the dual H-Rate incidence measure and H-Score intensity measure.

If this is right

Neutral-prompt hallucination tests will miss tone-driven behaviors that appear only under moderate or strong directive language.
Reading-style tasks and presence-detection tasks must be evaluated separately because they show qualitatively different responses to the same prompt pressure.
Non-monotonic sensitivity means that increasing prompt force does not produce steadily rising or falling hallucination rates in every model.
Aggregate hallucination metrics can obscure important differences between how often versus how strongly models fabricate.
Model families require distinct safeguards because their rate and intensity profiles do not align.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety evaluations for deployed VLMs should routinely include graded tone tests instead of relying solely on neutral prompts.
Training methods could add exposure to coercive phrasing to build resistance to tone-induced fabrication.
The negative-ground-truth construction approach could be extended to test tone effects in other modalities or closed models.
Applications that need high reliability might filter user prompts to reduce unnecessary directive force.

Load-bearing premise

That the GPT-4o-mini judge scores fabrication confidence and specificity without its own systematic biases, and that the generated images stay strictly negative-ground-truth for the queried target after the prompt variations.

What would settle it

Re-scoring the same model responses with human raters or a different judge model and obtaining large systematic shifts in the H-Scores, or manual inspection revealing that some images actually contain the queried element.

Figures

Figures reproduced from arXiv: 2604.18803 by Bingyu Shen, Boyang Li, Daehan Kwak, Kuan Huang, Meng Xu, Miles Q. Li, Tejaswi Dhandu, Umamaheswara Rao Tida, Weihao Hong, Xinlei Guan, Zhiyuan Jiang.

**Figure 1.** Figure 1: Overview of our framework. The Ghost-100 benchmark comprises 800 synthetically generated images across 8 categories and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Core workflow of the framework. Stage 1 performs multi-dimensional quality screening to filter generated images. Stage 2 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the five prompt tone levels for a text-illegibility example, showing how increasing directive pressure can shift [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt-tone effects on the two evaluation tracks. Left: tone-wise hallucination rate (H-Rate). Right: tone-wise hallucination [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Stage 2 pass rates for Design Compliance Verification by [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded linguistic pressure across structurally distinct task types. We present Ghost-100, a procedurally constructed benchmark of 800 synthetically generated images spanning eight categories across three task families: text-illegibility, time-reading, and object-absence, each designed under a negative-ground-truth principle that guarantees the queried target is absent, illegible, or indeterminate by construction. Every image is paired with five prompts drawn from a structured 5-Level Prompt Intensity Framework, holding the image and task identity fixed while varying only directive force, so that tone is isolated as the sole independent variable. We adopt a dual-track evaluation protocol: a rule-based H-Rate measuring the proportion of responses in which a model crosses from grounded refusal into unsupported positive commitment, and a GPT-4o-mini-judged H-Score on a 1-5 scale characterizing the confidence and specificity of fabrication once it occurs. We additionally release a three-stage automated validation workflow, which retrospectively confirms 717 of 800 images as strictly compliant. Evaluating nine open-weight VLMs, we find that H-Rate and H-Score dissociate substantially across model families, reading-style and presence-detection subsets respond to prompt pressure in qualitatively different ways, and several models exhibit non-monotonic sensitivity peaking at intermediate tone levels: patterns that aggregate metrics obscure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid controlled benchmark for tone effects on VLM hallucinations, but the GPT-4o-mini judge lacks calibration and could be driving the reported dissociations and non-monotonic patterns.

read the letter

The paper's real contribution is the Ghost-100 setup: 800 synthetic images with guaranteed negative ground truth, paired with five fixed prompts that ramp up directive force while holding everything else constant. That isolates tone as the variable in a way most hallucination benchmarks do not. They also split into rule-based H-Rate (simple commitment count) and LLM-judged H-Score (fabrication confidence and specificity), and they report that these two metrics move differently across models, that reading-style and presence-detection tasks react differently to pressure, and that some models peak in hallucination at intermediate tone levels rather than the strongest prompts. The automated validation hitting 717/800 is a practical plus and shows they took image quality seriously. The 5-Level Prompt Intensity Framework itself looks reusable for other work. The soft spot is exactly where the stress-test flagged: H-Score depends on GPT-4o-mini with no human calibration, no inter-judge stats, and no ablation against other judges. If the judge itself is sensitive to the same tone variations or has its own fabrication biases, the dissociation and non-monotonic claims become hard to trust. H-Rate is more robust because it is rule-based, but the qualitative differences and peak-at-intermediate findings rest on the scored track. The abstract does not mention any of those checks, so the empirical patterns need that verification before they can be taken as settled VLM behavior. This is the kind of paper that belongs in a reading group focused on evaluation methods. It gives people a concrete way to test prompt sensitivity that prior binary hallucination suites miss. I would not cite the specific model rankings or non-monotonic results yet, but the benchmark construction and dual-metric idea are worth following. A serious editor should send it to review; the framework is useful and the data collection is transparent enough that referees can ask for the missing judge validation directly.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Ghost-100, a procedurally generated benchmark of 800 synthetic images spanning text-illegibility, time-reading, and object-absence tasks, each with guaranteed negative ground truth. It pairs every image with five prompts from a 5-Level Prompt Intensity Framework that varies only directive force while holding image and task fixed. A dual-track protocol is used: rule-based H-Rate (proportion of responses crossing into unsupported positive commitment) and GPT-4o-mini-judged H-Score (1-5 scale of fabrication confidence and specificity). Evaluation of nine open-weight VLMs shows substantial dissociation between the two metrics, qualitatively different tone responses in reading-style versus presence-detection subsets, and non-monotonic sensitivity peaking at intermediate tone levels in several models.

Significance. If the judge calibration holds, the work demonstrates that tone-induced changes in hallucination incidence and intensity can be dissociated and that aggregate metrics obscure family-specific and task-specific patterns, which has direct relevance for high-stakes VLM deployment. The procedural negative-ground-truth construction, the automated three-stage validation workflow (717/800 compliant), and the explicit separation of rule-based and judged metrics are concrete strengths that could be reused by the community.

major comments (3)

[Evaluation Protocol] Evaluation Protocol section: The headline claims of metric dissociation, subset-specific qualitative differences, and non-monotonic tone sensitivity are diagnosed via the GPT-4o-mini H-Score. No human calibration, inter-judge agreement statistics, or ablation with alternative judges (e.g., GPT-4o or Claude) are reported. Because the judge itself may exhibit tone sensitivity or fabrication biases correlated with the 5-level prompt intensity, the observed patterns could be judge artifacts rather than VLM behavior; this is load-bearing for all H-Score-dependent conclusions.
[Benchmark Validation] Benchmark Validation and Results sections: The automated workflow confirms 717/800 images as compliant, yet the manuscript does not report whether the 83 excluded images alter the reported H-Rate/H-Score trends or whether exclusion criteria correlate with particular task families or tone levels. This leaves open the possibility of selection bias in the final 717-image results.
[§2] §2 (Task Families and Subsets): The distinction between 'reading-style' and 'presence-detection' subsets is central to the claim of qualitatively different responses to prompt pressure, but the mapping from the three task families (text-illegibility, time-reading, object-absence) onto these subsets is not explicitly tabulated or justified with example prompts and expected model outputs.

minor comments (2)

[Abstract] The abstract states that 'several models exhibit non-monotonic sensitivity' but does not specify which models or provide per-model H-Score curves; a supplementary figure or table would make the claim verifiable.
[Prompt Framework] The prompt templates for the five intensity levels are described at a high level; releasing the exact wording of each level (or at least one full example per task family) would allow replication and independent judge-prompt auditing.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments identify important areas for clarification and strengthening, particularly around the evaluation protocol, benchmark validation, and task subset definitions. We address each major comment point by point below. Revisions have been incorporated where they directly improve the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Evaluation Protocol] Evaluation Protocol section: The headline claims of metric dissociation, subset-specific qualitative differences, and non-monotonic tone sensitivity are diagnosed via the GPT-4o-mini H-Score. No human calibration, inter-judge agreement statistics, or ablation with alternative judges (e.g., GPT-4o or Claude) are reported. Because the judge itself may exhibit tone sensitivity or fabrication biases correlated with the 5-level prompt intensity, the observed patterns could be judge artifacts rather than VLM behavior; this is load-bearing for all H-Score-dependent conclusions.

Authors: We agree that the lack of reported human calibration and inter-judge agreement for the GPT-4o-mini H-Score represents a genuine limitation for claims relying on H-Score. The rule-based H-Rate metric is independent of the judge and already supports the dissociation findings across models. For the H-Score-dependent patterns, we have added to the revised Evaluation Protocol section: (1) inter-judge agreement statistics between GPT-4o-mini and GPT-4o on a random sample of 200 responses (Cohen's kappa = 0.78), and (2) a brief ablation showing that the non-monotonic and subset-specific patterns persist under the alternative judge. A full human calibration study was outside the scope of the current work due to resource constraints; we now explicitly flag this as a limitation and recommend it for follow-up research. This constitutes a partial revision that mitigates but does not fully eliminate the concern. revision: partial
Referee: [Benchmark Validation] Benchmark Validation and Results sections: The automated workflow confirms 717/800 images as compliant, yet the manuscript does not report whether the 83 excluded images alter the reported H-Rate/H-Score trends or whether exclusion criteria correlate with particular task families or tone levels. This leaves open the possibility of selection bias in the final 717-image results.

Authors: We have conducted a post-hoc analysis comparing results on the full set of 800 images versus the 717 compliant images. The core patterns (metric dissociation, qualitative differences between subsets, and non-monotonic tone sensitivity) remain qualitatively unchanged, with only small quantitative differences in absolute rates. We have added this comparison, along with a breakdown showing the distribution of the 83 excluded images across the three task families and five tone levels, to the revised Benchmark Validation section. The exclusion criteria (failure to meet negative-ground-truth guarantees) show no systematic correlation with tone intensity. This analysis is now reported explicitly. revision: yes
Referee: [§2] §2 (Task Families and Subsets): The distinction between 'reading-style' and 'presence-detection' subsets is central to the claim of qualitatively different responses to prompt pressure, but the mapping from the three task families (text-illegibility, time-reading, object-absence) onto these subsets is not explicitly tabulated or justified with example prompts and expected model outputs.

Authors: We appreciate this observation, as the subset distinction is indeed central to our findings. In the revised §2, we have inserted a new table that explicitly maps each task family to its subset category (reading-style: text-illegibility and time-reading; presence-detection: object-absence), provides one representative prompt per intensity level for each family, and states the expected model output under the negative-ground-truth principle. This addition clarifies the grouping rationale and supports the reported qualitative differences without changing any experimental results. revision: yes

standing simulated objections not resolved

A comprehensive human-subject calibration of the GPT-4o-mini judge across all nine models, all tone levels, and all task families would require substantial new annotation effort beyond what is feasible in a revision; we have addressed the concern through model-based ablations and explicit limitation statements instead.

Circularity Check

0 steps flagged

No circularity: benchmark definitions and evaluation protocol are independent of reported outcomes

full rationale

The paper constructs Ghost-100 as a procedurally generated benchmark with negative-ground-truth images and a 5-level prompt intensity framework, then defines H-Rate (rule-based proportion of unsupported commitments) and H-Score (GPT-4o-mini judged 1-5 fabrication scale) explicitly as evaluation metrics. No equations, fitted parameters, or derivations are present that reduce any claimed result to its own inputs by construction. The reported dissociations, subset differences, and non-monotonic patterns are empirical observations from applying these independently defined metrics to nine VLMs; the metric definitions do not presuppose or encode those patterns. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The automated validation workflow (717/800 compliant) further separates construction from measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The work rests on the domain assumption that synthetic negative-ground-truth images can be generated reliably and that an LLM judge can serve as a proxy for human assessment of hallucination intensity; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption GPT-4o-mini can accurately rate hallucination confidence and specificity on a 1-5 scale
Invoked in the dual-track evaluation protocol description
domain assumption Procedural image generation guarantees negative ground truth for the queried targets
Stated in the benchmark construction paragraph

invented entities (3)

Ghost-100 benchmark no independent evidence
purpose: Controlled test set isolating prompt tone effects on VLM hallucinations
Newly constructed dataset of 800 images across eight categories
H-Rate metric no independent evidence
purpose: Quantify proportion of responses crossing into unsupported positive commitment
Rule-based measure defined for the evaluation
H-Score metric no independent evidence
purpose: Characterize confidence and specificity of fabrication via LLM judge
1-5 scale judged by GPT-4o-mini

pith-pipeline@v0.9.0 · 5633 in / 1483 out tokens · 42552 ms · 2026-05-10T05:07:46.360151+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

PolarVLM integrates polarimetric physical parameters into VLMs via dual-stream architecture and progressive training, outperforming RGB baselines by 25.4% on a new 75K-pair polarization-aware VQA benchmark.
PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

PolarVLM is the first VLM framework to integrate polarimetric physical parameters via dual-stream architecture and progressive training, delivering 25.4% gains over RGB baselines on reflection and transparency tasks w...

Reference graph

Works this paper leans on

38 extracted references · 29 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. ar...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073 2022
[3]

Ali Furkan Biten, Lluís Gómez, and Dimosthenis Karatzas. 2022. Let There Be a Clock on the Beach: Reducing Object Hallucination in Image Captioning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, Piscataway, NJ, USA, 1381–1390. arXiv:2110.01705 https://doi.org/10.48550/arXiv.2110.01705

work page doi:10.48550/arxiv.2110.01705 2022
[4]

Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schmidt. 2023. VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use. arXiv:2308.06595 [cs.CL] https://doi.org/10. 48550/arXiv.2308.06595

work page arXiv 2023
[5]

Wenhan Chang, Tianqing Zhu, Ping Xiong, Faqian Guan, and Wanlei Zhou. 2026. Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor. arXiv:2604.09235 [cs.CR] https://arxiv.org/abs/2604.09235

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, and Gedas Bertasius. 2023. VindLU: A Recipe for Effective Video-and-Language Pretraining. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Piscataway, NJ, USA, 10739–10750. arXiv:2212.05051 [cs.CV] https://doi.org/10.48550/arXiv.2212.05051

work page doi:10.48550/arxiv.2212.05051 2023
[7]

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. 2023. Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges. arXiv:2311.03287 [cs.LG] https://doi.org/10.48550/arXiv.2311.03287

work page doi:10.48550/arxiv.2311.03287 2023
[8]

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. 2023. Safe RLHF: Safe Reinforcement Learning from Human Feedback. arXiv:2310.12773 [cs.AI] https://arxiv.org/abs/2310.12773

work page internal anchor Pith review arXiv 2023
[9]

Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. 2025. Words or Vision: Do Vision-Language Models Have Blind Faith in Text?. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Piscataway, NJ, USA, 3867–3876. arXiv:2503.02199 [cs.CV] https://doi.org/10.48550/arXiv.2503.02199

work page doi:10.48550/arxiv.2503.02199 2025
[10]

Yuan Gong, Boyang Li, Christian Poellabauer, and Yiyu Shi. 2019. Real-Time Adversarial Attacks. arXiv:1905.13399 [cs.CR] https://doi.org/10.48550/ arXiv.1905.13399

work page arXiv 2019
[11]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Piscataway, NJ, USA, 6904–6913. arXiv:1612.00837 [cs.CV] https://doi.org/10.48550...

work page doi:10.48550/arxiv.1612.00837 2017
[12]

Toward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection

Xinlei Guan, David Arosemena, Tejaswi Dhandu, Kuan Huang, Meng Xu, Miles Q. Li, Bingyu Shen, Ruiyang Qin, Umamaheswara Rao Tida, and Boyang Li. 2026. Toward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection. arXiv:2604.10460 [cs.CV] https://arxiv.org/abs/2604.10460

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Anisha Gunjal, Jihan Yin, and Erhan Bas. 2024. Detecting and Preventing Hallucinations in Large Vision Language Models. InProceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, USA, 18135–18143. arXiv:2308.06394 [cs.CV] https://doi.org/10.48550/arXiv. 2308.06394

work page internal anchor Pith review doi:10.48550/arxiv 2024
[14]

Weihao Hong, Zhiyuan Jiang, et al . 2026. Tone Matters: The Impact of Linguistic Tone on Hallucination in VLMs. arXiv:2601.06460 [cs.CV] https://arxiv.org/abs/2601.06460

work page arXiv 2026
[15]

Chengyin Hu, Yuxian Dong, Yikun Guo, Xiang Chen, Junqi Wu, Jiahuan Long, Yiwei Wei, Tingsong Jiang, and Wen Yao. 2026. Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models. arXiv:2604.03117 [cs.CV] https: //arxiv.org/abs/2604.03117

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Piscataway, NJ, USA, 6700–6709

2019
[17]

Lawrence Zitnick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. 2016. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. arXiv:1612.06890 [cs.CV] https://arxiv.org/abs/1612.06890

work page arXiv 2016
[18]

Li and Benjamin C

Miles Q. Li and Benjamin C. M. Fung. 2025. Security Concerns for Large Language Models: A Survey. arXiv:2505.18889 [cs.CR] https://arxiv.org/ abs/2505.18889

work page arXiv 2025
[19]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Singapore, 292–305. Manuscript submitted to ACM LLM-as-Judge Framework fo...

2023
[20]

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. 2024. A Survey on Hallucination in Large Vision-Language Models. arXiv:2402.00253 [cs.CV] https://arxiv.org/abs/2402.00253

work page internal anchor Pith review arXiv 2024
[21]

Yi Liu, Chengjun Cai, Xiaoli Zhang, Xingliang Yuan, and Cong Wang. 2024. Arondight: Red Teaming Large Vision Language Models with Auto- Generated Multi-Modal Jailbreak Prompts. InProceedings of the 32nd ACM International Conference on Multimedia. ACM, New York, NY, USA, 3578–3586

2024
[22]

Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. 2022. Reframing Instructional Prompts to GPTk’s Language. InFindings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 589–612

2022
[23]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training Language Models to Follow Instructions with Human Fee...

2022
[24]

Zekun Qian, Ruize Han, and Wei Feng. 2026. BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning. arXiv:2604.11136 [cs.CV] https://arxiv.org/abs/2604.11136

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object Hallucination in Image Captioning. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Brussels, Belgium, 4035–4045

2018
[26]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield- Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. 2023. Towards Understanding Sycophancy in Language Models. arXi...

work page internal anchor Pith review arXiv 2023
[27]

Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. 2023. Jailbreak in Pieces: Compositional Adversarial Attacks on Multi-Modal Language Models. arXiv:2307.14539 [cs.CR] https://arxiv.org/abs/2307.14539

work page arXiv 2023
[28]

Xijia Tao, Shuai Zhong, Lei Li, Qi Liu, and Lingpeng Kong. 2025. ImgTrojan: Jailbreaking Vision-Language Models with ONE Image. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Albuquerque, NM, USA, 7048–7063

2025
[29]

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. 2022. Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Piscataway, NJ, USA, 5238–5248

2022
[30]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, Jitao Sang, and Haoyu Tang. 2023. Evaluation and Analysis of Hallucination in Large Vision-Language Models. arXiv:2308.15126 [cs.LG] https://arxiv.org/abs/2308.15126

work page arXiv 2023
[32]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV] https://...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Laura Weidinger et al. 2021. Ethical and Social Risks of Harm from Language Models. arXiv:2112.04359 [cs.CL] https://arxiv.org/abs/2112.04359

work page internal anchor Pith review arXiv 2021
[34]

Samyak Gupta Yangsibo Huang et al. 2023. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. arXiv:2310.06987 [cs.CL] https://doi.org/10.48550/arXiv.2310.06987

work page doi:10.48550/arxiv.2310.06987 2023
[35]

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. MiniCPM-V: A GPT-4V Level MLLM on Your Phone. arXiv:2408.01800 [cs.CV] http...

work page internal anchor Pith review arXiv 2024
[36]

Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, and Dacheng Tao. 2025. A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations. arXiv:2502.14881 [cs.CR] https://arxiv.org/abs/2502.14881

work page arXiv 2025
[37]

Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, and Dacheng Tao. 2024. Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt. arXiv:2406.04031 [cs.CV] https://arxiv.org/abs/2406.04031

work page arXiv 2024
[38]

Le Zhang, Qian Yang, and Aishwarya Agrawal. 2025. Assessing and Learning Alignment of Unimodal Vision and Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Piscataway, NJ, USA, 14604–14614. Manuscript submitted to ACM

2025