ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models

Brian Jalaian; Chung-En Johnny Yu; Nathaniel D. Bastian

arxiv: 2509.15435 · v3 · pith:3Y4HIQKFnew · submitted 2025-09-18 · 💻 cs.CV · cs.AI· cs.MA

ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models

Chung-En Johnny Yu , Brian Jalaian , Nathaniel D. Bastian This is my paper

Pith reviewed 2026-05-21 21:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MA

keywords hallucination mitigationadversarial robustnessvision language modelsagentic reasoninginference time improvementobject hallucinationmultimodal reliabilitycross model validation

0 comments

The pith

ORCA's Observe-Reason-Critique-Act loop with small vision models cuts hallucinations and adds adversarial robustness to large vision-language models at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ORCA to address hallucinations and adversarial vulnerabilities in pretrained large vision-language models. It does this by running an iterative reasoning process that consults several small vision models to check and correct the large model's outputs on factual questions about images. A sympathetic reader would care because current LVLMs often produce unreliable answers in practice, and this method improves performance on standard benchmarks without needing to change or retrain the underlying models. If the approach holds, it suggests a practical way to make multimodal AI systems more dependable for applications where errors matter.

Core claim

ORCA improves standalone LVLMs performance by +3.64% to +40.67% across different subsets on the POPE hallucination benchmark through its agentic framework. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20% to +48.00% across metrics. The framework uses an Observe-Reason-Critique-Act loop with small vision models to validate cross-model inconsistencies without accessing model internals or retraining.

What carries the argument

The Observe-Reason-Critique-Act loop that queries multiple small vision models with evidential questions to detect and resolve inconsistencies in the large model's responses.

If this is right

LVLMs show measurable accuracy gains on hallucination detection tasks like POPE without any model changes.
The same process provides robustness gains of about 20 percent on average when images are adversarially altered.
Combining ORCA with existing defense methods yields additional improvements on perturbed images from the AMBER benchmark.
Intermediate reasoning traces are stored, enabling auditable and explainable decisions.
The method applies to multiple different large vision-language models across tested settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agentic loops could be tested on other types of hallucinations, such as those involving relations or attributes rather than just objects.
Deploying small models for validation might allow lighter overall systems if they reliably correct larger ones.
Extending the framework to video or other sequential data could address temporal hallucinations in dynamic scenes.
Users could experiment with different suites of small vision models to optimize for specific domains like medical imaging.

Load-bearing premise

The reported performance gains come specifically from the structured Observe-Reason-Critique-Act loop and inconsistency checks with small vision models, and not from other unmentioned factors like prompt engineering or benchmark tuning.

What would settle it

Running the large models with simple repeated prompting or random small model queries on the same POPE and AMBER adversarial sets and finding no significant accuracy difference from the full ORCA loop.

Figures

Figures reproduced from arXiv: 2509.15435 by Brian Jalaian, Chung-En Johnny Yu, Nathaniel D. Bastian.

**Figure 1.** Figure 1: Adversarial perturbations can cause LVLMs to assert nonexistent objects injected by an attacker, and LVLMs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the ORCA framework. ORCA operates via an Observe–Reason–Critique–Act loop over a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: ORCA corrects false predictions from standalone LVLMs by querying diverse vision models and resolving [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Average accuracy across the three subsets of POPE, comparing standalone LVLMs and ORCA-augmented [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison under two attack settings. Each vertex represents one LVLM, and the score reflects [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) exhibit strong multimodal capabilities but remain vulnerable to hallucinations from intrinsic errors and adversarial attacks from external exploitations, limiting their reliability in real-world applications. We present ORCA, an agentic reasoning framework that improves the factual accuracy and adversarial robustness of pretrained LVLMs through inference-time structured inference reasoning with a suite of small vision models (less than 3B parameters). ORCA operates via an Observe-Reason-Critique-Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. ORCA also stores intermediate reasoning traces, which supports auditable decision-making. Though designed primarily to mitigate object-level hallucinations, ORCA also exhibits emergent adversarial robustness without requiring adversarial training or defense mechanisms. We evaluate ORCA across three settings: (1) clean images on hallucination benchmarks, (2) adversarially perturbed images without defense, and (3) adversarially perturbed images with defense applied. On the POPE hallucination benchmark, ORCA improves standalone LVLMs performance by +3.64% to +40.67% across different subsets. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20% to +48.00% across metrics. These results demonstrate that ORCA offers a promising path toward building more reliable and robust multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ORCA adds an inference-time Observe-Reason-Critique-Act loop with small vision models to cut LVLM hallucinations and gain some adversarial robustness, with benchmark gains that still need ablations to confirm the loop itself drives them.

read the letter

ORCA is an inference-time framework that wraps pretrained vision-language models with a four-step loop using small vision models under 3B parameters. The loop queries images for evidence, checks inconsistencies across the small models, and refines the answer without any retraining or access to the large model's internals. It also keeps the intermediate steps for later audit. The paper reports accuracy lifts on POPE from 3.64% to 40.67% on clean data and an average 20.11% under adversarial perturbations, with further gains on defended AMBER images. The emergence of robustness without targeted adversarial training is a useful side effect, and the auditable traces are a practical feature for deployment settings where explanations matter.

Referee Report

2 major / 2 minor

Summary. The paper proposes ORCA, an agentic reasoning framework for Large Vision-Language Models (LVLMs) that employs an Observe-Reason-Critique-Act loop with small vision models (<3B parameters) to mitigate object-level hallucinations and improve adversarial robustness at inference time, without retraining or access to model internals. It reports empirical gains on the POPE benchmark (+3.64% to +40.67% on clean images, average +20.11% under adversarial perturbations) and further improvements on adversarially perturbed AMBER images when combined with defenses (+1.20% to +48.00% across metrics), while storing reasoning traces for auditability.

Significance. If the performance gains can be attributed specifically to the structured agentic loop and cross-model inconsistency checks, ORCA would represent a practical inference-time method for enhancing reliability of existing LVLMs. The emergent adversarial robustness without dedicated training and the emphasis on auditable traces are notable features that could support broader adoption in safety-critical multimodal applications.

major comments (2)

[§5] §5 (Experiments and Results): The central claims of accuracy gains on POPE and AMBER are presented without ablation studies isolating the contribution of the full Observe-Reason-Critique-Act loop and inconsistency validation from simpler controls, such as equivalent numbers of independent queries to the small vision models or non-iterative multi-prompt baselines. This is load-bearing for the claim that the agentic mechanism itself drives the reported improvements (+3.64% to +40.67%, +20.11% under attack).
[§3] §3 (ORCA Framework): The description of cross-model inconsistency validation lacks a precise definition of the inconsistency metric, the threshold for triggering the Act step, and the exact suite of small vision models employed. Without these, it is not possible to verify that the gains arise from the proposed structure rather than unstated implementation choices or benchmark-specific tuning.

minor comments (2)

[Abstract and §4] The abstract and §4 should explicitly name the small vision models used and provide pseudocode or a clear algorithmic outline for the iterative loop to improve reproducibility.
[§5] No statistical significance tests, standard deviations, or details on the number of runs are reported for the percentage gains; adding these would strengthen the empirical presentation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the clarity and rigor of our presentation. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§5] §5 (Experiments and Results): The central claims of accuracy gains on POPE and AMBER are presented without ablation studies isolating the contribution of the full Observe-Reason-Critique-Act loop and inconsistency validation from simpler controls, such as equivalent numbers of independent queries to the small vision models or non-iterative multi-prompt baselines. This is load-bearing for the claim that the agentic mechanism itself drives the reported improvements (+3.64% to +40.67%, +20.11% under attack).

Authors: We agree that dedicated ablation studies are necessary to rigorously isolate the contribution of the structured Observe-Reason-Critique-Act loop and cross-model inconsistency checks. While the current experiments include comparisons of ORCA against standalone LVLMs across clean, adversarially perturbed, and defended settings, these do not explicitly control for equivalent numbers of independent queries or non-iterative multi-prompt baselines. We will add these ablations in the revised manuscript, including direct comparisons to simpler multi-query and non-iterative prompting strategies using the same small vision models. This will provide stronger evidence that the agentic mechanism drives the observed gains. revision: yes
Referee: [§3] §3 (ORCA Framework): The description of cross-model inconsistency validation lacks a precise definition of the inconsistency metric, the threshold for triggering the Act step, and the exact suite of small vision models employed. Without these, it is not possible to verify that the gains arise from the proposed structure rather than unstated implementation choices or benchmark-specific tuning.

Authors: We thank the referee for identifying this gap in reproducibility. Section 3 describes the Observe-Reason-Critique-Act loop and the use of cross-model inconsistency validation, but we acknowledge that the inconsistency metric, triggering threshold, and specific model suite require more precise specification. In the revised version, we will explicitly define the inconsistency metric (as the rate of output disagreement across models on the same evidential query), state the threshold value used to initiate the Act step, and enumerate the exact small vision models employed (all under 3B parameters). These additions will clarify that the reported improvements stem from the proposed framework rather than hidden implementation details. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external benchmarks

full rationale

The paper introduces the ORCA agentic framework (Observe-Reason-Critique-Act loop with small vision models for inconsistency validation) and reports accuracy improvements on the standard external POPE and AMBER benchmarks under clean and adversarial settings. No equations, fitted parameters, or self-referential definitions appear in the provided text; performance deltas (+3.64% to +40.67% on POPE, etc.) are presented as direct experimental outcomes rather than quantities derived by construction from the method itself. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are invoked to justify core claims. The evaluation relies on independently verifiable benchmarks outside the paper's own definitions, satisfying the criteria for a self-contained empirical result with no reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework depends on the assumption that small vision models can supply useful evidential signals for cross-validation and that iterative critique improves outputs; these are domain assumptions rather than derived results.

axioms (1)

domain assumption Small vision models under 3B parameters can generate reliable answers to targeted evidential questions about images that help detect and correct inconsistencies in LVLM outputs.
Invoked in the Observe and Critique steps of the loop as the basis for validation without LVLM internals.

invented entities (1)

ORCA Observe-Reason-Critique-Act loop no independent evidence
purpose: To structure inference-time reasoning for hallucination mitigation and robustness
Newly introduced procedural entity whose effectiveness is demonstrated only through the paper's own experiments.

pith-pipeline@v0.9.0 · 5841 in / 1564 out tokens · 96334 ms · 2026-05-21T21:20:44.363566+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ORCA operates via an Observe–Reason–Critique–Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

gains ranging from +3.64% to +40.67% across different subsets on the POPE hallucination benchmark

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 11 internal anchors

[1]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[2]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

work page 2022
[3]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucina- tion of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Adversarial illusions in {Multi-Modal} embeddings

Eugene Bagdasaryan, Rishi Jha, Vitaly Shmatikov, and Tingwei Zhang. Adversarial illusions in {Multi-Modal} embeddings. In33rd USENIX Security Symposium (USENIX Security 24), pages 3009–3025, 2024

work page 2024
[5]

On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

work page 2023
[6]

Doubly-universal adversarial perturbations: Deceiving vision-language models across both images and text with a single perturbation.arXiv preprint arXiv:2412.08108, 2024

Hee-Seon Kim, Minbeom Kim, and Changick Kim. Doubly-universal adversarial perturbations: Deceiving vision-language models across both images and text with a single perturbation.arXiv preprint arXiv:2412.08108, 2024

work page arXiv 2024
[7]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

A survey of attacks on large vision- language models: Resources, advances, and future trends

Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Yu Cheng, and Wei Hu. A survey of attacks on large vision-language models: Resources, advances, and future trends.arXiv preprint arXiv:2407.07403, 2024

work page arXiv 2024
[9]

Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies,

Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies.arXiv preprint arXiv:2308.03188, 2023

work page arXiv 2023
[10]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024

work page 2024
[12]

Logical closed loop: Uncovering object hallucinations in large vision-language models.arXiv preprint arXiv:2402.11622, 2024

Junfei Wu, Qiang Liu, Ding Wang, Jinghao Zhang, Shu Wu, Liang Wang, and Tieniu Tan. Logical closed loop: Uncovering object hallucinations in large vision-language models.arXiv preprint arXiv:2402.11622, 2024

work page arXiv 2024
[13]

Combating multimodal llm hallucination via bottom-up holistic reasoning

Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, and Tat-Seng Chua. Combating multimodal llm hallucination via bottom-up holistic reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8460–8468, 2025

work page 2025
[14]

Seeing is deceiving: Exploitation of visual pathways in multi-modal language models.arXiv preprint arXiv:2411.05056, 2024

Pete Janowczyk, Linda Laurier, Ave Giulietta, Arlo Octavia, and Meade Cleti. Seeing is deceiving: Exploitation of visual pathways in multi-modal language models.arXiv preprint arXiv:2411.05056, 2024

work page arXiv 2024
[15]

Revisiting the adversarial robustness of vision language models: a multimodal perspective,

Wanqi Zhou, Shuanghao Bai, Danilo P Mandic, Qibin Zhao, and Badong Chen. Revisiting the adversarial robustness of vision language models: a multimodal perspective.arXiv preprint arXiv:2404.19287, 2024

work page arXiv 2024
[16]

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24408–24419, 2024

work page 2024
[17]

A study of the effect of JPG compression on adversarial images

Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel M Roy. A study of the effect of jpg compression on adversarial images.arXiv preprint arXiv:1608.00853, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks

Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks.arXiv preprint arXiv:1704.01155, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models.IEEE Transactions on Information Forensics and Security, 2024

Qi Guo, Shanmin Pang, Xiaojun Jia, Yang Liu, and Qing Guo. Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models.IEEE Transactions on Information Forensics and Security, 2024

work page 2024
[20]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. 11 Chung-En (Johnny) Yu et al

work page 2024
[21]

Critic-v: Vlm critics help catch vlm errors in multimodal reasoning

Di Zhang, Jingdi Lei, Junxian Li, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, et al. Critic-v: Vlm critics help catch vlm errors in multimodal reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9050–9061, 2025

work page 2025
[22]

Alleviating hallucination in large vision-language models with active retrieval augmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 2024

Xiaoye Qu, Qiyuan Chen, Wei Wei, Jiashuo Sun, Daizong Liu, and Jianfeng Dong. Alleviating hallucination in large vision-language models with active retrieval augmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 2024

work page 2024
[23]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

A survey of multimodel large language models

Zijing Liang, Yanjie Xu, Yifan Hong, Penghui Shang, Qi Wang, Qiang Fu, and Ke Liu. A survey of multimodel large language models. InProceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering, pages 405–409, 2024

work page 2024
[25]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180–15190, 2023

work page 2023
[27]

One transformer fits all distributions in multi-modal diffusion at scale

Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. InInternational Conference on Machine Learning, pages 1692–1717. PMLR, 2023

work page 2023
[28]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

work page 2020
[32]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Llama 3.2-11b vision

Meta AI. Llama 3.2-11b vision. https://huggingface.co/meta-llama/Llama-3.2-11B-Vision , 2025. Accessed: 2025-03-07

work page 2025
[34]

the object

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised adver- sarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024. 12 ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models A Additional Results Table 7 an...

work page arXiv 2024

[1] [1]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[2] [2]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

work page 2022

[3] [3]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucina- tion of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Adversarial illusions in {Multi-Modal} embeddings

Eugene Bagdasaryan, Rishi Jha, Vitaly Shmatikov, and Tingwei Zhang. Adversarial illusions in {Multi-Modal} embeddings. In33rd USENIX Security Symposium (USENIX Security 24), pages 3009–3025, 2024

work page 2024

[5] [5]

On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

work page 2023

[6] [6]

Doubly-universal adversarial perturbations: Deceiving vision-language models across both images and text with a single perturbation.arXiv preprint arXiv:2412.08108, 2024

Hee-Seon Kim, Minbeom Kim, and Changick Kim. Doubly-universal adversarial perturbations: Deceiving vision-language models across both images and text with a single perturbation.arXiv preprint arXiv:2412.08108, 2024

work page arXiv 2024

[7] [7]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

A survey of attacks on large vision- language models: Resources, advances, and future trends

Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Yu Cheng, and Wei Hu. A survey of attacks on large vision-language models: Resources, advances, and future trends.arXiv preprint arXiv:2407.07403, 2024

work page arXiv 2024

[9] [9]

Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies,

Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies.arXiv preprint arXiv:2308.03188, 2023

work page arXiv 2023

[10] [10]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024

work page 2024

[12] [12]

Logical closed loop: Uncovering object hallucinations in large vision-language models.arXiv preprint arXiv:2402.11622, 2024

Junfei Wu, Qiang Liu, Ding Wang, Jinghao Zhang, Shu Wu, Liang Wang, and Tieniu Tan. Logical closed loop: Uncovering object hallucinations in large vision-language models.arXiv preprint arXiv:2402.11622, 2024

work page arXiv 2024

[13] [13]

Combating multimodal llm hallucination via bottom-up holistic reasoning

Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, and Tat-Seng Chua. Combating multimodal llm hallucination via bottom-up holistic reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8460–8468, 2025

work page 2025

[14] [14]

Seeing is deceiving: Exploitation of visual pathways in multi-modal language models.arXiv preprint arXiv:2411.05056, 2024

Pete Janowczyk, Linda Laurier, Ave Giulietta, Arlo Octavia, and Meade Cleti. Seeing is deceiving: Exploitation of visual pathways in multi-modal language models.arXiv preprint arXiv:2411.05056, 2024

work page arXiv 2024

[15] [15]

Revisiting the adversarial robustness of vision language models: a multimodal perspective,

Wanqi Zhou, Shuanghao Bai, Danilo P Mandic, Qibin Zhao, and Badong Chen. Revisiting the adversarial robustness of vision language models: a multimodal perspective.arXiv preprint arXiv:2404.19287, 2024

work page arXiv 2024

[16] [16]

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24408–24419, 2024

work page 2024

[17] [17]

A study of the effect of JPG compression on adversarial images

Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel M Roy. A study of the effect of jpg compression on adversarial images.arXiv preprint arXiv:1608.00853, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[18] [18]

Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks

Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks.arXiv preprint arXiv:1704.01155, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models.IEEE Transactions on Information Forensics and Security, 2024

Qi Guo, Shanmin Pang, Xiaojun Jia, Yang Liu, and Qing Guo. Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models.IEEE Transactions on Information Forensics and Security, 2024

work page 2024

[20] [20]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. 11 Chung-En (Johnny) Yu et al

work page 2024

[21] [21]

Critic-v: Vlm critics help catch vlm errors in multimodal reasoning

Di Zhang, Jingdi Lei, Junxian Li, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, et al. Critic-v: Vlm critics help catch vlm errors in multimodal reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9050–9061, 2025

work page 2025

[22] [22]

Alleviating hallucination in large vision-language models with active retrieval augmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 2024

Xiaoye Qu, Qiyuan Chen, Wei Wei, Jiashuo Sun, Daizong Liu, and Jianfeng Dong. Alleviating hallucination in large vision-language models with active retrieval augmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 2024

work page 2024

[23] [23]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

A survey of multimodel large language models

Zijing Liang, Yanjie Xu, Yifan Hong, Penghui Shang, Qi Wang, Qiang Fu, and Ke Liu. A survey of multimodel large language models. InProceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering, pages 405–409, 2024

work page 2024

[25] [25]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180–15190, 2023

work page 2023

[27] [27]

One transformer fits all distributions in multi-modal diffusion at scale

Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. InInternational Conference on Machine Learning, pages 1692–1717. PMLR, 2023

work page 2023

[28] [28]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

work page 2020

[32] [32]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Llama 3.2-11b vision

Meta AI. Llama 3.2-11b vision. https://huggingface.co/meta-llama/Llama-3.2-11B-Vision , 2025. Accessed: 2025-03-07

work page 2025

[34] [34]

the object

Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised adver- sarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024. 12 ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models A Additional Results Table 7 an...

work page arXiv 2024