pith. machine review for the scientific record. sign in

arxiv: 2509.15435 · v2 · submitted 2025-09-18 · 💻 cs.CV · cs.AI· cs.MA

ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models

Pith reviewed 2026-05-18 15:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MA
keywords hallucination mitigationadversarial robustnessvision language modelsagentic reasoninginference time methodsmultimodal modelsobject detection verification
0
0 comments X

The pith

ORCA uses an Observe-Reason-Critique-Act loop with small vision models to reduce object hallucinations and boost adversarial robustness in large vision-language models at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ORCA as a framework that enhances pretrained large vision-language models by querying multiple small vision models with evidential questions and validating inconsistencies through an iterative loop. This method corrects hallucinations and provides robustness against adversarial attacks without needing internal access or retraining. Results on the POPE benchmark show accuracy improvements ranging from 3.64% to 40.67% on clean images and an average gain of 20.11% under perturbations. Readers should care because it suggests a lightweight way to make multimodal AI more dependable for practical applications by leveraging external tools for verification.

Core claim

ORCA operates via an Observe-Reason-Critique-Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. It also stores intermediate reasoning traces to support auditable decision-making and shows gains on hallucination benchmarks as well as under adversarial conditions.

What carries the argument

The Observe-Reason-Critique-Act loop coordinating evidential queries and cross-model inconsistency validation across small vision models.

If this is right

  • ORCA raises standalone LVLM performance by 3.64% to 40.67% on POPE hallucination subsets.
  • Under adversarial perturbations on POPE, it delivers an average accuracy gain of 20.11% across LVLMs.
  • Combined with defense techniques on perturbed AMBER images, it yields further gains from 1.20% to 48.00% across metrics.
  • Intermediate reasoning traces enable auditable and interpretable decisions in multimodal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar agentic loops could be adapted for other types of errors in vision-language models beyond object-level hallucinations.
  • The approach might reduce the need for large-scale adversarial training by providing inference-time corrections.
  • Storing traces could facilitate regulatory compliance or debugging in high-stakes deployments.

Load-bearing premise

That inconsistencies between answers from multiple small vision models queried with evidential questions can reliably detect and correct hallucinations in large vision-language models.

What would settle it

Demonstrating that ORCA provides no improvement or even decreases accuracy on a held-out hallucination dataset or a novel adversarial attack strategy targeting non-object elements.

Figures

Figures reproduced from arXiv: 2509.15435 by Brian Jalaian, Chung-En Johnny Yu, Nathaniel D. Bastian.

Figure 1
Figure 1. Figure 1: Adversarial perturbations can cause LVLMs to assert nonexistent objects injected by an attacker, and LVLMs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the ORCA framework. ORCA operates via an Observe–Reason–Critique–Act loop over a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ORCA corrects false predictions from standalone LVLMs by querying diverse vision models and resolving [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average accuracy across the three subsets of POPE, comparing standalone LVLMs and ORCA-augmented [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison under two attack settings. Each vertex represents one LVLM, and the score reflects [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Large Vision-Language Models (LVLMs) exhibit strong multimodal capabilities but remain vulnerable to hallucinations from intrinsic errors and adversarial attacks from external exploitations, limiting their reliability in real-world applications. We present ORCA, an agentic reasoning framework that improves the factual accuracy and adversarial robustness of pretrained LVLMs through inference-time structured inference reasoning with a suite of small vision models (less than 3B parameters). ORCA operates via an Observe-Reason-Critique-Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. ORCA also stores intermediate reasoning traces, which supports auditable decision-making. Though designed primarily to mitigate object-level hallucinations, ORCA also exhibits emergent adversarial robustness without requiring adversarial training or defense mechanisms. We evaluate ORCA across three settings: (1) clean images on hallucination benchmarks, (2) adversarially perturbed images without defense, and (3) adversarially perturbed images with defense applied. On the POPE hallucination benchmark, ORCA improves standalone LVLMs performance by +3.64% to +40.67% across different subsets. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20% to +48.00% across metrics. These results demonstrate that ORCA offers a promising path toward building more reliable and robust multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents ORCA, an agentic reasoning framework that applies an Observe-Reason-Critique-Act loop at inference time to mitigate object-level hallucinations and improve adversarial robustness in pretrained Large Vision-Language Models (LVLMs). The approach queries multiple small vision models (<3B parameters) with evidential questions, validates cross-model inconsistencies, and iteratively refines outputs without retraining or internal access; it also stores reasoning traces for auditability. Reported results include accuracy gains of +3.64% to +40.67% on POPE subsets for clean images, an average +20.11% under adversarial perturbations on POPE, and further gains of +1.20% to +48.00% when combined with defenses on perturbed AMBER images.

Significance. If the gains are attributable to the structured agentic loop rather than auxiliary model capabilities alone, the work would offer a practical training-free method for enhancing LVLM reliability. Strengths include the inference-time design, emergent robustness without adversarial training, and support for auditable traces. These elements address real deployment needs in multimodal systems.

major comments (2)
  1. [Evaluation on POPE] POPE evaluation results: The central claim attributes +3.64% to +40.67% gains (and +20.11% adversarial) to the Observe-Reason-Critique-Act loop with cross-model inconsistency validation. However, the manuscript provides no ablations comparing the full ORCA procedure against direct aggregation or majority voting over the same small vision models. Since POPE directly tests object presence detection, which the auxiliary models perform natively, the reported improvements may not demonstrate the contribution of the agentic reasoning components.
  2. [Methods / Experimental Setup] Methods and experimental setup: The abstract and results report concrete accuracy numbers but omit implementation details such as exact inconsistency thresholds, evidential question templates, number of iterations, and error bars or statistical significance for the gains. These omissions make it difficult to verify reproducibility or isolate the framework's effect.
minor comments (2)
  1. [Abstract] Abstract: Consider specifying the exact LVLMs and small vision models used in the experiments for clarity.
  2. [Results] Notation and figures: Ensure all tables reporting accuracy include standard deviations or confidence intervals to support the claimed gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We agree that additional ablations and implementation details will strengthen the paper's claims and reproducibility. We address each major comment below.

read point-by-point responses
  1. Referee: [Evaluation on POPE] POPE evaluation results: The central claim attributes +3.64% to +40.67% gains (and +20.11% adversarial) to the Observe-Reason-Critique-Act loop with cross-model inconsistency validation. However, the manuscript provides no ablations comparing the full ORCA procedure against direct aggregation or majority voting over the same small vision models. Since POPE directly tests object presence detection, which the auxiliary models perform natively, the reported improvements may not demonstrate the contribution of the agentic reasoning components.

    Authors: We agree that direct comparisons to simpler baselines are needed to isolate the contribution of the agentic loop. In the revised manuscript we will add ablations on POPE that compare the full Observe-Reason-Critique-Act procedure against (i) direct aggregation of the small vision model outputs and (ii) majority voting over the same models. These results will quantify the incremental benefit of inconsistency detection and iterative refinement beyond basic ensembling. revision: yes

  2. Referee: [Methods / Experimental Setup] Methods and experimental setup: The abstract and results report concrete accuracy numbers but omit implementation details such as exact inconsistency thresholds, evidential question templates, number of iterations, and error bars or statistical significance for the gains. These omissions make it difficult to verify reproducibility or isolate the framework's effect.

    Authors: We acknowledge the need for these details. The revised manuscript will add a dedicated reproducibility subsection (and appendix) specifying the inconsistency threshold, full evidential question templates, iteration limit, and will report error bars together with statistical significance tests for all accuracy gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical agentic framework with independent evaluation

full rationale

The paper describes ORCA as an external inference-time procedure (Observe-Reason-Critique-Act loop) that queries auxiliary small vision models and validates inconsistencies to improve LVLM outputs on POPE and adversarial settings. No equations, fitted parameters, or derivations are presented that reduce the claimed accuracy gains to the inputs by construction. The method is self-contained against external benchmarks, with reported improvements treated as empirical outcomes rather than self-referential predictions. No self-citation chains or ansatzes are invoked to justify core claims in the abstract or described framework.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The framework introduces a new reasoning procedure but rests on standard assumptions about small vision models being able to answer targeted visual questions reliably; no explicit free parameters or invented physical entities are stated in the abstract.

invented entities (1)
  • ORCA Observe-Reason-Critique-Act loop no independent evidence
    purpose: structured inference reasoning to mitigate hallucinations
    Newly proposed agentic procedure in this work

pith-pipeline@v0.9.0 · 5841 in / 1153 out tokens · 41140 ms · 2026-05-18T15:19:21.527758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 11 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  2. [2]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

  3. [3]

    Hallucination of Multimodal Large Language Models: A Survey

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucina- tion of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024

  4. [4]

    Adversarial illusions in {Multi-Modal} embeddings

    Eugene Bagdasaryan, Rishi Jha, Vitaly Shmatikov, and Tingwei Zhang. Adversarial illusions in {Multi-Modal} embeddings. In33rd USENIX Security Symposium (USENIX Security 24), pages 3009–3025, 2024

  5. [5]

    On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

    Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

  6. [6]

    Doubly-universal adversarial perturbations: Deceiving vision-language models across both images and text with a single perturbation.arXiv preprint arXiv:2412.08108, 2024

    Hee-Seon Kim, Minbeom Kim, and Changick Kim. Doubly-universal adversarial perturbations: Deceiving vision-language models across both images and text with a single perturbation.arXiv preprint arXiv:2412.08108, 2024

  7. [7]

    A Survey on Hallucination in Large Vision-Language Models

    Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024

  8. [8]

    A survey of attacks on large vision-language models: Resources, advances, and future trends.arXiv preprint arXiv:2407.07403, 2024

    Daizong Liu, Mingyu Yang, Xiaoye Qu, Pan Zhou, Yu Cheng, and Wei Hu. A survey of attacks on large vision-language models: Resources, advances, and future trends.arXiv preprint arXiv:2407.07403, 2024

  9. [9]

    Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies.arXiv preprint arXiv:2308.03188, 2023

    Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies.arXiv preprint arXiv:2308.03188, 2023

  10. [10]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  11. [11]

    Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024

  12. [12]

    Logical closed loop: Uncovering object hallucinations in large vision-language models.arXiv preprint arXiv:2402.11622, 2024

    Junfei Wu, Qiang Liu, Ding Wang, Jinghao Zhang, Shu Wu, Liang Wang, and Tieniu Tan. Logical closed loop: Uncovering object hallucinations in large vision-language models.arXiv preprint arXiv:2402.11622, 2024

  13. [13]

    Combating multimodal llm hallucination via bottom-up holistic reasoning

    Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, and Tat-Seng Chua. Combating multimodal llm hallucination via bottom-up holistic reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8460–8468, 2025

  14. [14]

    Seeing is deceiving: Exploitation of visual pathways in multi-modal language models.arXiv preprint arXiv:2411.05056, 2024

    Pete Janowczyk, Linda Laurier, Ave Giulietta, Arlo Octavia, and Meade Cleti. Seeing is deceiving: Exploitation of visual pathways in multi-modal language models.arXiv preprint arXiv:2411.05056, 2024

  15. [15]

    Revisiting the adversarial robustness of vision language models: a multimodal perspective.arXiv preprint arXiv:2404.19287, 2024

    Wanqi Zhou, Shuanghao Bai, Danilo P Mandic, Qibin Zhao, and Badong Chen. Revisiting the adversarial robustness of vision language models: a multimodal perspective.arXiv preprint arXiv:2404.19287, 2024

  16. [16]

    One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

    Lin Li, Haoyan Guan, Jianing Qiu, and Michael Spratling. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24408–24419, 2024

  17. [17]

    A study of the effect of JPG compression on adversarial images

    Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel M Roy. A study of the effect of jpg compression on adversarial images.arXiv preprint arXiv:1608.00853, 2016

  18. [18]

    Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks

    Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks.arXiv preprint arXiv:1704.01155, 2017

  19. [19]

    Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models.IEEE Transactions on Information Forensics and Security, 2024

    Qi Guo, Shanmin Pang, Xiaojun Jia, Yang Liu, and Qing Guo. Efficient generation of targeted and transferable adversarial examples for vision-language models via diffusion models.IEEE Transactions on Information Forensics and Security, 2024

  20. [20]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. 11 Chung-En (Johnny) Yu et al

  21. [21]

    Critic-v: Vlm critics help catch vlm errors in multimodal reasoning

    Di Zhang, Jingdi Lei, Junxian Li, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, et al. Critic-v: Vlm critics help catch vlm errors in multimodal reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9050–9061, 2025

  22. [22]

    Alleviating hallucination in large vision-language models with active retrieval augmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 2024

    Xiaoye Qu, Qiyuan Chen, Wei Wei, Jiashuo Sun, Daizong Liu, and Jianfeng Dong. Alleviating hallucination in large vision-language models with active retrieval augmentation.ACM Transactions on Multimedia Computing, Communications and Applications, 2024

  23. [23]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

  24. [24]

    A survey of multimodel large language models

    Zijing Liang, Yanjie Xu, Yifan Hong, Penghui Shang, Qi Wang, Qiang Fu, and Ke Liu. A survey of multimodel large language models. InProceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering, pages 405–409, 2024

  25. [25]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

  26. [26]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15180–15190, 2023

  27. [27]

    One transformer fits all distributions in multi-modal diffusion at scale

    Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. InInternational Conference on Machine Learning, pages 1692–1717. PMLR, 2023

  28. [28]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023

  29. [29]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023

  30. [30]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

  31. [31]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

  32. [32]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  33. [33]

    Llama 3.2-11b vision

    Meta AI. Llama 3.2-11b vision. https://huggingface.co/meta-llama/Llama-3.2-11B-Vision , 2025. Accessed: 2025-03-07

  34. [34]

    the object

    Christian Schlarmann, Naman Deep Singh, Francesco Croce, and Matthias Hein. Robust clip: Unsupervised adver- sarial fine-tuning of vision embeddings for robust large vision-language models.arXiv preprint arXiv:2402.12336, 2024. 12 ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models A Additional Results Table 7 an...