Recognition: unknown
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
Pith reviewed 2026-05-10 01:24 UTC · model grok-4.3
The pith
Vision-language models can reduce their own object hallucinations by verifying descriptions of specific image regions they generate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By breaking the verification task into six sequential steps that begin with an initial response and end with a revised response, the method elicits coordinate generation and region descriptions from the same model and uses those descriptions to detect and remove claims of nonexistent objects.
What carries the argument
The six-step Region-aware Chain-of-Verification process that chains initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation, using the model's self-produced regions as the verification signal.
If this is right
- The method applies to many existing vision-language models without any weight updates.
- It requires no separate object-detection model or other external components.
- Performance gains appear on multiple public benchmarks that measure object hallucinations.
- The same prompting structure can be reused across different model sizes and architectures.
Where Pith is reading between the lines
- The results imply that current models already encode usable location information that can be surfaced by targeted prompts rather than being absent.
- Similar self-chaining steps could be tested on other hallucination types such as incorrect attributes or relations.
- The approach might scale to video or multi-image inputs if the coordinate and region steps are adapted accordingly.
Load-bearing premise
The model must produce accurate coordinates and region descriptions in the intermediate steps without creating new errors that invalidate the later verification.
What would settle it
Apply the full chain to a model on a standard hallucination benchmark and measure whether the rate of invented objects in the final answer stays the same or increases compared with the initial answer.
Figures
read the original abstract
Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information -- often focusing on specific image regions or details within a given sample -- we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. As a simple yet effective method, R-CoV can be seamlessly integrated into various LVLMs in a training-free manner and without relying on external detection models. Extensive experiments on several widely used hallucination benchmarks across multiple LVLMs demonstrate that R-CoV can significantly alleviate object hallucinations in LVLMs. Project page: https://github.com/Jiahao000/R-CoV.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Region-aware Chain-of-Verification (R-CoV), a training-free post-hoc method for alleviating object hallucinations in LVLMs. It consists of a six-step pipeline (initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation) that elicits region-level processing from the LVLM itself to serve as a chaining cue for self-verification, without requiring external detectors or fine-tuning. The central claim is that this approach can be seamlessly integrated into various LVLMs and yields significant reductions in object hallucinations, as demonstrated through experiments on multiple standard benchmarks across several models.
Significance. If the results hold after addressing the robustness concerns, R-CoV would provide a practical, lightweight way to improve LVLM reliability by leveraging the model's internal region cues rather than external tools or retraining. This aligns with the trend toward inference-time interventions and could see adoption in deployed multimodal systems. The training-free nature and lack of reliance on detection models are clear strengths, as is the evaluation across multiple LVLMs and benchmarks.
major comments (2)
- [Method section (six-step pipeline)] Method section (description of the six-step pipeline, particularly coordinate generation and region description): The verification step operates on regions defined by coordinates and descriptions generated by the same LVLM. LVLMs have well-documented weaknesses in precise spatial localization and grounding. If the generated coordinates have low overlap with actual objects, verification occurs on irrelevant patches, which directly risks failing to correct (or even exacerbating) hallucinations. This assumption is load-bearing for the claim of effective alleviation in a training-free manner without external detectors; it requires explicit validation such as reporting localization metrics (e.g., IoU with ground-truth boxes) or ablation studies isolating the impact of coordinate accuracy.
- [Experiments] Experimental results section: The abstract asserts that R-CoV 'significantly alleviate[s] object hallucinations,' yet the magnitude, statistical significance, and controls (e.g., for prompt sensitivity across the multi-step pipeline) are not detailed here. Without per-benchmark tables showing effect sizes, variance, and comparisons to strong baselines that also use multi-prompt strategies, it is difficult to assess whether improvements exceed what could be achieved by simpler prompting variants.
minor comments (2)
- [Abstract / Method] A pipeline diagram would greatly improve clarity for the six-step process described in the abstract and method.
- [Project page / Experiments] Ensure the GitHub repository includes all prompts used in each step and exact evaluation scripts for full reproducibility.
Simulated Author's Rebuttal
Thank you for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. Where the concerns are valid, we outline specific revisions to strengthen the paper while preserving its core contributions.
read point-by-point responses
-
Referee: Method section (six-step pipeline)] Method section (description of the six-step pipeline, particularly coordinate generation and region description): The verification step operates on regions defined by coordinates and descriptions generated by the same LVLM. LVLMs have well-documented weaknesses in precise spatial localization and grounding. If the generated coordinates have low overlap with actual objects, verification occurs on irrelevant patches, which directly risks failing to correct (or even exacerbating) hallucinations. This assumption is load-bearing for the claim of effective alleviation in a training-free manner without external detectors; it requires explicit validation such as reporting localization metrics (e.g., IoU with ground-truth boxes) or ablation studies isolating the impact of coordinate accuracy.
Authors: We acknowledge the referee's valid point that coordinate accuracy is a critical assumption. LVLMs do have known limitations in precise grounding, and inaccurate coordinates could in principle direct verification to irrelevant regions. However, our design intentionally uses the LVLM's own outputs for consistency: the same model generates both the initial response and the region cues, allowing it to verify against its internal representation rather than external ground truth. This self-referential chaining is the core of the training-free approach. To directly address the concern, we will add (1) an ablation study that isolates the coordinate generation step (comparing full R-CoV against a variant using only entity extraction and description without coordinates) and (2) localization quality metrics (average IoU with ground-truth boxes) on any benchmarks that provide bounding-box annotations. These additions will quantify the robustness of the assumption. revision: partial
-
Referee: Experimental results section: The abstract asserts that R-CoV 'significantly alleviate[s] object hallucinations,' yet the magnitude, statistical significance, and controls (e.g., for prompt sensitivity across the multi-step pipeline) are not detailed here. Without per-benchmark tables showing effect sizes, variance, and comparisons to strong baselines that also use multi-prompt strategies, it is difficult to assess whether improvements exceed what could be achieved by simpler prompting variants.
Authors: We agree that the current experimental presentation would benefit from greater granularity to support the 'significant alleviation' claim. In the revised manuscript we will expand the results section with: per-benchmark tables reporting absolute and relative reductions in hallucination rates, standard deviations across multiple independent runs (to quantify variance and prompt sensitivity), and paired statistical significance tests. We will also add direct comparisons to strong multi-prompt baselines (e.g., self-consistency, iterative Chain-of-Thought, and self-verification variants) that use comparable numbers of forward passes, thereby demonstrating that gains are attributable to the region-aware chaining rather than prompt length alone. revision: yes
Circularity Check
No circularity in the derivation chain
full rationale
The paper presents a purely procedural, training-free method consisting of six explicit steps (initial response, entity extraction, coordinate generation, region description, verification, final response) that elicits region-level processing from the LVLM to verify its own outputs. No mathematical equations, fitted parameters, or self-citations are used to derive the central claim; effectiveness is shown via empirical evaluation on external hallucination benchmarks across multiple LVLMs. The result does not reduce to its inputs by construction, and the method remains self-contained against external benchmarks without any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LVLMs can produce usable bounding-box coordinates and region descriptions when explicitly prompted in the intermediate steps.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS,
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1, 2
work page internal anchor Pith review arXiv 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Lan- guage models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. InNeurIPS, 2020. 1, 2
2020
-
[5]
Sharegpt4v: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In ECCV, 2024. 1
2024
-
[6]
Unified hallucination detection for multi- modal large language models
Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xi- aoyan Yang, Qiang Li, Yue Shen, Lei Liang, Jinjie Gu, and Huajun Chen. Unified hallucination detection for multi- modal large language models. InACL, 2024. 2
2024
-
[7]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,
-
[8]
Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding
Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduc- tion via adaptive focal-contrast decoding. InICML, 2024. 2
2024
-
[9]
Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Lin- jun Zhang, James Zou, and Huaxiu Yao. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference chal- lenges.arXiv preprint arXiv:2311.03287, 2023. 2
-
[10]
Instructblip: Towards general-purpose vision-language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. InNeurIPS, 2023. 1, 2 9
2023
-
[11]
Chain-of-verification reduces hallucination in large language models
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. InACL Findings, 2024. 1
2024
-
[12]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 2, 4
work page internal anchor Pith review arXiv 2023
-
[14]
Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR, 2024. 2
2024
-
[15]
Detecting and preventing hallucinations in large vision language models
Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. InAAAI, 2024. 1, 2
2024
-
[16]
Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation
Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. InCVPR, 2024. 1, 2, 8
2024
-
[17]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Faith- score: Fine-grained evaluations of hallucinations in large vision-language models
Liqiang Jing, Ruosen Li, Yunmo Chen, and Xinya Du. Faith- score: Fine-grained evaluations of hallucinations in large vision-language models. InEMNLP Findings, 2024. 2
2024
-
[19]
Brave: Broadening the visual encoding of vision-language models
O ˘guzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. Brave: Broadening the visual encoding of vision-language models. InECCV, 2024. 1, 2
2024
-
[20]
Geochat: Grounded large vision-language model for remote sensing
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InCVPR, 2024. 1
2024
-
[21]
V olcano: mitigating multimodal hallucination through self-feedback guided revision
Seongyun Lee, Sue Hyun Park, Yongrae Jo, and Minjoon Seo. V olcano: mitigating multimodal hallucination through self-feedback guided revision. InNAACL, 2024. 1, 2
2024
-
[22]
Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastive decoding. InCVPR, 2024. 1, 2, 8
2024
-
[23]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint arXiv:2305.03726, 2023. 1, 2
work page internal anchor Pith review arXiv 2023
-
[24]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML,
-
[25]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023. 2, 4
2023
-
[26]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 4
2014
-
[27]
Mitigating hallucination in large multi-modal models via robust instruction tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Ya- coob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. InICLR,
-
[28]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1, 2, 4
2023
-
[29]
Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms
Shi Liu, Kecheng Zheng, and Wei Chen. Paying more atten- tion to image: A training-free method for alleviating halluci- nation in lvlms. InECCV, 2024. 1, 2, 8
2024
-
[30]
Negative object presence evaluation (nope) to measure object hallucination in vision-language models
Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, and Pascale Fung. Negative object presence evaluation (nope) to measure object hallucination in vision-language models. InACLW, 2024. 2
2024
-
[31]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,
work page internal anchor Pith review arXiv
-
[32]
Self- checkgpt: Zero-resource black-box hallucination detection for generative large language models
Potsawee Manakul, Adian Liusie, and Mark JF Gales. Self- checkgpt: Zero-resource black-box hallucination detection for generative large language models. InEMNLP, 2023. 1, 2, 4
2023
-
[33]
The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai
AI Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.https://ai. meta. com/blog/llama-4-multimodal-intelligence/, checked on, 4 (7):2025, 2025. 1, 2
2025
-
[34]
Clipttt: Clip-guided test-time training helps lvlms see better
Mriganka Nath, Anurag Das, Jiahao Xie, and Bernt Schiele. Clipttt: Clip-guided test-time training helps lvlms see better. arXiv preprint arXiv:2603.26486, 2026. 2
-
[35]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Training lan- guage models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. InNeurIPS, 2022. 1, 2
2022
-
[37]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019. 4
2019
-
[38]
Object hallucination in image cap- tioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. InEMNLP, 2018. 2, 4, 8
2018
-
[39]
Visual chain of thought: bridging logical gaps with multimodal infillings
Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, and William Yang Wang. Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv preprint arXiv:2305.02317, 2023. 2 10
-
[40]
Eagle: Exploring the de- sign space for multimodal llms with mixture of encoders
Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, et al. Eagle: Exploring the de- sign space for multimodal llms with mixture of encoders. In ICLR, 2025. 1, 2
2025
-
[41]
Aligning large multi- modal models with factually augmented rlhf
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multi- modal models with factually augmented rlhf. InACL Find- ings, 2024. 2
2024
-
[42]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 1, 2
work page internal anchor Pith review arXiv 2024
-
[43]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Winoground: Probing vision and language models for visio- linguistic compositionality
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InCVPR, 2022. 1
2022
-
[45]
Vigc: Visual instruction generation and correction
Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, et al. Vigc: Visual instruction generation and correction. In AAAI, 2024. 1, 2
2024
-
[46]
Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023. 2
-
[47]
Chain-of-thought prompting elicits reasoning in large lan- guage models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. InNeurIPS, 2022. 1, 2
2022
-
[48]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. 4
work page internal anchor Pith review arXiv 1910
-
[49]
Logical closed loop: Uncover- ing object hallucinations in large vision-language models
Junfei Wu, Qiang Liu, Ding Wang, Jinghao Zhang, Shu Wu, Liang Wang, and Tieniu Tan. Logical closed loop: Uncover- ing object hallucinations in large vision-language models. In ACL, 2024. 1, 2, 3, 4, 5, 6
2024
-
[50]
Combating mul- timodal llm hallucination via bottom-up holistic reasoning
Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, and Tat-Seng Chua. Combating mul- timodal llm hallucination via bottom-up holistic reasoning. InAAAI, 2025. 1
2025
-
[51]
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.TPAMI, 2024
Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.TPAMI, 2024. 2
2024
-
[52]
Fiha: Automated fine-grained hallu- cinations evaluations in large vision language models with davidson scene graphs
Bowen Yan, Zhengsong Zhang, Liqiang Jing, Eftekhar Hos- sain, and Xinya Du. Fiha: Automated fine-grained hallu- cinations evaluations in large vision language models with davidson scene graphs. InACL Findings, 2025. 2
2025
-
[53]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023. 1, 2, 4
work page Pith review arXiv 2023
-
[54]
Woodpecker: Hallucination correction for multimodal large language models.SCIS, 2024
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.SCIS, 2024. 1, 2, 4
2024
-
[55]
Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness
Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, et al. Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness. InCVPR, 2025. 2
2025
-
[56]
Multimodal chain-of-thought rea- soning in language models.TMLR, 2024
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought rea- soning in language models.TMLR, 2024. 2
2024
-
[57]
arXiv preprint arXiv:2402.11411 , year=
Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large lan- guage models via preference fine-tuning.arXiv preprint arXiv:2402.11411, 2024. 2
-
[58]
Analyzing and mitigating object hallucination in large vision-language models
Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. Analyzing and mitigating object hallucination in large vision-language models. InICLR, 2024. 1, 2, 4
2024
-
[59]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 1, 2, 4 11 Appendix In this supplementary material, we provide the detailed prompt templates in Section A and visualize more quali- tative results in Section B,...
work page internal anchor Pith review arXiv 2023
-
[60]
Yes”/“No
Your response should be limited to one of the following two choices: “Yes”/“No”
-
[61]
For example, a baseball is a subclass of the sports ball
Note that instances of a certain category can also belong to its super-categories. For example, a baseball is a subclass of the sports ball
-
[62]
Note that the table is equivalent to the dining table here
-
[63]
DO NOT RESPOND WITH ANYTHING ELSE. [Response Format] Yes/No Now complete the following: [Statement] {Input statement} [Question] Is there a{object}in the statement? [Response] 13 Table 11.Prompt template for final response generation.{In-context examples}are in-context examples for better instruction.{Input query}is the question asked by the user.{Input p...
-
[64]
Remove the objects that are confirmed to not exist in the supplementary information
Correct the sentences in the passage if they are inconsistent with the supplementary information. Remove the objects that are confirmed to not exist in the supplementary information
-
[65]
Do not modify correct sentences and introduce additional information
-
[66]
The refined passage should be a reasonable answer to the query
When giving refined passage, also pay attention to the given query. The refined passage should be a reasonable answer to the query
-
[67]
Output only the corrected passage, without introducing extra contents
Note the dining table is equivalent to the table. Output only the corrected passage, without introducing extra contents. Here are examples: {In-context examples} Now complete the following: [Query] {Input query} [Passage] {Input passage} [Supplementary Information] {Input information} [Response] 14 Table 12.Prompt template for GPT-4o assisted evaluation.{...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.