pith. sign in

arxiv: 2506.21546 · v4 · submitted 2025-06-26 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

Pith reviewed 2026-05-19 07:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords segmentation hallucinationsvision-language modelscounterfactual reasoningreferring expression segmentationabstentionbenchmarkfine-tuning
0
0 comments X

The pith

Counterfactual fine-tuning trains segmentation models to abstain from masking absent objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Segmentation vision-language models frequently produce masks for objects that do not exist in the image. Current evaluations change only text or labels and therefore miss the spatial and visual causes of these errors. The paper formalizes Counterfactual Segmentation Reasoning so that a model must segment the correct object in a factual image yet abstain entirely in its visually altered counterpart. It supplies HalluSegBench, a large benchmark built from controlled visual edits, together with metrics that separate vision-driven from language-driven hallucinations. Training a model called RobustSeg with counterfactual fine-tuning on these pairs reduces hallucinations by 30 percent and raises accuracy on standard referring segmentation tests.

Core claim

By pairing each factual image with a controlled visual counterfactual in which the referenced object is removed or altered, a segmentation VLM can be trained to output a mask only when the object is present and to abstain otherwise, thereby cutting pixel-grounding hallucinations while preserving or improving segmentation quality.

What carries the argument

Counterfactual fine-tuning (CFT), which exposes the model to matched factual-counterfactual image pairs so it learns the visual conditions under which segmentation is appropriate.

If this is right

  • Models learn an explicit abstention signal tied directly to the presence or absence of the queried object in the visual input.
  • Vision-driven and language-driven hallucinations can be measured and reported separately using the new severity and disentanglement metrics.
  • Segmentation accuracy rises on FP-RefCOCO(+/g) while hallucination rates fall.
  • The same training recipe can be applied to any segmentation VLM that accepts image-text pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same factual-counterfactual pairing could be adapted to reduce grounding errors in object detection or visual question answering.
  • Automatically generated counterfactuals, rather than hand-crafted ones, would allow the method to scale to larger unlabeled datasets.
  • Combining the visual abstention signal with existing language-only hallucination detectors could address mixed failure modes more completely.

Load-bearing premise

The controlled visual changes used to build the counterfactual images isolate vision-driven hallucinations without adding artifacts or biases that would change how the model behaves or how the metrics are scored.

What would settle it

Run RobustSeg on a fresh collection of natural images that lack matched counterfactual versions and measure whether the reported 30 percent hallucination reduction disappears or whether accuracy on the original benchmarks falls.

Figures

Figures reproduced from arXiv: 2506.21546 by Adheesh Juvekar, Ismini Lourentzou, Jiaxun Zhang, Kiet A. Nguyen, Muntasir Wahed, Tianjiao Yu, Xingyou Liu, Xinzhuo Li, Yifan Shen.

Figure 1
Figure 1. Figure 1: Illustration of Segmentation Behavior from LISA [ [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of HalluSegBench Dataset Characteristics. (a) Distribution of mask sizes as a percentage of the total image area. (b) Top-20 most frequent factual-counterfactual object replacement pairs, illustrating common substitution patterns in the dataset. Dataset Statistics. HalluSegBench comprises 1, 340 mask pairs across 281 unique object classes totaling 2, 680 segmentation masks and 2, 342 images. Figur… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of Object Categories. the entire dataset, represented as a percent￾age of the total image area and segmented by mask type: All (overall dataset), Factual, and Counterfactual instances. The majority of masks occupy a small fraction of the image, predominantly in the 5–10% range, mirror￾ing typical real-world scenes where objects are part of larger visual contexts. Both factual and counterfactua… view at source ↗
Figure 4
Figure 4. Figure 4: mIoU Comparison of Reasoning Segmentation Models. Higher mIoU indicates better seg￾mentation performance. Baselines. We evaluate a range of pixel-grounding VLMs, including models explicitly designed to mitigate grounded hallucination. The reasoning-based models include LISA [14], GLaMM [29], and Pix￾elLM [31], which leverage large language models for reasoning, and SAM [13] or other Transformer-based archi… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Comparison of Reasoning Segmentation Model Predictions across Factual and [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the Data Generation Pipeline. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of ∆IoU Across All Samples. Most ∆IoU values lie near zero, especially under visual edits, indicating persistent hallucinations. Metric Distributions and Summary Statistics. Fig￾ure 7 illustrates the empirical distribution of our ∆IoU across all examples in HalluSegBench and all baselines. The distribution of ∆IoUtextual and ∆IoUvisual reveals a bimodal pattern: one peak near 1.0 corresponding… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Comparison of Reasoning Segmentation Models across Factual and Counterfactual [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative Comparison of Reasoning Segmentation Models across Factual and Counterfactual [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative Comparison of Reasoning Segmentation Models across Factual and Counterfac [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative Comparison of Reasoning Segmentation Models across Factual and Counterfac [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative Comparison of Reasoning Segmentation Models across Factual and Counterfac [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for Generating Object Replacement Instructions. [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for Constrained Image Editing. This prompt instructs a generative model to edit only unmasked regions while preserving scene structure and realism. Here, {item[’instruction’]} denotes the extracted instruction using prompt shown in [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
read the original abstract

Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper formalizes Counterfactual Segmentation Reasoning (CSR) for segmentation VLMs to diagnose pixel-grounding hallucinations. It introduces HalluSegBench, a benchmark built on controlled visual counterfactuals for referring and reasoning expression segmentation, along with new metrics that quantify hallucination severity and attempt to disentangle vision-driven versus language-driven failures. It further proposes RobustSeg, a model trained via counterfactual fine-tuning (CFT), and reports that this approach reduces hallucinations by 30% while improving performance on FP-RefCOCO(+/g).

Significance. If the central empirical claims hold after proper validation, the work would meaningfully advance evaluation of grounded VLMs by shifting emphasis from text/label perturbations to vision-focused counterfactuals. The benchmark construction, severity metrics, and CFT training procedure represent practical contributions that could help the community better isolate and mitigate vision-driven hallucinations. The attempt to disentangle failure modes is a notable strength if the metrics prove robust.

major comments (3)
  1. [Abstract] Abstract: The claim that RobustSeg 'reduces hallucinations by 30%' and improves segmentation on FP-RefCOCO(+/g) is presented without any information on dataset size, counterfactual generation procedure, baseline models, statistical significance testing, or precise definitions of the severity metrics. This absence prevents assessment of whether the reported gains are load-bearing or reproducible.
  2. [HalluSegBench construction] HalluSegBench construction (methods section): The central assumption that controlled visual counterfactuals (object removal or scene editing) isolate vision-driven hallucinations without introducing new artifacts is not validated. Unnatural textures, lighting inconsistencies, or correlated semantic shifts could cause models to abstain for low-level visual reasons rather than grounding failures, confounding both the severity metrics and the claimed CFT benefit.
  3. [Evaluation metrics] Evaluation metrics section: The new metrics intended to measure hallucination severity and disentangle vision- versus language-driven modes lack explicit formulations, ablation studies, or controls for potential biases introduced by the counterfactual generation process. Without these, it is unclear whether the metrics achieve the claimed disentanglement or simply reflect artifacts in the benchmark.
minor comments (2)
  1. [Introduction] The acronym CSR is defined in the abstract but could be restated at first use in the introduction for improved readability.
  2. [Evaluation metrics] Notation for the new severity and disentanglement scores should be accompanied by explicit equations or pseudocode to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our work formalizing Counterfactual Segmentation Reasoning and introducing HalluSegBench along with RobustSeg. The comments highlight important areas for improving clarity and rigor. We address each major comment point-by-point below, providing explanations grounded in the manuscript and indicating revisions where they will strengthen the presentation without altering core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that RobustSeg 'reduces hallucinations by 30%' and improves segmentation on FP-RefCOCO(+/g) is presented without any information on dataset size, counterfactual generation procedure, baseline models, statistical significance testing, or precise definitions of the severity metrics. This absence prevents assessment of whether the reported gains are load-bearing or reproducible.

    Authors: We acknowledge that the abstract is concise by design and omits granular details to fit length constraints. The full manuscript specifies HalluSegBench scale (thousands of counterfactual pairs derived from RefCOCO and similar sources), the counterfactual generation via object removal and scene editing in Section 3, baselines including standard segmentation VLMs, severity metrics defined in Section 3.2 as normalized hallucinated pixel ratios with vision/language disentanglement, and statistical significance via paired t-tests with p-values reported in Section 4. To improve immediate assessability, we will revise the abstract to include brief mentions of benchmark size, the 30% reduction context, and reference to significance testing. revision: yes

  2. Referee: [HalluSegBench construction] HalluSegBench construction (methods section): The central assumption that controlled visual counterfactuals (object removal or scene editing) isolate vision-driven hallucinations without introducing new artifacts is not validated. Unnatural textures, lighting inconsistencies, or correlated semantic shifts could cause models to abstain for low-level visual reasons rather than grounding failures, confounding both the severity metrics and the claimed CFT benefit.

    Authors: This concern about potential visual artifacts is well-taken and directly relevant to the validity of CSR. The manuscript describes use of controlled editing pipelines with post-generation filtering for visual coherence and includes qualitative examples plus controls testing abstention on referent-absent counterfactuals. However, we agree explicit validation against low-level confounds would strengthen the work. We will add a dedicated subsection with quantitative checks (e.g., human ratings of naturalness and model performance on artifact-controlled subsets) and an ablation isolating editing effects. revision: yes

  3. Referee: [Evaluation metrics] Evaluation metrics section: The new metrics intended to measure hallucination severity and disentangle vision- versus language-driven modes lack explicit formulations, ablation studies, or controls for potential biases introduced by the counterfactual generation process. Without these, it is unclear whether the metrics achieve the claimed disentanglement or simply reflect artifacts in the benchmark.

    Authors: We agree that explicit formulations and controls are essential for the metrics' credibility. Section 3.3 provides the severity metric as the fraction of hallucinated area in counterfactuals where abstention fails, with vision-driven failures isolated by holding language fixed and varying visuals, and language-driven by the converse; initial ablations correlate with human annotations. To address the referee's point fully, we will expand this section with complete mathematical definitions, additional ablation tables controlling for generation biases (e.g., texture/lighting variants), and bias analysis results in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and fine-tuning

full rationale

The paper introduces the CSR task, curates HalluSegBench via controlled visual counterfactuals, defines new severity metrics, and trains RobustSeg with counterfactual fine-tuning (CFT). No equations, derivations, parameter fittings, or self-citation chains are present that would reduce any claim to its inputs by construction. Central results (30% hallucination reduction, FP-RefCOCO gains) are reported from direct experiments on the newly constructed benchmark and external datasets, rendering the work self-contained against external benchmarks with no load-bearing reductions to prior fitted quantities or author-defined uniqueness.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach depends on the ability to generate counterfactual images that cleanly remove or alter the target object while preserving other scene elements.

axioms (1)
  • domain assumption Visual counterfactuals can be constructed to isolate vision-driven hallucinations without confounding artifacts
    Invoked when defining the CSR task and HalluSegBench to separate vision- from language-driven failures.
invented entities (1)
  • RobustSeg no independent evidence
    purpose: Segmentation VLM trained with counterfactual fine-tuning to learn abstention
    New model introduced to demonstrate mitigation; no independent evidence provided beyond the reported 30% reduction.

pith-pipeline@v0.9.0 · 5783 in / 1250 out tokens · 40362 ms · 2026-05-19T07:32:04.785228+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

    cs.CV 2026-04 unverdicted novelty 8.0

    3D-VCD reduces hallucinations in 3D-LLM embodied agents by contrasting predictions from original and distorted 3D scene representations at inference time.

  2. VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

    cs.CV 2026-05 unverdicted novelty 7.0

    VISTAQA is a new benchmark for joint visual question answering correctness and pixel-level grounding, evaluated with the GROVE metric that uses per-sample geometric mean to require both dimensions to succeed.

  3. From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

    cs.CV 2026-05 unverdicted novelty 7.0

    CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 3 Pith papers · 3 internal anchors

  1. [1]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and others. Flamingo: a Visual Language Model for Few-Shot Learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  2. [2]

    Blended Latent Diffusion.ACM Transactions on Graphics (TOG), 2023

    Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended Latent Diffusion.ACM Transactions on Graphics (TOG), 2023

  3. [3]

    Mitigating Open- Vocabulary Caption Hallucinations

    Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, and Hadar Averbuch-Elor. Mitigating Open- Vocabulary Caption Hallucinations. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

  4. [4]

    Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting.arXiv preprint arXiv:2503.21770, 2025

    Anand Bhattad, Konpat Preechakul, and Alexei A Efros. Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting.arXiv preprint arXiv:2503.21770, 2025

  5. [5]

    InstructPix2Pix: Learning to Follow Image Editing Instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to Follow Image Editing Instructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  6. [6]

    UNITER: UNiversal Image-TExt Representation Learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: UNiversal Image-TExt Representation Learning. InEuropean Conference on Computer Vision (ECCV), 2020. 11

  7. [7]

    SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabil- ities

    Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brain and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabil- ities. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  8. [8]

    Plausible May Not Be Faithful: Prob- ing Object Hallucination in Vision-Language Pre-training

    Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. Plausible May Not Be Faithful: Prob- ing Object Hallucination in Vision-Language Pre-training. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

  9. [9]

    Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

    Gregor Geigle, Radu Timofte, and Goran Glavaš. Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

  10. [10]

    Visual Hallucinations of Multi-modal Large Language Models

    Wen Huang, Hongbin Liu, Minxin Guo, and Neil Gong. Visual Hallucinations of Multi-modal Large Language Models. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

  11. [11]

    SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

    Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  12. [12]

    A Style-Based Generator Architecture for Generative Adversarial Networks

    Tero Karras, Samuli Laine, and Timo Aila. A Style-Based Generator Architecture for Generative Adversarial Networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  13. [13]

    SegmentAnything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, SpencerWhitehead, AlexanderCBerg, Wan-YenLo, etal. SegmentAnything. InInternationalConference on Computer Vision (ICCV), 2023

  14. [14]

    LISA: Reasoning Segmentation via Large Language Model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning Segmentation via Large Language Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  15. [15]

    BLIP-2: Bootstrapping Language-Image Pre- training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-Image Pre- training with Frozen Image Encoders and Large Language Models. InInternational Conference on Machine Learning (ICML), 2023

  16. [16]

    ZONE: Zero-Shot Instruction-Guided Local Editing

    Shanglin Li, Bohan Zeng, Yutang Feng, Sicheng Gao, Xiuhui Liu, Jiaming Liu, Lin Li, Xu Tang, Yao Hu, Jianzhuang Liu, et al. ZONE: Zero-Shot Instruction-Guided Local Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  17. [17]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating Object Hallucination in Large Vision-Language Models. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  18. [18]

    Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering

    Zujie Liang, Weitao Jiang, Haifeng Hu, and Jiaying Zhu. Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  19. [19]

    GRES: Generalized Referring Expression Segmentation

    Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized Referring Expression Segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 12

  20. [20]

    PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset

    Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, and Xirong Li. PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  21. [21]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement.arXiv preprint arXiv:2503.06520, 2025

  22. [22]

    Visual Instruction Tuning

    Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae. Visual Instruction Tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  23. [23]

    Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models

    Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, and Pascale Fung. Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), 2024

  24. [24]

    Groma: Localized Visual Tokeniza- tion for Grounding Multimodal Large Language Models

    Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized Visual Tokeniza- tion for Grounding Multimodal Large Language Models. InEuropean Conference on Computer Vision (ECCV), 2024

  25. [25]

    CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

    Kiet A Nguyen, Adheesh Juvekar, Tianjiao Yu, Muntasir Wahed, and Ismini Lourentzou. CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  26. [26]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. InInternational Conference on Machine Learning (ICML), 2022

  27. [27]

    Counterfactual VQA: A Cause-Effect Look at Language Bias

    Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual VQA: A Cause-Effect Look at Language Bias. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  28. [28]

    Counterfactual Vision-and-Language Navigation: Unravelling the Unseen

    Amin Parvaneh, Ehsan Abbasnejad, Damien Teney, Javen Qinfeng Shi, and Anton Van den Hengel. Counterfactual Vision-and-Language Navigation: Unravelling the Unseen. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  29. [29]

    GLaMM: Pixel Grounding Large Multimodal Model

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. GLaMM: Pixel Grounding Large Multimodal Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  30. [30]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks.arXiv preprint arXiv:2401.14159, 2024

  31. [31]

    PixelLM: Pixel Reasoning with Large Multimodal Model

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel Reasoning with Large Multimodal Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  32. [32]

    Object Hallucina- tion in Image Captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object Hallucina- tion in Image Captioning. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2018. 13

  33. [33]

    Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

    Neelabh Sinha, Vinija Jain, and Aman Chadha. Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types. InProceedings of the First Workshop of Evaluation of Multi-Modal Generation, 2025

  34. [34]

    Rethinking Visual Counterfactual Explanations Through Region Constraint

    Bartlomiej Sobieski, Jakub Grzywaczewski, Bartłomiej Sadlej, Matthew Tivnan, and Przemyslaw Biecek. Rethinking Visual Counterfactual Explanations Through Region Constraint. InInternational Conference on Learning Representations (ICLR), 2024

  35. [35]

    Doubly Abductive Counterfactual Inference for Text-based Image Editing

    Xue Song, Jiequan Cui, Hanwang Zhang, Jingjing Chen, Richang Hong, and Yu-Gang Jiang. Doubly Abductive Counterfactual Inference for Text-based Image Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  36. [36]

    Features of Similarity.Psychological review, 1977

    Amos Tversky. Features of Similarity.Psychological review, 1977

  37. [37]

    PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation.arXiv preprint arXiv:2412.15209, 2024

    Muntasir Wahed, Kiet A Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, and Ismini Lourentzou. PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation.arXiv preprint arXiv:2412.15209, 2024

  38. [38]

    Counterfactual Cycle- Consistent Learning for Instruction Following and Generation in Vision-Language Navigation

    Hanqing Wang, Wei Liang, Jianbing Shen, Luc Van Gool, and Wenguan Wang. Counterfactual Cycle- Consistent Learning for Instruction Following and Generation in Vision-Language Navigation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  39. [39]

    OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. InInternational Conference on Machine Learning (ICML). Proceedings of Machine Learning Research (PMLR), 2022

  40. [40]

    Instructedit: Improving automatic masks for diffusion-based image editing with user instructions.ArXiv, abs/2305.18047, 2023

    Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka. InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions.arXiv preprint arXiv:2305.18047, 2023

  41. [41]

    Hyperseg: Towards universal visual segmentation with large language model, 2024

    Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, and Yujiu Yang. HyperSeg: Towards Universal Visual Segmentation with Large Language Model.arXiv preprint arXiv:2411.17606, 2024

  42. [42]

    InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models.arXiv preprint arXiv:2412.14006, 2024

    Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Zheng Zhao, and Yujiu Yang. InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models.arXiv preprint arXiv:2412.14006, 2024

  43. [43]

    Towards Robust Referring Image Segmentation.IEEE Transactions on Image Processing (TIP), 2024

    Jianzong Wu, Xiangtai Li, Xia Li, Henghui Ding, Yunhai Tong, and Dacheng Tao. Towards Robust Referring Image Segmentation.IEEE Transactions on Image Processing (TIP), 2024

  44. [44]

    See, Say, and Segment: Teaching LMMs to Overcome False Premises

    Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta, Xudong Wang, Joseph E Gonzalez, and Trevor Darrell. See, Say, and Segment: Teaching LMMs to Overcome False Premises. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  45. [45]

    GSVA: Generalized Segmentation via Multimodal Large Language Models

    Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. GSVA: Generalized Segmentation via Multimodal Large Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  46. [46]

    Benchmarking Segmentation Models with Mask-Preserved Attribute Editing

    Zijin Yin, Kongming Liang, Bing Li, Zhanyu Ma, and Jun Guo. Benchmarking Segmentation Models with Mask-Preserved Attribute Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 14

  47. [47]

    Modeling Context in Referring Expressions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling Context in Referring Expressions. InEuropean Conference on Computer Vision (ECCV), 2016

  48. [48]

    Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption.arXiv preprint arXiv:2310.01779,

    Bohan Zhai, Shijia Yang, Chenfeng Xu, Sheng Shen, Kurt Keutzer, Chunyuan Li, and Manling Li. HallE- Control: ControllingObjectHallucinationinLargeMultimodalModels. arXivpreprintarXiv:2310.01779 , 2023

  49. [49]

    MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  50. [50]

    OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understand- ing

    Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understand- ing. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  51. [51]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.arXiv preprint arXiv:2309.01219, 2023

  52. [52]

    elephant

    Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. InInternational Conference on Computer Vision (ICCV), 2017. 15 A. HalluSegBench Details Motivation. HalluSegBenchintroduces a counterfactual visual reasoning framework to evaluate segmenta- tion models under contro...

  53. [53]

    {label}", described as

    A binary mask marking an object labeled "{label}", described as "{description}". In case of vague or wrong descriptions, follow the image and mask. Task: - Locate the masked object precisely. - Create a replacement instruction that: • Uniquely identifies the object (position, color, size, etc.) • Swaps it for a new object that is not already present. • Ne...