Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

Adheesh Juvekar; Ismini Lourentzou; Jiaxun Zhang; Kiet A. Nguyen; Muntasir Wahed; Tianjiao Yu; Xingyou Liu; Xinzhuo Li; Yifan Shen

arxiv: 2506.21546 · v4 · submitted 2025-06-26 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

Xinzhuo Li , Adheesh Juvekar , Jiaxun Zhang , Xingyou Liu , Muntasir Wahed , Kiet A. Nguyen , Yifan Shen , Tianjiao Yu

show 1 more author

Ismini Lourentzou

This is my paper

Pith reviewed 2026-05-19 07:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords segmentation hallucinationsvision-language modelscounterfactual reasoningreferring expression segmentationabstentionbenchmarkfine-tuning

0 comments

The pith

Counterfactual fine-tuning trains segmentation models to abstain from masking absent objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Segmentation vision-language models frequently produce masks for objects that do not exist in the image. Current evaluations change only text or labels and therefore miss the spatial and visual causes of these errors. The paper formalizes Counterfactual Segmentation Reasoning so that a model must segment the correct object in a factual image yet abstain entirely in its visually altered counterpart. It supplies HalluSegBench, a large benchmark built from controlled visual edits, together with metrics that separate vision-driven from language-driven hallucinations. Training a model called RobustSeg with counterfactual fine-tuning on these pairs reduces hallucinations by 30 percent and raises accuracy on standard referring segmentation tests.

Core claim

By pairing each factual image with a controlled visual counterfactual in which the referenced object is removed or altered, a segmentation VLM can be trained to output a mask only when the object is present and to abstain otherwise, thereby cutting pixel-grounding hallucinations while preserving or improving segmentation quality.

What carries the argument

Counterfactual fine-tuning (CFT), which exposes the model to matched factual-counterfactual image pairs so it learns the visual conditions under which segmentation is appropriate.

If this is right

Models learn an explicit abstention signal tied directly to the presence or absence of the queried object in the visual input.
Vision-driven and language-driven hallucinations can be measured and reported separately using the new severity and disentanglement metrics.
Segmentation accuracy rises on FP-RefCOCO(+/g) while hallucination rates fall.
The same training recipe can be applied to any segmentation VLM that accepts image-text pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factual-counterfactual pairing could be adapted to reduce grounding errors in object detection or visual question answering.
Automatically generated counterfactuals, rather than hand-crafted ones, would allow the method to scale to larger unlabeled datasets.
Combining the visual abstention signal with existing language-only hallucination detectors could address mixed failure modes more completely.

Load-bearing premise

The controlled visual changes used to build the counterfactual images isolate vision-driven hallucinations without adding artifacts or biases that would change how the model behaves or how the metrics are scored.

What would settle it

Run RobustSeg on a fresh collection of natural images that lack matched counterfactual versions and measure whether the reported 30 percent hallucination reduction disappears or whether accuracy on the original benchmarks falls.

Figures

Figures reproduced from arXiv: 2506.21546 by Adheesh Juvekar, Ismini Lourentzou, Jiaxun Zhang, Kiet A. Nguyen, Muntasir Wahed, Tianjiao Yu, Xingyou Liu, Xinzhuo Li, Yifan Shen.

**Figure 2.** Figure 2: Overview of HalluSegBench Dataset Characteristics. (a) Distribution of mask sizes as a percentage of the total image area. (b) Top-20 most frequent factual-counterfactual object replacement pairs, illustrating common substitution patterns in the dataset. Dataset Statistics. HalluSegBench comprises 1, 340 mask pairs across 281 unique object classes totaling 2, 680 segmentation masks and 2, 342 images. Figur… view at source ↗

**Figure 3.** Figure 3: Distribution of Object Categories. the entire dataset, represented as a percentage of the total image area and segmented by mask type: All (overall dataset), Factual, and Counterfactual instances. The majority of masks occupy a small fraction of the image, predominantly in the 5–10% range, mirroring typical real-world scenes where objects are part of larger visual contexts. Both factual and counterfactua… view at source ↗

**Figure 4.** Figure 4: mIoU Comparison of Reasoning Segmentation Models. Higher mIoU indicates better segmentation performance. Baselines. We evaluate a range of pixel-grounding VLMs, including models explicitly designed to mitigate grounded hallucination. The reasoning-based models include LISA [14], GLaMM [29], and PixelLM [31], which leverage large language models for reasoning, and SAM [13] or other Transformer-based archi… view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison of Reasoning Segmentation Model Predictions across Factual and [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the Data Generation Pipeline. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of ∆IoU Across All Samples. Most ∆IoU values lie near zero, especially under visual edits, indicating persistent hallucinations. Metric Distributions and Summary Statistics. Figure 7 illustrates the empirical distribution of our ∆IoU across all examples in HalluSegBench and all baselines. The distribution of ∆IoUtextual and ∆IoUvisual reveals a bimodal pattern: one peak near 1.0 corresponding… view at source ↗

**Figure 8.** Figure 8: Qualitative Comparison of Reasoning Segmentation Models across Factual and Counterfactual [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative Comparison of Reasoning Segmentation Models across Factual and Counterfactual [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative Comparison of Reasoning Segmentation Models across Factual and Counterfac [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative Comparison of Reasoning Segmentation Models across Factual and Counterfac [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative Comparison of Reasoning Segmentation Models across Factual and Counterfac [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for Generating Object Replacement Instructions. [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for Constrained Image Editing. This prompt instructs a generative model to edit only unmasked regions while preserving scene structure and realism. Here, {item[’instruction’]} denotes the extracted instruction using prompt shown in [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

read the original abstract

Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces CSR and HalluSegBench to test vision-driven hallucinations via visual counterfactuals, with a fine-tuning method claiming 30% reduction, but the results depend on whether those edits stay clean.

read the letter

The main takeaway is that this paper defines a new task called Counterfactual Segmentation Reasoning and releases HalluSegBench, a benchmark that uses visual counterfactuals instead of text changes to test for pixel-grounding hallucinations in VLMs. What the paper does well is point out that current evaluations miss vision-driven failures because they only check label matches. By requiring the model to segment in the original image but abstain in the edited version, they try to measure when the model hallucinates based on visual input. The RobustSeg model with counterfactual fine-tuning is presented as a way to teach abstention, and they report a 30% drop in hallucinations along with gains on FP-RefCOCO. The benchmark and metrics for severity and disentangling vision versus language modes could be useful additions if the setup is sound. The soft spot is the quality of those visual counterfactuals. The stress test raises a fair point: if editing the images to create counterfactuals adds unnatural elements, the model might learn to detect those instead of truly reasoning about object presence. That would make the hallucination reduction look better than it is. The paper needs to detail the generation method clearly and show that it doesn't introduce biases or artifacts that affect the results. From what is described, the empirical claims depend on this isolation working as intended. If the full paper has solid controls for that, it strengthens the case. This kind of work is for people focused on making VLMs more reliable in real applications where wrong segmentations can cause problems. Readers who care about evaluation benchmarks for hallucinations will find the task definition and the attempt to separate failure modes worth their time. It should go to peer review because the core idea addresses a genuine limitation in current methods, even though the details on benchmark construction will need close examination.

Referee Report

3 major / 2 minor

Summary. The paper formalizes Counterfactual Segmentation Reasoning (CSR) for segmentation VLMs to diagnose pixel-grounding hallucinations. It introduces HalluSegBench, a benchmark built on controlled visual counterfactuals for referring and reasoning expression segmentation, along with new metrics that quantify hallucination severity and attempt to disentangle vision-driven versus language-driven failures. It further proposes RobustSeg, a model trained via counterfactual fine-tuning (CFT), and reports that this approach reduces hallucinations by 30% while improving performance on FP-RefCOCO(+/g).

Significance. If the central empirical claims hold after proper validation, the work would meaningfully advance evaluation of grounded VLMs by shifting emphasis from text/label perturbations to vision-focused counterfactuals. The benchmark construction, severity metrics, and CFT training procedure represent practical contributions that could help the community better isolate and mitigate vision-driven hallucinations. The attempt to disentangle failure modes is a notable strength if the metrics prove robust.

major comments (3)

[Abstract] Abstract: The claim that RobustSeg 'reduces hallucinations by 30%' and improves segmentation on FP-RefCOCO(+/g) is presented without any information on dataset size, counterfactual generation procedure, baseline models, statistical significance testing, or precise definitions of the severity metrics. This absence prevents assessment of whether the reported gains are load-bearing or reproducible.
[HalluSegBench construction] HalluSegBench construction (methods section): The central assumption that controlled visual counterfactuals (object removal or scene editing) isolate vision-driven hallucinations without introducing new artifacts is not validated. Unnatural textures, lighting inconsistencies, or correlated semantic shifts could cause models to abstain for low-level visual reasons rather than grounding failures, confounding both the severity metrics and the claimed CFT benefit.
[Evaluation metrics] Evaluation metrics section: The new metrics intended to measure hallucination severity and disentangle vision- versus language-driven modes lack explicit formulations, ablation studies, or controls for potential biases introduced by the counterfactual generation process. Without these, it is unclear whether the metrics achieve the claimed disentanglement or simply reflect artifacts in the benchmark.

minor comments (2)

[Introduction] The acronym CSR is defined in the abstract but could be restated at first use in the introduction for improved readability.
[Evaluation metrics] Notation for the new severity and disentanglement scores should be accompanied by explicit equations or pseudocode to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our work formalizing Counterfactual Segmentation Reasoning and introducing HalluSegBench along with RobustSeg. The comments highlight important areas for improving clarity and rigor. We address each major comment point-by-point below, providing explanations grounded in the manuscript and indicating revisions where they will strengthen the presentation without altering core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that RobustSeg 'reduces hallucinations by 30%' and improves segmentation on FP-RefCOCO(+/g) is presented without any information on dataset size, counterfactual generation procedure, baseline models, statistical significance testing, or precise definitions of the severity metrics. This absence prevents assessment of whether the reported gains are load-bearing or reproducible.

Authors: We acknowledge that the abstract is concise by design and omits granular details to fit length constraints. The full manuscript specifies HalluSegBench scale (thousands of counterfactual pairs derived from RefCOCO and similar sources), the counterfactual generation via object removal and scene editing in Section 3, baselines including standard segmentation VLMs, severity metrics defined in Section 3.2 as normalized hallucinated pixel ratios with vision/language disentanglement, and statistical significance via paired t-tests with p-values reported in Section 4. To improve immediate assessability, we will revise the abstract to include brief mentions of benchmark size, the 30% reduction context, and reference to significance testing. revision: yes
Referee: [HalluSegBench construction] HalluSegBench construction (methods section): The central assumption that controlled visual counterfactuals (object removal or scene editing) isolate vision-driven hallucinations without introducing new artifacts is not validated. Unnatural textures, lighting inconsistencies, or correlated semantic shifts could cause models to abstain for low-level visual reasons rather than grounding failures, confounding both the severity metrics and the claimed CFT benefit.

Authors: This concern about potential visual artifacts is well-taken and directly relevant to the validity of CSR. The manuscript describes use of controlled editing pipelines with post-generation filtering for visual coherence and includes qualitative examples plus controls testing abstention on referent-absent counterfactuals. However, we agree explicit validation against low-level confounds would strengthen the work. We will add a dedicated subsection with quantitative checks (e.g., human ratings of naturalness and model performance on artifact-controlled subsets) and an ablation isolating editing effects. revision: yes
Referee: [Evaluation metrics] Evaluation metrics section: The new metrics intended to measure hallucination severity and disentangle vision- versus language-driven modes lack explicit formulations, ablation studies, or controls for potential biases introduced by the counterfactual generation process. Without these, it is unclear whether the metrics achieve the claimed disentanglement or simply reflect artifacts in the benchmark.

Authors: We agree that explicit formulations and controls are essential for the metrics' credibility. Section 3.3 provides the severity metric as the fraction of hallucinated area in counterfactuals where abstention fails, with vision-driven failures isolated by holding language fixed and varying visuals, and language-driven by the converse; initial ablations correlate with human annotations. To address the referee's point fully, we will expand this section with complete mathematical definitions, additional ablation tables controlling for generation biases (e.g., texture/lighting variants), and bias analysis results in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and fine-tuning

full rationale

The paper introduces the CSR task, curates HalluSegBench via controlled visual counterfactuals, defines new severity metrics, and trains RobustSeg with counterfactual fine-tuning (CFT). No equations, derivations, parameter fittings, or self-citation chains are present that would reduce any claim to its inputs by construction. Central results (30% hallucination reduction, FP-RefCOCO gains) are reported from direct experiments on the newly constructed benchmark and external datasets, rendering the work self-contained against external benchmarks with no load-bearing reductions to prior fitted quantities or author-defined uniqueness.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach depends on the ability to generate counterfactual images that cleanly remove or alter the target object while preserving other scene elements.

axioms (1)

domain assumption Visual counterfactuals can be constructed to isolate vision-driven hallucinations without confounding artifacts
Invoked when defining the CSR task and HalluSegBench to separate vision- from language-driven failures.

invented entities (1)

RobustSeg no independent evidence
purpose: Segmentation VLM trained with counterfactual fine-tuning to learn abstention
New model introduced to demonstrate mitigation; no independent evidence provided beyond the reported 30% reduction.

pith-pipeline@v0.9.0 · 5783 in / 1250 out tokens · 40362 ms · 2026-05-19T07:32:04.785228+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
cs.CV 2026-04 unverdicted novelty 8.0

3D-VCD reduces hallucinations in 3D-LLM embodied agents by contrasting predictions from original and distorted 3D scene representations at inference time.
VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence
cs.CV 2026-05 unverdicted novelty 7.0

VISTAQA is a new benchmark for joint visual question answering correctness and pixel-level grounding, evaluated with the GROVE metric that uses per-sample geometric mean to require both dimensions to succeed.
From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
cs.CV 2026-05 unverdicted novelty 7.0

CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 3 Pith papers · 3 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and others. Flamingo: a Visual Language Model for Few-Shot Learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[2]

Blended Latent Diffusion.ACM Transactions on Graphics (TOG), 2023

Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended Latent Diffusion.ACM Transactions on Graphics (TOG), 2023

work page 2023
[3]

Mitigating Open- Vocabulary Caption Hallucinations

Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, and Hadar Averbuch-Elor. Mitigating Open- Vocabulary Caption Hallucinations. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024
[4]

Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting.arXiv preprint arXiv:2503.21770, 2025

Anand Bhattad, Konpat Preechakul, and Alexei A Efros. Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting.arXiv preprint arXiv:2503.21770, 2025

work page arXiv 2025
[5]

InstructPix2Pix: Learning to Follow Image Editing Instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to Follow Image Editing Instructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[6]

UNITER: UNiversal Image-TExt Representation Learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: UNiversal Image-TExt Representation Learning. InEuropean Conference on Computer Vision (ECCV), 2020. 11

work page 2020
[7]

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabil- ities

Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brain and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabil- ities. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[8]

Plausible May Not Be Faithful: Prob- ing Object Hallucination in Vision-Language Pre-training

Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. Plausible May Not Be Faithful: Prob- ing Object Hallucination in Vision-Language Pre-training. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

work page 2023
[9]

Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Gregor Geigle, Radu Timofte, and Goran Glavaš. Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024
[10]

Visual Hallucinations of Multi-modal Large Language Models

Wen Huang, Hongbin Liu, Minxin Guo, and Neil Gong. Visual Hallucinations of Multi-modal Large Language Models. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024
[11]

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[12]

A Style-Based Generator Architecture for Generative Adversarial Networks

Tero Karras, Samuli Laine, and Timo Aila. A Style-Based Generator Architecture for Generative Adversarial Networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[13]

SegmentAnything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, SpencerWhitehead, AlexanderCBerg, Wan-YenLo, etal. SegmentAnything. InInternationalConference on Computer Vision (ICCV), 2023

work page 2023
[14]

LISA: Reasoning Segmentation via Large Language Model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning Segmentation via Large Language Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[15]

BLIP-2: Bootstrapping Language-Image Pre- training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-Image Pre- training with Frozen Image Encoders and Large Language Models. InInternational Conference on Machine Learning (ICML), 2023

work page 2023
[16]

ZONE: Zero-Shot Instruction-Guided Local Editing

Shanglin Li, Bohan Zeng, Yutang Feng, Sicheng Gao, Xiuhui Liu, Jiaming Liu, Lin Li, Xu Tang, Yao Hu, Jianzhuang Liu, et al. ZONE: Zero-Shot Instruction-Guided Local Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[17]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating Object Hallucination in Large Vision-Language Models. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023
[18]

Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering

Zujie Liang, Weitao Jiang, Haifeng Hu, and Jiaying Zhu. Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020
[19]

GRES: Generalized Referring Expression Segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized Referring Expression Segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 12

work page 2023
[20]

PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset

Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, and Xirong Li. PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[21]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Visual Instruction Tuning

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae. Visual Instruction Tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[23]

Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models

Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, and Pascale Fung. Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), 2024

work page 2024
[24]

Groma: Localized Visual Tokeniza- tion for Grounding Multimodal Large Language Models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized Visual Tokeniza- tion for Grounding Multimodal Large Language Models. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[25]

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

Kiet A Nguyen, Adheesh Juvekar, Tianjiao Yu, Muntasir Wahed, and Ismini Lourentzou. CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[26]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. InInternational Conference on Machine Learning (ICML), 2022

work page 2022
[27]

Counterfactual VQA: A Cause-Effect Look at Language Bias

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual VQA: A Cause-Effect Look at Language Bias. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[28]

Counterfactual Vision-and-Language Navigation: Unravelling the Unseen

Amin Parvaneh, Ehsan Abbasnejad, Damien Teney, Javen Qinfeng Shi, and Anton Van den Hengel. Counterfactual Vision-and-Language Navigation: Unravelling the Unseen. In Advances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[29]

GLaMM: Pixel Grounding Large Multimodal Model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. GLaMM: Pixel Grounding Large Multimodal Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[30]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks.arXiv preprint arXiv:2401.14159, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

PixelLM: Pixel Reasoning with Large Multimodal Model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel Reasoning with Large Multimodal Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[32]

Object Hallucina- tion in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object Hallucina- tion in Image Captioning. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2018. 13

work page 2018
[33]

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Neelabh Sinha, Vinija Jain, and Aman Chadha. Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types. InProceedings of the First Workshop of Evaluation of Multi-Modal Generation, 2025

work page 2025
[34]

Rethinking Visual Counterfactual Explanations Through Region Constraint

Bartlomiej Sobieski, Jakub Grzywaczewski, Bartłomiej Sadlej, Matthew Tivnan, and Przemyslaw Biecek. Rethinking Visual Counterfactual Explanations Through Region Constraint. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[35]

Doubly Abductive Counterfactual Inference for Text-based Image Editing

Xue Song, Jiequan Cui, Hanwang Zhang, Jingjing Chen, Richang Hong, and Yu-Gang Jiang. Doubly Abductive Counterfactual Inference for Text-based Image Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[36]

Features of Similarity.Psychological review, 1977

Amos Tversky. Features of Similarity.Psychological review, 1977

work page 1977
[37]

PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation.arXiv preprint arXiv:2412.15209, 2024

Muntasir Wahed, Kiet A Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, and Ismini Lourentzou. PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation.arXiv preprint arXiv:2412.15209, 2024

work page arXiv 2024
[38]

Counterfactual Cycle- Consistent Learning for Instruction Following and Generation in Vision-Language Navigation

Hanqing Wang, Wei Liang, Jianbing Shen, Luc Van Gool, and Wenguan Wang. Counterfactual Cycle- Consistent Learning for Instruction Following and Generation in Vision-Language Navigation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[39]

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. InInternational Conference on Machine Learning (ICML). Proceedings of Machine Learning Research (PMLR), 2022

work page 2022
[40]

Instructedit: Improving automatic masks for diffusion-based image editing with user instructions.ArXiv, abs/2305.18047, 2023

Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka. InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions.arXiv preprint arXiv:2305.18047, 2023

work page arXiv 2023
[41]

Hyperseg: Towards universal visual segmentation with large language model, 2024

Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, and Yujiu Yang. HyperSeg: Towards Universal Visual Segmentation with Large Language Model.arXiv preprint arXiv:2411.17606, 2024

work page arXiv 2024
[42]

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models.arXiv preprint arXiv:2412.14006, 2024

Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Zheng Zhao, and Yujiu Yang. InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models.arXiv preprint arXiv:2412.14006, 2024

work page arXiv 2024
[43]

Towards Robust Referring Image Segmentation.IEEE Transactions on Image Processing (TIP), 2024

Jianzong Wu, Xiangtai Li, Xia Li, Henghui Ding, Yunhai Tong, and Dacheng Tao. Towards Robust Referring Image Segmentation.IEEE Transactions on Image Processing (TIP), 2024

work page 2024
[44]

See, Say, and Segment: Teaching LMMs to Overcome False Premises

Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta, Xudong Wang, Joseph E Gonzalez, and Trevor Darrell. See, Say, and Segment: Teaching LMMs to Overcome False Premises. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[45]

GSVA: Generalized Segmentation via Multimodal Large Language Models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. GSVA: Generalized Segmentation via Multimodal Large Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[46]

Benchmarking Segmentation Models with Mask-Preserved Attribute Editing

Zijin Yin, Kongming Liang, Bing Li, Zhanyu Ma, and Jun Guo. Benchmarking Segmentation Models with Mask-Preserved Attribute Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 14

work page 2024
[47]

Modeling Context in Referring Expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling Context in Referring Expressions. InEuropean Conference on Computer Vision (ECCV), 2016

work page 2016
[48]

Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption.arXiv preprint arXiv:2310.01779,

Bohan Zhai, Shijia Yang, Chenfeng Xu, Sheng Shen, Kurt Keutzer, Chunyuan Li, and Manling Li. HallE- Control: ControllingObjectHallucinationinLargeMultimodalModels. arXivpreprintarXiv:2310.01779 , 2023

work page arXiv 2023
[49]

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[50]

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understand- ing

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understand- ing. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[51]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.arXiv preprint arXiv:2309.01219, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

elephant

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. InInternational Conference on Computer Vision (ICCV), 2017. 15 A. HalluSegBench Details Motivation. HalluSegBenchintroduces a counterfactual visual reasoning framework to evaluate segmenta- tion models under contro...

work page arXiv 2017
[53]

{label}", described as

A binary mask marking an object labeled "{label}", described as "{description}". In case of vague or wrong descriptions, follow the image and mask. Task: - Locate the masked object precisely. - Create a replacement instruction that: • Uniquely identifies the object (position, color, size, etc.) • Swaps it for a new object that is not already present. • Ne...

work page

[1] [1]

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and others. Flamingo: a Visual Language Model for Few-Shot Learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[2] [2]

Blended Latent Diffusion.ACM Transactions on Graphics (TOG), 2023

Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended Latent Diffusion.ACM Transactions on Graphics (TOG), 2023

work page 2023

[3] [3]

Mitigating Open- Vocabulary Caption Hallucinations

Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, and Hadar Averbuch-Elor. Mitigating Open- Vocabulary Caption Hallucinations. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024

[4] [4]

Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting.arXiv preprint arXiv:2503.21770, 2025

Anand Bhattad, Konpat Preechakul, and Alexei A Efros. Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting.arXiv preprint arXiv:2503.21770, 2025

work page arXiv 2025

[5] [5]

InstructPix2Pix: Learning to Follow Image Editing Instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to Follow Image Editing Instructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[6] [6]

UNITER: UNiversal Image-TExt Representation Learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: UNiversal Image-TExt Representation Learning. InEuropean Conference on Computer Vision (ECCV), 2020. 11

work page 2020

[7] [7]

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabil- ities

Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brain and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabil- ities. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[8] [8]

Plausible May Not Be Faithful: Prob- ing Object Hallucination in Vision-Language Pre-training

Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. Plausible May Not Be Faithful: Prob- ing Object Hallucination in Vision-Language Pre-training. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

work page 2023

[9] [9]

Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Gregor Geigle, Radu Timofte, and Goran Glavaš. Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024

[10] [10]

Visual Hallucinations of Multi-modal Large Language Models

Wen Huang, Hongbin Liu, Minxin Guo, and Neil Gong. Visual Hallucinations of Multi-modal Large Language Models. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024

[11] [11]

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[12] [12]

A Style-Based Generator Architecture for Generative Adversarial Networks

Tero Karras, Samuli Laine, and Timo Aila. A Style-Based Generator Architecture for Generative Adversarial Networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[13] [13]

SegmentAnything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, SpencerWhitehead, AlexanderCBerg, Wan-YenLo, etal. SegmentAnything. InInternationalConference on Computer Vision (ICCV), 2023

work page 2023

[14] [14]

LISA: Reasoning Segmentation via Large Language Model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning Segmentation via Large Language Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[15] [15]

BLIP-2: Bootstrapping Language-Image Pre- training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-Image Pre- training with Frozen Image Encoders and Large Language Models. InInternational Conference on Machine Learning (ICML), 2023

work page 2023

[16] [16]

ZONE: Zero-Shot Instruction-Guided Local Editing

Shanglin Li, Bohan Zeng, Yutang Feng, Sicheng Gao, Xiuhui Liu, Jiaming Liu, Lin Li, Xu Tang, Yao Hu, Jianzhuang Liu, et al. ZONE: Zero-Shot Instruction-Guided Local Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[17] [17]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating Object Hallucination in Large Vision-Language Models. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023

[18] [18]

Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering

Zujie Liang, Weitao Jiang, Haifeng Hu, and Jiaying Zhu. Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020

[19] [19]

GRES: Generalized Referring Expression Segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized Referring Expression Segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 12

work page 2023

[20] [20]

PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset

Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, and Xirong Li. PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[21] [21]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Visual Instruction Tuning

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae. Visual Instruction Tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[23] [23]

Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models

Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, and Pascale Fung. Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), 2024

work page 2024

[24] [24]

Groma: Localized Visual Tokeniza- tion for Grounding Multimodal Large Language Models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized Visual Tokeniza- tion for Grounding Multimodal Large Language Models. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024

[25] [25]

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

Kiet A Nguyen, Adheesh Juvekar, Tianjiao Yu, Muntasir Wahed, and Ismini Lourentzou. CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[26] [26]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. InInternational Conference on Machine Learning (ICML), 2022

work page 2022

[27] [27]

Counterfactual VQA: A Cause-Effect Look at Language Bias

Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual VQA: A Cause-Effect Look at Language Bias. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

[28] [28]

Counterfactual Vision-and-Language Navigation: Unravelling the Unseen

Amin Parvaneh, Ehsan Abbasnejad, Damien Teney, Javen Qinfeng Shi, and Anton Van den Hengel. Counterfactual Vision-and-Language Navigation: Unravelling the Unseen. In Advances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[29] [29]

GLaMM: Pixel Grounding Large Multimodal Model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. GLaMM: Pixel Grounding Large Multimodal Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[30] [30]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks.arXiv preprint arXiv:2401.14159, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

PixelLM: Pixel Reasoning with Large Multimodal Model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel Reasoning with Large Multimodal Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[32] [32]

Object Hallucina- tion in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object Hallucina- tion in Image Captioning. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2018. 13

work page 2018

[33] [33]

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Neelabh Sinha, Vinija Jain, and Aman Chadha. Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types. InProceedings of the First Workshop of Evaluation of Multi-Modal Generation, 2025

work page 2025

[34] [34]

Rethinking Visual Counterfactual Explanations Through Region Constraint

Bartlomiej Sobieski, Jakub Grzywaczewski, Bartłomiej Sadlej, Matthew Tivnan, and Przemyslaw Biecek. Rethinking Visual Counterfactual Explanations Through Region Constraint. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[35] [35]

Doubly Abductive Counterfactual Inference for Text-based Image Editing

Xue Song, Jiequan Cui, Hanwang Zhang, Jingjing Chen, Richang Hong, and Yu-Gang Jiang. Doubly Abductive Counterfactual Inference for Text-based Image Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[36] [36]

Features of Similarity.Psychological review, 1977

Amos Tversky. Features of Similarity.Psychological review, 1977

work page 1977

[37] [37]

PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation.arXiv preprint arXiv:2412.15209, 2024

Muntasir Wahed, Kiet A Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, and Ismini Lourentzou. PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation.arXiv preprint arXiv:2412.15209, 2024

work page arXiv 2024

[38] [38]

Counterfactual Cycle- Consistent Learning for Instruction Following and Generation in Vision-Language Navigation

Hanqing Wang, Wei Liang, Jianbing Shen, Luc Van Gool, and Wenguan Wang. Counterfactual Cycle- Consistent Learning for Instruction Following and Generation in Vision-Language Navigation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[39] [39]

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. InInternational Conference on Machine Learning (ICML). Proceedings of Machine Learning Research (PMLR), 2022

work page 2022

[40] [40]

Instructedit: Improving automatic masks for diffusion-based image editing with user instructions.ArXiv, abs/2305.18047, 2023

Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka. InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions.arXiv preprint arXiv:2305.18047, 2023

work page arXiv 2023

[41] [41]

Hyperseg: Towards universal visual segmentation with large language model, 2024

Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, and Yujiu Yang. HyperSeg: Towards Universal Visual Segmentation with Large Language Model.arXiv preprint arXiv:2411.17606, 2024

work page arXiv 2024

[42] [42]

InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models.arXiv preprint arXiv:2412.14006, 2024

Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Zheng Zhao, and Yujiu Yang. InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models.arXiv preprint arXiv:2412.14006, 2024

work page arXiv 2024

[43] [43]

Towards Robust Referring Image Segmentation.IEEE Transactions on Image Processing (TIP), 2024

Jianzong Wu, Xiangtai Li, Xia Li, Henghui Ding, Yunhai Tong, and Dacheng Tao. Towards Robust Referring Image Segmentation.IEEE Transactions on Image Processing (TIP), 2024

work page 2024

[44] [44]

See, Say, and Segment: Teaching LMMs to Overcome False Premises

Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta, Xudong Wang, Joseph E Gonzalez, and Trevor Darrell. See, Say, and Segment: Teaching LMMs to Overcome False Premises. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[45] [45]

GSVA: Generalized Segmentation via Multimodal Large Language Models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. GSVA: Generalized Segmentation via Multimodal Large Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[46] [46]

Benchmarking Segmentation Models with Mask-Preserved Attribute Editing

Zijin Yin, Kongming Liang, Bing Li, Zhanyu Ma, and Jun Guo. Benchmarking Segmentation Models with Mask-Preserved Attribute Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 14

work page 2024

[47] [47]

Modeling Context in Referring Expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling Context in Referring Expressions. InEuropean Conference on Computer Vision (ECCV), 2016

work page 2016

[48] [48]

Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption.arXiv preprint arXiv:2310.01779,

Bohan Zhai, Shijia Yang, Chenfeng Xu, Sheng Shen, Kurt Keutzer, Chunyuan Li, and Manling Li. HallE- Control: ControllingObjectHallucinationinLargeMultimodalModels. arXivpreprintarXiv:2310.01779 , 2023

work page arXiv 2023

[49] [49]

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[50] [50]

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understand- ing

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understand- ing. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[51] [51]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.arXiv preprint arXiv:2309.01219, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

elephant

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. InInternational Conference on Computer Vision (ICCV), 2017. 15 A. HalluSegBench Details Motivation. HalluSegBenchintroduces a counterfactual visual reasoning framework to evaluate segmenta- tion models under contro...

work page arXiv 2017

[53] [53]

{label}", described as

A binary mask marking an object labeled "{label}", described as "{description}". In case of vague or wrong descriptions, follow the image and mask. Task: - Locate the masked object precisely. - Create a replacement instruction that: • Uniquely identifies the object (position, color, size, etc.) • Swaps it for a new object that is not already present. • Ne...

work page