Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination
Pith reviewed 2026-05-19 07:32 UTC · model grok-4.3
The pith
Counterfactual fine-tuning trains segmentation models to abstain from masking absent objects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By pairing each factual image with a controlled visual counterfactual in which the referenced object is removed or altered, a segmentation VLM can be trained to output a mask only when the object is present and to abstain otherwise, thereby cutting pixel-grounding hallucinations while preserving or improving segmentation quality.
What carries the argument
Counterfactual fine-tuning (CFT), which exposes the model to matched factual-counterfactual image pairs so it learns the visual conditions under which segmentation is appropriate.
If this is right
- Models learn an explicit abstention signal tied directly to the presence or absence of the queried object in the visual input.
- Vision-driven and language-driven hallucinations can be measured and reported separately using the new severity and disentanglement metrics.
- Segmentation accuracy rises on FP-RefCOCO(+/g) while hallucination rates fall.
- The same training recipe can be applied to any segmentation VLM that accepts image-text pairs.
Where Pith is reading between the lines
- The same factual-counterfactual pairing could be adapted to reduce grounding errors in object detection or visual question answering.
- Automatically generated counterfactuals, rather than hand-crafted ones, would allow the method to scale to larger unlabeled datasets.
- Combining the visual abstention signal with existing language-only hallucination detectors could address mixed failure modes more completely.
Load-bearing premise
The controlled visual changes used to build the counterfactual images isolate vision-driven hallucinations without adding artifacts or biases that would change how the model behaves or how the metrics are scored.
What would settle it
Run RobustSeg on a fresh collection of natural images that lack matched counterfactual versions and measure whether the reported 30 percent hallucination reduction disappears or whether accuracy on the original benchmarks falls.
Figures
read the original abstract
Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes Counterfactual Segmentation Reasoning (CSR) for segmentation VLMs to diagnose pixel-grounding hallucinations. It introduces HalluSegBench, a benchmark built on controlled visual counterfactuals for referring and reasoning expression segmentation, along with new metrics that quantify hallucination severity and attempt to disentangle vision-driven versus language-driven failures. It further proposes RobustSeg, a model trained via counterfactual fine-tuning (CFT), and reports that this approach reduces hallucinations by 30% while improving performance on FP-RefCOCO(+/g).
Significance. If the central empirical claims hold after proper validation, the work would meaningfully advance evaluation of grounded VLMs by shifting emphasis from text/label perturbations to vision-focused counterfactuals. The benchmark construction, severity metrics, and CFT training procedure represent practical contributions that could help the community better isolate and mitigate vision-driven hallucinations. The attempt to disentangle failure modes is a notable strength if the metrics prove robust.
major comments (3)
- [Abstract] Abstract: The claim that RobustSeg 'reduces hallucinations by 30%' and improves segmentation on FP-RefCOCO(+/g) is presented without any information on dataset size, counterfactual generation procedure, baseline models, statistical significance testing, or precise definitions of the severity metrics. This absence prevents assessment of whether the reported gains are load-bearing or reproducible.
- [HalluSegBench construction] HalluSegBench construction (methods section): The central assumption that controlled visual counterfactuals (object removal or scene editing) isolate vision-driven hallucinations without introducing new artifacts is not validated. Unnatural textures, lighting inconsistencies, or correlated semantic shifts could cause models to abstain for low-level visual reasons rather than grounding failures, confounding both the severity metrics and the claimed CFT benefit.
- [Evaluation metrics] Evaluation metrics section: The new metrics intended to measure hallucination severity and disentangle vision- versus language-driven modes lack explicit formulations, ablation studies, or controls for potential biases introduced by the counterfactual generation process. Without these, it is unclear whether the metrics achieve the claimed disentanglement or simply reflect artifacts in the benchmark.
minor comments (2)
- [Introduction] The acronym CSR is defined in the abstract but could be restated at first use in the introduction for improved readability.
- [Evaluation metrics] Notation for the new severity and disentanglement scores should be accompanied by explicit equations or pseudocode to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our work formalizing Counterfactual Segmentation Reasoning and introducing HalluSegBench along with RobustSeg. The comments highlight important areas for improving clarity and rigor. We address each major comment point-by-point below, providing explanations grounded in the manuscript and indicating revisions where they will strengthen the presentation without altering core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that RobustSeg 'reduces hallucinations by 30%' and improves segmentation on FP-RefCOCO(+/g) is presented without any information on dataset size, counterfactual generation procedure, baseline models, statistical significance testing, or precise definitions of the severity metrics. This absence prevents assessment of whether the reported gains are load-bearing or reproducible.
Authors: We acknowledge that the abstract is concise by design and omits granular details to fit length constraints. The full manuscript specifies HalluSegBench scale (thousands of counterfactual pairs derived from RefCOCO and similar sources), the counterfactual generation via object removal and scene editing in Section 3, baselines including standard segmentation VLMs, severity metrics defined in Section 3.2 as normalized hallucinated pixel ratios with vision/language disentanglement, and statistical significance via paired t-tests with p-values reported in Section 4. To improve immediate assessability, we will revise the abstract to include brief mentions of benchmark size, the 30% reduction context, and reference to significance testing. revision: yes
-
Referee: [HalluSegBench construction] HalluSegBench construction (methods section): The central assumption that controlled visual counterfactuals (object removal or scene editing) isolate vision-driven hallucinations without introducing new artifacts is not validated. Unnatural textures, lighting inconsistencies, or correlated semantic shifts could cause models to abstain for low-level visual reasons rather than grounding failures, confounding both the severity metrics and the claimed CFT benefit.
Authors: This concern about potential visual artifacts is well-taken and directly relevant to the validity of CSR. The manuscript describes use of controlled editing pipelines with post-generation filtering for visual coherence and includes qualitative examples plus controls testing abstention on referent-absent counterfactuals. However, we agree explicit validation against low-level confounds would strengthen the work. We will add a dedicated subsection with quantitative checks (e.g., human ratings of naturalness and model performance on artifact-controlled subsets) and an ablation isolating editing effects. revision: yes
-
Referee: [Evaluation metrics] Evaluation metrics section: The new metrics intended to measure hallucination severity and disentangle vision- versus language-driven modes lack explicit formulations, ablation studies, or controls for potential biases introduced by the counterfactual generation process. Without these, it is unclear whether the metrics achieve the claimed disentanglement or simply reflect artifacts in the benchmark.
Authors: We agree that explicit formulations and controls are essential for the metrics' credibility. Section 3.3 provides the severity metric as the fraction of hallucinated area in counterfactuals where abstention fails, with vision-driven failures isolated by holding language fixed and varying visuals, and language-driven by the converse; initial ablations correlate with human annotations. To address the referee's point fully, we will expand this section with complete mathematical definitions, additional ablation tables controlling for generation biases (e.g., texture/lighting variants), and bias analysis results in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical benchmark construction and fine-tuning
full rationale
The paper introduces the CSR task, curates HalluSegBench via controlled visual counterfactuals, defines new severity metrics, and trains RobustSeg with counterfactual fine-tuning (CFT). No equations, derivations, parameter fittings, or self-citation chains are present that would reduce any claim to its inputs by construction. Central results (30% hallucination reduction, FP-RefCOCO gains) are reported from direct experiments on the newly constructed benchmark and external datasets, rendering the work self-contained against external benchmarks with no load-bearing reductions to prior fitted quantities or author-defined uniqueness.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual counterfactuals can be constructed to isolate vision-driven hallucinations without confounding artifacts
invented entities (1)
-
RobustSeg
no independent evidence
Forward citations
Cited by 3 Pith papers
-
3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
3D-VCD reduces hallucinations in 3D-LLM embodied agents by contrasting predictions from original and distorted 3D scene representations at inference time.
-
VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence
VISTAQA is a new benchmark for joint visual question answering correctness and pixel-level grounding, evaluated with the GROVE metric that uses per-sample geometric mean to require both dimensions to succeed.
-
From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a Visual Language Model for Few-Shot Learning
Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and others. Flamingo: a Visual Language Model for Few-Shot Learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[2]
Blended Latent Diffusion.ACM Transactions on Graphics (TOG), 2023
Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended Latent Diffusion.ACM Transactions on Graphics (TOG), 2023
work page 2023
-
[3]
Mitigating Open- Vocabulary Caption Hallucinations
Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, and Hadar Averbuch-Elor. Mitigating Open- Vocabulary Caption Hallucinations. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
work page 2024
-
[4]
Anand Bhattad, Konpat Preechakul, and Alexei A Efros. Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting.arXiv preprint arXiv:2503.21770, 2025
-
[5]
InstructPix2Pix: Learning to Follow Image Editing Instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. InstructPix2Pix: Learning to Follow Image Editing Instructions. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[6]
UNITER: UNiversal Image-TExt Representation Learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: UNiversal Image-TExt Representation Learning. InEuropean Conference on Computer Vision (ECCV), 2020. 11
work page 2020
-
[7]
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabil- ities
Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brain and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabil- ities. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[8]
Plausible May Not Be Faithful: Prob- ing Object Hallucination in Vision-Language Pre-training
Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. Plausible May Not Be Faithful: Prob- ing Object Hallucination in Vision-Language Pre-training. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
work page 2023
-
[9]
Gregor Geigle, Radu Timofte, and Goran Glavaš. Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
work page 2024
-
[10]
Visual Hallucinations of Multi-modal Large Language Models
Wen Huang, Hongbin Liu, Minxin Guo, and Neil Gong. Visual Hallucinations of Multi-modal Large Language Models. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
work page 2024
-
[11]
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[12]
A Style-Based Generator Architecture for Generative Adversarial Networks
Tero Karras, Samuli Laine, and Timo Aila. A Style-Based Generator Architecture for Generative Adversarial Networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
work page 2019
-
[13]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, SpencerWhitehead, AlexanderCBerg, Wan-YenLo, etal. SegmentAnything. InInternationalConference on Computer Vision (ICCV), 2023
work page 2023
-
[14]
LISA: Reasoning Segmentation via Large Language Model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning Segmentation via Large Language Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[15]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-Image Pre- training with Frozen Image Encoders and Large Language Models. InInternational Conference on Machine Learning (ICML), 2023
work page 2023
-
[16]
ZONE: Zero-Shot Instruction-Guided Local Editing
Shanglin Li, Bohan Zeng, Yutang Feng, Sicheng Gao, Xiuhui Liu, Jiaming Liu, Lin Li, Xu Tang, Yao Hu, Jianzhuang Liu, et al. ZONE: Zero-Shot Instruction-Guided Local Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[17]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating Object Hallucination in Large Vision-Language Models. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
work page 2023
-
[18]
Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering
Zujie Liang, Weitao Jiang, Haifeng Hu, and Jiaying Zhu. Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
work page 2020
-
[19]
GRES: Generalized Referring Expression Segmentation
Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized Referring Expression Segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 12
work page 2023
-
[20]
PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset
Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, and Xirong Li. PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[21]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement.arXiv preprint arXiv:2503.06520, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae. Visual Instruction Tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[23]
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, and Pascale Fung. Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models. InProceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), 2024
work page 2024
-
[24]
Groma: Localized Visual Tokeniza- tion for Grounding Multimodal Large Language Models
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized Visual Tokeniza- tion for Grounding Multimodal Large Language Models. InEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[25]
CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models
Kiet A Nguyen, Adheesh Juvekar, Tianjiao Yu, Muntasir Wahed, and Ismini Lourentzou. CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[26]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. InInternational Conference on Machine Learning (ICML), 2022
work page 2022
-
[27]
Counterfactual VQA: A Cause-Effect Look at Language Bias
Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual VQA: A Cause-Effect Look at Language Bias. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
work page 2021
-
[28]
Counterfactual Vision-and-Language Navigation: Unravelling the Unseen
Amin Parvaneh, Ehsan Abbasnejad, Damien Teney, Javen Qinfeng Shi, and Anton Van den Hengel. Counterfactual Vision-and-Language Navigation: Unravelling the Unseen. In Advances in Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[29]
GLaMM: Pixel Grounding Large Multimodal Model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. GLaMM: Pixel Grounding Large Multimodal Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[30]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks.arXiv preprint arXiv:2401.14159, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
PixelLM: Pixel Reasoning with Large Multimodal Model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel Reasoning with Large Multimodal Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[32]
Object Hallucina- tion in Image Captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object Hallucina- tion in Image Captioning. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2018. 13
work page 2018
-
[33]
Neelabh Sinha, Vinija Jain, and Aman Chadha. Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types. InProceedings of the First Workshop of Evaluation of Multi-Modal Generation, 2025
work page 2025
-
[34]
Rethinking Visual Counterfactual Explanations Through Region Constraint
Bartlomiej Sobieski, Jakub Grzywaczewski, Bartłomiej Sadlej, Matthew Tivnan, and Przemyslaw Biecek. Rethinking Visual Counterfactual Explanations Through Region Constraint. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[35]
Doubly Abductive Counterfactual Inference for Text-based Image Editing
Xue Song, Jiequan Cui, Hanwang Zhang, Jingjing Chen, Richang Hong, and Yu-Gang Jiang. Doubly Abductive Counterfactual Inference for Text-based Image Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[36]
Features of Similarity.Psychological review, 1977
Amos Tversky. Features of Similarity.Psychological review, 1977
work page 1977
-
[37]
Muntasir Wahed, Kiet A Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, and Ismini Lourentzou. PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation.arXiv preprint arXiv:2412.15209, 2024
-
[38]
Hanqing Wang, Wei Liang, Jianbing Shen, Luc Van Gool, and Wenguan Wang. Counterfactual Cycle- Consistent Learning for Instruction Following and Generation in Vision-Language Navigation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[39]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. InInternational Conference on Machine Learning (ICML). Proceedings of Machine Learning Research (PMLR), 2022
work page 2022
-
[40]
Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka. InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions.arXiv preprint arXiv:2305.18047, 2023
-
[41]
Hyperseg: Towards universal visual segmentation with large language model, 2024
Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, and Yujiu Yang. HyperSeg: Towards Universal Visual Segmentation with Large Language Model.arXiv preprint arXiv:2411.17606, 2024
-
[42]
Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Zheng Zhao, and Yujiu Yang. InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models.arXiv preprint arXiv:2412.14006, 2024
-
[43]
Towards Robust Referring Image Segmentation.IEEE Transactions on Image Processing (TIP), 2024
Jianzong Wu, Xiangtai Li, Xia Li, Henghui Ding, Yunhai Tong, and Dacheng Tao. Towards Robust Referring Image Segmentation.IEEE Transactions on Image Processing (TIP), 2024
work page 2024
-
[44]
See, Say, and Segment: Teaching LMMs to Overcome False Premises
Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta, Xudong Wang, Joseph E Gonzalez, and Trevor Darrell. See, Say, and Segment: Teaching LMMs to Overcome False Premises. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[45]
GSVA: Generalized Segmentation via Multimodal Large Language Models
Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. GSVA: Generalized Segmentation via Multimodal Large Language Models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[46]
Benchmarking Segmentation Models with Mask-Preserved Attribute Editing
Zijin Yin, Kongming Liang, Bing Li, Zhanyu Ma, and Jun Guo. Benchmarking Segmentation Models with Mask-Preserved Attribute Editing. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 14
work page 2024
-
[47]
Modeling Context in Referring Expressions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling Context in Referring Expressions. InEuropean Conference on Computer Vision (ECCV), 2016
work page 2016
-
[48]
Bohan Zhai, Shijia Yang, Chenfeng Xu, Sheng Shen, Kurt Keutzer, Chunyuan Li, and Manling Li. HallE- Control: ControllingObjectHallucinationinLargeMultimodalModels. arXivpreprintarXiv:2310.01779 , 2023
-
[49]
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[50]
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understand- ing
Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understand- ing. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[51]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.arXiv preprint arXiv:2309.01219, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. InInternational Conference on Computer Vision (ICCV), 2017. 15 A. HalluSegBench Details Motivation. HalluSegBenchintroduces a counterfactual visual reasoning framework to evaluate segmenta- tion models under contro...
-
[53]
A binary mask marking an object labeled "{label}", described as "{description}". In case of vague or wrong descriptions, follow the image and mask. Task: - Locate the masked object precisely. - Create a replacement instruction that: • Uniquely identifies the object (position, color, size, etc.) • Swaps it for a new object that is not already present. • Ne...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.