MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models
Pith reviewed 2026-05-19 13:02 UTC · model grok-4.3
The pith
Vision-language models display weak mutual exclusivity bias but can use spatial context to handle ambiguous novel objects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MEBench is a benchmark that evaluates mutual exclusivity bias in VLMs by incorporating spatial reasoning into tasks with novel objects. Using a flexible data generation pipeline to construct diverse annotated scenes, the authors test models and find they exhibit weak ME bias while showing some ability to leverage extra spatial context to resolve ambiguity in multiple novel object settings.
What carries the argument
MEBench benchmark together with its data generation pipeline, which builds annotated scenes containing novel objects to probe ME-based reasoning under varying spatial conditions.
If this is right
- VLMs may require additional mechanisms to match the efficiency of child word learning that relies on mutual exclusivity.
- Spatial context supplies a usable cue for resolving label ambiguity when several novel objects appear together.
- The scalable scene-generation pipeline supports repeatable experiments on how visual structure affects reasoning biases.
Where Pith is reading between the lines
- Training regimes that reward explicit avoidance of familiar labels for new items could strengthen ME-like behavior in future models.
- The same controlled scene pipeline could be reused to measure other developmental biases such as shape bias or basic-level naming preferences.
Load-bearing premise
The generated scenes and novel-object labels in the data pipeline faithfully reproduce the cognitive conditions under which mutual exclusivity bias is observed in children without introducing artifacts.
What would settle it
Finding that VLMs strongly apply mutual exclusivity even without spatial context, or that spatial context produces no measurable improvement in disambiguating multiple novel objects, would falsify the reported pattern of weak bias with partial context use.
Figures
read the original abstract
This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes. We assess the performance of various vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. We find that these VLMs exhibit weak ME bias, while showing some ability to leverage extra spatial context to resolve ambiguity in multiple novel object settings. Project page: http://mebench.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MEBench, a benchmark for evaluating mutual exclusivity (ME) bias in vision-language models that extends traditional tasks by incorporating spatial reasoning. It describes a flexible, scalable data-generation pipeline for constructing diverse annotated scenes with novel objects and reports results from evaluating several VLMs using custom metrics. The central empirical claim is that the tested VLMs exhibit weak ME bias while showing some ability to leverage extra spatial context to resolve ambiguity in multiple-novel-object settings.
Significance. If the benchmark scenes and metrics faithfully reproduce the referential ambiguity and exclusion logic of developmental ME studies, the work would offer a useful controlled testbed for probing reasoning limitations in current VLMs. The scalable pipeline itself is a constructive contribution that could support future ablation studies. The reported finding of weak bias plus partial spatial-context use would then be a concrete, falsifiable observation about VLM capabilities.
major comments (2)
- Data Generation Pipeline: The central claim that VLMs exhibit weak ME bias (and partial spatial-context leverage) rests on the unvalidated assumption that the synthetic scenes and novel-object labels reproduce the precise referential ambiguity and exclusion conditions used in child word-learning experiments. No human-child replication baseline or ablation isolating training-data leakage, unnatural spatial statistics, or label-distribution artifacts is described, leaving open the possibility that the measured 'weak bias' reflects pipeline-specific heuristics rather than ME reasoning per se.
- Evaluation Metrics: The abstract states that novel metrics are used to capture key aspects of ME-based reasoning, yet supplies no definitions, formulas, or controls for label leakage and statistical significance. Without these details the reported performance differences cannot be verified or compared to prior ME literature, undermining the empirical conclusions.
minor comments (2)
- Add explicit section headings and a table summarizing the exact VLMs, scene counts, and metric formulas so readers can trace the pipeline from generation to scoring.
- Clarify in the methods whether the spatial arrangements are generated from fixed templates or learned distributions, and state any constraints on object co-occurrence frequencies.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below, providing the strongest honest defense of our work while acknowledging areas where revisions are warranted. We have updated the manuscript to incorporate additional details and ablations where feasible.
read point-by-point responses
-
Referee: Data Generation Pipeline: The central claim that VLMs exhibit weak ME bias (and partial spatial-context leverage) rests on the unvalidated assumption that the synthetic scenes and novel-object labels reproduce the precise referential ambiguity and exclusion conditions used in child word-learning experiments. No human-child replication baseline or ablation isolating training-data leakage, unnatural spatial statistics, or label-distribution artifacts is described, leaving open the possibility that the measured 'weak bias' reflects pipeline-specific heuristics rather than ME reasoning per se.
Authors: We appreciate the referee's emphasis on validation. The data-generation pipeline was constructed to directly mirror the referential ambiguity and exclusion logic from classic developmental ME studies (e.g., one familiar object paired with one or more novel objects under controlled spatial layouts). Novel object labels were generated to be out-of-distribution relative to common VLM training corpora. While a full human-child replication baseline is beyond the scope of the current computational study, we have added a new subsection with targeted ablations that isolate the effects of spatial statistics and label distributions. These ablations show that the weak ME bias persists across varied configurations, supporting that the result is not an artifact of the pipeline alone. We have also expanded the discussion of design choices grounded in the ME literature. revision: partial
-
Referee: Evaluation Metrics: The abstract states that novel metrics are used to capture key aspects of ME-based reasoning, yet supplies no definitions, formulas, or controls for label leakage and statistical significance. Without these details the reported performance differences cannot be verified or compared to prior ME literature, undermining the empirical conclusions.
Authors: We agree that the abstract was insufficiently explicit. The novel metrics (ME-bias score and spatial-context utilization score) are formally defined with equations in Section 4 of the main text, including controls that ensure novel labels do not appear in the evaluated models' training data. We have now moved concise definitions and formulas into the abstract itself. Statistical significance is assessed via bootstrap resampling and reported with p-values in the results tables. These additions allow direct verification and comparison to prior ME work. revision: yes
- A direct human-child replication experiment to empirically validate that the synthetic scenes produce equivalent referential ambiguity levels to classic developmental studies.
Circularity Check
Empirical benchmark with no derivation chain or self-referential results
full rationale
The paper introduces MEBench as an empirical evaluation benchmark consisting of a synthetic scene generation pipeline and novel metrics for measuring VLM performance on mutual exclusivity tasks. No equations, first-principles derivations, fitted parameters, or predictions are present that could reduce to the paper's own inputs by construction. Results consist of direct measurements of model outputs against the generated test cases, which are independent of any internal fitting or self-citation that would create circularity. The work is therefore self-contained against external model evaluations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The spatial relations and novel-object placements in the generated scenes match the conditions under which children exhibit mutual exclusivity bias.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We assess the performance of various vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MEBench goes beyond prior work ... by introducing realistic, cluttered scenes and additional reasoning requirements ... spatial reasoning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Geoshapes add-on, https://blendermarket.com/products/geoshapes- parametric-geometry-node-objects. 4, 9
-
[3]
Meirom, Yuval Atzmon, Shie Mannor, and Gal Chechik
Harsh Agrawal, Eli A. Meirom, Yuval Atzmon, Shie Mannor, and Gal Chechik. Known unknowns: Learning novel concepts using reasoning-by-elimination. InProceedings of the Thirty- Seventh Conference on Uncertainty in Artificial Intelligence, pages 504–514. PMLR, 2021. 2
work page 2021
- [4]
-
[5]
Chuang Gan, Siyuan Zhou, Jeremy Schwartz, Seth Alter, Abhishek Bhandwaldar, Dan Gutfreund, Daniel LK Yamins, James J DiCarlo, Josh McDermott, Antonio Torralba, et al. The threedworld transport challenge: A visually guided task- and-motion planning benchmark for physically realistic em- bodied ai.arXiv preprint arXiv:2103.14025, 2021. 2
-
[6]
Kanishk Gandhi and Brenden M Lake. Mutual exclusivity as a challenge for deep neural networks.Advances in Neural Information Processing Systems, 33:14182–14192, 2020. 1, 2, 5
work page 2020
-
[7]
Kubric: A scalable dataset generator
Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- 11 gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 2
work page 2022
-
[8]
Grounded language learning fast and slow.arXiv preprint arXiv:2009.01719, 2020
Felix Hill, Olivier Tieleman, Tamara V on Glehn, Nathaniel Wong, Hamza Merzic, and Stephen Clark. Grounded language learning fast and slow.arXiv preprint arXiv:2009.01719, 2020. 2
-
[9]
Jessica S Horst and Michael C Hout. The novel object and unusual name (noun) database: A collection of novel images for use in experimental research.Behavior research methods, 48(4):1393–1409, 2016. 3
work page 2016
-
[10]
Guangyuan Jiang, Manjie Xu, Shiji Xin, Wei Liang, Yujia Peng, Chi Zhang, and Yixin Zhu. Mewl: Few-shot mul- timodal word learning with referential uncertainty.arXiv preprint arXiv:2306.00503, 2023. 2
-
[11]
Bongard-hoi: Benchmarking few-shot visual reasoning for human-object interactions
Huaizu Jiang, Xiaojian Ma, Weili Nie, Zhiding Yu, Yuke Zhu, and Anima Anandkumar. Bongard-hoi: Benchmarking few-shot visual reasoning for human-object interactions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19056–19065, 2022. 3
work page 2022
-
[12]
Towards open world object detec- tion
KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Vi- neeth N Balasubramanian. Towards open world object detec- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 5830–5840, 2021. 2
work page 2021
-
[13]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 2, 5, 6, 7, 10
work page 2024
-
[14]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3, 5, 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Deepseek-vl: Towards real-world vision- language understanding, 2024
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision- language understanding, 2024. 3
work page 2024
-
[16]
Children’s use of mutual exclusivity to constrain the meanings of words
Ellen M Markman and Gwyn F Wachtel. Children’s use of mutual exclusivity to constrain the meanings of words. Cognitive psychology, 20(2):121–157, 1988. 2
work page 1988
-
[17]
Ellen M Markman, Judith L Wasow, and Mikkel B Hansen. Use of the mutual exclusivity assumption by young word learners.Cognitive psychology, 47(3):241–275, 2003. 2
work page 2003
-
[18]
Defusing the childhood vocabulary explo- sion.Science, 317(5838):631–631, 2007
Bob McMurray. Defusing the childhood vocabulary explo- sion.Science, 317(5838):631–631, 2007. 1
work page 2007
-
[19]
OpenAI. Gpt-4 with vision (gpt-4v). https://openai. com/research/gpt-4, 2023. Accessed: 2024-02-28. 3
work page 2023
-
[20]
Lsfsl: Leveraging shape informa- tion in few-shot learning
Deepan Chakravarthi Padmanabhan, Shruthi Gowda, Elahe Arani, and Bahram Zonooz. Lsfsl: Leveraging shape informa- tion in few-shot learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4970–4979, 2023. 2
work page 2023
-
[21]
Uouo: Uncontextualized uncommon objects for measuring knowledge horizons of vision language models
Xinyu Pi, Mingyuan Wu, Jize Jiang, Haozhen Zheng, Beitong Tian, Chengxiang Zhai, Klara Nahrstedt, and Zhiting Hu. Uouo: Uncontextualized uncommon objects for measuring knowledge horizons of vision language models. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6432–6441, 2024. 3
work page 2024
-
[22]
Implicit shape biased few-shot learning for 3d object gener- alization
Shitala Prasad, Yiqun Li, Dongyun Lin, and Aiyuan Guo. Implicit shape biased few-shot learning for 3d object gener- alization. In2022 IEEE International Conference on Image Processing (ICIP), pages 3436–3440. IEEE, 2022. 2
work page 2022
-
[23]
Infinigen indoors: Photorealistic indoor scenes using procedural generation
Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, and Jia Deng. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 21783–...
work page 2024
-
[24]
Glamm: Pixel grounding large multimodal model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrah- man Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024. 2, 3
work page 2024
-
[25]
You only look once: Unified, real-time object de- tection
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 5
work page 2016
-
[26]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2
work page 2022
-
[27]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2
work page 2022
-
[28]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339– 9347, 2019. 2
work page 2019
-
[29]
Linda Smith and Chen Yu. Infants rapidly learn word-referent mappings via cross-situational statistics.Cognition, 106(3): 1558–1568, 2008. 2
work page 2008
-
[30]
Using shape to categorize: Low-shot learning with an explicit shape bias
Stefan Stojanov, Anh Thai, and James M Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1798–1808, 2021. 2, 3, 9
work page 2021
-
[31]
Stefan Stojanov, Anh Thai, Zixuan Huang, and James M. Rehg. Learning dense object descriptors from multiple views for low-shot category generalization. InAdvances in Neural Information Processing Systems, pages 12566–12580, 2022. 3
work page 2022
-
[32]
Gemini: A family of highly capable multi- modal models, 2024
Gemini Team. Gemini: A family of highly capable multi- modal models, 2024. 3, 5, 7, 10 12
work page 2024
-
[33]
Anh Thai, Ahmad Humayun, Stefan Stojanov, Zixuan Huang, Bikram Boote, and James M Rehg. Low-shot object learning with mutual exclusivity bias.Advances in Neural Information Processing Systems, 36:70208–70228, 2023. 2
work page 2023
-
[34]
Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, et al. Yolov10: Real-time end-to-end object detec- tion.Advances in Neural Information Processing Systems, 37:107984–108011, 2025. 5
work page 2025
-
[35]
Cogvlm: Visual expert for large language models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al. Cogvlm: Visual expert for large language models. 2023. 3, 5, 6, 7, 8, 10, 11
work page 2023
-
[36]
Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion
Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav ARORA, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and Jerome Revaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion. InAdvances in Neural Information Processing Systems. 3
-
[37]
F-lmm: Grounding frozen large multimodal models.arXiv preprint arXiv:2406.05821,
Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, and Chen Change Loy. F-lmm: Grounding frozen large multimodal models.arXiv preprint arXiv:2406.05821,
-
[38]
Open- vocabulary panoptic segmentation with text-to-image diffusion models, 2023
Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. ODISE: Open-V ocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. arXiv preprint arXiv: 2303.04803, 2023. 2
-
[39]
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming- Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025. 2, 3, 5, 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.Advances in Neural Information Processing Systems, 37:71737–71767, 2025. 3, 5, 10
work page 2025
-
[41]
Thingi10K: A Dataset of 10,000 3D-Printing Models
Qingnan Zhou and Alec Jacobson. Thingi10k: A dataset of 10,000 3d-printing models.arXiv preprint arXiv:1605.04797,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.