pith. sign in

arxiv: 2505.20122 · v2 · submitted 2025-05-26 · 💻 cs.CV

MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models

Pith reviewed 2026-05-19 13:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords mutual exclusivity biasvision-language modelsbenchmarkspatial reasoningword learningnovel objects
0
0 comments X

The pith

Vision-language models display weak mutual exclusivity bias but can use spatial context to handle ambiguous novel objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces MEBench to test mutual exclusivity bias in vision-language models, the tendency during word learning to assume a new label applies to an unfamiliar object rather than a known one. The benchmark adds spatial reasoning to standard tasks, creating harder and more realistic settings with multiple novel objects. A flexible data generation pipeline produces diverse annotated scenes for controlled tests. New metrics evaluate key aspects of ME reasoning, and assessments of several VLMs show only weak bias overall alongside limited success at using added spatial details to resolve ambiguity.

Core claim

MEBench is a benchmark that evaluates mutual exclusivity bias in VLMs by incorporating spatial reasoning into tasks with novel objects. Using a flexible data generation pipeline to construct diverse annotated scenes, the authors test models and find they exhibit weak ME bias while showing some ability to leverage extra spatial context to resolve ambiguity in multiple novel object settings.

What carries the argument

MEBench benchmark together with its data generation pipeline, which builds annotated scenes containing novel objects to probe ME-based reasoning under varying spatial conditions.

If this is right

  • VLMs may require additional mechanisms to match the efficiency of child word learning that relies on mutual exclusivity.
  • Spatial context supplies a usable cue for resolving label ambiguity when several novel objects appear together.
  • The scalable scene-generation pipeline supports repeatable experiments on how visual structure affects reasoning biases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes that reward explicit avoidance of familiar labels for new items could strengthen ME-like behavior in future models.
  • The same controlled scene pipeline could be reused to measure other developmental biases such as shape bias or basic-level naming preferences.

Load-bearing premise

The generated scenes and novel-object labels in the data pipeline faithfully reproduce the cognitive conditions under which mutual exclusivity bias is observed in children without introducing artifacts.

What would settle it

Finding that VLMs strongly apply mutual exclusivity even without spatial context, or that spatial context produces no measurable improvement in disambiguating multiple novel objects, would falsify the reported pattern of weak bias with partial context use.

Figures

Figures reproduced from arXiv: 2505.20122 by Anh Thai, Bikram Boote, James M. Rehg, Stefan Stojanov, Zixuan Huang.

Figure 1
Figure 1. Figure 1: Mutual Exclusivity Bias Evaluation Settings. (a) Traditional ME bias evaluation in developmental psychology and early computational studies [6], (b) MEBench setup for classic ME bias testing, and (c) MEBench setup for evaluating ME bias in conjunction with spatial reasoning. Abstract This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phe￾nomenon observ… view at source ↗
Figure 2
Figure 2. Figure 2: Example of Rendered Data for the MEBench Benchmark. We systematically generate diverse object configurations within varied room backgrounds, ensuring photorealistic renderings that capture realistic spatial arrangements and lighting conditions [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Novel Objects in MEBench. To minimize lexical leakage during evaluation, we constructed a database of novel objects using procedural generation in Blender [1] with geometry nodes from GeoShapeV2 [2] and Thingi10K [41], paired with pseudo-words as labels. “Dog”:{“to left of”:[“dax”, “pig”]} we generate the natural language description: “The dog is to the left of the dax and the pig.” Notably, this scene des… view at source ↗
Figure 4
Figure 4. Figure 4: Object Detection Performance of VLMs on Known Objects in the (a) 1K-0U (1 known and 0 unknown object), (b) 1K-1U (1 known and 1 unknown objects), and (c) 2K-1U (2 known and 1 unknown objects. Each baseline is run three times and performance’s standard deviation is shown as vertical bar at each data point. Spatial Reasoning. To evaluate the model’s spatial reason￾ing ability, we analyze the improvement in t… view at source ↗
Figure 5
Figure 5. Figure 5: Mutual Exclusivity (ME) Analysis in the (Left) 1K-1U and (Middle) 2K-1U settings. These settings contain one novel object in the scene. The response types are categorized as follows: N → N denotes correctly assigning the novel label to the novel object, N → K represents misassigning the novel label to a known object, and N → Bg indicates misassigning the novel label to a background distractor or failing to… view at source ↗
Figure 6
Figure 6. Figure 6: Impact of Having Object Names and Scene Descrip￾tions on ME Score comparing between Question-Only language prompt (e.g. Where is the dax?); Minimal Scene Context (e.g. There are three objects in the scene: dog, a cat, and a dax. Where is the dax?) and Full Scene Description: The models receive a detailed scene description By comparing performance between (1) Question-Only and (2) Minimal Scene Context, we … view at source ↗
Figure 7
Figure 7. Figure 7: Illustrative Example of Expected Inputs and Outputs in MEBench. For each subtask, we present the expected visual input, text input, and the corresponding model output, demonstrating the structured evaluation process. Toys4K dataset ME-Novel dataset Room assets Room generation Known object(s) selection Data rendering View-point selection Scene description generation Inference Novel label selection Novel wor… view at source ↗
Figure 8
Figure 8. Figure 8: Data Generation Pipeline. We begin with 3D databases containing known objects, novel objects, and background room assets, from which we select and compose components into a 3D scene. The scene is then rendered from multiple camera viewpoints. During inference, we select a viewpoint where all objects are visible. Based on this selected view, we generate a spatial scene description and assign a novel label t… view at source ↗
read the original abstract

This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes. We assess the performance of various vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. We find that these VLMs exhibit weak ME bias, while showing some ability to leverage extra spatial context to resolve ambiguity in multiple novel object settings. Project page: http://mebench.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MEBench, a benchmark for evaluating mutual exclusivity (ME) bias in vision-language models that extends traditional tasks by incorporating spatial reasoning. It describes a flexible, scalable data-generation pipeline for constructing diverse annotated scenes with novel objects and reports results from evaluating several VLMs using custom metrics. The central empirical claim is that the tested VLMs exhibit weak ME bias while showing some ability to leverage extra spatial context to resolve ambiguity in multiple-novel-object settings.

Significance. If the benchmark scenes and metrics faithfully reproduce the referential ambiguity and exclusion logic of developmental ME studies, the work would offer a useful controlled testbed for probing reasoning limitations in current VLMs. The scalable pipeline itself is a constructive contribution that could support future ablation studies. The reported finding of weak bias plus partial spatial-context use would then be a concrete, falsifiable observation about VLM capabilities.

major comments (2)
  1. Data Generation Pipeline: The central claim that VLMs exhibit weak ME bias (and partial spatial-context leverage) rests on the unvalidated assumption that the synthetic scenes and novel-object labels reproduce the precise referential ambiguity and exclusion conditions used in child word-learning experiments. No human-child replication baseline or ablation isolating training-data leakage, unnatural spatial statistics, or label-distribution artifacts is described, leaving open the possibility that the measured 'weak bias' reflects pipeline-specific heuristics rather than ME reasoning per se.
  2. Evaluation Metrics: The abstract states that novel metrics are used to capture key aspects of ME-based reasoning, yet supplies no definitions, formulas, or controls for label leakage and statistical significance. Without these details the reported performance differences cannot be verified or compared to prior ME literature, undermining the empirical conclusions.
minor comments (2)
  1. Add explicit section headings and a table summarizing the exact VLMs, scene counts, and metric formulas so readers can trace the pipeline from generation to scoring.
  2. Clarify in the methods whether the spatial arrangements are generated from fixed templates or learned distributions, and state any constraints on object co-occurrence frequencies.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below, providing the strongest honest defense of our work while acknowledging areas where revisions are warranted. We have updated the manuscript to incorporate additional details and ablations where feasible.

read point-by-point responses
  1. Referee: Data Generation Pipeline: The central claim that VLMs exhibit weak ME bias (and partial spatial-context leverage) rests on the unvalidated assumption that the synthetic scenes and novel-object labels reproduce the precise referential ambiguity and exclusion conditions used in child word-learning experiments. No human-child replication baseline or ablation isolating training-data leakage, unnatural spatial statistics, or label-distribution artifacts is described, leaving open the possibility that the measured 'weak bias' reflects pipeline-specific heuristics rather than ME reasoning per se.

    Authors: We appreciate the referee's emphasis on validation. The data-generation pipeline was constructed to directly mirror the referential ambiguity and exclusion logic from classic developmental ME studies (e.g., one familiar object paired with one or more novel objects under controlled spatial layouts). Novel object labels were generated to be out-of-distribution relative to common VLM training corpora. While a full human-child replication baseline is beyond the scope of the current computational study, we have added a new subsection with targeted ablations that isolate the effects of spatial statistics and label distributions. These ablations show that the weak ME bias persists across varied configurations, supporting that the result is not an artifact of the pipeline alone. We have also expanded the discussion of design choices grounded in the ME literature. revision: partial

  2. Referee: Evaluation Metrics: The abstract states that novel metrics are used to capture key aspects of ME-based reasoning, yet supplies no definitions, formulas, or controls for label leakage and statistical significance. Without these details the reported performance differences cannot be verified or compared to prior ME literature, undermining the empirical conclusions.

    Authors: We agree that the abstract was insufficiently explicit. The novel metrics (ME-bias score and spatial-context utilization score) are formally defined with equations in Section 4 of the main text, including controls that ensure novel labels do not appear in the evaluated models' training data. We have now moved concise definitions and formulas into the abstract itself. Statistical significance is assessed via bootstrap resampling and reported with p-values in the results tables. These additions allow direct verification and comparison to prior ME work. revision: yes

standing simulated objections not resolved
  • A direct human-child replication experiment to empirically validate that the synthetic scenes produce equivalent referential ambiguity levels to classic developmental studies.

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-referential results

full rationale

The paper introduces MEBench as an empirical evaluation benchmark consisting of a synthetic scene generation pipeline and novel metrics for measuring VLM performance on mutual exclusivity tasks. No equations, first-principles derivations, fitted parameters, or predictions are present that could reduce to the paper's own inputs by construction. Results consist of direct measurements of model outputs against the generated test cases, which are independent of any internal fitting or self-citation that would create circularity. The work is therefore self-contained against external model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the synthetic scenes isolate mutual-exclusivity bias without confounding factors from the generation process itself. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The spatial relations and novel-object placements in the generated scenes match the conditions under which children exhibit mutual exclusivity bias.
    Invoked when the benchmark is presented as a more realistic extension of traditional ME tasks.

pith-pipeline@v0.9.0 · 5663 in / 1242 out tokens · 41262 ms · 2026-05-19T13:02:46.665873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 3 internal anchors

  1. [1]

    2, 3, 4, 9

    Blender, https://blender.org/. 2, 3, 4, 9

  2. [2]

    Geoshapes add-on, https://blendermarket.com/products/geoshapes- parametric-geometry-node-objects. 4, 9

  3. [3]

    Meirom, Yuval Atzmon, Shie Mannor, and Gal Chechik

    Harsh Agrawal, Eli A. Meirom, Yuval Atzmon, Shie Mannor, and Gal Chechik. Known unknowns: Learning novel concepts using reasoning-by-elimination. InProceedings of the Thirty- Seventh Conference on Uncertainty in Artificial Intelligence, pages 504–514. PMLR, 2021. 2

  4. [4]

    Claude 3

    Anthropic. Claude 3. https://www.anthropic.com/ index/claude, 2024. Accessed: 2024-02-28. 3

  5. [5]

    The threedworld transport challenge: A visually guided task- and-motion planning benchmark for physically realistic em- bodied ai.arXiv preprint arXiv:2103.14025, 2021

    Chuang Gan, Siyuan Zhou, Jeremy Schwartz, Seth Alter, Abhishek Bhandwaldar, Dan Gutfreund, Daniel LK Yamins, James J DiCarlo, Josh McDermott, Antonio Torralba, et al. The threedworld transport challenge: A visually guided task- and-motion planning benchmark for physically realistic em- bodied ai.arXiv preprint arXiv:2103.14025, 2021. 2

  6. [6]

    Mutual exclusivity as a challenge for deep neural networks.Advances in Neural Information Processing Systems, 33:14182–14192, 2020

    Kanishk Gandhi and Brenden M Lake. Mutual exclusivity as a challenge for deep neural networks.Advances in Neural Information Processing Systems, 33:14182–14192, 2020. 1, 2, 5

  7. [7]

    Kubric: A scalable dataset generator

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- 11 gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 2

  8. [8]

    Grounded language learning fast and slow.arXiv preprint arXiv:2009.01719, 2020

    Felix Hill, Olivier Tieleman, Tamara V on Glehn, Nathaniel Wong, Hamza Merzic, and Stephen Clark. Grounded language learning fast and slow.arXiv preprint arXiv:2009.01719, 2020. 2

  9. [9]

    The novel object and unusual name (noun) database: A collection of novel images for use in experimental research.Behavior research methods, 48(4):1393–1409, 2016

    Jessica S Horst and Michael C Hout. The novel object and unusual name (noun) database: A collection of novel images for use in experimental research.Behavior research methods, 48(4):1393–1409, 2016. 3

  10. [10]

    Mewl: Few-shot mul- timodal word learning with referential uncertainty.arXiv preprint arXiv:2306.00503, 2023

    Guangyuan Jiang, Manjie Xu, Shiji Xin, Wei Liang, Yujia Peng, Chi Zhang, and Yixin Zhu. Mewl: Few-shot mul- timodal word learning with referential uncertainty.arXiv preprint arXiv:2306.00503, 2023. 2

  11. [11]

    Bongard-hoi: Benchmarking few-shot visual reasoning for human-object interactions

    Huaizu Jiang, Xiaojian Ma, Weili Nie, Zhiding Yu, Yuke Zhu, and Anima Anandkumar. Bongard-hoi: Benchmarking few-shot visual reasoning for human-object interactions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19056–19065, 2022. 3

  12. [12]

    Towards open world object detec- tion

    KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Vi- neeth N Balasubramanian. Towards open world object detec- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 5830–5840, 2021. 2

  13. [13]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 2, 5, 6, 7, 10

  14. [14]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3, 5, 10

  15. [15]

    Deepseek-vl: Towards real-world vision- language understanding, 2024

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision- language understanding, 2024. 3

  16. [16]

    Children’s use of mutual exclusivity to constrain the meanings of words

    Ellen M Markman and Gwyn F Wachtel. Children’s use of mutual exclusivity to constrain the meanings of words. Cognitive psychology, 20(2):121–157, 1988. 2

  17. [17]

    Use of the mutual exclusivity assumption by young word learners.Cognitive psychology, 47(3):241–275, 2003

    Ellen M Markman, Judith L Wasow, and Mikkel B Hansen. Use of the mutual exclusivity assumption by young word learners.Cognitive psychology, 47(3):241–275, 2003. 2

  18. [18]

    Defusing the childhood vocabulary explo- sion.Science, 317(5838):631–631, 2007

    Bob McMurray. Defusing the childhood vocabulary explo- sion.Science, 317(5838):631–631, 2007. 1

  19. [19]

    Gpt-4 with vision (gpt-4v)

    OpenAI. Gpt-4 with vision (gpt-4v). https://openai. com/research/gpt-4, 2023. Accessed: 2024-02-28. 3

  20. [20]

    Lsfsl: Leveraging shape informa- tion in few-shot learning

    Deepan Chakravarthi Padmanabhan, Shruthi Gowda, Elahe Arani, and Bahram Zonooz. Lsfsl: Leveraging shape informa- tion in few-shot learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4970–4979, 2023. 2

  21. [21]

    Uouo: Uncontextualized uncommon objects for measuring knowledge horizons of vision language models

    Xinyu Pi, Mingyuan Wu, Jize Jiang, Haozhen Zheng, Beitong Tian, Chengxiang Zhai, Klara Nahrstedt, and Zhiting Hu. Uouo: Uncontextualized uncommon objects for measuring knowledge horizons of vision language models. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6432–6441, 2024. 3

  22. [22]

    Implicit shape biased few-shot learning for 3d object gener- alization

    Shitala Prasad, Yiqun Li, Dongyun Lin, and Aiyuan Guo. Implicit shape biased few-shot learning for 3d object gener- alization. In2022 IEEE International Conference on Image Processing (ICIP), pages 3436–3440. IEEE, 2022. 2

  23. [23]

    Infinigen indoors: Photorealistic indoor scenes using procedural generation

    Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, and Jia Deng. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 21783–...

  24. [24]

    Glamm: Pixel grounding large multimodal model

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrah- man Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024. 2, 3

  25. [25]

    You only look once: Unified, real-time object de- tection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 5

  26. [26]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  27. [27]

    Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- torealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2

  28. [28]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339– 9347, 2019. 2

  29. [29]

    Infants rapidly learn word-referent mappings via cross-situational statistics.Cognition, 106(3): 1558–1568, 2008

    Linda Smith and Chen Yu. Infants rapidly learn word-referent mappings via cross-situational statistics.Cognition, 106(3): 1558–1568, 2008. 2

  30. [30]

    Using shape to categorize: Low-shot learning with an explicit shape bias

    Stefan Stojanov, Anh Thai, and James M Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1798–1808, 2021. 2, 3, 9

  31. [31]

    Stefan Stojanov, Anh Thai, Zixuan Huang, and James M. Rehg. Learning dense object descriptors from multiple views for low-shot category generalization. InAdvances in Neural Information Processing Systems, pages 12566–12580, 2022. 3

  32. [32]

    Gemini: A family of highly capable multi- modal models, 2024

    Gemini Team. Gemini: A family of highly capable multi- modal models, 2024. 3, 5, 7, 10 12

  33. [33]

    Low-shot object learning with mutual exclusivity bias.Advances in Neural Information Processing Systems, 36:70208–70228, 2023

    Anh Thai, Ahmad Humayun, Stefan Stojanov, Zixuan Huang, Bikram Boote, and James M Rehg. Low-shot object learning with mutual exclusivity bias.Advances in Neural Information Processing Systems, 36:70208–70228, 2023. 2

  34. [34]

    Yolov10: Real-time end-to-end object detec- tion.Advances in Neural Information Processing Systems, 37:107984–108011, 2025

    Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, et al. Yolov10: Real-time end-to-end object detec- tion.Advances in Neural Information Processing Systems, 37:107984–108011, 2025. 5

  35. [35]

    Cogvlm: Visual expert for large language models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al. Cogvlm: Visual expert for large language models. 2023. 3, 5, 6, 7, 8, 10, 11

  36. [36]

    Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion

    Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Br´egier, Yohann Cabon, Vaibhav ARORA, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and Jerome Revaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion. InAdvances in Neural Information Processing Systems. 3

  37. [37]

    F-lmm: Grounding frozen large multimodal models.arXiv preprint arXiv:2406.05821,

    Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, and Chen Change Loy. F-lmm: Grounding frozen large multimodal models.arXiv preprint arXiv:2406.05821,

  38. [38]

    Open- vocabulary panoptic segmentation with text-to-image diffusion models, 2023

    Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. ODISE: Open-V ocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. arXiv preprint arXiv: 2303.04803, 2023. 2

  39. [39]

    Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

    Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming- Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025. 2, 3, 5, 10

  40. [40]

    Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.Advances in Neural Information Processing Systems, 37:71737–71767, 2025

    Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.Advances in Neural Information Processing Systems, 37:71737–71767, 2025. 3, 5, 10

  41. [41]

    Thingi10K: A Dataset of 10,000 3D-Printing Models

    Qingnan Zhou and Alec Jacobson. Thingi10k: A dataset of 10,000 3d-printing models.arXiv preprint arXiv:1605.04797,