Recognition: unknown
BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs
Pith reviewed 2026-05-10 16:15 UTC · model grok-4.3
The pith
Current vision-language models lack genuine geometric comprehension and instead rely on texture and contextual shortcuts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By depriving VLMs of RGB information and presenting only boundary contours from a noise-free geometric taxonomy, the benchmark reveals a universal 'Texture Bias Cliff' where even the most advanced models like GPT-4.1 and Claude fail to identify shapes based on geometry alone.
What carries the argument
The WTP-Bench collection and overall BareBones benchmark, which curates pixel-level silhouettes from segmentation datasets to create fine-grained geometric puzzles that force reliance on shape alone.
If this is right
- Advancements in VLMs will require methods that explicitly model geometric structure rather than texture patterns.
- Tasks involving precise spatial or shape-based reasoning may remain unreliable for current models.
- New training approaches could use silhouette data to reduce texture dependence.
- Evaluation protocols for multimodal models should include RGB-deprived tests to assess genuine comprehension.
Where Pith is reading between the lines
- Similar texture biases might affect performance in other zero-shot tasks like object detection or scene understanding.
- Developing models that can handle abstract shape recognition could improve generalization to novel environments.
- Researchers could test if fine-tuning on these benchmarks improves overall robustness.
Load-bearing premise
That the pixel-level silhouettes and the taxonomy in WTP-Bench provide no unintended semantic or contextual information that models could exploit beyond pure geometry.
What would settle it
Demonstrating that any current or future VLM achieves accuracy on the silhouette benchmark close to its performance on the original RGB images would challenge the existence of a universal texture bias.
Figures
read the original abstract
While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce $\textbf{BareBones}$, a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free geometric taxonomy. WTP-Bench is an extreme, fine-grained visual puzzle that forces models to identify inter-class geometric concepts from boundary contours alone. Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (eg. GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the $\textit{Texture Bias Cliff}$. By documenting universal structural blindspots, BareBones establishes a rigorous yardstick for genuine geometric grounding. Project Page: https://eternal-f1ame.github.io/WTP-Bench/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BareBones, a zero-shot benchmark using pixel-level silhouettes curated from ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200 and a new flagship WTP-Bench collection. It evaluates 26 proprietary and open-weight VLMs and reports a consistent, severe performance collapse when RGB texture is removed, which the authors term the Texture Bias Cliff, to argue that current models lack genuine geometric comprehension.
Significance. If the silhouettes and WTP-Bench taxonomy truly isolate geometric structure without semantic or class-prior leakage, the documented universal structural blindspots would provide a valuable, falsifiable yardstick for measuring progress toward geometric grounding in VLMs.
major comments (3)
- [Abstract and WTP-Bench construction] The central claim of a 'noise-free geometric taxonomy' and 'Texture Bias Cliff' rests on the assumption that boundary contours from the six datasets contain no exploitable class-specific shape statistics or annotation cues. No controls (novel synthetic shapes, prompt ablations, or human contour-only baselines) are described to rule out pre-training leakage via contour statistics when the zero-shot prompt supplies the class vocabulary.
- [Evaluation methodology] Exact prompts, the silhouette generation pipeline, and the annotation verification procedure are not reported in sufficient detail to allow independent reproduction or verification that semantic/contextual cues have been eliminated. This directly affects the soundness of the performance-collapse measurements across the 26 models.
- [Results and analysis] No error bars, confidence intervals, or statistical significance tests are provided for the reported performance differences between RGB and silhouette conditions, weakening the assertion of a 'consistent, severe' collapse.
minor comments (1)
- [Abstract] The abbreviation 'eg.' in the abstract should be written as 'e.g.' for standard academic formatting.
Simulated Author's Rebuttal
Thank you for your constructive review and for acknowledging the potential value of BareBones as a benchmark for geometric comprehension in VLMs. We agree that the points raised regarding controls for leakage, reproducibility details, and statistical rigor are important and will improve the manuscript. We will prepare a revised version that incorporates additional analyses and clarifications while maintaining the core findings. Our point-by-point responses to the major comments are below.
read point-by-point responses
-
Referee: [Abstract and WTP-Bench construction] The central claim of a 'noise-free geometric taxonomy' and 'Texture Bias Cliff' rests on the assumption that boundary contours from the six datasets contain no exploitable class-specific shape statistics or annotation cues. No controls (novel synthetic shapes, prompt ablations, or human contour-only baselines) are described to rule out pre-training leakage via contour statistics when the zero-shot prompt supplies the class vocabulary.
Authors: We appreciate this observation on the central assumption. The current version does not include novel synthetic shapes or human contour-only baselines. In revision we will add prompt ablations that vary the class vocabulary and template phrasing to quantify any leakage from contour statistics. We will also revise the abstract and introduction to moderate the 'noise-free' phrasing and add a dedicated limitations paragraph discussing potential pre-training exposure to shape priors. We maintain that the consistent collapse across 26 models and six heterogeneous datasets provides supporting evidence for the Texture Bias Cliff, but we accept that the suggested controls would further strengthen the claims. This constitutes a partial revision, as conducting new human baseline studies falls outside the scope of the current work. revision: partial
-
Referee: [Evaluation methodology] Exact prompts, the silhouette generation pipeline, and the annotation verification procedure are not reported in sufficient detail to allow independent reproduction or verification that semantic/contextual cues have been eliminated. This directly affects the soundness of the performance-collapse measurements across the 26 models.
Authors: We agree that insufficient methodological detail hinders reproducibility. In the revised manuscript we will provide the exact zero-shot prompts used for every model family, a step-by-step description of the silhouette extraction and post-processing pipeline (including source code references), and the full annotation verification protocol employed to confirm removal of semantic cues. We will also release the complete prompt templates and generation scripts as supplementary material. These additions directly address the concern and will be marked as a full revision. revision: yes
-
Referee: [Results and analysis] No error bars, confidence intervals, or statistical significance tests are provided for the reported performance differences between RGB and silhouette conditions, weakening the assertion of a 'consistent, severe' collapse.
Authors: We acknowledge the absence of statistical support in the submitted version. Because the evaluations are zero-shot, we will recompute results for open-weight models across multiple prompt seeds where stochasticity exists, add error bars (standard deviation or bootstrap confidence intervals) to all bar plots, and include paired statistical tests (e.g., Wilcoxon signed-rank or McNemar tests) comparing RGB versus silhouette accuracy per model. These analyses and updated figures will appear in the revised results section. revision: yes
Circularity Check
No circularity: direct empirical measurements on held-out silhouettes
full rationale
The paper introduces BareBones as an empirical benchmark consisting of curated pixel-level silhouettes from existing segmentation datasets plus a new WTP-Bench collection. It evaluates 26 VLMs in zero-shot settings and reports observed performance drops under RGB deprivation. No mathematical derivations, fitted parameters, self-referential equations, or load-bearing self-citations are present. The Texture Bias Cliff is defined as the measured collapse itself, not constructed from any internal definition or prior result by the same authors. All reported numbers are direct observations on the benchmark inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pixel-level silhouettes from the listed sources provide a noise-free geometric taxonomy without semantic leakage.
Reference graph
Works this paper leans on
-
[1]
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, Soham Ghosh, Am ´elie H ´eliou, Paul Jacob, Albert Q. Jiang, Kartik Khandelwal, Timoth ´ee Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Mar...
work page internal anchor Pith review arXiv
-
[2]
Ovis2.5: Structural embedding alignment for multimodal large language model, 2025
AIDC-AI. Ovis2.5: Structural embedding alignment for multimodal large language model, 2025. 3
2025
-
[3]
Claude 4.5 model card, 2025
Anthropic. Claude 4.5 model card, 2025. 3
2025
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Re: Verse-can your vlm read a manga? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3761–3771, 2025
Aaditya Baranwal, Madhav Kataria, Naitik Agrawal, Yo- gesh S Rawat, and Shruti Vyas. Re: Verse-can your vlm read a manga? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3761–3771, 2025. 1
2025
-
[6]
Florence-vl: Enhancing vision-language models with generative vision encoder and depth-breadth fusion, 2024
Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, and Bin Xiao. Florence-vl: Enhancing vision-language models with generative vision encoder and depth-breadth fusion, 2024. 3
2024
-
[7]
The pascal visual object classes (voc) challenge.IJCV, 2010
Mark Everingham et al. The pascal visual object classes (voc) challenge.IJCV, 2010. 2, 3, 1
2010
-
[8]
Large-scale unsupervised semantic seg- mentation.IEEE TPAMI, 2022
Shanghua Gao et al. Large-scale unsupervised semantic seg- mentation.IEEE TPAMI, 2022. 2, 3, 1
2022
-
[9]
Are vision language models texture or shape biased and can we steer them? InMMFM Workshop @ CVPR, 2024
Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Muhammad Jehanzeb Mirza, Margret Keuper, and Janis Ke- uper. Are vision language models texture or shape biased and can we steer them? InMMFM Workshop @ CVPR, 2024. 2
2024
-
[10]
Imagenet-trained cnns are biased to- wards texture; increasing shape bias improves accuracy and robustness
Robert Geirhos et al. Imagenet-trained cnns are biased to- wards texture; increasing shape bias improves accuracy and robustness. InICLR, 2019. 2
2019
-
[11]
Gemini Team Google. Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Paligemma: A versatile 3b vlm for transfer, 2024
Google. Paligemma: A versatile 3b vlm for transfer, 2024. 3
2024
-
[13]
The origins and prevalence of tex- ture bias in convolutional neural networks
Katherine Hermann et al. The origins and prevalence of tex- ture bias in convolutional neural networks. InNeurIPS, 2020. 2
2020
-
[14]
Scaling up visual and vision-language rep- resentation learning with noisy text supervision
Chao Jia et al. Scaling up visual and vision-language rep- resentation learning with noisy text supervision. InICML,
-
[15]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Seed-bench: Benchmarking multimodal llms with generative comprehension
Bohao Li et al. Seed-bench: Benchmarking multimodal llms with generative comprehension. InCVPR, 2024. 1, 2
2024
-
[17]
Deep learning for thin object segmenta- tion
Jun Hao Liew et al. Deep learning for thin object segmenta- tion. InCVPR, 2021. 2, 3, 1
2021
-
[18]
Visual instruction tuning
Haotian Liu et al. Visual instruction tuning. InNeurIPS,
-
[19]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR, 2024. 1, 2
2024
-
[20]
SmolVLM: Redefining small and efficient multimodal models
Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tun- stall, Leandro von Werra, and Thomas Wolf. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299...
work page internal anchor Pith review arXiv 2025
-
[21]
Phi-3 vision 128k instruct, 2024
Microsoft. Phi-3 vision 128k instruct, 2024. 3
2024
-
[22]
Gpt-4.1 technical report.arXiv preprint arXiv:2503.12917, 2025
OpenAI. Gpt-4.1 technical report.arXiv preprint arXiv:2503.12917, 2025. 1, 3, 2
-
[23]
Internvl2.5 pretrained models, 2024
OpenGVLab. Internvl2.5 pretrained models, 2024. 3
2024
-
[24]
Robust onion: Peeling open vocab object detectors under noise
Priyank Pathak, Mukilan Karuppasamy, Aaditya Baranwal, Shyam Marjit, Shruti Vyas, and Yogesh S Rawat. Robust onion: Peeling open vocab object detectors under noise
-
[25]
Highly accurate dichotomous image seg- mentation
Xuebin Qin et al. Highly accurate dichotomous image seg- mentation. InECCV, 2022. 2, 3, 1
2022
-
[26]
Qwen2.5-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024
Qwen Team. Qwen2.5-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. 3
2024
-
[27]
Learning transferable visual models from natural language supervision
Alec Radford et al. Learning transferable visual models from natural language supervision. InICML, 2021. 2
2021
-
[28]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong et al. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 1, 2
2024
-
[29]
The caltech-ucsd birds-200-2011 dataset
Catherine Wah, Steve Branson, Peter Welinder, Pietro Per- ona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology,
2011
-
[30]
Who’s That Pok´emon?
xAI. Grok 4 model card, 2025. 3 BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs Supplementary Material Extended Benchmark Methodology Dataset Taxonomy BareBones repurposes five established high-fidelity seg- mentation sources and introduces one novel flagship col- lection. Figure 7 provides a qualitative overview of each dataset, illustr...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.