arxiv: 2604.10528 · v3 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs

Aaditya Baranwal , Vishal Yadav , Abhishek Rajora

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelszero-shot evaluationgeometric understandingtexture biassilhouette benchmarkmultimodal AIVLM robustness

0 comments

The pith

Current vision-language models lack genuine geometric comprehension and instead rely on texture and contextual shortcuts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BareBones, a new benchmark that tests VLMs using only pixel-level silhouettes to isolate pure geometric understanding from texture or context. Evaluations across 26 models show a sharp and consistent drop in performance when color and texture are removed. This indicates that apparent zero-shot capabilities in these models are largely due to statistical patterns in RGB data rather than structural insight. The work provides a new standard for measuring true geometric grounding in multimodal AI.

Core claim

By depriving VLMs of RGB information and presenting only boundary contours from a noise-free geometric taxonomy, the benchmark reveals a universal 'Texture Bias Cliff' where even the most advanced models like GPT-4.1 and Claude fail to identify shapes based on geometry alone.

What carries the argument

The WTP-Bench collection and overall BareBones benchmark, which curates pixel-level silhouettes from segmentation datasets to create fine-grained geometric puzzles that force reliance on shape alone.

If this is right

Advancements in VLMs will require methods that explicitly model geometric structure rather than texture patterns.
Tasks involving precise spatial or shape-based reasoning may remain unreliable for current models.
New training approaches could use silhouette data to reduce texture dependence.
Evaluation protocols for multimodal models should include RGB-deprived tests to assess genuine comprehension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar texture biases might affect performance in other zero-shot tasks like object detection or scene understanding.
Developing models that can handle abstract shape recognition could improve generalization to novel environments.
Researchers could test if fine-tuning on these benchmarks improves overall robustness.

Load-bearing premise

That the pixel-level silhouettes and the taxonomy in WTP-Bench provide no unintended semantic or contextual information that models could exploit beyond pure geometry.

What would settle it

Demonstrating that any current or future VLM achieves accuracy on the silhouette benchmark close to its performance on the original RGB images would challenge the existence of a universal texture bias.

Figures

Figures reproduced from arXiv: 2604.10528 by Aaditya Baranwal, Abhishek Rajora, Vishal Yadav.

**Figure 1.** Figure 1: RGB vs. Silhouette-only Recognition Across Benchmark Subsets. Representative image–silhouette pairs from four datasets evaluated against four VLMs under two conditions: standard RGB input and silhouette-only inputs. Check marks (✓) and crosses (×) denote correct and incorrect top-1 predictions, respectively. Aggregate scores (rightmost columns) reveal a consistent and severe drop in identification accuracy… view at source ↗

**Figure 2.** Figure 2: Evaluation pipeline: HQ images and their segmentation masks are binarized to pure structural silhouettes, eliminating all color, texture, and background cues. Models are queried in two modes (HQ and Silhouette), and exact-match accuracy is recorded. cues, while polygon segmentations inadequately capture fine structural boundaries. Progress on these benchmarks masks fundamental visual deficits, motivating h… view at source ↗

**Figure 3.** Figure 3: Semantic Collapse: Performance Plummets Without Texture. 16 open-weight models across all 5 datasets. Left: High-quality (texture) performance. Right: Silhouette (shape-only) performance. Sweep of red from left to right maps the Texture Bias Cliff. 0 10 20 30 40 50 60 70 Model Parameter Count (Billions) 0 1 2 3 4 5 Exact Match Silhouette Accuracy (%) OSM-0.5B OSM-7 OSM-11B OSM-19 OSM-20 Average Open-Weight… view at source ↗

**Figure 5.** Figure 5: Widespread Statistical Hallucinations. When stripped of RGB patterns, models disregard vision encoder inputs and default to pretraining priors (e.g. afghan hound, american crow) or outright refuse the prompt tens of thousands of times. uine geometric grounding in future multimodal models. The WTP-Bench dataset and code will be made public. Limitations. Several design choices in our canonicalization pipel… view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Dataset Mosaic. Qualitative zero-shot evaluation samples from the five repurposed segmentation datasets. Each column shows the HQ image (top) and its binarized silhouette (bottom), illustrating the diversity in boundary complexity from fine-grained biological forms (CUB-200) to rigid man-made structures (DIS5K, ThinObject5K). shot setting to prevent distribution shift from task-specific fine-tuning. For pr… view at source ↗

**Figure 8.** Figure 8: WTP-Bench Qualitative Pairs. High-quality artwork (left) vs. pure structural silhouette (right) for a representative selection of targets spanning all generational tiers, highlighting the full spectrum of geometric difficulty: simple quadrupedal outlines, intricate multi-appendage designs, and amorphous Gigantamax forms. Animals/Nature Vehicles/Transport Indoor Objects Birds (Ornithology) Thin/Wire Structu… view at source ↗

**Figure 9.** Figure 9: Macro-Category Vulnerabilities. Average performance across primary visual domains on WTP-Bench. Finegrained biological targets (Birds, Animals) collapse to near-zero on silhouettes, while coarse object categories (Vehicles, Indoor) degrade more gracefully. increasing silhouette ambiguity, compounding the baseline confusion between a form and its unevolved counterpart. The elemental and morphological stra… view at source ↗

**Figure 10.** Figure 10: Typological and Morphological Breakdown. Left: Accuracy stratified by Pokemon elemental type. Ghost-, Bug-, and ´ Dragon-type silhouettes (amorphous geometry) are consistently hardest. Right: Accuracy by morphological form. Alternate forms (Mega, Gigantamax) incur the steepest silhouette drops as added geometric complexity compounds the baseline difficulty [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Pre-training Distribution Bias. Both proprietary and open-weight models achieve their highest exact matches on Generation 1 targets, dropping sharply on modern targets with identical structural complexity. Rather, the effect mirrors the human familiarity gradient: dedicated fans achieve ∼70% on Gen 1 silhouettes but near single digits on later generations, driven by differential exposure rather than sha… view at source ↗

**Figure 12.** Figure 12: Generation 1 Over-representation in Failures. When models fail on a silhouette, they hallucinate a Generation 1 target at a rate far exceeding the 19.6% baseline prevalence. Open-Weight Silhouette Errors Generation Wrong Guesses % of Errors Dataset % 1 13,959 79.2% 19.6% 2 1,156 6.6% 9.6% 3 613 3.5% 14.2% 4 868 4.9% 11.2% 5 371 2.1% 15.9% 6 372 2.1% 9.4% 7 174 1.0% 9.0% 8 108 0.6% 11.1% [PITH_FULL_IMAGE:… view at source ↗

read the original abstract

While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce $\textbf{BareBones}$, a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free geometric taxonomy. WTP-Bench is an extreme, fine-grained visual puzzle that forces models to identify inter-class geometric concepts from boundary contours alone. Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (eg. GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the $\textit{Texture Bias Cliff}$. By documenting universal structural blindspots, BareBones establishes a rigorous yardstick for genuine geometric grounding. Project Page: https://eternal-f1ame.github.io/WTP-Bench/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a new silhouette benchmark and documents a consistent performance drop across 26 VLMs, but the drop may reflect shape priors or prompt cues rather than a clean test of geometric comprehension.

read the letter

The main point is that BareBones tests VLMs on pixel-level silhouettes drawn from five existing segmentation datasets plus a new WTP-Bench collection, and finds a sharp drop in zero-shot recognition that the authors label the Texture Bias Cliff. They evaluate 26 models including GPT-4.1, Gemini, Claude, and LLaVA variants and report the collapse holds across the board. This supplies a reusable diagnostic for a known VLM weakness around structural grounding. The new WTP-Bench taxonomy and the systematic RGB-deprivation protocol across multiple sources are the concrete additions; prior silhouette tests existed but not at this scale or with this explicit focus on fine-grained geometric puzzles from contours alone. The work does a service by making the evaluation concrete and public. The soft spot is that the silhouettes may not isolate geometry as cleanly as claimed. Outlines from ImageNet-S or CUB-200 classes can still encode class-typical shape statistics that models saw in pre-training, and the zero-shot prompts supply the class names, so models could match contour statistics to known categories without computing intrinsic geometry. The abstract does not describe controls such as novel synthetic shapes, prompt ablations, or human baselines on the same contour inputs that would rule out those shortcuts. Details on the exact silhouette generation pipeline and annotation verification are also missing from what is visible, which leaves room for inadvertent leakage. This is for people who build or evaluate VLMs and want practical tests for texture versus structure. A reader focused on multimodal benchmarks will get a usable dataset and quantified observation even if the interpretation needs tightening. The paper engages the literature directly and reports a reproducible measurement, so it deserves a serious referee to check the methods and add the missing controls.

Referee Report

3 major / 1 minor

Summary. The paper introduces BareBones, a zero-shot benchmark using pixel-level silhouettes curated from ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200 and a new flagship WTP-Bench collection. It evaluates 26 proprietary and open-weight VLMs and reports a consistent, severe performance collapse when RGB texture is removed, which the authors term the Texture Bias Cliff, to argue that current models lack genuine geometric comprehension.

Significance. If the silhouettes and WTP-Bench taxonomy truly isolate geometric structure without semantic or class-prior leakage, the documented universal structural blindspots would provide a valuable, falsifiable yardstick for measuring progress toward geometric grounding in VLMs.

major comments (3)

[Abstract and WTP-Bench construction] The central claim of a 'noise-free geometric taxonomy' and 'Texture Bias Cliff' rests on the assumption that boundary contours from the six datasets contain no exploitable class-specific shape statistics or annotation cues. No controls (novel synthetic shapes, prompt ablations, or human contour-only baselines) are described to rule out pre-training leakage via contour statistics when the zero-shot prompt supplies the class vocabulary.
[Evaluation methodology] Exact prompts, the silhouette generation pipeline, and the annotation verification procedure are not reported in sufficient detail to allow independent reproduction or verification that semantic/contextual cues have been eliminated. This directly affects the soundness of the performance-collapse measurements across the 26 models.
[Results and analysis] No error bars, confidence intervals, or statistical significance tests are provided for the reported performance differences between RGB and silhouette conditions, weakening the assertion of a 'consistent, severe' collapse.

minor comments (1)

[Abstract] The abbreviation 'eg.' in the abstract should be written as 'e.g.' for standard academic formatting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your constructive review and for acknowledging the potential value of BareBones as a benchmark for geometric comprehension in VLMs. We agree that the points raised regarding controls for leakage, reproducibility details, and statistical rigor are important and will improve the manuscript. We will prepare a revised version that incorporates additional analyses and clarifications while maintaining the core findings. Our point-by-point responses to the major comments are below.

read point-by-point responses

Referee: [Abstract and WTP-Bench construction] The central claim of a 'noise-free geometric taxonomy' and 'Texture Bias Cliff' rests on the assumption that boundary contours from the six datasets contain no exploitable class-specific shape statistics or annotation cues. No controls (novel synthetic shapes, prompt ablations, or human contour-only baselines) are described to rule out pre-training leakage via contour statistics when the zero-shot prompt supplies the class vocabulary.

Authors: We appreciate this observation on the central assumption. The current version does not include novel synthetic shapes or human contour-only baselines. In revision we will add prompt ablations that vary the class vocabulary and template phrasing to quantify any leakage from contour statistics. We will also revise the abstract and introduction to moderate the 'noise-free' phrasing and add a dedicated limitations paragraph discussing potential pre-training exposure to shape priors. We maintain that the consistent collapse across 26 models and six heterogeneous datasets provides supporting evidence for the Texture Bias Cliff, but we accept that the suggested controls would further strengthen the claims. This constitutes a partial revision, as conducting new human baseline studies falls outside the scope of the current work. revision: partial
Referee: [Evaluation methodology] Exact prompts, the silhouette generation pipeline, and the annotation verification procedure are not reported in sufficient detail to allow independent reproduction or verification that semantic/contextual cues have been eliminated. This directly affects the soundness of the performance-collapse measurements across the 26 models.

Authors: We agree that insufficient methodological detail hinders reproducibility. In the revised manuscript we will provide the exact zero-shot prompts used for every model family, a step-by-step description of the silhouette extraction and post-processing pipeline (including source code references), and the full annotation verification protocol employed to confirm removal of semantic cues. We will also release the complete prompt templates and generation scripts as supplementary material. These additions directly address the concern and will be marked as a full revision. revision: yes
Referee: [Results and analysis] No error bars, confidence intervals, or statistical significance tests are provided for the reported performance differences between RGB and silhouette conditions, weakening the assertion of a 'consistent, severe' collapse.

Authors: We acknowledge the absence of statistical support in the submitted version. Because the evaluations are zero-shot, we will recompute results for open-weight models across multiple prompt seeds where stochasticity exists, add error bars (standard deviation or bootstrap confidence intervals) to all bar plots, and include paired statistical tests (e.g., Wilcoxon signed-rank or McNemar tests) comparing RGB versus silhouette accuracy per model. These analyses and updated figures will appear in the revised results section. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on held-out silhouettes

full rationale

The paper introduces BareBones as an empirical benchmark consisting of curated pixel-level silhouettes from existing segmentation datasets plus a new WTP-Bench collection. It evaluates 26 VLMs in zero-shot settings and reports observed performance drops under RGB deprivation. No mathematical derivations, fitted parameters, self-referential equations, or load-bearing self-citations are present. The Texture Bias Cliff is defined as the measured collapse itself, not constructed from any internal definition or prior result by the same authors. All reported numbers are direct observations on the benchmark inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that boundary contours alone constitute a sufficient and uncontaminated signal for geometric comprehension; no free parameters are fitted, no new physical entities are postulated, and no mathematical axioms beyond standard image-processing definitions are invoked.

axioms (1)

domain assumption Pixel-level silhouettes from the listed sources provide a noise-free geometric taxonomy without semantic leakage.
Invoked in the abstract when describing curation of 'noise-free geometric taxonomy' and 'extreme, fine-grained visual puzzle'.

pith-pipeline@v0.9.0 · 5559 in / 1337 out tokens · 30349 ms · 2026-05-10T16:15:29.386337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, Soham Ghosh, Am ´elie H ´eliou, Paul Jacob, Albert Q. Jiang, Kartik Khandelwal, Timoth ´ee Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Mar...

work page internal anchor Pith review arXiv
[2]

Ovis2.5: Structural embedding alignment for multimodal large language model, 2025

AIDC-AI. Ovis2.5: Structural embedding alignment for multimodal large language model, 2025. 3

2025
[3]

Claude 4.5 model card, 2025

Anthropic. Claude 4.5 model card, 2025. 3

2025
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Re: Verse-can your vlm read a manga? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3761–3771, 2025

Aaditya Baranwal, Madhav Kataria, Naitik Agrawal, Yo- gesh S Rawat, and Shruti Vyas. Re: Verse-can your vlm read a manga? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3761–3771, 2025. 1

2025
[6]

Florence-vl: Enhancing vision-language models with generative vision encoder and depth-breadth fusion, 2024

Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, and Bin Xiao. Florence-vl: Enhancing vision-language models with generative vision encoder and depth-breadth fusion, 2024. 3

2024
[7]

The pascal visual object classes (voc) challenge.IJCV, 2010

Mark Everingham et al. The pascal visual object classes (voc) challenge.IJCV, 2010. 2, 3, 1

2010
[8]

Large-scale unsupervised semantic seg- mentation.IEEE TPAMI, 2022

Shanghua Gao et al. Large-scale unsupervised semantic seg- mentation.IEEE TPAMI, 2022. 2, 3, 1

2022
[9]

Are vision language models texture or shape biased and can we steer them? InMMFM Workshop @ CVPR, 2024

Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Muhammad Jehanzeb Mirza, Margret Keuper, and Janis Ke- uper. Are vision language models texture or shape biased and can we steer them? InMMFM Workshop @ CVPR, 2024. 2

2024
[10]

Imagenet-trained cnns are biased to- wards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos et al. Imagenet-trained cnns are biased to- wards texture; increasing shape bias improves accuracy and robustness. InICLR, 2019. 2

2019
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team Google. Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Paligemma: A versatile 3b vlm for transfer, 2024

Google. Paligemma: A versatile 3b vlm for transfer, 2024. 3

2024
[13]

The origins and prevalence of tex- ture bias in convolutional neural networks

Katherine Hermann et al. The origins and prevalence of tex- ture bias in convolutional neural networks. InNeurIPS, 2020. 2

2020
[14]

Scaling up visual and vision-language rep- resentation learning with noisy text supervision

Chao Jia et al. Scaling up visual and vision-language rep- resentation learning with noisy text supervision. InICML,
[15]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Seed-bench: Benchmarking multimodal llms with generative comprehension

Bohao Li et al. Seed-bench: Benchmarking multimodal llms with generative comprehension. InCVPR, 2024. 1, 2

2024
[17]

Deep learning for thin object segmenta- tion

Jun Hao Liew et al. Deep learning for thin object segmenta- tion. InCVPR, 2021. 2, 3, 1

2021
[18]

Visual instruction tuning

Haotian Liu et al. Visual instruction tuning. InNeurIPS,
[19]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR, 2024. 1, 2

2024
[20]

SmolVLM: Redefining small and efficient multimodal models

Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tun- stall, Leandro von Werra, and Thomas Wolf. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299...

work page internal anchor Pith review arXiv 2025
[21]

Phi-3 vision 128k instruct, 2024

Microsoft. Phi-3 vision 128k instruct, 2024. 3

2024
[22]

Gpt-4.1 technical report.arXiv preprint arXiv:2503.12917, 2025

OpenAI. Gpt-4.1 technical report.arXiv preprint arXiv:2503.12917, 2025. 1, 3, 2

work page arXiv 2025
[23]

Internvl2.5 pretrained models, 2024

OpenGVLab. Internvl2.5 pretrained models, 2024. 3

2024
[24]

Robust onion: Peeling open vocab object detectors under noise

Priyank Pathak, Mukilan Karuppasamy, Aaditya Baranwal, Shyam Marjit, Shruti Vyas, and Yogesh S Rawat. Robust onion: Peeling open vocab object detectors under noise
[25]

Highly accurate dichotomous image seg- mentation

Xuebin Qin et al. Highly accurate dichotomous image seg- mentation. InECCV, 2022. 2, 3, 1

2022
[26]

Qwen2.5-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024

Qwen Team. Qwen2.5-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. 3

2024
[27]

Learning transferable visual models from natural language supervision

Alec Radford et al. Learning transferable visual models from natural language supervision. InICML, 2021. 2

2021
[28]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong et al. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024. 1, 2

2024
[29]

The caltech-ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Per- ona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology,

2011
[30]

Who’s That Pok´emon?

xAI. Grok 4 model card, 2025. 3 BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs Supplementary Material Extended Benchmark Methodology Dataset Taxonomy BareBones repurposes five established high-fidelity seg- mentation sources and introduces one novel flagship col- lection. Figure 7 provides a qualitative overview of each dataset, illustr...

2025