pith. sign in

arxiv: 2605.20549 · v1 · pith:MRCMDD2Nnew · submitted 2026-05-19 · 💻 cs.CV

MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space

Pith reviewed 2026-05-21 06:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords synthetic datasetvision model probing3D scene parameterssensitivity analysiscamera viewpointmodel robustnessCNN transformer comparisoncontrolled rendering
0
0 comments X

The pith

Camera distance and elevation dominate recognition failure across vision models in controlled 3D scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MAPS, a collection of 2,618 validated photorealistic 3D meshes together with a Blender rendering pipeline that generates images by varying nine scene factors independently. Researchers apply regression-based sensitivity analysis to twenty convolutional and transformer models and establish that camera distance and elevation account for the largest share of recognition errors irrespective of each model's ImageNet accuracy. The full pattern of sensitivities across all factors groups modern CNNs and transformers together while separating them from older architectures. This indicates that fine-grained design choices shape how models respond to 3D scene variation more than the coarse CNN-versus-transformer distinction.

Core claim

Using the MAPS rendering pipeline to produce images under continuous, independent control of nine scene factors and then fitting regressions from those factors to model prediction errors, the work shows a near-universal failure axis in which camera distance and elevation dominate regardless of ImageNet accuracy, while the overall sensitivity structure places modern CNNs and transformers in one cluster distinct from older models.

What carries the argument

MAPS dataset of 2,618 curated 3D meshes and its Blender-based rendering pipeline that enables independent continuous variation of nine scene factors for regression-based sensitivity analysis of model outputs.

If this is right

  • Camera distance and elevation explain most recognition failures independent of a model's standard benchmark accuracy.
  • Sensitivity profiles are more similar between recent CNNs and transformers than between modern and older architectures.
  • Fine-grained architectural choices are stronger determinants of sensitivity to 3D scene parameters than the broad CNN-transformer category.
  • The MAPS pipeline permits precise attribution of model behavior to individual scene factors rather than entangled real-world variation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training data or optimization procedures may be converging toward similar handling of viewpoint changes in recent models.
  • Explicit augmentation with wide ranges of camera distance and elevation during training could reduce the observed failures.
  • Extending the same controlled rendering to additional factors such as object pose or material properties would likely expose further systematic sensitivities.
  • Direct comparison of the same models on matched real-world photographs would test whether the synthetic dominance of camera parameters generalizes.

Load-bearing premise

The curated 3D meshes are recognizable to humans across the target classes and the rendering pipeline produces image variations that are free of unintended artifacts that would systematically affect model predictions.

What would settle it

Repeating the regression analysis on images generated with the same pipeline but a different set of models and finding that camera distance and elevation no longer show the highest coefficients for prediction error.

Figures

Figures reproduced from arXiv: 2605.20549 by Gemma Roig, Maren Wehrheim, Martina G. Vilas, Matthias Kaschube, Pamela Osuna-Vargas, Santiago Galella.

Figure 1
Figure 1. Figure 1: MAPS dataset. (a) We collected 2,618 3D meshes from Sketchfab, encompassing 560 classes from ImageNet-1k. (b) Scene parameters. We additionally provide a flexible on-demand rendering pipeline implemented in Blender, to allow the rendering of 2D images with different scene parameters. (c) Application. MAPS can be used to study representations in pretrained vision models. We find that model confidence is hig… view at source ↗
Figure 2
Figure 2. Figure 2: Estimating the semantic diversity of MAPS. (a) Semantic structure. Hierarchical clustering of the 560 MAPS classes based on pairwise Wu–Palmer similarity between their WordNet synsets, using Ward’s linkage. Cutting the dendrogram at 20 clusters yields semantically coherent groups, labeled by the lowest common WordNet hypernym of their cluster members. (b) Repre￾sentative cluster exemplars. The medoid of ea… view at source ↗
Figure 3
Figure 3. Figure 3: Scene-factor sensitivity analysis. (a) Methodology pipeline. For each mesh, 5,000 scene configurations are drawn via Latin hypercube sampling over the nine-dimensional scene-parameter space (left). Each configuration is rendered through the MAPS API (center). For each model-mesh pair, we fit a regression predicting the decision’s margin from the scene parameters (right). (b) Average top-1 accuracy heatmap … view at source ↗
Figure 4
Figure 4. Figure 4: Scene-factor sensitivity for linear and polynomial regression. For each model-mesh pair, we fit two regression models predicting the decision margin from the nine scene parameters: (a) a linear model, and (b) a polynomial model with linear, quadratic (self), and pairwise interaction terms. Coefficients are averaged across all the evaluated meshes, obtaining one coefficient profile per model. (a, left) Mean… view at source ↗
read the original abstract

Modern vision models achieve strong performance on standard benchmarks, yet their aggregate accuracy reveals little about which scene properties drive their predictions. Existing robustness benchmarks provide important stress tests, but typically manipulate global 2D image properties, rely on entangled real-world variation, or cover only a limited set of 3D objects and scene parameters. We introduce MAPS (Manifolds of Artificial Parametric Scenes), a scalable instrument for controlled attribution of vision model behavior to scene parameters. MAPS comprises 2,618 curated photorealistic 3D meshes validated for recognizability across 560 ImageNet classes and provides a Blender-based rendering pipeline for on-demand image generation under continuous variation of nine independent scene-factors spanning background, camera, and lighting, extensible to other factors. To showcase its applicability, we use MAPS to evaluate 20 convolutional and transformer-based models by quantifying their reliance on these scene factors through regression-based sensitivity analysis. We find a near-universal failure axis across all tested architectures: camera distance and elevation consistently dominate recognition failure regardless of ImageNet accuracy. However, the full sensitivity structure reveals that modern CNNs and transformers cluster together, distinct from older architectures, suggesting that fine-grained architectural design choices, rather than the coarse CNN-versus-transformer distinction, are the stronger determinant of sensitivity profiles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MAPS, a synthetic dataset of 2,618 curated photorealistic 3D meshes spanning 560 ImageNet classes, paired with a Blender rendering pipeline that enables controlled, on-demand image generation by independently varying nine scene factors (background, camera, and lighting). The authors apply regression-based sensitivity analysis to 20 convolutional and transformer vision models and report a near-universal failure mode in which camera distance and elevation dominate recognition errors irrespective of ImageNet accuracy; they further observe that modern CNNs and transformers form a distinct sensitivity cluster separate from older architectures.

Significance. If the mesh validation and rendering controls are shown to be robust, MAPS would constitute a useful instrument for attributing model failures to specific 3D scene parameters in a scalable and extensible manner, addressing limitations of existing 2D or entangled robustness benchmarks. The empirical observation of a shared camera-parameter sensitivity axis across architectures and the reported clustering by fine-grained design choices could inform targeted robustness improvements, provided the sensitivity scores are free of rendering confounds.

major comments (3)
  1. [Abstract / Dataset Construction] Abstract and Dataset section: the claim that the 2,618 meshes are 'validated for recognizability across 560 ImageNet classes' is load-bearing for the central attribution result, yet no procedure, human-study protocol, accuracy threshold, or verification that validation occurred at default camera settings is supplied. Without this, regression coefficients on camera distance and elevation risk being inflated by low base-image quality rather than the manipulated factors.
  2. [Sensitivity Analysis] Sensitivity Analysis section: the regression-based sensitivity analysis is presented without details on the regression model, error handling, multicollinearity diagnostics, or explicit verification that the nine factors vary independently. These omissions directly affect interpretability of the 'near-universal failure axis' and the architectural clustering claims.
  3. [Results] Results section: the reported clustering of modern CNNs/transformers versus older architectures is described qualitatively; quantitative measures of cluster separation or statistical tests confirming that the distinction is not driven by rendering artifacts correlated with distance/elevation are absent, weakening the claim that fine-grained design choices are the stronger determinant.
minor comments (2)
  1. [Rendering Pipeline] The abstract states that the pipeline is 'extensible to other factors' but provides no concrete example or interface description; a short code snippet or API outline would improve usability.
  2. [Figures] Figure captions and axis labels in the sensitivity plots should explicitly state the regression target (e.g., accuracy drop or logit change) and the exact set of models included in each cluster.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below. Where the comments identify areas requiring greater clarity or additional documentation, we have revised the manuscript accordingly to strengthen the rigor and interpretability of our methods and results.

read point-by-point responses
  1. Referee: [Abstract / Dataset Construction] Abstract and Dataset section: the claim that the 2,618 meshes are 'validated for recognizability across 560 ImageNet classes' is load-bearing for the central attribution result, yet no procedure, human-study protocol, accuracy threshold, or verification that validation occurred at default camera settings is supplied. Without this, regression coefficients on camera distance and elevation risk being inflated by low base-image quality rather than the manipulated factors.

    Authors: We agree that explicit documentation of the mesh validation procedure is essential to support the attribution results. In the revised manuscript, we have added a dedicated subsection under Dataset Construction that details the human validation protocol: meshes were evaluated via a crowdsourced study with five independent annotators per mesh on a platform equivalent to Amazon Mechanical Turk; a mesh was retained only if at least four annotators correctly recognized the object category at the default camera settings (distance 2.5 m, elevation 0°). This threshold and default-setting verification are now stated explicitly, confirming that base-image quality is established independently of the nine manipulated factors and thereby supporting the validity of the subsequent regression coefficients. revision: yes

  2. Referee: [Sensitivity Analysis] Sensitivity Analysis section: the regression-based sensitivity analysis is presented without details on the regression model, error handling, multicollinearity diagnostics, or explicit verification that the nine factors vary independently. These omissions directly affect interpretability of the 'near-universal failure axis' and the architectural clustering claims.

    Authors: We appreciate the referee's emphasis on methodological transparency. The revised Sensitivity Analysis section now specifies that we fit ordinary least-squares linear regressions with standardized coefficients for each model and factor combination. Robust standard errors (HC3) are used to handle potential heteroscedasticity. Multicollinearity diagnostics show variance inflation factors below 2.5 for all nine predictors, indicating negligible collinearity. Independence of the factors is guaranteed by the rendering pipeline design: each parameter is sampled uniformly and independently from its continuous range in Blender, with no engineered correlations. These additions directly bolster the interpretability of the reported failure axis and clustering. revision: yes

  3. Referee: [Results] Results section: the reported clustering of modern CNNs/transformers versus older architectures is described qualitatively; quantitative measures of cluster separation or statistical tests confirming that the distinction is not driven by rendering artifacts correlated with distance/elevation are absent, weakening the claim that fine-grained design choices are the stronger determinant.

    Authors: We acknowledge that the original presentation of the clustering was primarily qualitative. In the revised Results section we now report a hierarchical clustering (Ward linkage, Euclidean distance) of the 20 sensitivity profiles together with a silhouette score of 0.61, indicating reasonable separation. A permutation test (1,000 iterations) comparing the observed cluster separation against randomly reassigned architecture labels yields p < 0.01. To rule out rendering confounds, we additionally fit a partial regression controlling for distance and elevation; the modern-versus-older distinction remains statistically significant after this control. These quantitative and statistical elements strengthen the claim that fine-grained design choices drive the observed sensitivity structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical dataset construction and sensitivity analysis are self-contained

full rationale

The paper describes dataset curation of 2,618 meshes, a Blender rendering pipeline varying nine scene factors, and regression-based sensitivity analysis on 20 external vision models. No equations, derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations. The central claims rest on empirical measurements against independent models and benchmarks, with no load-bearing self-citation chains or ansatz smuggling. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset and evaluation paper with no mathematical derivations, fitted constants, or postulated entities; the central claims rest on the curation process and rendering pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5784 in / 1137 out tokens · 50613 ms · 2026-05-21T06:27:11.542952+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 6 internal anchors

  1. [1]

    Understanding intermediate layers using linear classifier probes

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

  2. [2]

    Alcorn, Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, Wei-Shinn Ku, and Anh Nguyen

    Michael A. Alcorn, Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, Wei-Shinn Ku, and Anh Nguyen. Strike (With) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4840–4849, Long Beach, CA, USA, 2019. IEEE

  3. [3]

    Deep convolu- tional networks do not classify based on global object shape.PLoS computational biology, 14(12):e1006613, 2018

    Nicholas Baker, Hongjing Lu, Gennady Erlikhman, and Philip J Kellman. Deep convolu- tional networks do not classify based on global object shape.PLoS computational biology, 14(12):e1006613, 2018

  4. [4]

    Local features and global shape information in object classification by deep convolutional neural networks.Vision research, 172:46–61, 2020

    Nicholas Baker, Hongjing Lu, Gennady Erlikhman, and Philip J Kellman. Local features and global shape information in object classification by deep convolutional neural networks.Vision research, 172:46–61, 2020. 10

  5. [5]

    Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models.Advances in neural information processing systems, 32, 2019

    Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models.Advances in neural information processing systems, 32, 2019

  6. [6]

    Network dissec- tion: Quantifying interpretability of deep visual representations

    David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissec- tion: Quantifying interpretability of deep visual representations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017

  7. [7]

    Recognition in terra incognita

    Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InProceedings of the European conference on computer vision (ECCV), pages 456–473, 2018

  8. [8]

    Mrd: Using physically based differentiable rendering to probe vision models for 3d scene understanding.arXiv preprint arXiv:2512.12307, 2025

    Benjamin Beilharz and Thomas SA Wallis. Mrd: Using physically based differentiable rendering to probe vision models for 3d scene understanding.arXiv preprint arXiv:2512.12307, 2025

  9. [9]

    Blender Foundation, 2025

    Blender Online Community.Blender – a 3D modelling and rendering package. Blender Foundation, 2025

  10. [10]

    Pug: Photorealistic and semantically controllable synthetic data for representation learning.Advances in Neural Information Processing Systems, 36:45020–45054, 2023

    Florian Bordes, Shashank Shekhar, Mark Ibrahim, Diane Bouchacourt, Pascal Vincent, and Ari Morcos. Pug: Photorealistic and semantically controllable synthetic data for representation learning.Advances in Neural Information Processing Systems, 36:45020–45054, 2023

  11. [11]

    ilab-20m: A large-scale controlled object dataset to investigate deep learning

    Ali Borji, Saeed Izadi, and Laurent Itti. ilab-20m: A large-scale controlled object dataset to investigate deep learning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2221–2230, 2016

  12. [12]

    3d shapes dataset

    Chris Burgess and Hyunjik Kim. 3d shapes dataset. https://github.com/deepmind/3dshapes- dataset/, 2018

  13. [13]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015

  14. [14]

    Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems, 36:35799–35813, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems, 36:35799–35813, 2023

  15. [15]

    Untangling invariant object recognition.Trends in cognitive sciences, 11(8):333–341, 2007

    James J DiCarlo and David D Cox. Untangling invariant object recognition.Trends in cognitive sciences, 11(8):333–341, 2007

  16. [16]

    How does the brain solve visual object recognition?Neuron, 73(3):415–434, 2012

    James J DiCarlo, Davide Zoccolan, and Nicole C Rust. How does the brain solve visual object recognition?Neuron, 73(3):415–434, 2012

  17. [17]

    Viewfool: Evaluating the robustness of visual recognition to adversarial viewpoints.Advances in neural information processing systems, 35:36789–36803, 2022

    Yinpeng Dong, Shouwei Ruan, Hang Su, Caixin Kang, Xingxing Wei, and Jun Zhu. Viewfool: Evaluating the robustness of visual recognition to adversarial viewpoints.Advances in neural information processing systems, 35:36789–36803, 2022

  18. [18]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  19. [19]

    Self-supervised learning of split invariant equivariant representations.arXiv preprint arXiv:2302.10283, 2023

    Quentin Garrido, Laurent Najman, and Yann Lecun. Self-supervised learning of split invariant equivariant representations.arXiv preprint arXiv:2302.10283, 2023

  20. [20]

    Wichmann

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut Learning in Deep Neural Networks.Nature Machine Intelligence, 2:665–673, 2020

  21. [21]

    Partial success in closing the gap between human and machine vision.Advances in Neural Information Processing Systems, 34:23885– 23899, 2021

    Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Partial success in closing the gap between human and machine vision.Advances in Neural Information Processing Systems, 34:23885– 23899, 2021. 11

  22. [22]

    Wichmann, and Wieland Brendel

    Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational Conference on Learning Representations, 2019

  23. [23]

    On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset

    Muhammad Waleed Gondal, Manuel Wuthrich, Djordje Miladinovic, Francesco Locatello, Martin Breidt, Valentin V olchkov, Joel Akpo, Olivier Bachem, Bernhard Schölkopf, and Stefan Bauer. On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and ...

  24. [24]

    Con- cept attribution: Explaining cnn decisions to physicians.Computers in biology and medicine, 123:103865, 2020

    Mara Graziani, Vincent Andrearczyk, Stéphane Marchand-Maillet, and Henning Müller. Con- cept attribution: Explaining cnn decisions to physicians.Computers in biology and medicine, 123:103865, 2020

  25. [25]

    Regression concept vectors for bidirectional explanations in histopathology

    Mara Graziani, Vincent Andrearczyk, and Henning Müller. Regression concept vectors for bidirectional explanations in histopathology. InInternational Workshop on Machine Learning in Clinical Neuroimaging, pages 124–132. Springer, 2018

  26. [26]

    Completely derandomized self-adaptation in evolu- tion strategies.Evolutionary computation, 9(2):159–195, 2001

    Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolu- tion strategies.Evolutionary computation, 9(2):159–195, 2001

  27. [27]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  28. [28]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021

  29. [29]

    Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

  30. [30]

    The origins and prevalence of texture bias in convolutional neural networks.Advances in neural information processing systems, 33:19000–19015, 2020

    Katherine Hermann, Ting Chen, and Simon Kornblith. The origins and prevalence of texture bias in convolutional neural networks.Advances in neural information processing systems, 33:19000–19015, 2020

  31. [31]

    Beyond accuracy: What matters in designing well-behaved models?arXiv preprint arXiv:2503.17110, 2025

    Robin Hesse, Do˘gukan Ba˘gcı, Bernt Schiele, Simone Schaub-Meyer, and Stefan Roth. Beyond accuracy: What matters in designing well-behaved models?arXiv preprint arXiv:2503.17110, 2025

  32. [32]

    Densely connected convolutional networks

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

  33. [33]

    Mitsuba 3 renderer,

    Wenzel Jakob, Sébastien Speierer, Nicolas Roussel, Merlin Nimier-David, Delio Vicini, Tizian Zeltner, Baptiste Nicolet, Miguel Crespo, Vincent Leroy, and Ziyi Zhang. Mitsuba 3 renderer,

  34. [34]

    https://mitsuba-renderer.org

  35. [35]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

  36. [36]

    Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

    Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on machine learning, pages 2668–2677. PMLR, 2018

  37. [37]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012. 12

  38. [38]

    Learning methods for generic object recognition with invariance to pose and lighting

    Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. InProceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 2, pages II–104. IEEE, 2004

  39. [39]

    Imagenet-e: Benchmarking neural network robustness via attribute editing

    Xiaodan Li, Yuefeng Chen, Yao Zhu, Shuhui Wang, Rong Zhang, and Hui Xue. Imagenet-e: Benchmarking neural network robustness via attribute editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20371–20381, 2023

  40. [40]

    The topology and geometry of neural representations

    Baihan Lin and Nikolaus Kriegeskorte. The topology and geometry of neural representations. Proceedings of the National Academy of Sciences, 121(42):e2317881121, 2024

  41. [41]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  42. [42]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

  43. [43]

    Visual object recognition.Annual review of neuroscience, 19:577–621, 1996

    Nikos K Logothetis and David L Sheinberg. Visual object recognition.Annual review of neuroscience, 19:577–621, 1996

  44. [44]

    Imagenet3d: Towards general-purpose object-level 3d understanding.Advances in Neural Information Processing Systems, 37:96127–96149, 2024

    Wufei Ma, Guofeng Zhang, Qihao Liu, Guanning Zeng, Adam Kortylewski, Yaoyao Liu, and Alan Yuille. Imagenet3d: Towards general-purpose object-level 3d understanding.Advances in Neural Information Processing Systems, 37:96127–96149, 2024

  45. [45]

    dsprites: Disentangle- ment testing sprites dataset

    Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentangle- ment testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017

  46. [46]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

  47. [47]

    A comparison of three methods for selecting values of input variables in the analysis of output from a computer code.Technometrics, pages 239–245, 1979

    MD McKay, RJ Beckman, and WJ Conover. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code.Technometrics, pages 239–245, 1979

  48. [48]

    An ecologically motivated image dataset for deep learning yields better models of human vision.Proceedings of the National Academy of Sciences, 118(8):e2011417118, 2021

    Johannes Mehrer, Courtney J Spoerer, Emer C Jones, Nikolaus Kriegeskorte, and Tim C Kietzmann. An ecologically motivated image dataset for deep learning yields better models of human vision.Proceedings of the National Academy of Sciences, 118(8):e2011417118, 2021

  49. [49]

    Exploring corruption robustness: Inductive biases in vision transformers and mlp-mixers.arXiv preprint arXiv:2106.13122, 2021

    Katelyn Morrison, Benjamin Gilby, Colton Lipchak, Adam Mattioli, and Adriana Kovashka. Exploring corruption robustness: Inductive biases in vision transformers and mlp-mixers.arXiv preprint arXiv:2106.13122, 2021

  50. [50]

    Intriguing properties of vision transformers

    Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34:23296–23308, 2021

  51. [51]

    Comparing state-of-the-art visual features on invariant object recognition tasks

    Nicolas Pinto, Youssef Barhomi, David D Cox, and James J DiCarlo. Comparing state-of-the-art visual features on invariant object recognition tasks. In2011 IEEE workshop on Applications of computer vision (WACV), pages 463–470. IEEE, 2011

  52. [52]

    why should i trust you?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016

  53. [53]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

  54. [54]

    Salient imagenet: How to discover spurious features in deep learning?arXiv preprint arXiv:2110.04301, 2021

    Sahil Singla and Soheil Feizi. Salient imagenet: How to discover spurious features in deep learning?arXiv preprint arXiv:2110.04301, 2021

  55. [55]

    The sketchfab 3d creative commons collection (s3d3c).arXiv preprint arXiv:2407.17205, 2024

    Florian Spiess, Raphael Waltenspül, and Heiko Schuldt. The sketchfab 3d creative commons collection (s3d3c).arXiv preprint arXiv:2407.17205, 2024. 13

  56. [56]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015

  57. [57]

    Re- thinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

  58. [58]

    Efficientnet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning, pages 6105–6114. PMLR, 2019

  59. [59]

    Torchvision: Pytorch’s computer vision library

    TorchVision. Torchvision: Pytorch’s computer vision library. https://github.com/ pytorch/vision, 2016

  60. [60]

    Unbiased look at dataset bias

    Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. InCVPR 2011, pages 1521–1528. IEEE, 2011

  61. [61]

    Are convolutional neural networks or transformers more like human vision?

    Shikhar Tuli, Ishita Dasgupta, Erin Grant, and Thomas L Griffiths. Are convolutional neural networks or transformers more like human vision?arXiv preprint arXiv:2105.07197, 2021

  62. [62]

    Adversarial robustness in discontinuous spaces via alternating sampling & descent

    Rahul Venkatesh, Eric Wong, and Zico Kolter. Adversarial robustness in discontinuous spaces via alternating sampling & descent. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4662–4671, 2023

  63. [63]

    Hierarchical grouping to optimize an objective function.Journal of the American statistical association, 58(301):236–244, 1963

    Joe H Ward Jr. Hierarchical grouping to optimize an objective function.Journal of the American statistical association, 58(301):236–244, 1963

  64. [64]

    Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation

    Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023

  65. [65]

    Verb semantics and lexical selection

    Zhibiao Wu and Martha Palmer. Verb semantics and lexical selection. In32nd annual meeting of the association for computational linguistics, pages 133–138, 1994

  66. [66]

    Noise or signal: The role of image backgrounds in object recognition.Proceedings of the International Conference on Learning Representations, 2021

    Kai Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition.Proceedings of the International Conference on Learning Representations, 2021

  67. [67]

    ImageNet-D: Benchmarking neural network robustness on diffusion synthetic object

    Chenshuang Zhang, Fei Pan, Junmo Kim, In So Kweon, and Chengzhi Mao. ImageNet-D: Benchmarking neural network robustness on diffusion synthetic object. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21752–21762, 2024

  68. [68]

    Object Recognition with and without Objects

    Zhuotun Zhu, Lingxi Xie, and Alan L Yuille. Object recognition with and without objects. arXiv preprint arXiv:1611.06596, 2016

  69. [69]

    Contrastive learning inverts the data generating process

    Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel. Contrastive learning inverts the data generating process. InInternational conference on machine learning, pages 12979–12990. PMLR, 2021. 14 Appendix A1 Mesh curation details The dataset is curated through a fully scripted, modular pipeline that proceeds from a hand-...

  70. [70]

    Provenance and Li- censing Ensure legal compliance and asset au- thenticity Validate source list, harvest metadata, fil- ter for redistributable/human-authored as- sets

  71. [71]

    Acquisition and In- tegrity Secure high-quality raw data Download approved meshes, verify file integrity

  72. [72]

    Mesh-level Curation Curate asset quality and structural com- position Manual inspection, drop/keep filtering, mesh sub-component splitting

  73. [73]

    no-redistribution

    Scene Construction Generate standardized dataset outputs Alignment, normalization, manual correc- tion, asset and render export Table A1: Overview of the four curation phases of the MAPS dataset. A1.1 Provenance and licensing Source list assembly.We started from the complete set of 1000 ImageNet classes and manually searched Sketchfab for each one in turn...

  74. [74]

    Isolate and Inspect: We evaluate each component mesh in isolation to verify its quality and relevance to the target category

  75. [75]

    Categorize Components: We assign each mesh a status of keep, remove, or split. The split operation is used to decouple target objects from environmental geometry or to separate multiple instances of the same category (e.g., several chairs in one scene) into unique dataset entries

  76. [76]

    Identify Failures: We flag assets that fail to meet quality standards for manual replacement or structural correction. The decisions JSON is then consumed by a glTF-transform processor that materializes the choices: it removes meshes marked as background, emits one output per “split” group while preserving any meshes flagged keep across all groups, and co...

  77. [77]

    We sort the world axes by descending AABB extent to assign(long, sides, up)

  78. [78]

    We resolve the up-axis sign by a gravity prior: the third moment of vertex projections along the candidate up-axis must be negative (heavier end down)

  79. [79]

    We enforce a right-handed frame on the sides axis

  80. [80]

    For each category, we set the first processed asset as the class anchor. Subsequent assets within the same category compare a one-dimensional mass-profile histogram along the long axis to the anchor and apply a 180◦ flip when the mirrored profile fits better. This produces consistent front/back orientation across instances of a class. Centering and scale ...