Recognition: unknown
AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI
Pith reviewed 2026-05-14 21:14 UTC · model grok-4.3
The pith
AmaraSpatial-10K supplies over 10,000 synthetic 3D assets formatted for immediate zero-shot use in embodied AI and spatial computing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AmaraSpatial-10K consists of over 10,000 synthetic 3D assets each delivered as a metric-scaled, deterministically anchored .glb with separated PBR maps, convex collision hull, paired reference image, and multi-sentence text metadata; when substituted directly into existing pipelines it raises CLIP Recall@5 from 0.181 to 0.612, reaches 99.1 percent physics stability under Habitat-Sim with approximately 20 times wall-time reduction, and yields zero-overlap scenes in Holodeck, with the gains traced to description richness.
What carries the argument
The AmaraSpatial-10K asset format, which enforces metric scaling, deterministic anchoring, separated PBR maps, convex hulls, and paired multi-sentence metadata to make each model immediately usable without further processing.
If this is right
- Scene synthesis tools receive drop-in assets that require no manual overlap correction or scaling fixes.
- Physics engines complete the same number of steps in roughly one-twentieth the time while preserving over 99 percent stability.
- Multimodal retrieval systems locate relevant 3D models from text or image queries with more than three times the recall of prior collections.
- Robotics and spatial computing pipelines can swap asset banks without additional preprocessing code.
Where Pith is reading between the lines
- The same standardization steps could be applied to real-world scans to create hybrid datasets that narrow the sim-to-real gap.
- Textual description quality appears to be a higher-leverage lever for 3D asset utility than geometric refinements alone.
- Widespread adoption might reduce reliance on heavy domain randomization during training of embodied agents.
Load-bearing premise
That superior results on synthetic retrieval and simulation benchmarks will translate to better performance when the assets are placed in real physical environments.
What would settle it
Deploy the dataset in a physical robot navigation or manipulation task and measure whether success rates or sample efficiency exceed those obtained with Objaverse assets under identical training conditions.
Figures
read the original abstract
Web-scale 3D asset collections are abundant but rarely deployment-ready, suffering from arbitrary metric scaling, incorrect pivots, brittle geometry, and incomplete textures, defects that limit their use in embodied AI, robotics, and spatial computing. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets optimised for zero-shot deployment. Each asset ships as a metric-scaled, deterministically anchored .glb with separated PBR maps, a convex collision hull, a paired reference image, and multi-sentence text metadata. Alongside the dataset we introduce a reusable evaluation suite for 3D asset banks, a continuous Scale Plausibility Score (SPS), an LLM Concept Density metric, anchor-error auditing, and a cross-modal CLIP coherence protocol, and apply it to AmaraSpatial-10K alongside matched subsets of Objaverse, HSSD, ABO, and GSO. AmaraSpatial-10K improves CLIP Recall@5 by $3.4\times$ over Objaverse ($0.612$ vs. $0.181$, median rank $267 \rightarrow 3$), achieves a $99.1\%$ physics-stability rate under Habitat-Sim with $\sim 20\times$ wall-time speed-up, and produces zero-overlap scenes when used as a drop-in asset bank for Holodeck. Controlled ablations on the same asset bank attribute the retrieval gain to description richness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets each provided as metric-scaled .glb files with deterministic anchoring, separated PBR maps, convex collision hulls, paired reference images, and multi-sentence text metadata. It also presents a reusable evaluation suite including the Scale Plausibility Score (SPS), LLM Concept Density metric, anchor-error auditing, and a CLIP coherence protocol. The central empirical claims are that AmaraSpatial-10K achieves 3.4× higher CLIP Recall@5 than Objaverse (0.612 vs. 0.181, median rank 267→3), 99.1% physics stability under Habitat-Sim with ~20× wall-time speedup, and zero-overlap scenes when used as a drop-in asset bank for Holodeck, with ablations attributing retrieval gains to description richness.
Significance. If the proxy-metric gains hold under broader scrutiny, the dataset could reduce preprocessing overhead for embodied AI and robotics researchers by supplying deployment-ready assets, and the introduced evaluation protocols would provide a reusable benchmark for 3D asset banks. The release of the dataset and code for the evaluation suite supports reproducibility and community use. However, the significance is limited by the absence of direct validation on downstream task performance.
major comments (3)
- [Abstract / Evaluation] Abstract and Evaluation section: The claim that AmaraSpatial-10K is 'optimised for zero-shot deployment' in embodied AI rests entirely on synthetic proxy metrics (CLIP Recall@5, Habitat-Sim stability, Holodeck overlap). No experiments measure end-to-end task success (navigation, manipulation, or instruction following) under realistic sensor noise or on hardware, leaving the translation from asset-level proxies to task-level improvement untested.
- [Methods] Methods / Asset Generation: The manuscript provides insufficient detail on the asset generation pipeline, including how metric scaling, deterministic anchoring, PBR separation, and convex hull computation are performed at scale. This omission prevents independent reproduction or extension of the 10K-asset collection.
- [Evaluation] Evaluation suite: The newly introduced Scale Plausibility Score (SPS) and LLM Concept Density metric lack validation against human judgments or correlation with established 3D quality measures; their definitions and scoring procedures must be specified with equations or pseudocode to support the reported numerical gains.
minor comments (2)
- [Abstract] Abstract: The CLIP Recall@5 comparison should explicitly state the number of assets and query categories used for each baseline (Objaverse, HSSD, ABO, GSO) to allow direct interpretation of the 0.612 vs. 0.181 figures.
- [Results] Table/Figure captions: Ensure all reported metrics (e.g., 99.1% stability, 20× speedup) include standard deviations or confidence intervals and specify the exact number of trials.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment in detail below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation section: The claim that AmaraSpatial-10K is 'optimised for zero-shot deployment' in embodied AI rests entirely on synthetic proxy metrics (CLIP Recall@5, Habitat-Sim stability, Holodeck overlap). No experiments measure end-to-end task success (navigation, manipulation, or instruction following) under realistic sensor noise or on hardware, leaving the translation from asset-level proxies to task-level improvement untested.
Authors: We acknowledge that our claims regarding optimization for zero-shot deployment are based on proxy metrics rather than direct end-to-end task evaluations. This is a valid point, and we recognize that proxy metrics do not fully capture performance under real sensor noise or hardware conditions. To address this, we have added a dedicated Limitations paragraph in the revised manuscript discussing the reliance on proxies and the need for future task-specific benchmarks. We maintain that the provided metrics offer a strong initial validation for asset quality, consistent with practices in similar dataset papers, but we do not claim direct task improvements without further testing. revision: partial
-
Referee: [Methods] Methods / Asset Generation: The manuscript provides insufficient detail on the asset generation pipeline, including how metric scaling, deterministic anchoring, PBR separation, and convex hull computation are performed at scale. This omission prevents independent reproduction or extension of the 10K-asset collection.
Authors: We agree that additional details are necessary for reproducibility. In the revised manuscript, we have substantially expanded the Methods section to include a step-by-step description of the asset generation pipeline. This now covers: (1) metric scaling using reference measurements from semantic metadata and standard object sizes; (2) deterministic anchoring via computation of the geometric center and alignment to a canonical frame; (3) separation of PBR maps using material parsing in the source 3D software; and (4) convex hull generation using the V-HACD library with specific parameters for decomposition. We also include pseudocode for the scaling and anchoring procedures. revision: yes
-
Referee: [Evaluation] Evaluation suite: The newly introduced Scale Plausibility Score (SPS) and LLM Concept Density metric lack validation against human judgments or correlation with established 3D quality measures; their definitions and scoring procedures must be specified with equations or pseudocode to support the reported numerical gains.
Authors: We have revised the Evaluation section to provide formal mathematical definitions and pseudocode for both the Scale Plausibility Score (SPS) and the LLM Concept Density metric. Specifically, SPS is now defined as an integral over scale deviation probabilities, and LLM Concept Density as the average number of unique concepts per description normalized by length. We note that while direct human validation studies were not conducted in this work, the metrics are designed to align with intuitive quality aspects, and we have added a discussion on their correlation with existing measures such as those used in prior 3D asset evaluations. The numerical results are now explicitly tied to these definitions. revision: yes
Circularity Check
No circularity: empirical dataset release with direct metric comparisons
full rationale
The paper presents AmaraSpatial-10K as a new asset collection and reports performance via standard external benchmarks (CLIP Recall@5, Habitat-Sim stability, Holodeck overlap) applied to the released assets and matched subsets of prior datasets. No equations, parameter fits, or derivations are described that reduce to the paper's own inputs. All claims are direct empirical measurements on the dataset itself; no self-citation chains or ansatzes are invoked to justify core results. This is a standard dataset paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption CLIP embeddings capture meaningful cross-modal similarity between text descriptions and 3D asset geometry
- domain assumption Habitat-Sim physics simulation produces reliable stability and overlap outcomes for synthetic 3D assets
invented entities (2)
-
Scale Plausibility Score (SPS)
no independent evidence
-
LLM Concept Density metric
no independent evidence
Reference graph
Works this paper leans on
-
[1]
ShapeNet: An Information-Rich 3D Model Repository
Chang, A.X., et al. “ShapeNet: An Information-Rich 3D Model Repository.”arXiv preprint arXiv:1512.03012, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[2]
Objaverse: A Universe of Annotated 3D Objects
Deitke, M., et al. “Objaverse: A Universe of Annotated 3D Objects.”CVPR, 2023
work page 2023
-
[3]
Objaverse-XL: A Universe of 10M+ 3D Objects
Deitke, M., et al. “Objaverse-XL: A Universe of 10M+ 3D Objects.”NeurIPS, 2023
work page 2023
-
[4]
Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items
Downs, L., et al. “Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items.”ICRA, 2022
work page 2022
-
[5]
ABO: Dataset and Benchmarks for Real-World 3D Object Understanding
Collins, J., et al. “ABO: Dataset and Benchmarks for Real-World 3D Object Understanding.” CVPR, 2022. 21
work page 2022
-
[6]
Khanna, M., et al. “Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation.”CVPR, 2024
work page 2024
-
[7]
Habitat: A Platform for Embodied AI Research
Savva, M., et al. “Habitat: A Platform for Embodied AI Research.”ICCV, 2019
work page 2019
-
[8]
iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes
Shen, B., et al. “iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes.”IROS, 2021
work page 2021
-
[9]
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
Deitke, M., et al. “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation.” NeurIPS, 2022
work page 2022
-
[10]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Ramakrishnan, S.K., et al. “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI.”NeurIPS Datasets and Benchmarks, 2021
work page 2021
-
[11]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Chang, A., et al. “Matterport3D: Learning from RGB-D Data in Indoor Environments.”3DV, 2017
work page 2017
-
[12]
arXiv preprint arXiv:2403.02151 , year=
Tochilkin, D., et al. “TripoSR: Fast 3D Object Reconstruction from a Single Image.”arXiv preprint arXiv:2403.02151, 2024
-
[13]
Xu, J., et al. “InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models.”arXiv preprint arXiv:2404.07191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
LRM: Large Reconstruction Model for Single Image to 3D
Hong, Y ., et al. “LRM: Large Reconstruction Model for Single Image to 3D.”ICLR, 2024
work page 2024
-
[15]
CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model
Wang, Z., et al. “CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model.”arXiv preprint arXiv:2403.02099, 2024
-
[16]
Holodeck: Language Guided Generation of 3D Embodied AI Environments
Yang, Y ., et al. “Holodeck: Language Guided Generation of 3D Embodied AI Environments.” CVPR, 2024
work page 2024
-
[17]
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Feng, W., et al. “LayoutGPT: Compositional Visual Planning and Generation with Large Language Models.”NeurIPS, 2023
work page 2023
-
[18]
Learning Transferable Visual Models From Natural Language Supervision
Radford, A., et al. “Learning Transferable Visual Models From Natural Language Supervision.” ICML, 2021
work page 2021
-
[19]
Yang, A., et al. “Qwen2 Technical Report.”arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Google. “Gemini: A Family of Highly Capable Multimodal Models.”arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Gemini 3 Flash Image (Nano Banana 2)
Google DeepMind. “Gemini 3 Flash Image (Nano Banana 2).” https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/, 2025. Accessed April 23, 2026. 22
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.