pith. machine review for the scientific record. sign in

arxiv: 2604.23018 · v2 · submitted 2026-04-24 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords 3D datasetembodied AIspatial computingasset optimizationCLIP retrievalphysics simulationsynthetic 3D modelsHabitat-Sim
0
0 comments X

The pith

AmaraSpatial-10K supplies over 10,000 synthetic 3D assets formatted for immediate zero-shot use in embodied AI and spatial computing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AmaraSpatial-10K, a collection of more than 10,000 synthetic 3D assets that are pre-processed to eliminate common deployment barriers such as incorrect scaling and missing collision data. Each asset includes metric scaling, deterministic anchoring in a .glb file, separated PBR texture maps, a convex hull, a reference image, and multi-sentence text descriptions. When evaluated against subsets of Objaverse, HSSD, ABO, and GSO using a new suite of metrics, the dataset records a 3.4 times higher CLIP Recall@5, 99.1 percent physics stability in Habitat-Sim, roughly 20 times faster wall-clock times, and zero-overlap scene layouts in Holodeck. Ablations show that the richer textual metadata drives most of the retrieval improvement.

Core claim

AmaraSpatial-10K consists of over 10,000 synthetic 3D assets each delivered as a metric-scaled, deterministically anchored .glb with separated PBR maps, convex collision hull, paired reference image, and multi-sentence text metadata; when substituted directly into existing pipelines it raises CLIP Recall@5 from 0.181 to 0.612, reaches 99.1 percent physics stability under Habitat-Sim with approximately 20 times wall-time reduction, and yields zero-overlap scenes in Holodeck, with the gains traced to description richness.

What carries the argument

The AmaraSpatial-10K asset format, which enforces metric scaling, deterministic anchoring, separated PBR maps, convex hulls, and paired multi-sentence metadata to make each model immediately usable without further processing.

If this is right

  • Scene synthesis tools receive drop-in assets that require no manual overlap correction or scaling fixes.
  • Physics engines complete the same number of steps in roughly one-twentieth the time while preserving over 99 percent stability.
  • Multimodal retrieval systems locate relevant 3D models from text or image queries with more than three times the recall of prior collections.
  • Robotics and spatial computing pipelines can swap asset banks without additional preprocessing code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same standardization steps could be applied to real-world scans to create hybrid datasets that narrow the sim-to-real gap.
  • Textual description quality appears to be a higher-leverage lever for 3D asset utility than geometric refinements alone.
  • Widespread adoption might reduce reliance on heavy domain randomization during training of embodied agents.

Load-bearing premise

That superior results on synthetic retrieval and simulation benchmarks will translate to better performance when the assets are placed in real physical environments.

What would settle it

Deploy the dataset in a physical robot navigation or manipulation task and measure whether success rates or sample efficiency exceed those obtained with Objaverse assets under identical training conditions.

Figures

Figures reproduced from arXiv: 2604.23018 by Alex Perkins, Ashkan Dabbagh, Igor Maurell, Mohammad Sadegh Salehi, Raymond Wong.

Figure 1
Figure 1. Figure 1: Representative assets from AmaraSpatial-10K. The dataset spans indoor objects, vehicles, architecture, creatures, and props, all released with metric scale, semantically correct anchoring, and PBR-ready materials under a shared spatial convention. expanded scale and diversity, but utilizing them for downstream applications requires exhaustive preprocessing and heuristic filtering. Even after such curation,… view at source ↗
Figure 2
Figure 2. Figure 2: Assets per Subcategory Distribution. Subcategories are mostly populated with 5–15 assets each, with a heavy secondary cluster around 35–45 assets for visually rich categories (e.g. vehicles, architecture). 23 subcategories contain only a single asset each; these are retained for taxonomic breadth but are not intended for subcategory-level learning — see §6 for discussion. 3.2 Automated Taxonomy and Descrip… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison across four representative themes. Four assets per theme drawn from AmaraSpatial-10K (left) and Objaverse (right). AmaraSpatial-10K assets share consistent metric scale, canonical orientation, and PBR materials within a theme, whereas Objaverse assets, aggregated from heterogeneous creators, vary substantially in style, topology, and texture fidelity. 6 view at source ↗
Figure 4
Figure 4. Figure 4: Bounding box height distributions for the Seating category. Left: Objaverse (N = 181) exhibits a multimodal, pathologically wide distribution spanning five orders of magnitude (0.02 m to 115,276 m). Right: AmaraSpatial-10K (N = 353) shows a tight, physically grounded distribution centred around a median of 0.72 m. Both axes use a logarithmic scale to accommodate the dynamic range of Objaverse. 4.2 Scale Pl… view at source ↗
Figure 5
Figure 5. Figure 5: Real-world metric scaling across AmaraSpatial-10K. Eight representative assets rendered at shared ground scale, from cup to cathedral. All assets share a common metric ground truth, so no per-asset normalization is needed before placement. For fantasy creatures (e.g. dragon at 40 m) the metric scale reflects design intent encoded in the asset’s description and matches the LLM-judged plausible range for tha… view at source ↗
Figure 6
Figure 6. Figure 6: Scale Plausibility Score (SPS) as a function of measured height for three representative subcategories. Each panel shows the SPS curve Eq. (2) over the relevant measurement range. The shaded plateau (SPS = 1.0) corresponds to the LLM-judged plausible interval [ℓ, u] (dashed vertical lines). Outside this interval, SPS decays symmetrically via a Gaussian with half-width h = (u−ℓ)/2: an asset at distance d = … view at source ↗
Figure 7
Figure 7. Figure 7: Intra-category scale distribution. Side-by-side box plots of bounding box height (log scale) for each object category. AmaraSpatial-10K (Heavenly Gold "Ours") shows tight, phys￾ically plausible distributions centred around real-world object sizes. Objaverse (Blue) exhibits dramatically wider boxes and extreme outliers spanning several orders of magnitude, confirming severe scale inconsistency across all ca… view at source ↗
Figure 8
Figure 8. Figure 8: Face count distributions across datasets. The majority of AmaraSpatial-10K assets target ∼50K triangles. To support low-poly applications, roughly 2,000 assets are optimized to ∼10K triangles based on their specific category. Conversely, approximately 1,000 “hero” assets feature higher geometric detail at ∼100K triangles. HSSD’s distribution contains a visible spike at ∼2 triangles corresponding to placeho… view at source ↗
Figure 9
Figure 9. Figure 9: CLIP Coherence Distribution. Histograms of pairwise CLIP ViT-L/14 cosine similarity. For AmaraSpatial-10K: Text ↔ Ref. Image (gold) peaks near 0.30; Text ↔ 3D Render (purple) near 0.24; Ref. Image ↔ 3D Render (dark blue) exceptionally high at ∼0.72. For comparison, the matched Objaverse Text ↔ 3D Render distribution (light blue) peaks near 0.18 — well below AmaraSpatial’s 0.24 — providing a distributional … view at source ↗
Figure 10
Figure 10. Figure 10: CLIP classification of AmaraSpatial-10K renders. For each asset we render two viewpoints and score them with CLIP (ViT-B/32, OpenAI pretraining) against a fixed LVIS vocabulary of 1,207 categories, following the Objaverse protocol (softmax over the full vocabulary). Columns 1–2 show the probability of the true class under the two viewpoints; columns 3–4 re-display the same two renders scored against seman… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative CLIP retrieval comparison. Top 5 CLIP retrievals (ViT-L/14, mean-pooled over four orthographic renders) for the query “an ornate Victorian writing desk with brass handles”: Objaverse (top) vs. AmaraSpatial-10K (bottom). Labels are each dataset’s short descriptor verbatim (Objaverse: tags; AmaraSpatial-10K: name). Objaverse descriptions are generic category tags (“furniture-home”, “art-abstract… view at source ↗
read the original abstract

Web-scale 3D asset collections are abundant but rarely deployment-ready, suffering from arbitrary metric scaling, incorrect pivots, brittle geometry, and incomplete textures, defects that limit their use in embodied AI, robotics, and spatial computing. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets optimised for zero-shot deployment. Each asset ships as a metric-scaled, deterministically anchored .glb with separated PBR maps, a convex collision hull, a paired reference image, and multi-sentence text metadata. Alongside the dataset we introduce a reusable evaluation suite for 3D asset banks, a continuous Scale Plausibility Score (SPS), an LLM Concept Density metric, anchor-error auditing, and a cross-modal CLIP coherence protocol, and apply it to AmaraSpatial-10K alongside matched subsets of Objaverse, HSSD, ABO, and GSO. AmaraSpatial-10K improves CLIP Recall@5 by $3.4\times$ over Objaverse ($0.612$ vs. $0.181$, median rank $267 \rightarrow 3$), achieves a $99.1\%$ physics-stability rate under Habitat-Sim with $\sim 20\times$ wall-time speed-up, and produces zero-overlap scenes when used as a drop-in asset bank for Holodeck. Controlled ablations on the same asset bank attribute the retrieval gain to description richness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets each provided as metric-scaled .glb files with deterministic anchoring, separated PBR maps, convex collision hulls, paired reference images, and multi-sentence text metadata. It also presents a reusable evaluation suite including the Scale Plausibility Score (SPS), LLM Concept Density metric, anchor-error auditing, and a CLIP coherence protocol. The central empirical claims are that AmaraSpatial-10K achieves 3.4× higher CLIP Recall@5 than Objaverse (0.612 vs. 0.181, median rank 267→3), 99.1% physics stability under Habitat-Sim with ~20× wall-time speedup, and zero-overlap scenes when used as a drop-in asset bank for Holodeck, with ablations attributing retrieval gains to description richness.

Significance. If the proxy-metric gains hold under broader scrutiny, the dataset could reduce preprocessing overhead for embodied AI and robotics researchers by supplying deployment-ready assets, and the introduced evaluation protocols would provide a reusable benchmark for 3D asset banks. The release of the dataset and code for the evaluation suite supports reproducibility and community use. However, the significance is limited by the absence of direct validation on downstream task performance.

major comments (3)
  1. [Abstract / Evaluation] Abstract and Evaluation section: The claim that AmaraSpatial-10K is 'optimised for zero-shot deployment' in embodied AI rests entirely on synthetic proxy metrics (CLIP Recall@5, Habitat-Sim stability, Holodeck overlap). No experiments measure end-to-end task success (navigation, manipulation, or instruction following) under realistic sensor noise or on hardware, leaving the translation from asset-level proxies to task-level improvement untested.
  2. [Methods] Methods / Asset Generation: The manuscript provides insufficient detail on the asset generation pipeline, including how metric scaling, deterministic anchoring, PBR separation, and convex hull computation are performed at scale. This omission prevents independent reproduction or extension of the 10K-asset collection.
  3. [Evaluation] Evaluation suite: The newly introduced Scale Plausibility Score (SPS) and LLM Concept Density metric lack validation against human judgments or correlation with established 3D quality measures; their definitions and scoring procedures must be specified with equations or pseudocode to support the reported numerical gains.
minor comments (2)
  1. [Abstract] Abstract: The CLIP Recall@5 comparison should explicitly state the number of assets and query categories used for each baseline (Objaverse, HSSD, ABO, GSO) to allow direct interpretation of the 0.612 vs. 0.181 figures.
  2. [Results] Table/Figure captions: Ensure all reported metrics (e.g., 99.1% stability, 20× speedup) include standard deviations or confidence intervals and specify the exact number of trials.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment in detail below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: The claim that AmaraSpatial-10K is 'optimised for zero-shot deployment' in embodied AI rests entirely on synthetic proxy metrics (CLIP Recall@5, Habitat-Sim stability, Holodeck overlap). No experiments measure end-to-end task success (navigation, manipulation, or instruction following) under realistic sensor noise or on hardware, leaving the translation from asset-level proxies to task-level improvement untested.

    Authors: We acknowledge that our claims regarding optimization for zero-shot deployment are based on proxy metrics rather than direct end-to-end task evaluations. This is a valid point, and we recognize that proxy metrics do not fully capture performance under real sensor noise or hardware conditions. To address this, we have added a dedicated Limitations paragraph in the revised manuscript discussing the reliance on proxies and the need for future task-specific benchmarks. We maintain that the provided metrics offer a strong initial validation for asset quality, consistent with practices in similar dataset papers, but we do not claim direct task improvements without further testing. revision: partial

  2. Referee: [Methods] Methods / Asset Generation: The manuscript provides insufficient detail on the asset generation pipeline, including how metric scaling, deterministic anchoring, PBR separation, and convex hull computation are performed at scale. This omission prevents independent reproduction or extension of the 10K-asset collection.

    Authors: We agree that additional details are necessary for reproducibility. In the revised manuscript, we have substantially expanded the Methods section to include a step-by-step description of the asset generation pipeline. This now covers: (1) metric scaling using reference measurements from semantic metadata and standard object sizes; (2) deterministic anchoring via computation of the geometric center and alignment to a canonical frame; (3) separation of PBR maps using material parsing in the source 3D software; and (4) convex hull generation using the V-HACD library with specific parameters for decomposition. We also include pseudocode for the scaling and anchoring procedures. revision: yes

  3. Referee: [Evaluation] Evaluation suite: The newly introduced Scale Plausibility Score (SPS) and LLM Concept Density metric lack validation against human judgments or correlation with established 3D quality measures; their definitions and scoring procedures must be specified with equations or pseudocode to support the reported numerical gains.

    Authors: We have revised the Evaluation section to provide formal mathematical definitions and pseudocode for both the Scale Plausibility Score (SPS) and the LLM Concept Density metric. Specifically, SPS is now defined as an integral over scale deviation probabilities, and LLM Concept Density as the average number of unique concepts per description normalized by length. We note that while direct human validation studies were not conducted in this work, the metrics are designed to align with intuitive quality aspects, and we have added a discussion on their correlation with existing measures such as those used in prior 3D asset evaluations. The numerical results are now explicitly tied to these definitions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release with direct metric comparisons

full rationale

The paper presents AmaraSpatial-10K as a new asset collection and reports performance via standard external benchmarks (CLIP Recall@5, Habitat-Sim stability, Holodeck overlap) applied to the released assets and matched subsets of prior datasets. No equations, parameter fits, or derivations are described that reduce to the paper's own inputs. All claims are direct empirical measurements on the dataset itself; no self-citation chains or ansatzes are invoked to justify core results. This is a standard dataset paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Central claims rest on standard assumptions about CLIP semantic alignment and Habitat-Sim physics fidelity plus two newly introduced metrics without external validation.

axioms (2)
  • domain assumption CLIP embeddings capture meaningful cross-modal similarity between text descriptions and 3D asset geometry
    Invoked in the cross-modal CLIP coherence protocol and Recall@5 metric
  • domain assumption Habitat-Sim physics simulation produces reliable stability and overlap outcomes for synthetic 3D assets
    Used to report 99.1% physics-stability rate and zero-overlap scenes
invented entities (2)
  • Scale Plausibility Score (SPS) no independent evidence
    purpose: Continuous metric to quantify metric-scale plausibility of assets
    Newly introduced evaluation metric
  • LLM Concept Density metric no independent evidence
    purpose: Measure of semantic richness in asset text metadata
    Newly introduced evaluation metric

pith-pipeline@v0.9.0 · 5592 in / 1390 out tokens · 35211 ms · 2026-05-14T21:14:27.121479+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 4 internal anchors

  1. [1]

    ShapeNet: An Information-Rich 3D Model Repository

    Chang, A.X., et al. “ShapeNet: An Information-Rich 3D Model Repository.”arXiv preprint arXiv:1512.03012, 2015

  2. [2]

    Objaverse: A Universe of Annotated 3D Objects

    Deitke, M., et al. “Objaverse: A Universe of Annotated 3D Objects.”CVPR, 2023

  3. [3]

    Objaverse-XL: A Universe of 10M+ 3D Objects

    Deitke, M., et al. “Objaverse-XL: A Universe of 10M+ 3D Objects.”NeurIPS, 2023

  4. [4]

    Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items

    Downs, L., et al. “Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items.”ICRA, 2022

  5. [5]

    ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

    Collins, J., et al. “ABO: Dataset and Benchmarks for Real-World 3D Object Understanding.” CVPR, 2022. 21

  6. [6]

    Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation

    Khanna, M., et al. “Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation.”CVPR, 2024

  7. [7]

    Habitat: A Platform for Embodied AI Research

    Savva, M., et al. “Habitat: A Platform for Embodied AI Research.”ICCV, 2019

  8. [8]

    iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes

    Shen, B., et al. “iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes.”IROS, 2021

  9. [9]

    ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

    Deitke, M., et al. “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation.” NeurIPS, 2022

  10. [10]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Ramakrishnan, S.K., et al. “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI.”NeurIPS Datasets and Benchmarks, 2021

  11. [11]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Chang, A., et al. “Matterport3D: Learning from RGB-D Data in Indoor Environments.”3DV, 2017

  12. [12]

    arXiv preprint arXiv:2403.02151 , year=

    Tochilkin, D., et al. “TripoSR: Fast 3D Object Reconstruction from a Single Image.”arXiv preprint arXiv:2403.02151, 2024

  13. [13]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Xu, J., et al. “InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models.”arXiv preprint arXiv:2404.07191, 2024

  14. [14]

    LRM: Large Reconstruction Model for Single Image to 3D

    Hong, Y ., et al. “LRM: Large Reconstruction Model for Single Image to 3D.”ICLR, 2024

  15. [15]

    CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

    Wang, Z., et al. “CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model.”arXiv preprint arXiv:2403.02099, 2024

  16. [16]

    Holodeck: Language Guided Generation of 3D Embodied AI Environments

    Yang, Y ., et al. “Holodeck: Language Guided Generation of 3D Embodied AI Environments.” CVPR, 2024

  17. [17]

    LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

    Feng, W., et al. “LayoutGPT: Compositional Visual Planning and Generation with Large Language Models.”NeurIPS, 2023

  18. [18]

    Learning Transferable Visual Models From Natural Language Supervision

    Radford, A., et al. “Learning Transferable Visual Models From Natural Language Supervision.” ICML, 2021

  19. [19]

    Qwen2 Technical Report

    Yang, A., et al. “Qwen2 Technical Report.”arXiv preprint arXiv:2407.10671, 2024

  20. [20]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Google. “Gemini: A Family of Highly Capable Multimodal Models.”arXiv preprint arXiv:2312.11805, 2023

  21. [21]

    Gemini 3 Flash Image (Nano Banana 2)

    Google DeepMind. “Gemini 3 Flash Image (Nano Banana 2).” https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/, 2025. Accessed April 23, 2026. 22