arxiv: 2604.23018 · v2 · submitted 2026-04-24 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

Mohammad Sadegh Salehi , Alex Perkins , Igor Maurell , Ashkan Dabbagh , Raymond Wong

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords 3D datasetembodied AIspatial computingasset optimizationCLIP retrievalphysics simulationsynthetic 3D modelsHabitat-Sim

0 comments

The pith

AmaraSpatial-10K supplies over 10,000 synthetic 3D assets formatted for immediate zero-shot use in embodied AI and spatial computing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AmaraSpatial-10K, a collection of more than 10,000 synthetic 3D assets that are pre-processed to eliminate common deployment barriers such as incorrect scaling and missing collision data. Each asset includes metric scaling, deterministic anchoring in a .glb file, separated PBR texture maps, a convex hull, a reference image, and multi-sentence text descriptions. When evaluated against subsets of Objaverse, HSSD, ABO, and GSO using a new suite of metrics, the dataset records a 3.4 times higher CLIP Recall@5, 99.1 percent physics stability in Habitat-Sim, roughly 20 times faster wall-clock times, and zero-overlap scene layouts in Holodeck. Ablations show that the richer textual metadata drives most of the retrieval improvement.

Core claim

AmaraSpatial-10K consists of over 10,000 synthetic 3D assets each delivered as a metric-scaled, deterministically anchored .glb with separated PBR maps, convex collision hull, paired reference image, and multi-sentence text metadata; when substituted directly into existing pipelines it raises CLIP Recall@5 from 0.181 to 0.612, reaches 99.1 percent physics stability under Habitat-Sim with approximately 20 times wall-time reduction, and yields zero-overlap scenes in Holodeck, with the gains traced to description richness.

What carries the argument

The AmaraSpatial-10K asset format, which enforces metric scaling, deterministic anchoring, separated PBR maps, convex hulls, and paired multi-sentence metadata to make each model immediately usable without further processing.

If this is right

Scene synthesis tools receive drop-in assets that require no manual overlap correction or scaling fixes.
Physics engines complete the same number of steps in roughly one-twentieth the time while preserving over 99 percent stability.
Multimodal retrieval systems locate relevant 3D models from text or image queries with more than three times the recall of prior collections.
Robotics and spatial computing pipelines can swap asset banks without additional preprocessing code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same standardization steps could be applied to real-world scans to create hybrid datasets that narrow the sim-to-real gap.
Textual description quality appears to be a higher-leverage lever for 3D asset utility than geometric refinements alone.
Widespread adoption might reduce reliance on heavy domain randomization during training of embodied agents.

Load-bearing premise

That superior results on synthetic retrieval and simulation benchmarks will translate to better performance when the assets are placed in real physical environments.

What would settle it

Deploy the dataset in a physical robot navigation or manipulation task and measure whether success rates or sample efficiency exceed those obtained with Objaverse assets under identical training conditions.

Figures

Figures reproduced from arXiv: 2604.23018 by Alex Perkins, Ashkan Dabbagh, Igor Maurell, Mohammad Sadegh Salehi, Raymond Wong.

**Figure 1.** Figure 1: Representative assets from AmaraSpatial-10K. The dataset spans indoor objects, vehicles, architecture, creatures, and props, all released with metric scale, semantically correct anchoring, and PBR-ready materials under a shared spatial convention. expanded scale and diversity, but utilizing them for downstream applications requires exhaustive preprocessing and heuristic filtering. Even after such curation,… view at source ↗

**Figure 2.** Figure 2: Assets per Subcategory Distribution. Subcategories are mostly populated with 5–15 assets each, with a heavy secondary cluster around 35–45 assets for visually rich categories (e.g. vehicles, architecture). 23 subcategories contain only a single asset each; these are retained for taxonomic breadth but are not intended for subcategory-level learning — see §6 for discussion. 3.2 Automated Taxonomy and Descrip… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison across four representative themes. Four assets per theme drawn from AmaraSpatial-10K (left) and Objaverse (right). AmaraSpatial-10K assets share consistent metric scale, canonical orientation, and PBR materials within a theme, whereas Objaverse assets, aggregated from heterogeneous creators, vary substantially in style, topology, and texture fidelity. 6 view at source ↗

**Figure 4.** Figure 4: Bounding box height distributions for the Seating category. Left: Objaverse (N = 181) exhibits a multimodal, pathologically wide distribution spanning five orders of magnitude (0.02 m to 115,276 m). Right: AmaraSpatial-10K (N = 353) shows a tight, physically grounded distribution centred around a median of 0.72 m. Both axes use a logarithmic scale to accommodate the dynamic range of Objaverse. 4.2 Scale Pl… view at source ↗

**Figure 5.** Figure 5: Real-world metric scaling across AmaraSpatial-10K. Eight representative assets rendered at shared ground scale, from cup to cathedral. All assets share a common metric ground truth, so no per-asset normalization is needed before placement. For fantasy creatures (e.g. dragon at 40 m) the metric scale reflects design intent encoded in the asset’s description and matches the LLM-judged plausible range for tha… view at source ↗

**Figure 6.** Figure 6: Scale Plausibility Score (SPS) as a function of measured height for three representative subcategories. Each panel shows the SPS curve Eq. (2) over the relevant measurement range. The shaded plateau (SPS = 1.0) corresponds to the LLM-judged plausible interval [ℓ, u] (dashed vertical lines). Outside this interval, SPS decays symmetrically via a Gaussian with half-width h = (u−ℓ)/2: an asset at distance d = … view at source ↗

**Figure 7.** Figure 7: Intra-category scale distribution. Side-by-side box plots of bounding box height (log scale) for each object category. AmaraSpatial-10K (Heavenly Gold "Ours") shows tight, physically plausible distributions centred around real-world object sizes. Objaverse (Blue) exhibits dramatically wider boxes and extreme outliers spanning several orders of magnitude, confirming severe scale inconsistency across all ca… view at source ↗

**Figure 8.** Figure 8: Face count distributions across datasets. The majority of AmaraSpatial-10K assets target ∼50K triangles. To support low-poly applications, roughly 2,000 assets are optimized to ∼10K triangles based on their specific category. Conversely, approximately 1,000 “hero” assets feature higher geometric detail at ∼100K triangles. HSSD’s distribution contains a visible spike at ∼2 triangles corresponding to placeho… view at source ↗

**Figure 9.** Figure 9: CLIP Coherence Distribution. Histograms of pairwise CLIP ViT-L/14 cosine similarity. For AmaraSpatial-10K: Text ↔ Ref. Image (gold) peaks near 0.30; Text ↔ 3D Render (purple) near 0.24; Ref. Image ↔ 3D Render (dark blue) exceptionally high at ∼0.72. For comparison, the matched Objaverse Text ↔ 3D Render distribution (light blue) peaks near 0.18 — well below AmaraSpatial’s 0.24 — providing a distributional … view at source ↗

**Figure 10.** Figure 10: CLIP classification of AmaraSpatial-10K renders. For each asset we render two viewpoints and score them with CLIP (ViT-B/32, OpenAI pretraining) against a fixed LVIS vocabulary of 1,207 categories, following the Objaverse protocol (softmax over the full vocabulary). Columns 1–2 show the probability of the true class under the two viewpoints; columns 3–4 re-display the same two renders scored against seman… view at source ↗

**Figure 11.** Figure 11: Qualitative CLIP retrieval comparison. Top 5 CLIP retrievals (ViT-L/14, mean-pooled over four orthographic renders) for the query “an ornate Victorian writing desk with brass handles”: Objaverse (top) vs. AmaraSpatial-10K (bottom). Labels are each dataset’s short descriptor verbatim (Objaverse: tags; AmaraSpatial-10K: name). Objaverse descriptions are generic category tags (“furniture-home”, “art-abstract… view at source ↗

read the original abstract

Web-scale 3D asset collections are abundant but rarely deployment-ready, suffering from arbitrary metric scaling, incorrect pivots, brittle geometry, and incomplete textures, defects that limit their use in embodied AI, robotics, and spatial computing. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets optimised for zero-shot deployment. Each asset ships as a metric-scaled, deterministically anchored .glb with separated PBR maps, a convex collision hull, a paired reference image, and multi-sentence text metadata. Alongside the dataset we introduce a reusable evaluation suite for 3D asset banks, a continuous Scale Plausibility Score (SPS), an LLM Concept Density metric, anchor-error auditing, and a cross-modal CLIP coherence protocol, and apply it to AmaraSpatial-10K alongside matched subsets of Objaverse, HSSD, ABO, and GSO. AmaraSpatial-10K improves CLIP Recall@5 by $3.4\times$ over Objaverse ($0.612$ vs. $0.181$, median rank $267 \rightarrow 3$), achieves a $99.1\%$ physics-stability rate under Habitat-Sim with $\sim 20\times$ wall-time speed-up, and produces zero-overlap scenes when used as a drop-in asset bank for Holodeck. Controlled ablations on the same asset bank attribute the retrieval gain to description richness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AmaraSpatial-10K ships a cleaned 10K-asset 3D collection with concrete gains on CLIP retrieval and Habitat physics stability, but all evidence stays at proxy level with no downstream task results.

read the letter

AmaraSpatial-10K releases over 10,000 synthetic 3D assets that arrive metric-scaled, deterministically anchored, with separated PBR maps, convex hulls, reference images, and multi-sentence metadata. The authors also release an evaluation suite that includes a Scale Plausibility Score, LLM Concept Density, anchor-error checks, and a CLIP coherence protocol, then run it against matched subsets of Objaverse, HSSD, ABO, and GSO. The reported numbers are straightforward: CLIP Recall@5 rises from 0.181 to 0.612, median rank drops from 267 to 3, physics stability hits 99.1 percent in Habitat-Sim, and loading is roughly 20 times faster. Ablations tie the retrieval lift mainly to richer descriptions. That package removes several common preprocessing headaches for people who just want to drop assets into simulators or spatial apps. The soft spots are also clear. Every result is measured on the assets themselves or on internal synthetic proxies; there are no experiments that swap these assets into a full navigation, manipulation, or instruction-following pipeline and report task success rates, even in simulation. The asset-generation steps are described at high level only, so it is hard to judge how much manual cleanup was involved or how repeatable the process is. The zero-overlap Holodeck result is useful but still stays inside the same asset bank. This work is aimed at embodied-AI and spatial-computing groups that need ready-to-use 3D banks rather than raw web scrapes. Anyone running Habitat or similar environments will see immediate practical value from the release itself. It is worth sending to referees because the empirical comparisons are concrete and the dataset lowers real barriers, even though the translation to task-level gains remains an open question.

Referee Report

3 major / 2 minor

Summary. The paper introduces AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets each provided as metric-scaled .glb files with deterministic anchoring, separated PBR maps, convex collision hulls, paired reference images, and multi-sentence text metadata. It also presents a reusable evaluation suite including the Scale Plausibility Score (SPS), LLM Concept Density metric, anchor-error auditing, and a CLIP coherence protocol. The central empirical claims are that AmaraSpatial-10K achieves 3.4× higher CLIP Recall@5 than Objaverse (0.612 vs. 0.181, median rank 267→3), 99.1% physics stability under Habitat-Sim with ~20× wall-time speedup, and zero-overlap scenes when used as a drop-in asset bank for Holodeck, with ablations attributing retrieval gains to description richness.

Significance. If the proxy-metric gains hold under broader scrutiny, the dataset could reduce preprocessing overhead for embodied AI and robotics researchers by supplying deployment-ready assets, and the introduced evaluation protocols would provide a reusable benchmark for 3D asset banks. The release of the dataset and code for the evaluation suite supports reproducibility and community use. However, the significance is limited by the absence of direct validation on downstream task performance.

major comments (3)

[Abstract / Evaluation] Abstract and Evaluation section: The claim that AmaraSpatial-10K is 'optimised for zero-shot deployment' in embodied AI rests entirely on synthetic proxy metrics (CLIP Recall@5, Habitat-Sim stability, Holodeck overlap). No experiments measure end-to-end task success (navigation, manipulation, or instruction following) under realistic sensor noise or on hardware, leaving the translation from asset-level proxies to task-level improvement untested.
[Methods] Methods / Asset Generation: The manuscript provides insufficient detail on the asset generation pipeline, including how metric scaling, deterministic anchoring, PBR separation, and convex hull computation are performed at scale. This omission prevents independent reproduction or extension of the 10K-asset collection.
[Evaluation] Evaluation suite: The newly introduced Scale Plausibility Score (SPS) and LLM Concept Density metric lack validation against human judgments or correlation with established 3D quality measures; their definitions and scoring procedures must be specified with equations or pseudocode to support the reported numerical gains.

minor comments (2)

[Abstract] Abstract: The CLIP Recall@5 comparison should explicitly state the number of assets and query categories used for each baseline (Objaverse, HSSD, ABO, GSO) to allow direct interpretation of the 0.612 vs. 0.181 figures.
[Results] Table/Figure captions: Ensure all reported metrics (e.g., 99.1% stability, 20× speedup) include standard deviations or confidence intervals and specify the exact number of trials.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment in detail below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: The claim that AmaraSpatial-10K is 'optimised for zero-shot deployment' in embodied AI rests entirely on synthetic proxy metrics (CLIP Recall@5, Habitat-Sim stability, Holodeck overlap). No experiments measure end-to-end task success (navigation, manipulation, or instruction following) under realistic sensor noise or on hardware, leaving the translation from asset-level proxies to task-level improvement untested.

Authors: We acknowledge that our claims regarding optimization for zero-shot deployment are based on proxy metrics rather than direct end-to-end task evaluations. This is a valid point, and we recognize that proxy metrics do not fully capture performance under real sensor noise or hardware conditions. To address this, we have added a dedicated Limitations paragraph in the revised manuscript discussing the reliance on proxies and the need for future task-specific benchmarks. We maintain that the provided metrics offer a strong initial validation for asset quality, consistent with practices in similar dataset papers, but we do not claim direct task improvements without further testing. revision: partial
Referee: [Methods] Methods / Asset Generation: The manuscript provides insufficient detail on the asset generation pipeline, including how metric scaling, deterministic anchoring, PBR separation, and convex hull computation are performed at scale. This omission prevents independent reproduction or extension of the 10K-asset collection.

Authors: We agree that additional details are necessary for reproducibility. In the revised manuscript, we have substantially expanded the Methods section to include a step-by-step description of the asset generation pipeline. This now covers: (1) metric scaling using reference measurements from semantic metadata and standard object sizes; (2) deterministic anchoring via computation of the geometric center and alignment to a canonical frame; (3) separation of PBR maps using material parsing in the source 3D software; and (4) convex hull generation using the V-HACD library with specific parameters for decomposition. We also include pseudocode for the scaling and anchoring procedures. revision: yes
Referee: [Evaluation] Evaluation suite: The newly introduced Scale Plausibility Score (SPS) and LLM Concept Density metric lack validation against human judgments or correlation with established 3D quality measures; their definitions and scoring procedures must be specified with equations or pseudocode to support the reported numerical gains.

Authors: We have revised the Evaluation section to provide formal mathematical definitions and pseudocode for both the Scale Plausibility Score (SPS) and the LLM Concept Density metric. Specifically, SPS is now defined as an integral over scale deviation probabilities, and LLM Concept Density as the average number of unique concepts per description normalized by length. We note that while direct human validation studies were not conducted in this work, the metrics are designed to align with intuitive quality aspects, and we have added a discussion on their correlation with existing measures such as those used in prior 3D asset evaluations. The numerical results are now explicitly tied to these definitions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release with direct metric comparisons

full rationale

The paper presents AmaraSpatial-10K as a new asset collection and reports performance via standard external benchmarks (CLIP Recall@5, Habitat-Sim stability, Holodeck overlap) applied to the released assets and matched subsets of prior datasets. No equations, parameter fits, or derivations are described that reduce to the paper's own inputs. All claims are direct empirical measurements on the dataset itself; no self-citation chains or ansatzes are invoked to justify core results. This is a standard dataset paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Central claims rest on standard assumptions about CLIP semantic alignment and Habitat-Sim physics fidelity plus two newly introduced metrics without external validation.

axioms (2)

domain assumption CLIP embeddings capture meaningful cross-modal similarity between text descriptions and 3D asset geometry
Invoked in the cross-modal CLIP coherence protocol and Recall@5 metric
domain assumption Habitat-Sim physics simulation produces reliable stability and overlap outcomes for synthetic 3D assets
Used to report 99.1% physics-stability rate and zero-overlap scenes

invented entities (2)

Scale Plausibility Score (SPS) no independent evidence
purpose: Continuous metric to quantify metric-scale plausibility of assets
Newly introduced evaluation metric
LLM Concept Density metric no independent evidence
purpose: Measure of semantic richness in asset text metadata
Newly introduced evaluation metric

pith-pipeline@v0.9.0 · 5592 in / 1390 out tokens · 35211 ms · 2026-05-14T21:14:27.121479+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 4 internal anchors

[1]

ShapeNet: An Information-Rich 3D Model Repository

Chang, A.X., et al. “ShapeNet: An Information-Rich 3D Model Repository.”arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

Objaverse: A Universe of Annotated 3D Objects

Deitke, M., et al. “Objaverse: A Universe of Annotated 3D Objects.”CVPR, 2023

work page 2023
[3]

Objaverse-XL: A Universe of 10M+ 3D Objects

Deitke, M., et al. “Objaverse-XL: A Universe of 10M+ 3D Objects.”NeurIPS, 2023

work page 2023
[4]

Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items

Downs, L., et al. “Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items.”ICRA, 2022

work page 2022
[5]

ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

Collins, J., et al. “ABO: Dataset and Benchmarks for Real-World 3D Object Understanding.” CVPR, 2022. 21

work page 2022
[6]

Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation

Khanna, M., et al. “Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation.”CVPR, 2024

work page 2024
[7]

Habitat: A Platform for Embodied AI Research

Savva, M., et al. “Habitat: A Platform for Embodied AI Research.”ICCV, 2019

work page 2019
[8]

iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes

Shen, B., et al. “iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes.”IROS, 2021

work page 2021
[9]

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

Deitke, M., et al. “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation.” NeurIPS, 2022

work page 2022
[10]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Ramakrishnan, S.K., et al. “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI.”NeurIPS Datasets and Benchmarks, 2021

work page 2021
[11]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Chang, A., et al. “Matterport3D: Learning from RGB-D Data in Indoor Environments.”3DV, 2017

work page 2017
[12]

arXiv preprint arXiv:2403.02151 , year=

Tochilkin, D., et al. “TripoSR: Fast 3D Object Reconstruction from a Single Image.”arXiv preprint arXiv:2403.02151, 2024

work page arXiv 2024
[13]

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Xu, J., et al. “InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models.”arXiv preprint arXiv:2404.07191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

LRM: Large Reconstruction Model for Single Image to 3D

Hong, Y ., et al. “LRM: Large Reconstruction Model for Single Image to 3D.”ICLR, 2024

work page 2024
[15]

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

Wang, Z., et al. “CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model.”arXiv preprint arXiv:2403.02099, 2024

work page arXiv 2024
[16]

Holodeck: Language Guided Generation of 3D Embodied AI Environments

Yang, Y ., et al. “Holodeck: Language Guided Generation of 3D Embodied AI Environments.” CVPR, 2024

work page 2024
[17]

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

Feng, W., et al. “LayoutGPT: Compositional Visual Planning and Generation with Large Language Models.”NeurIPS, 2023

work page 2023
[18]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., et al. “Learning Transferable Visual Models From Natural Language Supervision.” ICML, 2021

work page 2021
[19]

Qwen2 Technical Report

Yang, A., et al. “Qwen2 Technical Report.”arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Google. “Gemini: A Family of Highly Capable Multimodal Models.”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Gemini 3 Flash Image (Nano Banana 2)

Google DeepMind. “Gemini 3 Flash Image (Nano Banana 2).” https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/, 2025. Accessed April 23, 2026. 22

work page 2025