ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.
arXiv preprint arXiv:2505.02836 (2025)
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9representative citing papers
SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
VIGA introduces a training-free interleaved multimodal reasoning loop that improves vision-as-inverse-graphics accuracy over one-shot baselines on BlenderGym, SlideBench, and new BlenderBench.
LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.
SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
An intent-driven Real2Sim framework uses VLMs for semantic task decomposition to identify missing physical parameters and generates reactive behavior trees to acquire them via contact-rich robotic interactions on a Franka Panda arm.
EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.
citing papers explorer
-
ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.
-
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation
SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
-
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
-
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
VIGA introduces a training-free interleaved multimodal reasoning loop that improves vision-as-inverse-graphics accuracy over one-shot baselines on BlenderGym, SlideBench, and new BlenderBench.
-
Repurposing 3D Generative Model for Autoregressive Layout Generation
LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.
-
SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis
SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.
-
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
-
Real2Sim via Active Perception with Behavior Trees Automatically Generated by VLMs
An intent-driven Real2Sim framework uses VLMs for semantic task decomposition to identify missing physical parameters and generates reactive behavior trees to acquire them via contact-rich robotic interactions on a Franka Panda arm.
-
EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development
EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.