ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
How to enable llm with 3d capacity? a survey of spatial reasoning in llm
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
CAD-Coder generates valid CadQuery scripts from text via supervised fine-tuning followed by reinforcement learning with geometric Chamfer Distance rewards and chain-of-thought planning.
PAR3D is a part-aware 3D-MLLM framework with ScenePart dataset, Part-Aware 3D Representation Learning, and Hierarchical Segmentation Query Generation to improve part-level 3D scene understanding.
Introduces Layout-as-Policy (LaP) to turn 3D layout estimation into an iterative policy-learning refinement process for better physical coherence.
OmniView-Space framework with MPSM, tool-guided reasoning, and distillation achieves SOTA on spatial reasoning benchmarks for MLLMs while reducing external geometry dependencies.
citing papers explorer
-
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.