The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study

Chen Gao; Heng Dong; Jianjie Fang; Kaiyuan Li; Ruiying Peng; Weichen Zhang; Wei Li; Xinlei Chen; Xin Wang; Xin Zeng

arxiv: 2504.04540 · v2 · pith:IDC3K2XEnew · submitted 2025-04-06 · 💻 cs.CV · cs.AI

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study

Weichen Zhang , Ruiying Peng , Xin Zeng , Jianjie Fang , Ziyou Wang , Kaiyuan Li , Heng Dong , Wei Li

show 4 more authors

Chen Gao Xin Wang Xinlei Chen Yong Li

This is my paper

classification 💻 cs.CV cs.AI

keywords llmsspatialpointreasoningmodalitiescloudcloudsmodels

0 comments

read the original abstract

3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the advantages of point clouds over other modalities remain unclear. Moreover, existing 3D benchmarks are insufficient for fairly evaluating the ability of multimodal LLMs to comprehend spatial concepts. To address these challenges, we introduce ScanReQA, a 3D spatial reasoning benchmark encompassing text, vision, and point cloud modalities. We then evaluate the performance of text, 2D, and 3D LLMs on the benchmark to compare the effectiveness of different modalities in understanding spatial concepts. Furthermore, we analyze the reasoning mechanisms behind 3D LLMs using point clouds. Our findings reveal that: 1) binary spatial reasoning remains challenging for current 3D LLMs, 2) MLLMs based on point cloud and visual modalities demonstrate stronger spatial reasoning capabilities than LLMs, and 3) 3D LLMs exhibit the attention sink phenomenon similar to that in 2D LLMs, impairing spatial reasoning. We think these conclusions can help the next step of 3D LLMs and also offer insights for foundation models in other modalities. We release datasets and codes in the project page: https://github.com/EmbodiedCity/ScanReQA.code.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models
cs.CV 2026-05 unverdicted novelty 8.0

Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
CAD-Coder: Text-to-CAD Generation with Chain-of-Thought and Geometric Reward
cs.GR 2025-05 unverdicted novelty 7.0

CAD-Coder generates valid CadQuery scripts from text via supervised fine-tuning followed by reinforcement learning with geometric Chamfer Distance rewards and chain-of-thought planning.
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
cs.CV 2026-05 unverdicted novelty 6.0

Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations
cs.CV 2026-03 unverdicted novelty 6.0

GR3D turns 3D scene geometry into ID-indexed text references, enabling zero-shot MLLM spatial reasoning gains of 9% on VSI-Bench and 12% on MindCube.