Stream3D-VLM adds autoregressive streaming control, VSFI geometry integration, GAVC compression, and a 1M-pair benchmark to enable real-time 3D VLM performance that beats prior models on 29 online and offline tasks.
arXiv preprint arXiv:2506.09935 (2025) Egomotion-Aware Video Representation for 3D Scene Understanding 17
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
PAR3D is a part-aware 3D-MLLM framework with ScenePart dataset, Part-Aware 3D Representation Learning, and Hierarchical Segmentation Query Generation to improve part-level 3D scene understanding.
Proxy3D generates efficient 3D proxy representations via semantic clustering from video frames and aligns them to VLMs through multi-stage training on the new SpaceSpan dataset, achieving competitive performance on 3D VQA, grounding, and spatial benchmarks with shorter sequences.
Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at higher speed.
citing papers explorer
-
Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors
Stream3D-VLM adds autoregressive streaming control, VSFI geometry integration, GAVC compression, and a 1M-pair benchmark to enable real-time 3D VLM performance that beats prior models on 29 online and offline tasks.
-
PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding
PAR3D is a part-aware 3D-MLLM framework with ScenePart dataset, Part-Aware 3D Representation Learning, and Hierarchical Segmentation Query Generation to improve part-level 3D scene understanding.
-
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Proxy3D generates efficient 3D proxy representations via semantic clustering from video frames and aligns them to VLMs through multi-stage training on the new SpaceSpan dataset, achieving competitive performance on 3D VQA, grounding, and spatial benchmarks with shorter sequences.
-
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at higher speed.