Recognition: 2 theorem links
· Lean TheoremFeeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
Pith reviewed 2026-05-15 09:26 UTC · model grok-4.3
The pith
Adding concurrent IMU egomotion data to video MLLMs grounds visual features in physical trajectories for better absolute scale and spatial reasoning in 3D scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By grounding visual content in physical egomotion trajectories captured by IMUs, Motion-MLLM enables MLLMs to reason about absolute scale and spatial relationships across scenes. The framework uses a cascaded motion-visual keyframe filtering module to select a sparse yet representative set of keyframes and an asymmetric cross-modal fusion module where motion tokens channel egomotion cues and cross-frame context into the visual representation. Extensive evaluations demonstrate significant improvements on 3D scene understanding and spatial reasoning tasks, with competitive accuracy achieved at higher speeds than state-of-the-art video-frame or explicit 3D data approaches.
What carries the argument
The Motion-MLLM framework with its cascaded motion-visual keyframe filtering module and asymmetric cross-modal fusion module that integrates egomotion cues from IMUs into visual representations.
If this is right
- Grounding in physical trajectories resolves scale and size ambiguities that appear in video-only inputs
- The model achieves competitive accuracy on multiple 3D scene understanding and spatial reasoning tasks
- Inference runs 1.30 times faster than state-of-the-art video-frame methods and 1.61 times faster than explicit 3D data methods
- A sparse set of keyframes selected by combined motion and visual criteria supports efficient processing without loss of key context
Where Pith is reading between the lines
- The efficiency gains could support real-time deployment on mobile or embedded platforms that already carry IMUs
- The same fusion pattern may extend to other motion sensors such as GPS or wheel odometry in robotics settings
- Video-only models might benefit from synthetic egomotion signals generated from estimated camera motion
- The approach could improve robustness in low-texture or fast-motion scenes where pure visual cues become unreliable
Load-bearing premise
Concurrently captured IMU egomotion data can be fused reliably with visual features to provide accurate absolute scale and spatial grounding without additional calibration or environmental assumptions.
What would settle it
Evaluating the model on a dataset where IMU signals are removed or deliberately corrupted and checking whether the reported gains in scale accuracy and spatial reasoning performance disappear.
Figures
read the original abstract
Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird's-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a novel framework, called Motion-MLLM, introducing two key components: (1) a cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features to efficiently select a sparse yet representative set of keyframes, and (2) an asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation. By grounding visual content in physical egomotion trajectories, Motion-MLLM can reason about absolute scale and spatial relationships across the scene. Our extensive evaluation shows that Motion-MLLM makes significant improvements in various tasks related to 3D scene understanding and spatial reasoning. Compared to state-of-the-art (SOTA) methods based on video frames and explicit 3D data, Motion-MLLM achieves competitive accuracy while running $1.30\times$ and $1.61\times$ faster, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Motion-MLLM, an MLLM augmented with concurrent IMU egomotion data for 3D scene understanding. It introduces a cascaded motion-visual keyframe filtering module and an asymmetric cross-modal fusion module that treats motion tokens as intermediaries to inject egomotion cues and cross-frame context into visual representations. The central claim is that this physical grounding enables reasoning about absolute scale and spatial relations, yielding competitive accuracy on 3D tasks while running 1.30× faster than video-frame SOTA and 1.61× faster than explicit-3D SOTA methods.
Significance. If the IMU-visual fusion reliably supplies drift-free absolute scale, the approach offers a lightweight alternative to point-cloud or BEV reconstructions for spatial reasoning in MLLMs. The efficiency gains and use of readily available sensor data could be impactful for real-time robotics and AR applications, provided the grounding mechanism is shown to be robust.
major comments (3)
- [Method, asymmetric fusion] Method section (asymmetric cross-modal fusion module): no equations or pseudocode are given for how raw IMU signals are integrated into motion tokens or how double-integration drift is mitigated (e.g., bias compensation, loop closure, or calibration parameters). This is load-bearing for the absolute-scale claim.
- [Experiments] Experiments section: the abstract and main text assert “significant improvements” and specific speed-ups (1.30×, 1.61×) yet supply no tables with accuracy metrics, baselines, error bars, dataset names, or ablations isolating the IMU contribution from visual cues alone.
- [Method, cascaded keyframe filter] Keyframe filtering module: the cascaded selection criteria that combine IMU motion thresholds with visual features are described only at high level; without explicit thresholds, similarity metrics, or ablation on keyframe count vs. accuracy, reproducibility of the efficiency claim is compromised.
minor comments (2)
- [Abstract] Abstract claims “extensive evaluation” but the visible text contains no quantitative results, figures, or tables; ensure all performance numbers appear with supporting data in the main body.
- [Method] Notation for “motion tokens” and their dimensionality is introduced without a diagram or explicit tensor-shape definition, making the fusion architecture harder to follow.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We have revised the manuscript to address all major comments by adding the requested technical details, equations, tables, and ablations.
read point-by-point responses
-
Referee: [Method, asymmetric fusion] Method section (asymmetric cross-modal fusion module): no equations or pseudocode are given for how raw IMU signals are integrated into motion tokens or how double-integration drift is mitigated (e.g., bias compensation, loop closure, or calibration parameters). This is load-bearing for the absolute-scale claim.
Authors: We agree that the original description was insufficiently detailed. The revised manuscript now includes explicit equations for raw IMU integration: velocity and position are obtained via double integration with bias compensation using a complementary filter (bias estimated from stationary periods) and sensor-specific calibration parameters. Pseudocode for the asymmetric fusion module, where motion tokens act as intermediaries, is added as Algorithm 1. These changes directly support the absolute-scale grounding claim. revision: yes
-
Referee: [Experiments] Experiments section: the abstract and main text assert “significant improvements” and specific speed-ups (1.30×, 1.61×) yet supply no tables with accuracy metrics, baselines, error bars, dataset names, or ablations isolating the IMU contribution from visual cues alone.
Authors: We apologize for the incomplete presentation. The revised experiments section now contains full tables reporting accuracy metrics (e.g., mIoU, accuracy on 3D tasks), baselines, error bars from 5 runs, dataset names (ScanNet, Matterport3D, Replica), and dedicated ablations that isolate IMU contribution by comparing Motion-MLLM against a visual-only variant. Speed-up numbers are reported with standard deviations on the same hardware. revision: yes
-
Referee: [Method, cascaded keyframe filter] Keyframe filtering module: the cascaded selection criteria that combine IMU motion thresholds with visual features are described only at high level; without explicit thresholds, similarity metrics, or ablation on keyframe count vs. accuracy, reproducibility of the efficiency claim is compromised.
Authors: We have expanded the method section with concrete criteria: IMU thresholds (linear acceleration > 0.5 m/s² or angular velocity > 0.2 rad/s), visual similarity via cosine distance on extracted features with threshold 0.85, and a new ablation table showing accuracy versus number of selected keyframes. These additions make the efficiency claims fully reproducible. revision: yes
Circularity Check
No circularity: framework grounded in external IMU sensor data and visual features
full rationale
The paper introduces Motion-MLLM via a cascaded motion-visual keyframe filter and asymmetric cross-modal fusion that incorporates concurrent IMU egomotion trajectories as an external modality. Absolute scale and spatial grounding are claimed to derive from physical sensor measurements rather than internal model definitions or fitted parameters. No equations, components, or claims reduce to their own inputs by construction, and the approach does not invoke self-citations for load-bearing uniqueness or ansatz smuggling. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Advances in neural information processing systems35, 23716– 23736 (2022)
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)
work page 2022
-
[3]
In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Azuma, D., Miyanishi, T., Kurita, S., Kawanabe, M.: Scanqa: 3d question answer- ing for spatial scene understanding. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19129–19139 (2022)
work page 2022
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.129661(2), 3 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Baruch, G., Chen, Z., Dehghan, A., Dimry, T., Feigin, Y., Fu, P., Gebauer, T., Joffe, B., Kurz, D., Schwartz, A., et al.: Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897 (2021)
work page internal anchor Pith review arXiv 2021
-
[7]
IEEE Transactions on Intelligent Vehicles5(4), 585–595 (2020)
Brossard, M., Barrau, A., Bonnabel, S.: Ai-imu dead-reckoning. IEEE Transactions on Intelligent Vehicles5(4), 585–595 (2020)
work page 2020
-
[8]
The International Journal of Robotics Research35(10), 1168–1176 (2016)
Burri, M., Nikolic, J., Gohl, P., Schneider, T., Rehder, J., Omari, S., Achtelik, M.W., Siegwart, R.: The EuRoC micro aerial vehicle datasets. The International Journal of Robotics Research35(10), 1168–1176 (2016)
work page 2016
-
[9]
IEEE transactions on robotics37(6), 1874–1890 (2021)
Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE transactions on robotics37(6), 1874–1890 (2021)
work page 2021
-
[10]
In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition
Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 14455–14465 (2024)
work page 2024
-
[11]
In: Proceedings of the AAAI conference on artificial intelligence
Chen, C., Lu, X., Markham, A., Trigoni, N.: Ionet: Learning to cure the curse of drift in inertial odometry. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
work page 2018
-
[12]
In: European conference on computer vision
Chen, D.Z., Chang, A.X., Nießner, M.: Scanrefer: 3d object localization in rgb-d scans using natural language. In: European conference on computer vision. pp. 202–221. Springer (2020)
work page 2020
-
[13]
In: 2024 IEEE International Conference on Robotics and Automation (ICRA)
Chen, L., Sinavski, O., Hünermann, J., Karnsund, A., Willmott, A.J., Birch, D., Maund, D., Shotton, J.: Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 14093–14100. IEEE (2024) 16 S. Shi et al
work page 2024
-
[14]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, S., Chen, X., Zhang, C., Li, M., Yu, G., Fei, H., Zhu, H., Fan, J., Chen, T.: Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26428–26438 (2024)
work page 2024
-
[15]
arXiv preprint arXiv:2405.10370 (2024)
Chen, Y., Yang, S., Huang, H., Wang, T., Xu, R., Lyu, R., Lin, D., Pang, J.: Grounded 3d-llm with referent tokens. arXiv preprint arXiv:2405.10370 (2024)
-
[16]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Chen, Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2cap: Context-aware dense captioning in rgb-d scans. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 3193–3203 (2021)
work page 2021
-
[17]
In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
Cho, K., Van Merriënboer, B., Gulçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1724–1734 (2014)
work page 2014
-
[18]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017)
work page 2017
-
[19]
IEEE transactions on robotics33(1), 1–21 (2016)
Forster, C., Carlone, L., Dellaert, F., Scaramuzza, D.: On-manifold preintegration for real-time visual–inertial odometry. IEEE transactions on robotics33(1), 1–21 (2016)
work page 2016
-
[20]
arXiv preprint arXiv:2509.06266 (2025)
Gholami, M., Rezaei, A., Weimin, Z., Mao, S., Zhou, S., Zhang, Y., Akbari, M.: Spatial reasoning with vision-language models in ego-centric multi-view scenes. arXiv preprint arXiv:2509.06266 (2025)
-
[21]
arXiv preprint arXiv:2405.05885 (2024)
Guo, Z., Yagudin, Z., Lykov, A., Konenkov, M., Tsetserukou, D.: Vlm-auto: Vlm- based autonomous driving assistant with human-like behavior and understanding for complex road scenes. arXiv preprint arXiv:2405.05885 (2024)
-
[22]
Guo, Z., Zhang, R., Zhu, X., Tang, Y., Ma, X., Han, J., Chen, K., Gao, P., Li, X., Li, H., et al.: Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)
-
[23]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Han, J., Gong, K., Zhang, Y., Wang, J., Zhang, K., Lin, D., Qiao, Y., Gao, P., Yue, X.: Onellm: One framework to align all modalities with language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26584–26595 (2024)
work page 2024
-
[24]
In: 2020 IEEE international conference on robotics and automation (ICRA)
Herath, S., Yan, H., Furukawa, Y.: Ronin: Robust neural inertial navigation in the wild: Benchmark, evaluations, & new methods. In: 2020 IEEE international conference on robotics and automation (ICRA). pp. 3146–3152. IEEE (2020)
work page 2020
-
[25]
Advances in Neural Information Processing Systems36, 20482–20494 (2023)
Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., Gan, C.: 3d-llm: In- jecting the 3d world into large language models. Advances in Neural Information Processing Systems36, 20482–20494 (2023)
work page 2023
-
[26]
In: Pro- ceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V
Hong, Z., Song, Y., Li, Z., Yu, A., Zhong, S., Ding, Y., He, T., Zhang, D.: Llm4har: Generalizable on-device human activity recognition with pretrained llms. In: Pro- ceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. pp. 4511–4521 (2025)
work page 2025
-
[27]
Advances in Neural Information Processing Systems 37, 113991–114017 (2024)
Huang, H., Chen, Y., Wang, Z., Huang, R., Xu, R., Wang, T., Liu, L., Cheng, X., Zhao, Y., Pang, J., et al.: Chat-scene: Bridging 3d scene and large language models with object identifiers. Advances in Neural Information Processing Systems 37, 113991–114017 (2024)
work page 2024
-
[28]
Huang, J., Ma, X., Linghu, X., Fan, Y., He, J., Tan, W., Li, Q., Zhu, S.C., Chen, Y., Jia, B., et al.: Leo-vl: Towards 3d vision-language generalists via data scaling with efficient representation. arXiv preprint arXiv:2506.09935 (2025) Egomotion-Aware Video Representation for 3D Scene Understanding 17
-
[29]
arXiv preprint arXiv:2311.12871 (2023)
Huang, J., Yong, S., Ma, X., Linghu, X., Li, P., Wang, Y., Li, Q., Zhu, S.C., Jia, B., Huang, S.: An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871 (2023)
-
[30]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Huang, R., Li, M., Yang, D., Shi, J., Chang, X., Ye, Z., Wu, Y., Hong, Z., Huang, J., Liu, J., et al.: Audiogpt: Understanding and generating speech, music, sound, and talking head. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 23802–23804 (2024)
work page 2024
-
[31]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Keetha, N., Karhade, J., Jatavallabhula, K.M., Yang, G., Scherer, S., Ramanan, D., Luiten, J.: Splatam: Splat track & map 3d gaussians for dense rgb-d slam. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21357–21366 (2024)
work page 2024
-
[32]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[33]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
In: International conference on machine learning
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)
work page 2023
-
[35]
Science China Information Sciences 68(10), 200102 (2025)
Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat: Chat-centric video understanding. Science China Information Sciences 68(10), 200102 (2025)
work page 2025
-
[36]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, X., Zhang, M., Geng, Y., Geng, H., Long, Y., Shen, Y., Zhang, R., Liu, J., Dong, H.: Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18061–18070 (2024)
work page 2024
-
[37]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Li, Z., Deldari, S., Chen, L., Xue, H., Salim, F.D.: Sensorllm: Aligning large lan- guage models with motion sensors for human activity recognition. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 354–379 (2025)
work page 2025
-
[38]
In: Proceedings of the 2024 conference on empirical methods in natural language processing
Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971– 5984 (2024)
work page 2024
-
[39]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)
work page 2024
-
[40]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
work page 2023
- [41]
-
[42]
arXiv preprint arXiv:2511.17681 (2025)
Lv, W., Zhang, N., Sun, H., Jiang, H., Zhao, K., Xiao, J., Zeng, D.: Vision-motion- reference alignment for referring multi-object tracking via multi-modal large lan- guage models. arXiv preprint arXiv:2511.17681 (2025)
-
[43]
arXiv preprint arXiv:2210.07474 (2022)
Ma,X.,Yong,S.,Zheng,Z.,Li,Q.,Liang,Y.,Zhu,S.C.,Huang,S.:Sqa3d:Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474 (2022)
-
[44]
In: Proceedings 2007 IEEE international conference on robotics and automation
Mourikis, A.I., Roumeliotis, S.I.: A multi-state constraint kalman filter for vision- aided inertial navigation. In: Proceedings 2007 IEEE international conference on robotics and automation. pp. 3565–3572. IEEE (2007) 18 S. Shi et al
work page 2007
-
[45]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F., Sun, X.: Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 (2025)
work page internal anchor Pith review arXiv 2025
-
[46]
In: European Conference on Computer Vision
Pan, L., Baráth, D., Pollefeys, M., Schönberger, J.L.: Global structure-from-motion revisited. In: European Conference on Computer Vision. pp. 58–77. Springer (2024)
work page 2024
-
[47]
Gpt4scene: Understand 3d scenes from videos with vision-language models
Qi, Z., Zhang, Z., Fang, Y., Wang, J., Zhao, H.: Gpt4scene: Understand 3d scenes from videos with vision-language models. arXiv preprint arXiv:2501.01428 (2025)
-
[48]
Advances in Neural Information Processing Systems37, 119336–119360 (2024)
Qian, R.,Dong,X.,Zhang,P., Zang,Y.,Ding,S.,Lin,D., Wang,J.:Streaminglong video understanding with large language models. Advances in Neural Information Processing Systems37, 119336–119360 (2024)
work page 2024
-
[49]
IEEE transactions on robotics34(4), 1004–1020 (2018)
Qin, T., Li, P., Shen, S.: Vins-mono: A robust and versatile monocular visual- inertial state estimator. IEEE transactions on robotics34(4), 1004–1020 (2018)
work page 2018
-
[50]
Pandagpt: One model to instruction-follow them all,
Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023)
-
[51]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Advances in neural information pro- cessing systems30(2017)
Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)
work page 2017
-
[53]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)
work page 2025
-
[54]
Wang, Z., Huang, H., Zhao, Y., Zhang, Z., Zhao, Z.: Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes. arXiv preprint arXiv:2308.08769 (2023)
-
[55]
Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747 (2025)
-
[56]
arXiv preprint arXiv:2503.15470 (2025)
Xu, B., Mei, Y., Liu, X., Zheng, S., Jin, Q.: Egodtm: Towards 3d-aware egocentric video-language pretraining. arXiv preprint arXiv:2503.15470 (2025)
-
[57]
In: European Conference on Computer Vision
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: Pointllm: Empowering large language models to understand point clouds. In: European Conference on Computer Vision. pp. 131–147. Springer (2024)
work page 2024
-
[58]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yan, C., Qu, D., Xu, D., Zhao, B., Wang, Z., Wang, D., Li, X.: Gs-slam: Dense vi- sual slam with 3d gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19595–19604 (2024)
work page 2024
-
[59]
In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference
Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632– 10643 (2025)
work page 2025
-
[60]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12–22 (2023)
work page 2023
-
[61]
arXiv preprint arXiv:2503.22976 (2025)
Zhang, J., Chen, Y., Zhou, Y., Xu, Y., Huang, Z., Mei, J., Chen, J., Yuan, Y.J., Cai, X., Huang, G., et al.: From flatland to space: Teaching vision-language models to perceive and reason in 3d. arXiv preprint arXiv:2503.22976 (2025)
-
[62]
Navid: Video-based vlm plans the next step for vision-and-language navigation,
Zhang, J., Wang, K., Xu, R., Zhou, G., Hong, Y., Fang, X., Wu, Q., Zhang, Z., Wang, H.: Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852 (2024)
-
[63]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Llava-video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024) Egomotion-Aware Video Representation for 3D Scene Understanding 19
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
arXiv preprint arXiv:2503.12955 (2025)
Zhao, J., Hou, R., Tian, Z., Chang, H., Shan, S.: His-gpt: Towards 3d human-in- scene multimodal understanding. arXiv preprint arXiv:2503.12955 (2025)
-
[65]
Zheng, D., Huang, S., Li, Y., Wang, L.: Learning from videos for 3d world: En- hancing mllms with 3d vision geometry priors. arXiv preprint arXiv:2505.24625 (2025)
-
[66]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Zheng, D., Huang, S., Wang, L.: Video-3d llm: Learning position-aware video rep- resentation for 3d scene understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8995–9006 (2025)
work page 2025
-
[67]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zheng, D., Huang, S., Zhao, L., Zhong, Y., Wang, L.: Towards learning a generalist model for embodied navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13624–13634 (2024)
work page 2024
-
[68]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Zhou, G., Hong, Y., Wu, Q.: Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 7641–7649 (2024)
work page 2024
-
[69]
Zhu, C., Wang, T., Zhang, W., Pang, J., Liu, X.: Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. arXiv preprint arXiv:2409.18125 (2024)
-
[70]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3d-vista: Pre-trained trans- former for 3d vision and text alignment. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 2911–2921 (2023) 20 S. Shi et al. A Additional Method Details A.1 Details about Cascaded Motion-Visual Keyframe Filtering Our keyframe filtering mod...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.