pith. sign in

arxiv: 2605.22819 · v1 · pith:XKMQ5MDUnew · submitted 2026-05-21 · 💻 cs.CV

Cambrian-P: Pose-Grounded Video Understanding

Pith reviewed 2026-05-22 05:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords camera posevideo MLLMspatial reasoningpose regressionvideo QAstreaming pose estimationCambrian-P
0
0 comments X

The pith

Adding per-frame camera pose tokens to video MLLMs yields 4.5-6.5% gains on spatial reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that camera pose provides a shared spatial coordinate frame that relates observations across video frames, a signal largely missing from current multimodal LLMs that treat frames as isolated 2D snapshots. Cambrian-P adds per-frame learnable camera tokens and a pose regression head to ground the model in this persistent 3D structure. With a carefully designed sampling scheme, the model records 4.5-6.5% gains on spatial reasoning benchmarks such as VSI-Bench and generalizes to eight other spatial and general video QA tasks. It also reaches state-of-the-art streaming pose estimation on ScanNet as a byproduct. Training on pseudo-annotated poses from in-the-wild video further lifts general video QA performance, indicating that pose helps beyond narrow spatial reasoning.

Core claim

Camera pose matters because the position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Cambrian-P augments a video MLLM with per-frame learnable camera tokens and a pose regression head. Using a carefully designed sampling scheme, the model achieves 4.5-6.5% gains on spatial reasoning benchmarks such as VSI-Bench, generalizes across eight additional spatial and general video QA benchmarks, reaches state-of-the-art streaming pose estimation on ScanNet, and improves general video QA when trained on pseudo-annotated poses from in-the-wild video.

What carries the argument

Per-frame learnable camera tokens and a pose regression head that supply explicit 3D grounding across video frames.

If this is right

  • Substantial gains on spatial reasoning benchmarks such as VSI-Bench.
  • Generalization across eight additional spatial and general video QA benchmarks.
  • State-of-the-art streaming pose estimation on ScanNet as a byproduct.
  • Further improvement on general video QA benchmarks from training on pseudo-annotated in-the-wild poses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pose grounding may act as a lightweight regularizer that improves physical scene consistency even in non-spatial video tasks.
  • Real-time integration of estimated poses could support more stable long-horizon video reasoning systems.
  • Models might eventually learn equivalent 3D relations implicitly if given enough scale and diverse video data.

Load-bearing premise

The observed gains on both spatial and general video QA benchmarks are primarily caused by the addition of pose tokens and the regression head rather than by other unstated changes in training data, architecture, or optimization.

What would settle it

A controlled ablation that trains the identical base model, data, and optimizer with and without the pose tokens and regression head to test whether the benchmark improvements remain.

read the original abstract

Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet this signal is largely absent from multimodal LLMs (MLLMs) for video understanding, which process frames as isolated 2D snapshots, instead of the persistent scene humans perceive. We revisit pose as a lightweight supervisory signal and introduce Cambrian-P, a video MLLM augmented with per-frame learnable camera tokens and a pose regression head. With a carefully designed sampling scheme, the model achieves substantial gains of 4.5-6.5% on spatial reasoning benchmarks such as VSI-Bench, generalizes across eight additional spatial and general video QA benchmarks, and, as a byproduct, achieves state of the art streaming pose estimation on ScanNet. Surprisingly, training on pseudo-annotated poses from in-the-wild video further improves general video QA benchmarks, showing pose helps beyond spatial reasoning. Together, these results position camera pose as a fundamental signal for video models that reason about the physical world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Cambrian-P, a video multimodal LLM augmented with per-frame learnable camera tokens and a pose regression head to incorporate camera pose as a supervisory signal. It claims that a carefully designed sampling scheme yields 4.5-6.5% gains on spatial reasoning benchmarks such as VSI-Bench, generalizes across eight additional spatial and general video QA benchmarks, achieves state-of-the-art streaming pose estimation on ScanNet as a byproduct, and that training on pseudo-annotated in-the-wild poses further improves general video QA, indicating pose helps beyond spatial reasoning.

Significance. If the reported gains can be rigorously attributed to the added camera tokens and regression head rather than the sampling scheme or other unablated factors, the work would usefully demonstrate camera pose as a lightweight signal for improving spatial and physical-world reasoning in video MLLMs. The SOTA streaming pose result is a clear secondary contribution, and the in-the-wild pseudo-pose experiment suggests broader applicability. The absence of detailed controls currently limits the strength of these conclusions.

major comments (2)
  1. Abstract: The central claim attributes 4.5-6.5% gains on VSI-Bench and generalization to eight other benchmarks to the per-frame camera tokens plus regression head, yet the abstract explicitly flags a 'carefully designed sampling scheme' without stating that baselines were retrained with matched data volume, frame selection, optimization schedule, and architecture. This is load-bearing for the attribution of improvements to pose.
  2. Experiments (in-the-wild section): The result that pseudo-annotated poses from in-the-wild video improve general video QA benchmarks simultaneously introduces new training data and the pose signal, leaving open whether the gains arise from the pose tokens, the regression objective, or simply the additional data volume.
minor comments (2)
  1. Method: The integration of per-frame learnable camera tokens into the transformer layers would benefit from an explicit equation or diagram showing how they are added to the visual token sequence.
  2. Abstract: The exact per-benchmark deltas within the 4.5-6.5% range and the identities of the eight additional benchmarks should be stated for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments correctly identify areas where stronger experimental controls would improve the attribution of results. We have revised the manuscript to address these points directly.

read point-by-point responses
  1. Referee: Abstract: The central claim attributes 4.5-6.5% gains on VSI-Bench and generalization to eight other benchmarks to the per-frame camera tokens plus regression head, yet the abstract explicitly flags a 'carefully designed sampling scheme' without stating that baselines were retrained with matched data volume, frame selection, optimization schedule, and architecture. This is load-bearing for the attribution of improvements to pose.

    Authors: We agree that matched training conditions are essential for attributing gains specifically to the camera tokens and regression head. All baselines were retrained with identical data volume, frame selection, optimization schedule, and architecture; the only differences are the added per-frame camera tokens and pose regression head. We have revised the abstract to state that baselines were trained under these matched conditions and added a new ablation subsection in the Experiments section that explicitly details the controls and isolates the contribution of the pose components. revision: yes

  2. Referee: Experiments (in-the-wild section): The result that pseudo-annotated poses from in-the-wild video improve general video QA benchmarks simultaneously introduces new training data and the pose signal, leaving open whether the gains arise from the pose tokens, the regression objective, or simply the additional data volume.

    Authors: We acknowledge this potential confound between additional data volume and the pose signal. In the revised manuscript we include a control experiment that trains on the identical set of in-the-wild videos but omits the camera tokens and regression head. The results show that the pose components provide gains beyond those from data volume alone. We have updated the in-the-wild section to present this control and discuss its implications. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical addition of external pose signal

full rationale

The paper augments a video MLLM with per-frame camera tokens and a regression head as an external supervisory signal, reporting empirical gains on VSI-Bench and other QA benchmarks after a sampling scheme. No derivation chain, equations, or first-principles predictions are present that reduce to fitted inputs by construction, self-definitions, or load-bearing self-citations. Claims rest on experimental results against external benchmarks rather than internal redefinitions of performance metrics, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the premise that camera pose supplies a useful shared spatial frame and that pseudo-annotations from in-the-wild video are sufficiently accurate to drive measurable gains.

axioms (1)
  • domain assumption Camera pose defines a shared spatial coordinate frame that relates observations across video frames.
    Explicitly stated as the opening motivation in the abstract.
invented entities (2)
  • per-frame learnable camera tokens no independent evidence
    purpose: To inject pose information into the multimodal LLM
    New component added to the model architecture.
  • pose regression head no independent evidence
    purpose: To enable streaming pose estimation as a byproduct
    New output head attached to the model.

pith-pipeline@v0.9.0 · 5726 in / 1404 out tokens · 64901 ms · 2026-05-22T05:43:14.633198+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We introduce Cambrian-P, a video MLLM augmented with per-frame learnable camera tokens and a pose regression head. ... interleaved training strategy ... random-jitter frame sampling ... L = L_NTP + λ_pose · L_pose

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · 30 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Nocaps: Novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InICCV, 2019

  3. [3]

    Mapillary planet-scale depth dataset

    Manuel López Antequera, Pau Gargallo, Markus Hofinger, Samuel Rota Bulo, Yubin Kuang, and Peter Kontschieder. Mapillary planet-scale depth dataset. InECCV, 2020

  4. [4]

    3d semantic parsing of large-scale indoor spaces

    Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InCVPR, 2016

  5. [5]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  6. [6]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

  7. [7]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  8. [8]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  9. [9]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InICCV, 2021

  10. [10]

    ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InNeurIPS, 2021

  11. [11]

    Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

    Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, and Saining Xie. SIMS-V: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

  12. [12]

    train on the test set

    Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, and Saining Xie. Benchmark designers should “train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655, 2025

  13. [13]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, 2020

  14. [14]

    A naturalistic open source movie for optical flow evaluation

    Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InECCV, 2012

  15. [15]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InCVPR, 2015. 17

  16. [16]

    Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age.T-RO, 2017

    Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, José Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age.T-RO, 2017

  17. [17]

    Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

  18. [18]

    A Short Note about Kinetics-600

    Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600.arXiv preprint arXiv:1808.01340, 2018

  19. [19]

    A short note on the kinetics-700 human action dataset.arXiv preprint arXiv:1907.06987, 2019

    Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset.arXiv preprint arXiv:1907.06987, 2019

  20. [20]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024

  21. [21]

    Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding

    Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, et al. Gui-world: A video benchmark and dataset for multimodal gui-oriented understanding. InICLR, 2025

  22. [22]

    Eagle 2.5: Boosting long-context post-training for frontier vision-language models

    Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, et al. Eagle 2.5: Boosting long-context post-training for frontier vision-language models. InNeurIPS, 2025

  23. [23]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

  24. [24]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024

  25. [25]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  26. [26]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017

  27. [27]

    Scaling egocentric vision: The EPIC-KITCHENS dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The EPIC-KITCHENS dataset. InECCV, 2018

  28. [28]

    Procthor: Large-scale embodied ai using procedural generation

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation. InNeurIPS, 2022

  29. [29]

    Vlm-3r: Vision-language models augmented with instruction- aligned 3d reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction- aligned 3d reconstruction. InCVPR, 2026

  30. [30]

    Accurate, dense, and robust multiview stereopsis.TP AMI, 2009

    Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis.TP AMI, 2009

  31. [31]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fründ, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The “something something” video database for learning and evaluating visual common sense. InICCV, 2017. 18

  32. [32]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  33. [33]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

  34. [34]

    Cambridge university press, 2003

    Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision. Cambridge university press, 2003

  35. [35]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022

  36. [36]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-MMMU: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

  37. [37]

    G 2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025

    Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G 2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025

  38. [38]

    Sail-vos 3d: A synthetic dataset and baselines for object detection and 3d mesh reconstruction from video data

    Yuan-Ting Hu, Jiahong Wang, Raymond A Yeh, and Alexander G Schwing. Sail-vos 3d: A synthetic dataset and baselines for object detection and 3d mesh reconstruction from video data. InCVPR, 2021

  39. [39]

    ViPE: Video Pose Engine for 3D Geometric Perception

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tian- chang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

  40. [40]

    Deepmvs: Learning multi-view stereopsis

    Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InCVPR, 2018

  41. [41]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  42. [42]

    Dynamicstereo: Consistent dynamic depth from stereo videos

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. InCVPR, 2023

  43. [43]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  44. [44]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

  45. [45]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 2023

  46. [46]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024

  47. [47]

    Llava-onevision: Easy visual task transfer.TMLR, 2025

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.TMLR, 2025

  48. [48]

    Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

    Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, and Xiaodan Liang. Thinking with geometry: Active geometry integration for spatial reasoning.arXiv preprint arXiv:2602.06037, 2026. 19

  49. [49]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023

  50. [50]

    MVbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. MVbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024

  51. [51]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InCVPR, 2018

  52. [52]

    Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

    Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

  53. [53]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  54. [54]

    Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

    Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

  55. [55]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In CVPR, 2024

  56. [56]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024

  57. [57]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

  58. [58]

    Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

    Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, and Jiaqi Wang. Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

  59. [59]

    A computer algorithm for reconstructing a scene from two projections.Nature, 1981

    H Christopher Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections.Nature, 1981

  60. [60]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InICLR, 2024

  61. [61]

    Springer Science & Business Media, 2005

    Yi Ma, Stefano Soatto, Jana Kosecka, and S Shankar Sastry.An Invitation to 3-D Vision: From Images to Geometric Models. Springer Science & Business Media, 2005

  62. [62]

    Egoschema: A diagnostic benchmark for very long-form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. InNeurIPS, 2023

  63. [63]

    Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

    Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andrés Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. InCVPR, 2023

  64. [64]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 2021

  65. [65]

    Orb-slam: A versatile and accurate monocular slam system.T-RO, 2015

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.T-RO, 2015

  66. [66]

    Mast3r-slam: Real-time dense slam with 3d reconstruction priors

    Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. InCVPR, 2025. 20

  67. [67]

    Visual odometry

    David Nistér, Oleg Naroditsky, and James Bergen. Visual odometry. InCVPR, 2004

  68. [68]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  69. [69]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

  70. [70]

    Perception test: A diagnostic benchmark for multimodal video models

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models. InNeurIPS, 2023

  71. [71]

    Vins-mono: A robust and versatile monocular visual- inertial state estimator.T-RO, 2018

    Tong Qin, Peiliang Li, and Shaojie Shen. Vins-mono: A robust and versatile monocular visual- inertial state estimator.T-RO, 2018

  72. [72]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

  73. [73]

    Does spatial cognition emerge in frontier models? InICLR, 2025

    Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl, and Vladlen Koltun. Does spatial cognition emerge in frontier models? InICLR, 2025

  74. [74]

    Timechat: A time-sensitive multimodal large language model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. InCVPR, 2024

  75. [75]

    Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021

  76. [76]

    A dataset for movie descrip- tion

    Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. A dataset for movie descrip- tion. InCVPR, 2015

  77. [77]

    Scienceqa: A novel resource for question answering on scholarly articles.IJDL, 2022

    Tanik Saikh, Tirthankar Ghosal, Amish Mittal, Asif Ekbal, and Pushpak Bhattacharyya. Scienceqa: A novel resource for question answering on scholarly articles.IJDL, 2022

  78. [78]

    Structure-from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016

  79. [79]

    A multi-view stereo benchmark with high-resolution images and multi-camera videos

    Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InCVPR, 2017

  80. [80]

    A comparison and evaluation of multi-view stereo reconstruction algorithms

    Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms. InCVPR, 2006

Showing first 80 references.