pith. sign in

arxiv: 2602.00181 · v3 · submitted 2026-01-30 · 💻 cs.CV · cs.AI

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

Pith reviewed 2026-05-16 09:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords camera movement understandingstructured spatial reasoningobservation-thinking-answerreinforcement learningvideo spatial intelligencemultimodal modelslogical alignment
0
0 comments X

The pith

CamReasoner reframes camera movement understanding as an explicit Observation-Thinking-Answer process reinforced by RL to ground inferences in geometric structure rather than visual patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that multimodal models misclassify camera motions when they rely on superficial visual patterns instead of true spatial geometry. CamReasoner introduces the Observation-Thinking-Answer paradigm that requires the model to first state spatio-temporal observations, then reason through motion patterns in a dedicated block, and finally produce the answer. Training uses a custom suite of 18k supervised reasoning chains and 38k RL feedback samples to enforce this structure on a vision-language backbone. If the approach holds, models produce motion inferences that follow cinematic logic instead of contextual shortcuts, which matters for any video task that needs reliable spatial awareness.

Core claim

CamReasoner reformulates camera movement understanding as a structured inference process using the Observation-Thinking-Answer paradigm. It builds a Large-scale Inference Trajectory Suite containing 18k SFT reasoning chains and 38k RL feedback samples. The method applies RL for logical alignment so that motion inferences rest on explicit visual reasoning rather than guesswork. When applied to Qwen2.5-VL-7B, the resulting model shows higher accuracy on binary classification and VQA benchmarks for camera dynamics.

What carries the argument

The Observation-Thinking-Answer (O-T-A) paradigm, which inserts an explicit reasoning block between observation and answer and is reinforced through RL on structured trajectories.

If this is right

  • The model can separate physically distinct motions that produce similar-looking image sequences.
  • Motion inferences become traceable to explicit spatio-temporal observations instead of unstated priors.
  • Performance gains appear consistently across both classification and open-ended VQA tasks for camera dynamics.
  • The same trajectory construction and RL alignment can be reused on other video understanding problems that require spatial logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit reasoning blocks could allow downstream systems to inspect or correct the model's geometric assumptions before using the answer.
  • The method points toward using RL to enforce logical constraints across a wider range of multimodal spatial tasks.
  • Pairing the generated reasoning chains with 3D reconstruction algorithms would provide an independent check on whether the stated geometry matches the actual scene.

Load-bearing premise

That RL feedback on the reasoning chains teaches genuine geometric understanding of camera motions rather than teaching the model to output text that matches the training distribution.

What would settle it

Measure accuracy on a held-out set of camera sequences whose geometric properties, such as novel pan-tilt-roll combinations, lie outside the patterns present in the 18k training trajectories; if accuracy falls to the original backbone level, the claim fails.

read the original abstract

Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present \textbf{CamReasoner}, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to articulate spatio-temporal observations and reason about motion patterns within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. To the best of our knowledge, \textbf{we are the first to employ RL for logical alignment in camera movement understanding}, ensuring motion inferences are grounded in structured visual reasoning rather than contextual guesswork. Built upon Qwen2.5-VL-7B, CamReasoner-7B improves binary classification accuracy from 73.8\% to 78.4\% and VQA accuracy from 60.9\% to 74.5\% over its backbone, consistently outperforming both proprietary and open-source baselines across multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CamReasoner, a framework that reformulates camera movement understanding as a structured O-T-A (Observation-Thinking-Answer) inference process. It constructs a dataset of 18k SFT reasoning chains and 38k RL feedback samples, applies RL for logical alignment on the Qwen2.5-VL-7B backbone, and reports accuracy gains from 73.8% to 78.4% on binary classification and 60.9% to 74.5% on VQA, claiming to be the first to use RL to ground motion inferences in geometric reasoning rather than superficial patterns.

Significance. If the central claim holds and RL is shown to enforce geometric constraints rather than distributional matching, the work would advance video spatial intelligence by moving multimodal models beyond black-box classification toward explicit spatio-temporal reasoning. The scale of the constructed trajectory suite and the reported gains over both open and proprietary baselines would be notable contributions, but the current evidence does not yet isolate the mechanism.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: The claim that RL on the 38k feedback samples produces geometrically grounded inferences (rather than improved linguistic match to O-T-A chains) is load-bearing for the central contribution, yet no ablation is reported that trains the identical backbone on the 18k SFT chains alone and measures reduction in specific geometric errors (e.g., confusing pure translation with rotation).
  2. [Methods] Methods section: No quantitative verification is provided that the RL reward or feedback penalizes trajectories inconsistent with 3D camera models; the reported accuracy lifts are consistent with either geometric enforcement or better surface-form matching to the training distribution.
  3. [Results] Results section: The accuracy improvements (73.8%→78.4% binary, 60.9%→74.5% VQA) are presented without error bars, multiple random seeds, or statistical significance tests, and without breakdown by motion type, making it impossible to assess robustness or whether gains concentrate on geometrically distinct cases.
minor comments (2)
  1. [Introduction] The O-T-A paradigm is introduced without a formal definition or pseudocode for the reasoning block structure.
  2. [Dataset] Dataset construction details (how the 38k RL samples were generated and filtered) are referenced but lack explicit statistics on geometric validity rates before/after filtering.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of our experimental design and evidence presentation. We address each major comment below and commit to revisions that strengthen the isolation of RL's contribution to geometric reasoning.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: The claim that RL on the 38k feedback samples produces geometrically grounded inferences (rather than improved linguistic match to O-T-A chains) is load-bearing for the central contribution, yet no ablation is reported that trains the identical backbone on the 18k SFT chains alone and measures reduction in specific geometric errors (e.g., confusing pure translation with rotation).

    Authors: We agree that an explicit SFT-only ablation is necessary to isolate RL's role in enforcing geometric constraints beyond surface-form matching to the O-T-A format. In the revised manuscript we will add this ablation: the Qwen2.5-VL-7B backbone will be trained solely on the 18k SFT reasoning chains, then evaluated on the same test set with a breakdown of error types (pure translation vs. rotation confusion, incorrect depth ordering, etc.). This will quantify the incremental reduction in geometrically inconsistent predictions attributable to the RL stage. revision: yes

  2. Referee: [Methods] Methods section: No quantitative verification is provided that the RL reward or feedback penalizes trajectories inconsistent with 3D camera models; the reported accuracy lifts are consistent with either geometric enforcement or better surface-form matching to the training distribution.

    Authors: The RL feedback is generated by comparing generated O-T-A trajectories against reference chains that encode explicit geometric relations (e.g., optical-flow direction, vanishing-point shifts, and parallax cues). We will add a quantitative verification subsection that measures the fraction of post-RL outputs violating basic 3D camera-model constraints (e.g., inconsistent epipolar geometry or impossible motion vectors) on a held-out set of 500 samples, comparing pre- and post-RL rates to demonstrate that the reward indeed penalizes geometrically invalid reasoning rather than merely improving linguistic fidelity. revision: yes

  3. Referee: [Results] Results section: The accuracy improvements (73.8%→78.4% binary, 60.9%→74.5% VQA) are presented without error bars, multiple random seeds, or statistical significance tests, and without breakdown by motion type, making it impossible to assess robustness or whether gains concentrate on geometrically distinct cases.

    Authors: We will revise the Results section to report means and standard deviations over three independent random seeds, include error bars on all bar plots, apply statistical significance tests (paired t-test and McNemar's test), and provide per-motion-type breakdowns (translation, rotation, zoom, combined motions). This will allow readers to verify that gains are concentrated on geometrically distinct cases rather than uniform distributional improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: new data and RL application remain independent of inputs

full rationale

The derivation chain consists of constructing an external 18k+38k trajectory dataset, applying the O-T-A format, and running RL on the public Qwen2.5-VL-7B backbone. All reported accuracy gains are presented as measured outcomes on separate benchmarks rather than quantities that reduce by definition or self-citation to the training chains themselves. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text that would collapse the central claim into its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that human-written reasoning chains capture geometric truth and that RL can align model outputs to them without introducing new artifacts.

pith-pipeline@v0.9.0 · 5539 in / 1069 out tokens · 29676 ms · 2026-05-16T09:59:24.095993+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.

  2. EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.

  3. VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

    cs.CV 2026-04 unverdicted novelty 6.0

    VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 3 Pith papers · 19 internal anchors

  1. [1]

    The anatomy of video editing: A dataset and benchmark suite for ai-assisted video editing, 2022

    Dawit Mureja Argaw, Fabian Caba Heilbron, Joon-Young Lee, Markus Woodson, and In So Kweon. The anatomy of video editing: A dataset and benchmark suite for ai-assisted video editing, 2022. URLhttps://arxiv.org/abs/2207.09812

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

  4. [4]

    Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

    Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancingmultimodalreasoning: Fromoptimizedcoldstarttostagedreinforcementlearning.arXivpreprintarXiv:2506.04207, 2025

  5. [5]

    Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

    Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

  7. [7]

    Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

  8. [8]

    IEEE TPAMI29(6), 1052–1067 (2007).https://doi.org/10

    Andrew J. Davison, Ian D. Reid, Nicholas D. Molton, and Olivier Stasse. Monoslam: Real-time single camera slam.IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1052–1067, 2007. doi: 10.1109/TPAMI.2007.1049

  9. [9]

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024

  10. [10]

    Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718, 2025

    Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, et al. Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718, 2025

  11. [11]

    LSD-SLAM:Large-ScaleDirectMonocularSLAM

    JakobEngel,ThomasSchöps,andDanielCremers. LSD-SLAM:Large-ScaleDirectMonocularSLAM. InEuropeanConference on Computer Vision (ECCV), volume 8690 ofLecture Notes in Computer Science, pages 834–849, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10604-5. doi: 10.1007/978-3-319-10605-2

  12. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  13. [14]

    Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

    Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

  14. [15]

    Geometry-guided camera motion understanding in videollms.arXiv preprint arXiv:2603.13119, 2026

    Haoan Feng, Sri Harsha Musunuri, and Guan-Ming Su. Geometry-guided camera motion understanding in videollms.arXiv preprint arXiv:2603.13119, 2026. 12

  15. [16]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  16. [17]

    OneThinker: All-in-one Reasoning Model for Image and Video

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

  17. [18]

    Framemind: Frame-interleaved chain-of-thought for video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

    Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

  18. [19]

    MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

    Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models, 2025. URL https://arxiv.org/abs/2501.02955

  19. [20]

    Movienet: A holistic dataset for movie under- standing, 2020

    Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding, 2020. URLhttps://arxiv.org/abs/2007.10937

  20. [21]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  21. [22]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  22. [23]

    Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024

    Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos, 2025. URLhttps://arxiv.org/abs/2412.09621

  23. [24]

    Veu-bench: Towards comprehensive understanding of video editing

    Bozheng Li, Yongliang Wu, Yi Lu, Jiashuo Yu, Licheng Tang, Jiawang Cao, Wenqing Zhu, Yuyang Sun, Jay Wu, and Wenbo Zhu. Veu-bench: Towards comprehensive understanding of video editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13671–13680, 2025

  24. [25]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024

  25. [26]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025

  26. [27]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

  27. [28]

    Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

    Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

  28. [29]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024

  29. [30]

    arXiv preprint arXiv:2504.15376 , year=

    Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, et al. Towards understanding camera motions in any video.arXiv preprint arXiv:2504.15376, 2025

  30. [31]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conf...

  31. [32]

    Shotbench: Expert-level cinematic understanding in vision-language models, 2025

    Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, and Ziwei Liu. Shotbench: Expert-level cinematic understanding in vision-language models, 2025. URLhttps://arxiv.org/abs/2506.21356

  32. [33]

    Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

    Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

  33. [34]

    a1: Steep test-time scaling law via environment augmented generation.arXiv preprint arXiv:2504.14597, 2025

    Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Yuyao Ge, Jun Wan, Yurong Wu, and Xueqi Cheng. a1: Steep test-time scaling law via environment augmented generation.arXiv preprint arXiv:2504.14597, 2025

  34. [35]

    A Survey of Context Engineering for Large Language Models

    Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models, 2025. URLhttps://arxiv.org/abs/2507.13334. 13

  35. [36]

    Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

    Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

  36. [37]

    A unified framework for shot type classification based on subject centric lens, 2020

    Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei Zhou, and Dahua Lin. A unified framework for shot type classification based on subject centric lens, 2020. URLhttps://arxiv.org/abs/2008.03548

  37. [38]

    Schonberger and Jan-Michael Frahm

    Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

  38. [39]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

  39. [40]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  40. [41]

    Reinforcement fine-tuning powers reasoning ca- pability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

    Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

  41. [42]

    Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025

    Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025

  42. [43]

    Vggsfm: Visual geometry grounded deep structure from motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024

  43. [44]

    Tarsier: Recipes for training and evaluating large video description models

    JiaweiWang,LipingYuan,YuchenZhang,andHaomiaoSun. Tarsier: Recipesfortrainingandevaluatinglargevideodescription models, 2024. URLhttps://arxiv.org/abs/2407.00634

  44. [45]

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025

  45. [46]

    Refineshot: Rethinking cinematography understanding with foundational skill evaluation.arXiv preprint arXiv:2510.02423, 2025

    Hang Wu, Yujun Cai, Haonan Ge, Hongkai Chen, Ming-Hsuan Yang, and Yiwei Wang. Refineshot: Rethinking cinematography understanding with foundational skill evaluation.arXiv preprint arXiv:2510.02423, 2025

  46. [47]

    Thinking with sound: Audio chain-of-thought enables multimodal reasoning in large audio-language models,

    Zhen Xiong, Yujun Cai, Zhecheng Li, Junsong Yuan, and Yiwei Wang. Thinking with sound: Audio chain-of-thought enables multimodal reasoning in large audio-language models.arXiv preprint arXiv:2509.21749, 2025

  47. [48]

    Seg-r1: Segmentation can be surprisingly simple with reinforcement 33 ConceptSeg-R1 learning

    Zuyao You and Zuxuan Wu. Seg-r1: Segmentation can be surprisingly simple with reinforcement learning.arXiv preprint arXiv:2506.22624, 2025

  48. [49]

    arXiv preprint arXiv:2504.07954 , year =

    En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

  49. [50]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. URLhttps://arxiv.org/abs/2410.02713

  50. [51]

    Reinforced mllm: A survey on rl-based reasoning in multimodal large language models.arXiv preprint arXiv:2504.21277, 2025

    Guanghao Zhou, Panjia Qiu, Cen Chen, Jie Wang, Zheming Yang, Jian Xu, and Minghui Qiu. Reinforced mllm: A survey on rl-based reasoning in multimodal large language models.arXiv preprint arXiv:2504.21277, 2025

  51. [52]

    Stereo Magnification: Learning View Synthesis using Multiplane Images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images, 2018. URLhttps://arxiv.org/abs/1805.09817

  52. [53]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025. URL https://arxiv.org/abs/2504.10479. 14 Figure 5 Distribution of camera movement categories in CamReasoning-SFT-18k.The dataset encompasses a diverse range of cinematographic motions, with a primary focus on dynamic rotations and stable...