CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

Bowen Sun; Hang Wu; Haonan Ge; Junsong Yuan; Yiwei Wang; Yujun Cai; Zehao Li

arxiv: 2602.00181 · v3 · submitted 2026-01-30 · 💻 cs.CV · cs.AI

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

Hang Wu , Yujun Cai , Zehao Li , Haonan Ge , Bowen Sun , Junsong Yuan , Yiwei Wang This is my paper

Pith reviewed 2026-05-16 09:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords camera movement understandingstructured spatial reasoningobservation-thinking-answerreinforcement learningvideo spatial intelligencemultimodal modelslogical alignment

0 comments

The pith

CamReasoner reframes camera movement understanding as an explicit Observation-Thinking-Answer process reinforced by RL to ground inferences in geometric structure rather than visual patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that multimodal models misclassify camera motions when they rely on superficial visual patterns instead of true spatial geometry. CamReasoner introduces the Observation-Thinking-Answer paradigm that requires the model to first state spatio-temporal observations, then reason through motion patterns in a dedicated block, and finally produce the answer. Training uses a custom suite of 18k supervised reasoning chains and 38k RL feedback samples to enforce this structure on a vision-language backbone. If the approach holds, models produce motion inferences that follow cinematic logic instead of contextual shortcuts, which matters for any video task that needs reliable spatial awareness.

Core claim

CamReasoner reformulates camera movement understanding as a structured inference process using the Observation-Thinking-Answer paradigm. It builds a Large-scale Inference Trajectory Suite containing 18k SFT reasoning chains and 38k RL feedback samples. The method applies RL for logical alignment so that motion inferences rest on explicit visual reasoning rather than guesswork. When applied to Qwen2.5-VL-7B, the resulting model shows higher accuracy on binary classification and VQA benchmarks for camera dynamics.

What carries the argument

The Observation-Thinking-Answer (O-T-A) paradigm, which inserts an explicit reasoning block between observation and answer and is reinforced through RL on structured trajectories.

If this is right

The model can separate physically distinct motions that produce similar-looking image sequences.
Motion inferences become traceable to explicit spatio-temporal observations instead of unstated priors.
Performance gains appear consistently across both classification and open-ended VQA tasks for camera dynamics.
The same trajectory construction and RL alignment can be reused on other video understanding problems that require spatial logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit reasoning blocks could allow downstream systems to inspect or correct the model's geometric assumptions before using the answer.
The method points toward using RL to enforce logical constraints across a wider range of multimodal spatial tasks.
Pairing the generated reasoning chains with 3D reconstruction algorithms would provide an independent check on whether the stated geometry matches the actual scene.

Load-bearing premise

That RL feedback on the reasoning chains teaches genuine geometric understanding of camera motions rather than teaching the model to output text that matches the training distribution.

What would settle it

Measure accuracy on a held-out set of camera sequences whose geometric properties, such as novel pan-tilt-roll combinations, lie outside the patterns present in the 18k training trajectories; if accuracy falls to the original backbone level, the claim fails.

read the original abstract

Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present \textbf{CamReasoner}, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to articulate spatio-temporal observations and reason about motion patterns within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. To the best of our knowledge, \textbf{we are the first to employ RL for logical alignment in camera movement understanding}, ensuring motion inferences are grounded in structured visual reasoning rather than contextual guesswork. Built upon Qwen2.5-VL-7B, CamReasoner-7B improves binary classification accuracy from 73.8\% to 78.4\% and VQA accuracy from 60.9\% to 74.5\% over its backbone, consistently outperforming both proprietary and open-source baselines across multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CamReasoner adds an O-T-A reasoning format and RL on new camera trajectory data to improve motion understanding over Qwen2.5-VL, but the abstract leaves open whether the gains come from geometric grounding or just better text matching.

read the letter

The paper's main move is to treat camera movement as an explicit Observation-Thinking-Answer process instead of a direct classification. They build 18k SFT reasoning chains and 38k RL feedback samples into a 56k trajectory suite, then fine-tune Qwen2.5-VL-7B with RL for logical alignment. That produces reported lifts to 78.4% binary accuracy and 74.5% VQA accuracy, beating the backbone and several baselines. The data construction and the shift away from black-box outputs are the concrete pieces of work here, and the claim of being first to apply RL this way for camera dynamics is stated plainly. The approach is straightforward and targets a real gap in video spatial tasks. The soft spot is the missing evidence that RL is doing geometric work rather than teaching the model to output the expected reasoning style. The abstract shows no ablations that isolate the RL stage from the SFT chains alone, no error bars, and no direct checks against 3D camera models to confirm that mistakes like confusing translation with rotation actually drop. Without those, the central claim stays plausible but unproven from what is given. This is for people working on multimodal video models who want structured reasoning in spatial tasks. Readers focused on camera dynamics or cinematic understanding will find the paradigm and data useful to test or extend. It has enough of a clear idea and measurable results to deserve a serious referee, mainly to examine the RL setup and any geometric validation in the full methods. I would send it to peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces CamReasoner, a framework that reformulates camera movement understanding as a structured O-T-A (Observation-Thinking-Answer) inference process. It constructs a dataset of 18k SFT reasoning chains and 38k RL feedback samples, applies RL for logical alignment on the Qwen2.5-VL-7B backbone, and reports accuracy gains from 73.8% to 78.4% on binary classification and 60.9% to 74.5% on VQA, claiming to be the first to use RL to ground motion inferences in geometric reasoning rather than superficial patterns.

Significance. If the central claim holds and RL is shown to enforce geometric constraints rather than distributional matching, the work would advance video spatial intelligence by moving multimodal models beyond black-box classification toward explicit spatio-temporal reasoning. The scale of the constructed trajectory suite and the reported gains over both open and proprietary baselines would be notable contributions, but the current evidence does not yet isolate the mechanism.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: The claim that RL on the 38k feedback samples produces geometrically grounded inferences (rather than improved linguistic match to O-T-A chains) is load-bearing for the central contribution, yet no ablation is reported that trains the identical backbone on the 18k SFT chains alone and measures reduction in specific geometric errors (e.g., confusing pure translation with rotation).
[Methods] Methods section: No quantitative verification is provided that the RL reward or feedback penalizes trajectories inconsistent with 3D camera models; the reported accuracy lifts are consistent with either geometric enforcement or better surface-form matching to the training distribution.
[Results] Results section: The accuracy improvements (73.8%→78.4% binary, 60.9%→74.5% VQA) are presented without error bars, multiple random seeds, or statistical significance tests, and without breakdown by motion type, making it impossible to assess robustness or whether gains concentrate on geometrically distinct cases.

minor comments (2)

[Introduction] The O-T-A paradigm is introduced without a formal definition or pseudocode for the reasoning block structure.
[Dataset] Dataset construction details (how the 38k RL samples were generated and filtered) are referenced but lack explicit statistics on geometric validity rates before/after filtering.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of our experimental design and evidence presentation. We address each major comment below and commit to revisions that strengthen the isolation of RL's contribution to geometric reasoning.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: The claim that RL on the 38k feedback samples produces geometrically grounded inferences (rather than improved linguistic match to O-T-A chains) is load-bearing for the central contribution, yet no ablation is reported that trains the identical backbone on the 18k SFT chains alone and measures reduction in specific geometric errors (e.g., confusing pure translation with rotation).

Authors: We agree that an explicit SFT-only ablation is necessary to isolate RL's role in enforcing geometric constraints beyond surface-form matching to the O-T-A format. In the revised manuscript we will add this ablation: the Qwen2.5-VL-7B backbone will be trained solely on the 18k SFT reasoning chains, then evaluated on the same test set with a breakdown of error types (pure translation vs. rotation confusion, incorrect depth ordering, etc.). This will quantify the incremental reduction in geometrically inconsistent predictions attributable to the RL stage. revision: yes
Referee: [Methods] Methods section: No quantitative verification is provided that the RL reward or feedback penalizes trajectories inconsistent with 3D camera models; the reported accuracy lifts are consistent with either geometric enforcement or better surface-form matching to the training distribution.

Authors: The RL feedback is generated by comparing generated O-T-A trajectories against reference chains that encode explicit geometric relations (e.g., optical-flow direction, vanishing-point shifts, and parallax cues). We will add a quantitative verification subsection that measures the fraction of post-RL outputs violating basic 3D camera-model constraints (e.g., inconsistent epipolar geometry or impossible motion vectors) on a held-out set of 500 samples, comparing pre- and post-RL rates to demonstrate that the reward indeed penalizes geometrically invalid reasoning rather than merely improving linguistic fidelity. revision: yes
Referee: [Results] Results section: The accuracy improvements (73.8%→78.4% binary, 60.9%→74.5% VQA) are presented without error bars, multiple random seeds, or statistical significance tests, and without breakdown by motion type, making it impossible to assess robustness or whether gains concentrate on geometrically distinct cases.

Authors: We will revise the Results section to report means and standard deviations over three independent random seeds, include error bars on all bar plots, apply statistical significance tests (paired t-test and McNemar's test), and provide per-motion-type breakdowns (translation, rotation, zoom, combined motions). This will allow readers to verify that gains are concentrated on geometrically distinct cases rather than uniform distributional improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: new data and RL application remain independent of inputs

full rationale

The derivation chain consists of constructing an external 18k+38k trajectory dataset, applying the O-T-A format, and running RL on the public Qwen2.5-VL-7B backbone. All reported accuracy gains are presented as measured outcomes on separate benchmarks rather than quantities that reduce by definition or self-citation to the training chains themselves. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text that would collapse the central claim into its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that human-written reasoning chains capture geometric truth and that RL can align model outputs to them without introducing new artifacts.

pith-pipeline@v0.9.0 · 5539 in / 1069 out tokens · 29676 ms · 2026-05-16T09:59:24.095993+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

O-T-A paradigm... RL for logical alignment in camera movement understanding... 18k SFT reasoning chains and 38k RL feedback samples

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
cs.CV 2026-04 unverdicted novelty 6.0

VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 3 Pith papers · 19 internal anchors

[1]

The anatomy of video editing: A dataset and benchmark suite for ai-assisted video editing, 2022

Dawit Mureja Argaw, Fabian Caba Heilbron, Joon-Young Lee, Markus Woodson, and In So Kweon. The anatomy of video editing: A dataset and benchmark suite for ai-assisted video editing, 2022. URLhttps://arxiv.org/abs/2207.09812

work page arXiv 2022
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancingmultimodalreasoning: Fromoptimizedcoldstarttostagedreinforcementlearning.arXivpreprintarXiv:2506.04207, 2025

work page arXiv 2025
[5]

Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

work page arXiv 2025
[6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

work page 2023
[8]

IEEE TPAMI29(6), 1052–1067 (2007).https://doi.org/10

Andrew J. Davison, Ian D. Reid, Nicholas D. Molton, and Olivier Stasse. Monoslam: Real-time single camera slam.IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1052–1067, 2007. doi: 10.1109/TPAMI.2007.1049

work page doi:10.1109/tpami.2007.1049 2007
[9]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024

work page internal anchor Pith review arXiv 2024
[10]

Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718, 2025

Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, et al. Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718, 2025

work page arXiv 2025
[11]

LSD-SLAM:Large-ScaleDirectMonocularSLAM

JakobEngel,ThomasSchöps,andDanielCremers. LSD-SLAM:Large-ScaleDirectMonocularSLAM. InEuropeanConference on Computer Vision (ECCV), volume 8690 ofLecture Notes in Computer Science, pages 834–849, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10604-5. doi: 10.1007/978-3-319-10605-2

work page doi:10.1007/978-3-319-10605-2 2014
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

work page arXiv 2025
[15]

Geometry-guided camera motion understanding in videollms.arXiv preprint arXiv:2603.13119, 2026

Haoan Feng, Sri Harsha Musunuri, and Guan-Ming Su. Geometry-guided camera motion understanding in videollms.arXiv preprint arXiv:2603.13119, 2026. 12

work page arXiv 2026
[16]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Framemind: Frame-interleaved chain-of-thought for video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

work page arXiv 2025
[19]

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models, 2025. URL https://arxiv.org/abs/2501.02955

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Movienet: A holistic dataset for movie under- standing, 2020

Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding, 2020. URLhttps://arxiv.org/abs/2007.10937

work page arXiv 2020
[21]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos, 2025. URLhttps://arxiv.org/abs/2412.09621

work page arXiv 2025
[24]

Veu-bench: Towards comprehensive understanding of video editing

Bozheng Li, Yongliang Wu, Yi Lu, Jiashuo Yu, Licheng Tang, Jiawang Cao, Wenqing Zhu, Yuyang Sun, Jay Wu, and Wenbo Zhu. Veu-bench: Towards comprehensive understanding of video editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13671–13680, 2025

work page 2025
[25]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

work page 2024
[28]

Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

work page arXiv 2025
[29]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024

work page 2024
[30]

arXiv preprint arXiv:2504.15376 , year=

Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, et al. Towards understanding camera motions in any video.arXiv preprint arXiv:2504.15376, 2025

work page arXiv 2025
[31]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conf...

work page 2024
[32]

Shotbench: Expert-level cinematic understanding in vision-language models, 2025

Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, and Ziwei Liu. Shotbench: Expert-level cinematic understanding in vision-language models, 2025. URLhttps://arxiv.org/abs/2506.21356

work page arXiv 2025
[33]

Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

work page arXiv 2024
[34]

a1: Steep test-time scaling law via environment augmented generation.arXiv preprint arXiv:2504.14597, 2025

Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Yuyao Ge, Jun Wan, Yurong Wu, and Xueqi Cheng. a1: Steep test-time scaling law via environment augmented generation.arXiv preprint arXiv:2504.14597, 2025

work page arXiv 2025
[35]

A Survey of Context Engineering for Large Language Models

Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models, 2025. URLhttps://arxiv.org/abs/2507.13334. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

work page arXiv 2025
[37]

A unified framework for shot type classification based on subject centric lens, 2020

Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei Zhou, and Dahua Lin. A unified framework for shot type classification based on subject centric lens, 2020. URLhttps://arxiv.org/abs/2008.03548

work page arXiv 2020
[38]

Schonberger and Jan-Michael Frahm

Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

work page 2016
[39]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Reinforcement fine-tuning powers reasoning ca- pability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

work page arXiv 2025
[42]

Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025

work page arXiv 2025
[43]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024

work page 2024
[44]

Tarsier: Recipes for training and evaluating large video description models

JiaweiWang,LipingYuan,YuchenZhang,andHaomiaoSun. Tarsier: Recipesfortrainingandevaluatinglargevideodescription models, 2024. URLhttps://arxiv.org/abs/2407.00634

work page arXiv 2024
[45]

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025

work page internal anchor Pith review arXiv 2025
[46]

Refineshot: Rethinking cinematography understanding with foundational skill evaluation.arXiv preprint arXiv:2510.02423, 2025

Hang Wu, Yujun Cai, Haonan Ge, Hongkai Chen, Ming-Hsuan Yang, and Yiwei Wang. Refineshot: Rethinking cinematography understanding with foundational skill evaluation.arXiv preprint arXiv:2510.02423, 2025

work page arXiv 2025
[47]

Thinking with sound: Audio chain-of-thought enables multimodal reasoning in large audio-language models,

Zhen Xiong, Yujun Cai, Zhecheng Li, Junsong Yuan, and Yiwei Wang. Thinking with sound: Audio chain-of-thought enables multimodal reasoning in large audio-language models.arXiv preprint arXiv:2509.21749, 2025

work page arXiv 2025
[48]

Seg-r1: Segmentation can be surprisingly simple with reinforcement 33 ConceptSeg-R1 learning

Zuyao You and Zuxuan Wu. Seg-r1: Segmentation can be surprisingly simple with reinforcement learning.arXiv preprint arXiv:2506.22624, 2025

work page arXiv 2025
[49]

arXiv preprint arXiv:2504.07954 , year =

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

work page arXiv 2025
[50]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. URLhttps://arxiv.org/abs/2410.02713

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Reinforced mllm: A survey on rl-based reasoning in multimodal large language models.arXiv preprint arXiv:2504.21277, 2025

Guanghao Zhou, Panjia Qiu, Cen Chen, Jie Wang, Zheming Yang, Jian Xu, and Minghui Qiu. Reinforced mllm: A survey on rl-based reasoning in multimodal large language models.arXiv preprint arXiv:2504.21277, 2025

work page arXiv 2025
[52]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images, 2018. URLhttps://arxiv.org/abs/1805.09817

work page internal anchor Pith review Pith/arXiv arXiv 2018
[53]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025. URL https://arxiv.org/abs/2504.10479. 14 Figure 5 Distribution of camera movement categories in CamReasoning-SFT-18k.The dataset encompasses a diverse range of cinematographic motions, with a primary focus on dynamic rotations and stable...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

The anatomy of video editing: A dataset and benchmark suite for ai-assisted video editing, 2022

Dawit Mureja Argaw, Fabian Caba Heilbron, Joon-Young Lee, Markus Woodson, and In So Kweon. The anatomy of video editing: A dataset and benchmark suite for ai-assisted video editing, 2022. URLhttps://arxiv.org/abs/2207.09812

work page arXiv 2022

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancingmultimodalreasoning: Fromoptimizedcoldstarttostagedreinforcementlearning.arXivpreprintarXiv:2506.04207, 2025

work page arXiv 2025

[5] [5]

Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

work page arXiv 2025

[6] [6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

work page 2023

[8] [8]

IEEE TPAMI29(6), 1052–1067 (2007).https://doi.org/10

Andrew J. Davison, Ian D. Reid, Nicholas D. Molton, and Olivier Stasse. Monoslam: Real-time single camera slam.IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1052–1067, 2007. doi: 10.1109/TPAMI.2007.1049

work page doi:10.1109/tpami.2007.1049 2007

[9] [9]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024

work page internal anchor Pith review arXiv 2024

[10] [10]

Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718, 2025

Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, et al. Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718, 2025

work page arXiv 2025

[11] [11]

LSD-SLAM:Large-ScaleDirectMonocularSLAM

JakobEngel,ThomasSchöps,andDanielCremers. LSD-SLAM:Large-ScaleDirectMonocularSLAM. InEuropeanConference on Computer Vision (ECCV), volume 8690 ofLecture Notes in Computer Science, pages 834–849, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10604-5. doi: 10.1007/978-3-319-10605-2

work page doi:10.1007/978-3-319-10605-2 2014

[12] [12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [14]

Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

work page arXiv 2025

[14] [15]

Geometry-guided camera motion understanding in videollms.arXiv preprint arXiv:2603.13119, 2026

Haoan Feng, Sri Harsha Musunuri, and Guan-Ming Su. Geometry-guided camera motion understanding in videollms.arXiv preprint arXiv:2603.13119, 2026. 12

work page arXiv 2026

[15] [16]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [17]

OneThinker: All-in-one Reasoning Model for Image and Video

Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [18]

Framemind: Frame-interleaved chain-of-thought for video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

work page arXiv 2025

[18] [19]

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models, 2025. URL https://arxiv.org/abs/2501.02955

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [20]

Movienet: A holistic dataset for movie under- standing, 2020

Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding, 2020. URLhttps://arxiv.org/abs/2007.10937

work page arXiv 2020

[20] [21]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [22]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [23]

Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos, 2025. URLhttps://arxiv.org/abs/2412.09621

work page arXiv 2025

[23] [24]

Veu-bench: Towards comprehensive understanding of video editing

Bozheng Li, Yongliang Wu, Yi Lu, Jiashuo Yu, Licheng Tang, Jiawang Cao, Wenqing Zhu, Yuyang Sun, Jay Wu, and Wenbo Zhu. Veu-bench: Towards comprehensive understanding of video editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13671–13680, 2025

work page 2025

[24] [25]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [26]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [27]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024

work page 2024

[27] [28]

Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

work page arXiv 2025

[28] [29]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024

work page 2024

[29] [30]

arXiv preprint arXiv:2504.15376 , year=

Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, et al. Towards understanding camera motions in any video.arXiv preprint arXiv:2504.15376, 2025

work page arXiv 2025

[30] [31]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conf...

work page 2024

[31] [32]

Shotbench: Expert-level cinematic understanding in vision-language models, 2025

Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, and Ziwei Liu. Shotbench: Expert-level cinematic understanding in vision-language models, 2025. URLhttps://arxiv.org/abs/2506.21356

work page arXiv 2025

[32] [33]

Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

work page arXiv 2024

[33] [34]

a1: Steep test-time scaling law via environment augmented generation.arXiv preprint arXiv:2504.14597, 2025

Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Yuyao Ge, Jun Wan, Yurong Wu, and Xueqi Cheng. a1: Steep test-time scaling law via environment augmented generation.arXiv preprint arXiv:2504.14597, 2025

work page arXiv 2025

[34] [35]

A Survey of Context Engineering for Large Language Models

Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models, 2025. URLhttps://arxiv.org/abs/2507.13334. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [36]

Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

work page arXiv 2025

[36] [37]

A unified framework for shot type classification based on subject centric lens, 2020

Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei Zhou, and Dahua Lin. A unified framework for shot type classification based on subject centric lens, 2020. URLhttps://arxiv.org/abs/2008.03548

work page arXiv 2020

[37] [38]

Schonberger and Jan-Michael Frahm

Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

work page 2016

[38] [39]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [40]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [41]

Reinforcement fine-tuning powers reasoning ca- pability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

work page arXiv 2025

[41] [42]

Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025

work page arXiv 2025

[42] [43]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024

work page 2024

[43] [44]

Tarsier: Recipes for training and evaluating large video description models

JiaweiWang,LipingYuan,YuchenZhang,andHaomiaoSun. Tarsier: Recipesfortrainingandevaluatinglargevideodescription models, 2024. URLhttps://arxiv.org/abs/2407.00634

work page arXiv 2024

[44] [45]

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025

work page internal anchor Pith review arXiv 2025

[45] [46]

Refineshot: Rethinking cinematography understanding with foundational skill evaluation.arXiv preprint arXiv:2510.02423, 2025

Hang Wu, Yujun Cai, Haonan Ge, Hongkai Chen, Ming-Hsuan Yang, and Yiwei Wang. Refineshot: Rethinking cinematography understanding with foundational skill evaluation.arXiv preprint arXiv:2510.02423, 2025

work page arXiv 2025

[46] [47]

Thinking with sound: Audio chain-of-thought enables multimodal reasoning in large audio-language models,

Zhen Xiong, Yujun Cai, Zhecheng Li, Junsong Yuan, and Yiwei Wang. Thinking with sound: Audio chain-of-thought enables multimodal reasoning in large audio-language models.arXiv preprint arXiv:2509.21749, 2025

work page arXiv 2025

[47] [48]

Seg-r1: Segmentation can be surprisingly simple with reinforcement 33 ConceptSeg-R1 learning

Zuyao You and Zuxuan Wu. Seg-r1: Segmentation can be surprisingly simple with reinforcement learning.arXiv preprint arXiv:2506.22624, 2025

work page arXiv 2025

[48] [49]

arXiv preprint arXiv:2504.07954 , year =

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

work page arXiv 2025

[49] [50]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. URLhttps://arxiv.org/abs/2410.02713

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [51]

Reinforced mllm: A survey on rl-based reasoning in multimodal large language models.arXiv preprint arXiv:2504.21277, 2025

Guanghao Zhou, Panjia Qiu, Cen Chen, Jie Wang, Zheming Yang, Jian Xu, and Minghui Qiu. Reinforced mllm: A survey on rl-based reasoning in multimodal large language models.arXiv preprint arXiv:2504.21277, 2025

work page arXiv 2025

[51] [52]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images, 2018. URLhttps://arxiv.org/abs/1805.09817

work page internal anchor Pith review Pith/arXiv arXiv 2018

[52] [53]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025. URL https://arxiv.org/abs/2504.10479. 14 Figure 5 Distribution of camera movement categories in CamReasoning-SFT-18k.The dataset encompasses a diverse range of cinematographic motions, with a primary focus on dynamic rotations and stable...

work page internal anchor Pith review Pith/arXiv arXiv 2025