CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
Pith reviewed 2026-05-16 09:59 UTC · model grok-4.3
The pith
CamReasoner reframes camera movement understanding as an explicit Observation-Thinking-Answer process reinforced by RL to ground inferences in geometric structure rather than visual patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CamReasoner reformulates camera movement understanding as a structured inference process using the Observation-Thinking-Answer paradigm. It builds a Large-scale Inference Trajectory Suite containing 18k SFT reasoning chains and 38k RL feedback samples. The method applies RL for logical alignment so that motion inferences rest on explicit visual reasoning rather than guesswork. When applied to Qwen2.5-VL-7B, the resulting model shows higher accuracy on binary classification and VQA benchmarks for camera dynamics.
What carries the argument
The Observation-Thinking-Answer (O-T-A) paradigm, which inserts an explicit reasoning block between observation and answer and is reinforced through RL on structured trajectories.
If this is right
- The model can separate physically distinct motions that produce similar-looking image sequences.
- Motion inferences become traceable to explicit spatio-temporal observations instead of unstated priors.
- Performance gains appear consistently across both classification and open-ended VQA tasks for camera dynamics.
- The same trajectory construction and RL alignment can be reused on other video understanding problems that require spatial logic.
Where Pith is reading between the lines
- Explicit reasoning blocks could allow downstream systems to inspect or correct the model's geometric assumptions before using the answer.
- The method points toward using RL to enforce logical constraints across a wider range of multimodal spatial tasks.
- Pairing the generated reasoning chains with 3D reconstruction algorithms would provide an independent check on whether the stated geometry matches the actual scene.
Load-bearing premise
That RL feedback on the reasoning chains teaches genuine geometric understanding of camera motions rather than teaching the model to output text that matches the training distribution.
What would settle it
Measure accuracy on a held-out set of camera sequences whose geometric properties, such as novel pan-tilt-roll combinations, lie outside the patterns present in the 18k training trajectories; if accuracy falls to the original backbone level, the claim fails.
read the original abstract
Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present \textbf{CamReasoner}, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to articulate spatio-temporal observations and reason about motion patterns within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. To the best of our knowledge, \textbf{we are the first to employ RL for logical alignment in camera movement understanding}, ensuring motion inferences are grounded in structured visual reasoning rather than contextual guesswork. Built upon Qwen2.5-VL-7B, CamReasoner-7B improves binary classification accuracy from 73.8\% to 78.4\% and VQA accuracy from 60.9\% to 74.5\% over its backbone, consistently outperforming both proprietary and open-source baselines across multiple benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CamReasoner, a framework that reformulates camera movement understanding as a structured O-T-A (Observation-Thinking-Answer) inference process. It constructs a dataset of 18k SFT reasoning chains and 38k RL feedback samples, applies RL for logical alignment on the Qwen2.5-VL-7B backbone, and reports accuracy gains from 73.8% to 78.4% on binary classification and 60.9% to 74.5% on VQA, claiming to be the first to use RL to ground motion inferences in geometric reasoning rather than superficial patterns.
Significance. If the central claim holds and RL is shown to enforce geometric constraints rather than distributional matching, the work would advance video spatial intelligence by moving multimodal models beyond black-box classification toward explicit spatio-temporal reasoning. The scale of the constructed trajectory suite and the reported gains over both open and proprietary baselines would be notable contributions, but the current evidence does not yet isolate the mechanism.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: The claim that RL on the 38k feedback samples produces geometrically grounded inferences (rather than improved linguistic match to O-T-A chains) is load-bearing for the central contribution, yet no ablation is reported that trains the identical backbone on the 18k SFT chains alone and measures reduction in specific geometric errors (e.g., confusing pure translation with rotation).
- [Methods] Methods section: No quantitative verification is provided that the RL reward or feedback penalizes trajectories inconsistent with 3D camera models; the reported accuracy lifts are consistent with either geometric enforcement or better surface-form matching to the training distribution.
- [Results] Results section: The accuracy improvements (73.8%→78.4% binary, 60.9%→74.5% VQA) are presented without error bars, multiple random seeds, or statistical significance tests, and without breakdown by motion type, making it impossible to assess robustness or whether gains concentrate on geometrically distinct cases.
minor comments (2)
- [Introduction] The O-T-A paradigm is introduced without a formal definition or pseudocode for the reasoning block structure.
- [Dataset] Dataset construction details (how the 38k RL samples were generated and filtered) are referenced but lack explicit statistics on geometric validity rates before/after filtering.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of our experimental design and evidence presentation. We address each major comment below and commit to revisions that strengthen the isolation of RL's contribution to geometric reasoning.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: The claim that RL on the 38k feedback samples produces geometrically grounded inferences (rather than improved linguistic match to O-T-A chains) is load-bearing for the central contribution, yet no ablation is reported that trains the identical backbone on the 18k SFT chains alone and measures reduction in specific geometric errors (e.g., confusing pure translation with rotation).
Authors: We agree that an explicit SFT-only ablation is necessary to isolate RL's role in enforcing geometric constraints beyond surface-form matching to the O-T-A format. In the revised manuscript we will add this ablation: the Qwen2.5-VL-7B backbone will be trained solely on the 18k SFT reasoning chains, then evaluated on the same test set with a breakdown of error types (pure translation vs. rotation confusion, incorrect depth ordering, etc.). This will quantify the incremental reduction in geometrically inconsistent predictions attributable to the RL stage. revision: yes
-
Referee: [Methods] Methods section: No quantitative verification is provided that the RL reward or feedback penalizes trajectories inconsistent with 3D camera models; the reported accuracy lifts are consistent with either geometric enforcement or better surface-form matching to the training distribution.
Authors: The RL feedback is generated by comparing generated O-T-A trajectories against reference chains that encode explicit geometric relations (e.g., optical-flow direction, vanishing-point shifts, and parallax cues). We will add a quantitative verification subsection that measures the fraction of post-RL outputs violating basic 3D camera-model constraints (e.g., inconsistent epipolar geometry or impossible motion vectors) on a held-out set of 500 samples, comparing pre- and post-RL rates to demonstrate that the reward indeed penalizes geometrically invalid reasoning rather than merely improving linguistic fidelity. revision: yes
-
Referee: [Results] Results section: The accuracy improvements (73.8%→78.4% binary, 60.9%→74.5% VQA) are presented without error bars, multiple random seeds, or statistical significance tests, and without breakdown by motion type, making it impossible to assess robustness or whether gains concentrate on geometrically distinct cases.
Authors: We will revise the Results section to report means and standard deviations over three independent random seeds, include error bars on all bar plots, apply statistical significance tests (paired t-test and McNemar's test), and provide per-motion-type breakdowns (translation, rotation, zoom, combined motions). This will allow readers to verify that gains are concentrated on geometrically distinct cases rather than uniform distributional improvements. revision: yes
Circularity Check
No circularity: new data and RL application remain independent of inputs
full rationale
The derivation chain consists of constructing an external 18k+38k trajectory dataset, applying the O-T-A format, and running RL on the public Qwen2.5-VL-7B backbone. All reported accuracy gains are presented as measured outcomes on separate benchmarks rather than quantities that reduce by definition or self-citation to the training chains themselves. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text that would collapse the central claim into its own inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
O-T-A paradigm... RL for logical alignment in camera movement understanding... 18k SFT reasoning chains and 38k RL feedback samples
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
-
EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving
EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.
-
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
Reference graph
Works this paper leans on
-
[1]
The anatomy of video editing: A dataset and benchmark suite for ai-assisted video editing, 2022
Dawit Mureja Argaw, Fabian Caba Heilbron, Joon-Young Lee, Markus Woodson, and In So Kweon. The anatomy of video editing: A dataset and benchmark suite for ai-assisted video editing, 2022. URLhttps://arxiv.org/abs/2207.09812
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancingmultimodalreasoning: Fromoptimizedcoldstarttostagedreinforcementlearning.arXivpreprintarXiv:2506.04207, 2025
-
[5]
Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025
-
[6]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yiming Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Ji...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023
work page 2023
-
[8]
IEEE TPAMI29(6), 1052–1067 (2007).https://doi.org/10
Andrew J. Davison, Ian D. Reid, Nicholas D. Molton, and Olivier Stasse. Monoslam: Real-time single camera slam.IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1052–1067, 2007. doi: 10.1109/TPAMI.2007.1049
-
[9]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, et al. Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718, 2025
-
[11]
LSD-SLAM:Large-ScaleDirectMonocularSLAM
JakobEngel,ThomasSchöps,andDanielCremers. LSD-SLAM:Large-ScaleDirectMonocularSLAM. InEuropeanConference on Computer Vision (ECCV), volume 8690 ofLecture Notes in Computer Science, pages 834–849, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10604-5. doi: 10.1007/978-3-319-10605-2
-
[12]
Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025
Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025
-
[15]
Geometry-guided camera motion understanding in videollms.arXiv preprint arXiv:2603.13119, 2026
Haoan Feng, Sri Harsha Musunuri, and Guan-Ming Su. Geometry-guided camera motion understanding in videollms.arXiv preprint arXiv:2603.13119, 2026. 12
-
[16]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025
-
[19]
Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models, 2025. URL https://arxiv.org/abs/2501.02955
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Movienet: A holistic dataset for movie under- standing, 2020
Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding, 2020. URLhttps://arxiv.org/abs/2007.10937
-
[21]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos, 2025. URLhttps://arxiv.org/abs/2412.09621
-
[24]
Veu-bench: Towards comprehensive understanding of video editing
Bozheng Li, Yongliang Wu, Yi Lu, Jiashuo Yu, Licheng Tang, Jiawang Cao, Wenqing Zhu, Yuyang Sun, Jay Wu, and Wenbo Zhu. Veu-bench: Towards comprehensive understanding of video editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13671–13680, 2025
work page 2025
-
[25]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024
work page 2024
-
[28]
Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025
-
[29]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26689–26699, 2024
work page 2024
-
[30]
arXiv preprint arXiv:2504.15376 , year=
Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, et al. Towards understanding camera motions in any video.arXiv preprint arXiv:2504.15376, 2025
-
[31]
Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conf...
work page 2024
-
[32]
Shotbench: Expert-level cinematic understanding in vision-language models, 2025
Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, and Ziwei Liu. Shotbench: Expert-level cinematic understanding in vision-language models, 2025. URLhttps://arxiv.org/abs/2506.21356
-
[33]
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024
-
[34]
Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Yuyao Ge, Jun Wan, Yurong Wu, and Xueqi Cheng. a1: Steep test-time scaling law via environment augmented generation.arXiv preprint arXiv:2504.14597, 2025
-
[35]
A Survey of Context Engineering for Large Language Models
Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models, 2025. URLhttps://arxiv.org/abs/2507.13334. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025
-
[37]
A unified framework for shot type classification based on subject centric lens, 2020
Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei Zhou, and Dahua Lin. A unified framework for shot type classification based on subject centric lens, 2020. URLhttps://arxiv.org/abs/2008.03548
-
[38]
Schonberger and Jan-Michael Frahm
Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
work page 2016
-
[39]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025
-
[42]
Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025
Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025
-
[43]
Vggsfm: Visual geometry grounded deep structure from motion
Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024
work page 2024
-
[44]
Tarsier: Recipes for training and evaluating large video description models
JiaweiWang,LipingYuan,YuchenZhang,andHaomiaoSun. Tarsier: Recipesfortrainingandevaluatinglargevideodescription models, 2024. URLhttps://arxiv.org/abs/2407.00634
-
[45]
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025
work page internal anchor Pith review arXiv 2025
-
[46]
Hang Wu, Yujun Cai, Haonan Ge, Hongkai Chen, Ming-Hsuan Yang, and Yiwei Wang. Refineshot: Rethinking cinematography understanding with foundational skill evaluation.arXiv preprint arXiv:2510.02423, 2025
-
[47]
Zhen Xiong, Yujun Cai, Zhecheng Li, Junsong Yuan, and Yiwei Wang. Thinking with sound: Audio chain-of-thought enables multimodal reasoning in large audio-language models.arXiv preprint arXiv:2509.21749, 2025
-
[48]
Seg-r1: Segmentation can be surprisingly simple with reinforcement 33 ConceptSeg-R1 learning
Zuyao You and Zuxuan Wu. Seg-r1: Segmentation can be surprisingly simple with reinforcement learning.arXiv preprint arXiv:2506.22624, 2025
-
[49]
arXiv preprint arXiv:2504.07954 , year =
En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025
-
[50]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. URLhttps://arxiv.org/abs/2410.02713
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Guanghao Zhou, Panjia Qiu, Cen Chen, Jie Wang, Zheming Yang, Jian Xu, and Minghui Qiu. Reinforced mllm: A survey on rl-based reasoning in multimodal large language models.arXiv preprint arXiv:2504.21277, 2025
-
[52]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images, 2018. URLhttps://arxiv.org/abs/1805.09817
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[53]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025. URL https://arxiv.org/abs/2504.10479. 14 Figure 5 Distribution of camera movement categories in CamReasoning-SFT-18k.The dataset encompasses a diverse range of cinematographic motions, with a primary focus on dynamic rotations and stable...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.