Geo-Align: Video Generation Alignment via Metric Geometry Reward

Chunhua Shen; Haoyu Guo; Runzhe Teng; Tong He; Zizun Li

arxiv: 2605.23903 · v1 · pith:OLJDFC2Znew · submitted 2026-05-22 · 💻 cs.CV

Geo-Align: Video Generation Alignment via Metric Geometry Reward

Zizun Li , Haoyu Guo , Runzhe Teng , Chunhua Shen , Tong He This is my paper

Pith reviewed 2026-05-25 04:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationcamera controlreinforcement learning3D geometryreward mechanismvideo re-renderingtrajectory alignment

0 comments

The pith

A reinforcement learning framework uses metric 3D trajectory extraction to align camera paths in generated videos without paired real data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning can refine camera-controlled video re-rendering by adding a reward that penalizes mismatches between target and generated camera trajectories. It builds this reward from a metric 3D estimator applied directly to the output videos and pairs real conditioning videos with synthetic target trajectories to avoid needing synchronized multi-view training pairs. A sympathetic reader would care because existing supervised fine-tuning approaches trained on synthetic data generalize poorly to real videos and fail to maintain accurate physical scales and paths.

Core claim

Geo-Align is the first reinforcement learning framework for camera-controlled video re-rendering. Built on a pretrained model, it optimizes via a scale-aware perceptual reward that deploys a metric 3D estimator to extract camera trajectories from generated videos and explicitly penalizes deviations in rotation and translation. A data pipeline that combines real-world conditioning videos with target trajectories from synthetic data removes dependence on paired training examples. Experiments show the resulting model exceeds supervised learning baselines on both precise camera controllability and visual fidelity.

What carries the argument

The metric 3D estimator that pulls precise camera trajectories out of generated videos to form a penalizing reward on rotation and translation errors.

If this is right

Camera trajectories in generated videos adhere more closely to targets than those from supervised fine-tuning on synthetic data.
Visual quality improves at the same time as geometric accuracy without requiring additional paired real-world training data.
The model handles out-of-distribution real-world conditioning videos more reliably than prior supervised approaches.
The same reward construction can be applied on top of any existing pretrained video generation model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-based reward could be adapted to enforce consistency on other geometric properties such as object sizes or scene layout.
Replacing the reinforcement learning loop with direct regression on the extracted trajectories might simplify training while retaining the geometric signal.
The synthetic-target plus real-conditioning pipeline offers a template for incorporating geometric constraints into other video synthesis tasks.

Load-bearing premise

A metric 3D estimator can reliably extract precise camera trajectories from the generated videos to compute an effective penalizing reward signal.

What would settle it

An independent metric 3D estimator applied to videos produced by the trained model finds no reduction in rotation or translation error relative to supervised baselines.

Figures

Figures reproduced from arXiv: 2605.23903 by Chunhua Shen, Haoyu Guo, Runzhe Teng, Tong He, Zizun Li.

**Figure 2.** Figure 2: Geo-Align pipeline. Given a conditioning video, we sample a camera trajectory from other camera-annotated data and scale it to a plausible range, with the scaling factor drawn from a truncated Gaussian distribution. After the model generates a set of rollout videos, a metric 3D evaluator assesses the camera trajectory of each sample to compute geometry rewards. Finally, the model is optimized via Group Rel… view at source ↗

**Figure 3.** Figure 3: Qualitative results on the DAVIS [10] dataset. Geo-Align demonstrates superior capabilities in maintaining geometric consistency between the foreground subject and the background, whereas other methods suffer from varying degrees of distortion [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: More visualization results on CityWalk [11] dataset. For each example, the top row illustrates the input video, whereas the bottom row visualizes our results following the target trajectory. 4.3 Evaluation protocol We follow the evaluation protocol of ReDirector [2], using 50 videos from the DAVIS dataset. By applying 10 ReCamMaster [1] camera trajectories per video, we construct 500 test cases with length… view at source ↗

**Figure 5.** Figure 5: More qualitative comparison on DAVIS [10] dataset. For each example, the top row illustrates the input video, while the second and third rows present the results of ReDirector [2] and our model, respectively. Input Video Ours [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Failure Case. the model remains susceptible to failure when faced with excessively fast rotations, large translations, or large foreground objects close to the camera. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Geo-Align tries an RL reward from a metric 3D estimator to fix camera control in video generation, but the abstract supplies zero numbers or validation so the claims stay untested.

read the letter

The paper's main move is to replace supervised fine-tuning on synthetic pairs with an RL loop that scores generated videos using a metric 3D estimator. The estimator pulls out rotation and translation from the output itself and supplies a scale-aware penalty. They also describe a data pipeline that feeds real conditioning videos paired with synthetic target trajectories, which sidesteps the need for real multi-view pairs. That pipeline choice is the clearest practical step forward in the abstract. It directly targets the generalization problem the authors flag with out-of-distribution real videos. If the estimator stays accurate on the model's own imperfect outputs, the reward could in principle give better trajectory adherence than pure SFT. The abstract positions this as the first RL treatment of the task, which matches the novelty claim. The stress-test concern about estimator accuracy on generated content is not addressed in the provided text, and that component is load-bearing. The abstract asserts consistent outperformance on controllability and fidelity yet contains no metrics, no ablation tables, no error analysis, and no statement that the 3D estimator was checked on artifact-heavy model outputs. Without those, the central result cannot be evaluated. The work is aimed at groups already running camera-controlled video models and looking for RL alignment options. A reader who wants to see whether the 3D-reward idea can be made to work would get value from the full experiments if they exist. The paper deserves a serious referee only if the manuscript supplies the missing quantitative results and estimator validation; on the abstract alone it does not.

Referee Report

2 major / 0 minor

Summary. The paper proposes Geo-Align, the first RL framework for camera-controlled video re-rendering built on a pretrained model. It introduces a scale-aware perceptual reward derived from a metric 3D estimator that extracts camera trajectories (rotation/translation) from the generated videos themselves to penalize deviations, combined with a data pipeline using real-world conditioning videos and synthetic target trajectories to avoid paired data. The central claim is that this yields consistent outperformance over supervised fine-tuning baselines in precise camera controllability and visual fidelity.

Significance. If the result holds with validated reward signals, the approach would address the scarcity of synchronized multi-view real-world data by shifting from SFT to RL with geometric rewards, potentially improving generalization to out-of-distribution inputs in video generation tasks.

major comments (2)

[Abstract] Abstract and Experiments section: The claim that 'Geo-Align consistently outperforms existing supervised learning baselines' is asserted without any quantitative metrics, tables of results, ablation studies, or error analysis, rendering the central empirical claim impossible to evaluate.
[Method] Method and Experiments: The reward mechanism depends on the metric 3D estimator remaining accurate and unbiased on imperfect, artifact-containing generated videos during RL training, yet no validation, robustness tests, or comparison of estimator performance on real vs. generated content is reported; any systematic bias would invalidate the controllability gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to improve the manuscript's clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: The claim that 'Geo-Align consistently outperforms existing supervised learning baselines' is asserted without any quantitative metrics, tables of results, ablation studies, or error analysis, rendering the central empirical claim impossible to evaluate.

Authors: We agree that the abstract states the outperformance claim at a high level without supporting numbers. Although the experiments section presents comparative results, we will revise the abstract to include key quantitative metrics (e.g., trajectory error and fidelity scores) and expand the experiments with an explicit error analysis subsection and ablation tables to make the central claim directly evaluable. revision: yes
Referee: [Method] Method and Experiments: The reward mechanism depends on the metric 3D estimator remaining accurate and unbiased on imperfect, artifact-containing generated videos during RL training, yet no validation, robustness tests, or comparison of estimator performance on real vs. generated content is reported; any systematic bias would invalidate the controllability gains.

Authors: This is a substantive concern regarding potential bias in the reward signal. We will add a new subsection in the experiments validating the 3D estimator's accuracy and bias on generated videos (including comparisons to real videos and tests under artifact conditions) to confirm the reward's reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external estimator and experiments

full rationale

The paper describes an RL optimization loop that applies a pre-existing metric 3D estimator to generated video outputs in order to compute a reward penalizing trajectory deviations. This is a conventional reward-modeling step rather than a self-definitional or fitted-input reduction. No equations or claims in the abstract equate the reported controllability gains to quantities defined by the same fitted parameters or by self-citation chains. The central performance assertions are framed as experimental outcomes, not algebraic identities or renamed inputs. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract alone; no concrete free parameters, axioms, or invented entities can be identified. The metric 3D estimator is referenced but its internals and any assumptions are unknown.

pith-pipeline@v0.9.0 · 5725 in / 966 out tokens · 23368 ms · 2026-05-25T04:14:17.675130+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 23 internal anchors

[1]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

work page 2025
[2]

Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, and Jong Chul Ye. Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

work page arXiv 2025
[3]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 100–111, 2025

work page 2025
[4]

Reconstruct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos.arXiv preprint arXiv:2507.12646, 2025

Kaihua Chen, Tarasha Khurana, and Deva Ramanan. Reconstruct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos.arXiv preprint arXiv:2507.12646, 2025

work page arXiv 2025
[5]

Generative camera dolly: Extreme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEuropean Conference on Computer Vision, pages 313–331. Springer, 2024

work page 2024
[6]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

work page arXiv 2025
[7]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

work page 2025
[10]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

work page arXiv 2025
[12]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025

work page 2025
[13]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

work page 2024
[15]

Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis

Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, and Changick Kim. Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21524–21536, 2025

work page 2025
[16]

Depthcrafter: Generating consistent long depth sequences for open-world videos

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2005–2015, 2025

work page 2005
[17]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22831–22840, 2025. 10

work page 2025
[18]

Cotracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024

work page 2024
[19]

Spatialtracker: Tracking any 2d pixels in 3d space

Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20406–20417, 2024

work page 2024
[21]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Reangle-a-video: 4d video generation as video-to- video translation

Hyeonho Jeong, Suhyeon Lee, and Jong Chul Ye. Reangle-a-video: 4d video generation as video-to- video translation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11164–11175, 2025

work page 2025
[24]

See4d: Pose-free 4d generation via auto-regressive video inpainting.arXiv preprint arXiv:2510.26796, 2025

Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, et al. See4d: Pose-free 4d generation via auto-regressive video inpainting.arXiv preprint arXiv:2510.26796, 2025

work page arXiv 2025
[25]

Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2050–2062, 2025

work page 2050
[26]

Bundle adjustment—a modern synthesis

Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. InInternational workshop on vision algorithms, pages 298–372. Springer, 1999

work page 1999
[27]

Photo tourism: exploring photo collections in 3d

Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. In ACM SIGGRAPH 2006 Papers, pages 835–846, 2006

work page 2006
[28]

Towards linear-time incremental structure from motion

Changchang Wu. Towards linear-time incremental structure from motion. In2013 International Conference on 3D Vision-3DV 2013, pages 127–134. IEEE, 2013

work page 2013
[29]

Structure-from-motion revisited

Johannes L Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

work page 2016
[30]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

work page 2024
[31]

Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025. URLhttps://arxiv.org/abs/2508.10893

work page arXiv 2025
[32]

3D Reconstruction with Spatial Memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory, 2024. URL https://arxiv. org/abs/2408.16061

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Streaming 4D Visual Geometry Transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer, 2026. URLhttps://arxiv.org/abs/2507.11539

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Slam3r: Real-time dense scene reconstruction from monocular rgb videos, 2025

Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos, 2025. URL https: //arxiv.org/abs/2412.09401

work page arXiv 2025
[35]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

work page 2025
[36]

Wint3r: Window-based streaming reconstruction with camera token pool

Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. Wint3r: Window-based streaming reconstruction with camera token pool. arXiv preprint arXiv:2509.05296, 2025. 11

work page arXiv 2025
[37]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026. URLhttps://arxiv.org/abs/2502.12138

work page arXiv 2026
[38]

arXiv preprint arXiv:2501.13928 (2025)

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass, 2025. URLhttps://arxiv.org/abs/2501.13928

work page arXiv 2025
[39]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2026. URLhttps://arxiv.org/abs/2507.16443

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Grounding image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r, 2024. URLhttps://arxiv.org/abs/2406.09756

work page arXiv 2024
[41]

Monst3r: A simple approach for estimating geometry in the presence of motion,

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion,

work page
[42]

URLhttps://arxiv.org/abs/2410.03825

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025
[44]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-Equivariant Visual Geometry Learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Srpo: Self-referential policy optimization for vision-language-action models, 2025

Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, and Xipeng Qiu. Srpo: Self-referential policy optimization for vision-language-action models, 2025. URLhttps://arxiv.org/abs/2511.15605

work page arXiv 2025
[48]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again, 2025. URL https://arxiv.org/abs/ 2507.22058

work page arXiv 2025
[49]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation, 2025. URL https://arxiv.org/abs/2505.07818

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Grndctrl: Grounding world models via self-supervised reward alignment.arXiv preprint arXiv:2512.01952, 2025

Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei, and Sebastian Scherer. Grndctrl: Grounding world models via self-supervised reward alignment.arXiv preprint arXiv:2512.01952, 2025

work page arXiv 2025
[54]

Taming camera-controlled video generation with verifiable geometry reward.arXiv preprint arXiv:2512.02870, 2025

Zhaoqing Wang, Xiaobo Xia, Zhuolin Bie, Jinlin Liu, Dongdong Yu, Jia-Wang Bian, and Changhu Wang. Taming camera-controlled video generation with verifiable geometry reward.arXiv preprint arXiv:2512.02870, 2025

work page arXiv 2025
[55]

Campilot: Improving camera control in video diffusion model with efficient camera reward feedback,

Wenhang Ge, Guibao Shen, Jiawei Feng, Luozhou Wang, Hao Lu, Xingye Tian, Xin Tao, and Ying-Cong Chen. Campilot: Improving camera control in video diffusion model with efficient camera reward feedback,

work page
[56]

URLhttps://arxiv.org/abs/2601.16214. 12

work page arXiv
[57]

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y . Chen, Zhiyuan He, Yuqing Yang, and Bohan Zhuang. World-r1: Reinforcing 3d constraints for text-to-video generation, 2026. URLhttps://arxiv.org/abs/2604.24764

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

Ic-world: In-context generation for shared world modeling, 2025

Fan Wu, Jiacheng Wei, Ruibo Li, Yi Xu, Junyou Li, Deheng Ye, and Guosheng Lin. Ic-world: In-context generation for shared world modeling, 2025. URLhttps://arxiv.org/abs/2512.02793

work page arXiv 2025
[59]

Epipolar geometry improves video generation models, 2025

Orest Kupyn, Fabian Manhardt, Federico Tombari, and Christian Rupprecht. Epipolar geometry improves video generation models, 2025. URLhttps://arxiv.org/abs/2510.21615

work page arXiv 2025
[60]

Vigor: Video geometry-oriented reward for temporal generative alignment, 2026

Tengjiao Yin, Jinglei Shi, Heng Guo, and Xi Wang. Vigor: Video geometry-oriented reward for temporal generative alignment, 2026. URLhttps://arxiv.org/abs/2603.16271

work page arXiv 2026
[61]

Vggrpo: Towards world-consistent video generation with 4d latent reward, 2026

Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez- Franco, and Marta Tintore Gazulla. Vggrpo: Towards world-consistent video generation with 4d latent reward, 2026. URLhttps://arxiv.org/abs/2603.26599

work page arXiv 2026
[62]

Rlgf: Reinforcement learning with geometric feedback for autonomous driving video generation, 2025

Tianyi Yan, Wencheng Han, Xia Zhou, Xueyang Zhang, Kun Zhan, Cheng zhong Xu, and Jianbing Shen. Rlgf: Reinforcement learning with geometric feedback for autonomous driving video generation, 2025. URLhttps://arxiv.org/abs/2509.16500

work page arXiv 2025
[63]

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, and Yue Wang. Videogpa: Distilling geometry priors for 3d-consistent video generation, 2026. URLhttps://arxiv.org/abs/2601.23286

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling, 2025. URL https://arxiv.org/abs/2507.07982

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Longcat-video technical report.arXiv preprint arXiv:2510.22200,

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

work page arXiv 2025
[66]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[67]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Met3r: Measuring multi-view consistency in generated images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6034–6044, 2025

work page 2025
[70]

Steerx: Creating any camera-free 3d and 4d scenes with geometric steering

Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Steerx: Creating any camera-free 3d and 4d scenes with geometric steering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27326–27337, 2025

work page 2025
[71]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 13 A Appendix A.1 Limitations Our rein...

work page 2024

[1] [1]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025

work page 2025

[2] [2]

Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, and Jong Chul Ye. Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025

work page arXiv 2025

[3] [3]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 100–111, 2025

work page 2025

[4] [4]

Reconstruct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos.arXiv preprint arXiv:2507.12646, 2025

Kaihua Chen, Tarasha Khurana, and Deva Ramanan. Reconstruct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos.arXiv preprint arXiv:2507.12646, 2025

work page arXiv 2025

[5] [5]

Generative camera dolly: Extreme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEuropean Conference on Computer Vision, pages 313–331. Springer, 2024

work page 2024

[6] [6]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

work page arXiv 2025

[7] [7]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

work page 2025

[10] [10]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

work page arXiv 2025

[12] [12]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025

work page 2025

[13] [13]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

work page 2024

[15] [15]

Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis

Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, and Changick Kim. Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21524–21536, 2025

work page 2025

[16] [16]

Depthcrafter: Generating consistent long depth sequences for open-world videos

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2005–2015, 2025

work page 2005

[17] [17]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22831–22840, 2025. 10

work page 2025

[18] [18]

Cotracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024

work page 2024

[19] [19]

Spatialtracker: Tracking any 2d pixels in 3d space

Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20406–20417, 2024

work page 2024

[20] [21]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [22]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [23]

Reangle-a-video: 4d video generation as video-to- video translation

Hyeonho Jeong, Suhyeon Lee, and Jong Chul Ye. Reangle-a-video: 4d video generation as video-to- video translation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11164–11175, 2025

work page 2025

[23] [24]

See4d: Pose-free 4d generation via auto-regressive video inpainting.arXiv preprint arXiv:2510.26796, 2025

Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, et al. See4d: Pose-free 4d generation via auto-regressive video inpainting.arXiv preprint arXiv:2510.26796, 2025

work page arXiv 2025

[24] [25]

Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2050–2062, 2025

work page 2050

[25] [26]

Bundle adjustment—a modern synthesis

Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. InInternational workshop on vision algorithms, pages 298–372. Springer, 1999

work page 1999

[26] [27]

Photo tourism: exploring photo collections in 3d

Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. In ACM SIGGRAPH 2006 Papers, pages 835–846, 2006

work page 2006

[27] [28]

Towards linear-time incremental structure from motion

Changchang Wu. Towards linear-time incremental structure from motion. In2013 International Conference on 3D Vision-3DV 2013, pages 127–134. IEEE, 2013

work page 2013

[28] [29]

Structure-from-motion revisited

Johannes L Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

work page 2016

[29] [30]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

work page 2024

[30] [31]

Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025. URLhttps://arxiv.org/abs/2508.10893

work page arXiv 2025

[31] [32]

3D Reconstruction with Spatial Memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory, 2024. URL https://arxiv. org/abs/2408.16061

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [33]

Streaming 4D Visual Geometry Transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer, 2026. URLhttps://arxiv.org/abs/2507.11539

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [34]

Slam3r: Real-time dense scene reconstruction from monocular rgb videos, 2025

Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos, 2025. URL https: //arxiv.org/abs/2412.09401

work page arXiv 2025

[34] [35]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

work page 2025

[35] [36]

Wint3r: Window-based streaming reconstruction with camera token pool

Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. Wint3r: Window-based streaming reconstruction with camera token pool. arXiv preprint arXiv:2509.05296, 2025. 11

work page arXiv 2025

[36] [37]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026. URLhttps://arxiv.org/abs/2502.12138

work page arXiv 2026

[37] [38]

arXiv preprint arXiv:2501.13928 (2025)

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass, 2025. URLhttps://arxiv.org/abs/2501.13928

work page arXiv 2025

[38] [39]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2026. URLhttps://arxiv.org/abs/2507.16443

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [40]

Grounding image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r, 2024. URLhttps://arxiv.org/abs/2406.09756

work page arXiv 2024

[40] [41]

Monst3r: A simple approach for estimating geometry in the presence of motion,

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion,

work page

[41] [42]

URLhttps://arxiv.org/abs/2410.03825

work page internal anchor Pith review Pith/arXiv arXiv

[42] [43]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025

[43] [44]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-Equivariant Visual Geometry Learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [45]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [46]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [47]

Srpo: Self-referential policy optimization for vision-language-action models, 2025

Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, and Xipeng Qiu. Srpo: Self-referential policy optimization for vision-language-action models, 2025. URLhttps://arxiv.org/abs/2511.15605

work page arXiv 2025

[47] [48]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again, 2025. URL https://arxiv.org/abs/ 2507.22058

work page arXiv 2025

[48] [49]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation, 2025. URL https://arxiv.org/abs/2505.07818

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [50]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[50] [51]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [52]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [53]

Grndctrl: Grounding world models via self-supervised reward alignment.arXiv preprint arXiv:2512.01952, 2025

Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei, and Sebastian Scherer. Grndctrl: Grounding world models via self-supervised reward alignment.arXiv preprint arXiv:2512.01952, 2025

work page arXiv 2025

[53] [54]

Taming camera-controlled video generation with verifiable geometry reward.arXiv preprint arXiv:2512.02870, 2025

Zhaoqing Wang, Xiaobo Xia, Zhuolin Bie, Jinlin Liu, Dongdong Yu, Jia-Wang Bian, and Changhu Wang. Taming camera-controlled video generation with verifiable geometry reward.arXiv preprint arXiv:2512.02870, 2025

work page arXiv 2025

[54] [55]

Campilot: Improving camera control in video diffusion model with efficient camera reward feedback,

Wenhang Ge, Guibao Shen, Jiawei Feng, Luozhou Wang, Hao Lu, Xingye Tian, Xin Tao, and Ying-Cong Chen. Campilot: Improving camera control in video diffusion model with efficient camera reward feedback,

work page

[55] [56]

URLhttps://arxiv.org/abs/2601.16214. 12

work page arXiv

[56] [57]

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y . Chen, Zhiyuan He, Yuqing Yang, and Bohan Zhuang. World-r1: Reinforcing 3d constraints for text-to-video generation, 2026. URLhttps://arxiv.org/abs/2604.24764

work page internal anchor Pith review Pith/arXiv arXiv 2026

[57] [58]

Ic-world: In-context generation for shared world modeling, 2025

Fan Wu, Jiacheng Wei, Ruibo Li, Yi Xu, Junyou Li, Deheng Ye, and Guosheng Lin. Ic-world: In-context generation for shared world modeling, 2025. URLhttps://arxiv.org/abs/2512.02793

work page arXiv 2025

[58] [59]

Epipolar geometry improves video generation models, 2025

Orest Kupyn, Fabian Manhardt, Federico Tombari, and Christian Rupprecht. Epipolar geometry improves video generation models, 2025. URLhttps://arxiv.org/abs/2510.21615

work page arXiv 2025

[59] [60]

Vigor: Video geometry-oriented reward for temporal generative alignment, 2026

Tengjiao Yin, Jinglei Shi, Heng Guo, and Xi Wang. Vigor: Video geometry-oriented reward for temporal generative alignment, 2026. URLhttps://arxiv.org/abs/2603.16271

work page arXiv 2026

[60] [61]

Vggrpo: Towards world-consistent video generation with 4d latent reward, 2026

Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez- Franco, and Marta Tintore Gazulla. Vggrpo: Towards world-consistent video generation with 4d latent reward, 2026. URLhttps://arxiv.org/abs/2603.26599

work page arXiv 2026

[61] [62]

Rlgf: Reinforcement learning with geometric feedback for autonomous driving video generation, 2025

Tianyi Yan, Wencheng Han, Xia Zhou, Xueyang Zhang, Kun Zhan, Cheng zhong Xu, and Jianbing Shen. Rlgf: Reinforcement learning with geometric feedback for autonomous driving video generation, 2025. URLhttps://arxiv.org/abs/2509.16500

work page arXiv 2025

[62] [63]

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, and Yue Wang. Videogpa: Distilling geometry priors for 3d-consistent video generation, 2026. URLhttps://arxiv.org/abs/2601.23286

work page internal anchor Pith review Pith/arXiv arXiv 2026

[63] [64]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling, 2025. URL https://arxiv.org/abs/2507.07982

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [65]

Longcat-video technical report.arXiv preprint arXiv:2510.22200,

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

work page arXiv 2025

[65] [66]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[66] [67]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [68]

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [69]

Met3r: Measuring multi-view consistency in generated images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6034–6044, 2025

work page 2025

[69] [70]

Steerx: Creating any camera-free 3d and 4d scenes with geometric steering

Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Steerx: Creating any camera-free 3d and 4d scenes with geometric steering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27326–27337, 2025

work page 2025

[70] [71]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 13 A Appendix A.1 Limitations Our rein...

work page 2024