Geo-Align: Video Generation Alignment via Metric Geometry Reward
Pith reviewed 2026-05-25 04:14 UTC · model grok-4.3
The pith
A reinforcement learning framework uses metric 3D trajectory extraction to align camera paths in generated videos without paired real data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Geo-Align is the first reinforcement learning framework for camera-controlled video re-rendering. Built on a pretrained model, it optimizes via a scale-aware perceptual reward that deploys a metric 3D estimator to extract camera trajectories from generated videos and explicitly penalizes deviations in rotation and translation. A data pipeline that combines real-world conditioning videos with target trajectories from synthetic data removes dependence on paired training examples. Experiments show the resulting model exceeds supervised learning baselines on both precise camera controllability and visual fidelity.
What carries the argument
The metric 3D estimator that pulls precise camera trajectories out of generated videos to form a penalizing reward on rotation and translation errors.
If this is right
- Camera trajectories in generated videos adhere more closely to targets than those from supervised fine-tuning on synthetic data.
- Visual quality improves at the same time as geometric accuracy without requiring additional paired real-world training data.
- The model handles out-of-distribution real-world conditioning videos more reliably than prior supervised approaches.
- The same reward construction can be applied on top of any existing pretrained video generation model.
Where Pith is reading between the lines
- The same trajectory-based reward could be adapted to enforce consistency on other geometric properties such as object sizes or scene layout.
- Replacing the reinforcement learning loop with direct regression on the extracted trajectories might simplify training while retaining the geometric signal.
- The synthetic-target plus real-conditioning pipeline offers a template for incorporating geometric constraints into other video synthesis tasks.
Load-bearing premise
A metric 3D estimator can reliably extract precise camera trajectories from the generated videos to compute an effective penalizing reward signal.
What would settle it
An independent metric 3D estimator applied to videos produced by the trained model finds no reduction in rotation or translation error relative to supervised baselines.
Figures
read the original abstract
Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Geo-Align, the first RL framework for camera-controlled video re-rendering built on a pretrained model. It introduces a scale-aware perceptual reward derived from a metric 3D estimator that extracts camera trajectories (rotation/translation) from the generated videos themselves to penalize deviations, combined with a data pipeline using real-world conditioning videos and synthetic target trajectories to avoid paired data. The central claim is that this yields consistent outperformance over supervised fine-tuning baselines in precise camera controllability and visual fidelity.
Significance. If the result holds with validated reward signals, the approach would address the scarcity of synchronized multi-view real-world data by shifting from SFT to RL with geometric rewards, potentially improving generalization to out-of-distribution inputs in video generation tasks.
major comments (2)
- [Abstract] Abstract and Experiments section: The claim that 'Geo-Align consistently outperforms existing supervised learning baselines' is asserted without any quantitative metrics, tables of results, ablation studies, or error analysis, rendering the central empirical claim impossible to evaluate.
- [Method] Method and Experiments: The reward mechanism depends on the metric 3D estimator remaining accurate and unbiased on imperfect, artifact-containing generated videos during RL training, yet no validation, robustness tests, or comparison of estimator performance on real vs. generated content is reported; any systematic bias would invalidate the controllability gains.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and outline revisions to improve the manuscript's clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract and Experiments section: The claim that 'Geo-Align consistently outperforms existing supervised learning baselines' is asserted without any quantitative metrics, tables of results, ablation studies, or error analysis, rendering the central empirical claim impossible to evaluate.
Authors: We agree that the abstract states the outperformance claim at a high level without supporting numbers. Although the experiments section presents comparative results, we will revise the abstract to include key quantitative metrics (e.g., trajectory error and fidelity scores) and expand the experiments with an explicit error analysis subsection and ablation tables to make the central claim directly evaluable. revision: yes
-
Referee: [Method] Method and Experiments: The reward mechanism depends on the metric 3D estimator remaining accurate and unbiased on imperfect, artifact-containing generated videos during RL training, yet no validation, robustness tests, or comparison of estimator performance on real vs. generated content is reported; any systematic bias would invalidate the controllability gains.
Authors: This is a substantive concern regarding potential bias in the reward signal. We will add a new subsection in the experiments validating the 3D estimator's accuracy and bias on generated videos (including comparisons to real videos and tests under artifact conditions) to confirm the reward's reliability. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external estimator and experiments
full rationale
The paper describes an RL optimization loop that applies a pre-existing metric 3D estimator to generated video outputs in order to compute a reward penalizing trajectory deviations. This is a conventional reward-modeling step rather than a self-definitional or fitted-input reduction. No equations or claims in the abstract equate the reported controllability gains to quantities defined by the same fitted parameters or by self-citation chains. The central performance assertions are framed as experimental outcomes, not algebraic identities or renamed inputs. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Recammaster: Camera-controlled generative rendering from a single video
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14834–14844, 2025
work page 2025
-
[2]
Byeongjun Park, Byung-Hoon Kim, Hyungjin Chung, and Jong Chul Ye. Redirector: Creating any-length video retakes with rotary camera encoding.arXiv preprint arXiv:2511.19827, 2025
-
[3]
Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models
Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 100–111, 2025
work page 2025
-
[4]
Kaihua Chen, Tarasha Khurana, and Deva Ramanan. Reconstruct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos.arXiv preprint arXiv:2507.12646, 2025
-
[5]
Generative camera dolly: Extreme monocular dynamic novel view synthesis
Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InEuropean Conference on Computer Vision, pages 313–331. Springer, 2024
work page 2024
-
[6]
Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025
Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025
-
[7]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Hpsv3: Towards wide-spectrum human preference score
Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025
work page 2025
-
[10]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025
Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025
-
[12]
Ac3d: Analyzing and improving 3d camera control in video diffusion transformers
Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025
work page 2025
-
[13]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024
work page 2024
-
[15]
Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis
Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, and Changick Kim. Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21524–21536, 2025
work page 2025
-
[16]
Depthcrafter: Generating consistent long depth sequences for open-world videos
Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2005–2015, 2025
work page 2005
-
[17]
Video depth anything: Consistent depth estimation for super-long videos
Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22831–22840, 2025. 10
work page 2025
-
[18]
Cotracker: It is better to track together
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024
work page 2024
-
[19]
Spatialtracker: Tracking any 2d pixels in 3d space
Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20406–20417, 2024
work page 2024
-
[21]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Reangle-a-video: 4d video generation as video-to- video translation
Hyeonho Jeong, Suhyeon Lee, and Jong Chul Ye. Reangle-a-video: 4d video generation as video-to- video translation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11164–11175, 2025
work page 2025
-
[24]
Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, et al. See4d: Pose-free 4d generation via auto-regressive video inpainting.arXiv preprint arXiv:2510.26796, 2025
-
[25]
Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning
David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, and Nataniel Ruiz. Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2050–2062, 2025
work page 2050
-
[26]
Bundle adjustment—a modern synthesis
Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. InInternational workshop on vision algorithms, pages 298–372. Springer, 1999
work page 1999
-
[27]
Photo tourism: exploring photo collections in 3d
Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. In ACM SIGGRAPH 2006 Papers, pages 835–846, 2006
work page 2006
-
[28]
Towards linear-time incremental structure from motion
Changchang Wu. Towards linear-time incremental structure from motion. In2013 International Conference on 3D Vision-3DV 2013, pages 127–134. IEEE, 2013
work page 2013
-
[29]
Structure-from-motion revisited
Johannes L Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016
work page 2016
-
[30]
Dust3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024
work page 2024
-
[31]
Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025
Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025. URLhttps://arxiv.org/abs/2508.10893
-
[32]
3D Reconstruction with Spatial Memory
Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory, 2024. URL https://arxiv. org/abs/2408.16061
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Streaming 4D Visual Geometry Transformer
Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer, 2026. URLhttps://arxiv.org/abs/2507.11539
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Slam3r: Real-time dense scene reconstruction from monocular rgb videos, 2025
Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos, 2025. URL https: //arxiv.org/abs/2412.09401
-
[35]
Continuous 3d perception model with persistent state
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025
work page 2025
-
[36]
Wint3r: Window-based streaming reconstruction with camera token pool
Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. Wint3r: Window-based streaming reconstruction with camera token pool. arXiv preprint arXiv:2509.05296, 2025. 11
-
[37]
Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026
Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026. URLhttps://arxiv.org/abs/2502.12138
-
[38]
arXiv preprint arXiv:2501.13928 (2025)
Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass, 2025. URLhttps://arxiv.org/abs/2501.13928
-
[39]
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2026. URLhttps://arxiv.org/abs/2507.16443
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
Grounding image matching in 3d with mast3r, 2024
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r, 2024. URLhttps://arxiv.org/abs/2406.09756
-
[41]
Monst3r: A simple approach for estimating geometry in the presence of motion,
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion,
-
[42]
URLhttps://arxiv.org/abs/2410.03825
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025
work page 2025
-
[44]
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-Equivariant Visual Geometry Learning.arXiv preprint arXiv:2507.13347, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Srpo: Self-referential policy optimization for vision-language-action models, 2025
Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, and Xipeng Qiu. Srpo: Self-referential policy optimization for vision-language-action models, 2025. URLhttps://arxiv.org/abs/2511.15605
-
[48]
Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again, 2025. URL https://arxiv.org/abs/ 2507.22058
-
[49]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation, 2025. URL https://arxiv.org/abs/2505.07818
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[51]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Haoyang He, Jay Patrikar, Dong-Ki Kim, Max Smith, Daniel McGann, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei, and Sebastian Scherer. Grndctrl: Grounding world models via self-supervised reward alignment.arXiv preprint arXiv:2512.01952, 2025
-
[54]
Zhaoqing Wang, Xiaobo Xia, Zhuolin Bie, Jinlin Liu, Dongdong Yu, Jia-Wang Bian, and Changhu Wang. Taming camera-controlled video generation with verifiable geometry reward.arXiv preprint arXiv:2512.02870, 2025
-
[55]
Campilot: Improving camera control in video diffusion model with efficient camera reward feedback,
Wenhang Ge, Guibao Shen, Jiawei Feng, Luozhou Wang, Hao Lu, Xingye Tian, Xin Tao, and Ying-Cong Chen. Campilot: Improving camera control in video diffusion model with efficient camera reward feedback,
- [56]
-
[57]
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y . Chen, Zhiyuan He, Yuqing Yang, and Bohan Zhuang. World-r1: Reinforcing 3d constraints for text-to-video generation, 2026. URLhttps://arxiv.org/abs/2604.24764
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[58]
Ic-world: In-context generation for shared world modeling, 2025
Fan Wu, Jiacheng Wei, Ruibo Li, Yi Xu, Junyou Li, Deheng Ye, and Guosheng Lin. Ic-world: In-context generation for shared world modeling, 2025. URLhttps://arxiv.org/abs/2512.02793
-
[59]
Epipolar geometry improves video generation models, 2025
Orest Kupyn, Fabian Manhardt, Federico Tombari, and Christian Rupprecht. Epipolar geometry improves video generation models, 2025. URLhttps://arxiv.org/abs/2510.21615
-
[60]
Vigor: Video geometry-oriented reward for temporal generative alignment, 2026
Tengjiao Yin, Jinglei Shi, Heng Guo, and Xi Wang. Vigor: Video geometry-oriented reward for temporal generative alignment, 2026. URLhttps://arxiv.org/abs/2603.16271
-
[61]
Vggrpo: Towards world-consistent video generation with 4d latent reward, 2026
Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez- Franco, and Marta Tintore Gazulla. Vggrpo: Towards world-consistent video generation with 4d latent reward, 2026. URLhttps://arxiv.org/abs/2603.26599
-
[62]
Rlgf: Reinforcement learning with geometric feedback for autonomous driving video generation, 2025
Tianyi Yan, Wencheng Han, Xia Zhou, Xueyang Zhang, Kun Zhan, Cheng zhong Xu, and Jianbing Shen. Rlgf: Reinforcement learning with geometric feedback for autonomous driving video generation, 2025. URLhttps://arxiv.org/abs/2509.16500
-
[63]
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, and Yue Wang. Videogpa: Distilling geometry priors for 3d-consistent video generation, 2026. URLhttps://arxiv.org/abs/2601.23286
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[64]
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling, 2025. URL https://arxiv.org/abs/2507.07982
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Longcat-video technical report.arXiv preprint arXiv:2510.22200,
Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025
-
[66]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[67]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
ViPE: Video Pose Engine for 3D Geometric Perception
Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Met3r: Measuring multi-view consistency in generated images
Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6034–6044, 2025
work page 2025
-
[70]
Steerx: Creating any camera-free 3d and 4d scenes with geometric steering
Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, and Changick Kim. Steerx: Creating any camera-free 3d and 4d scenes with geometric steering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27326–27337, 2025
work page 2025
-
[71]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 13 A Appendix A.1 Limitations Our rein...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.