pith. sign in

arxiv: 2605.18365 · v1 · pith:FAD6OOFHnew · submitted 2026-05-18 · 💻 cs.CV

GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

Pith reviewed 2026-05-20 11:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords geometric consistencyvideo generationtext-to-videoreinforcement fine-tuningoptical flowdiffusion modelsscene geometrycamera motion
0
0 comments X

The pith

A geometry-consistency reward makes scene geometry an explicit optimization target for text-to-video models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-video diffusion models trained on web data generate motion that often breaks physical rules, such as objects stretching or backgrounds warping when the camera moves. The paper claims these failures happen because geometry is learned only as a side effect rather than as a direct goal. To fix this, the authors build a reward that checks whether background motion can be explained by rigid camera movement and whether moving objects keep their visual identity along their paths. The reward is computed by combining optical flow to measure pixel motion, depth and pose estimates to recover 3D structure, and feature matching to track object identity. When this reward is added to reinforcement fine-tuning, geometric consistency becomes something the model actively optimizes for instead of hoping it emerges.

Core claim

We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth–pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective.

What carries the argument

The geometry-consistency reward that separates rigid background regions from dynamic objects and scores each for physical plausibility using optical flow, depth-pose predictions, and feature-based correspondence.

If this is right

  • Generated videos exhibit fewer temporal geometric artifacts including object stretching and texture drift under camera motion.
  • The method applies to diverse scenes containing both camera movement and independently moving objects.
  • The reward can be added to existing video generators without altering their core architecture.
  • Perceptual quality metrics remain comparable to the original models after fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of rigid and dynamic regions could be reused to create evaluation benchmarks that specifically target 3D consistency rather than just pixel-level similarity.
  • Combining this reward with other explicit signals such as lighting or material consistency might produce videos that obey multiple physical constraints at once.
  • Because the reward is model-agnostic, it could be applied during continued training of future larger-scale video generators to maintain consistency as model capacity grows.

Load-bearing premise

The approach assumes that optical flow, depth-pose predictions, and feature-based correspondence can reliably separate rigid background regions from dynamic objects and accurately evaluate their respective consistency.

What would settle it

Fine-tuning a video model with the reward and then measuring no reduction in geometric artifacts such as object deformation or background warping on held-out test videos would show the reward does not achieve its claimed effect.

Figures

Figures reproduced from arXiv: 2605.18365 by Boyang Deng, Gordon Wetzstein, Jan Ackermann, Shengqu Cai, Songyou Peng, Zhengfei Kuang.

Figure 1
Figure 1. Figure 1: Improving the geometric consistency of generated videos. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GeoFlow method overview.(A) A video diffusion model πθ generates G can￾didate videos from text prompts using the same initial noise. (B) A monocular depth model predicts depth maps, while an optical flow model estimates the flow between two frames. Rigid flow is derived from the predicted depth and compared with the estimated optical flow. The resulting residual flow is combined with the discrepancy betwee… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison. Each row-pair shows sampled frames from a gener￾ated video for a different prompt, comparing a baseline model (top) with our fine-tuned model (bottom). The baselines exhibit a range of inconsistencies: objects dissolve or vanish (examples 1, 4), scene layouts shift unexpectedly (ex. 3), identities drift over time (ex. 5), and object structures morph beyond recognition (ex. 2, 3). Ou… view at source ↗
read the original abstract

Generating geometrically consistent videos remains an open challenge: text-to-video diffusion models trained on web-scale data treat geometry only implicitly, leading to object deformation, texture drift, and non-rigid backgrounds under camera motion. Existing solutions either improve consistency as a byproduct, apply only to static scenes or realign the latent space of the model completely. We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth--pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective for video generators. The approach is model agnostic and applies to diverse dynamic scenes containing both camera and object motion. Experiments show substantial reductions in temporal geometric artifacts over strong baselines while preserving perceptual quality. Code and model weights are published.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GeoFlow, a geometry-consistency reward for text-to-video diffusion models. The reward operationalizes physical consistency by using optical flow, depth-pose predictions, and feature-based correspondence to separate rigid background motion (explainable by camera pose) from dynamic object motion (appearance-preserving trajectories). This reward is combined with reinforcement fine-tuning to make geometric consistency an explicit optimization target rather than an emergent property. The method is presented as model-agnostic and applicable to dynamic scenes with both camera and object motion. Experiments are claimed to show substantial reductions in temporal geometric artifacts (object deformation, texture drift, non-rigid backgrounds) while preserving perceptual quality, with code and model weights released.

Significance. If the results hold, the work provides a concrete mechanism for injecting geometric priors into generative video models via RL, moving beyond implicit learning from web data. The reliance on off-the-shelf CV modules for the reward is a strength, as it avoids self-referential or fitted parameters and enables falsifiable evaluation. Releasing code and weights supports reproducibility. The approach could influence downstream applications requiring reliable 3D-consistent video, such as simulation or robotics, provided the reward signal proves robust.

major comments (2)
  1. [§3] §3 (Reward Formulation): The central claim requires that the geometry reward accurately scores consistency even when inputs contain the very artifacts it targets. Because the optical flow, depth-pose, and correspondence networks are pre-trained on real footage, their behavior on diffusion outputs with deformations and drift must be validated; otherwise the RL stage may optimize against predictor noise rather than scene geometry. The manuscript should include a controlled study (e.g., injecting known geometric errors and measuring reward correlation) to establish that the proxy remains informative.
  2. [§4] §4 (Experiments): The abstract asserts 'substantial reductions in temporal geometric artifacts' and preservation of perceptual quality, yet the strength of this claim depends on the specific metrics, baselines, and ablations reported. Quantitative results comparing against prior consistency methods, together with ablations isolating the rigid/dynamic separation and the RL reward weighting, are needed to confirm that improvements are attributable to the proposed objective rather than other factors.
minor comments (2)
  1. [§3] Notation for the combined reward (rigid flow term + dynamic trajectory term) should be introduced with an explicit equation early in §3 to aid readability.
  2. [Figures] Figure captions describing qualitative results should explicitly label which artifacts are reduced (e.g., 'non-rigid background warping') for direct comparison with the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments and constructive feedback on the manuscript. We address each major comment below and outline the revisions planned for the next version.

read point-by-point responses
  1. Referee: [§3] §3 (Reward Formulation): The central claim requires that the geometry reward accurately scores consistency even when inputs contain the very artifacts it targets. Because the optical flow, depth-pose, and correspondence networks are pre-trained on real footage, their behavior on diffusion outputs with deformations and drift must be validated; otherwise the RL stage may optimize against predictor noise rather than scene geometry. The manuscript should include a controlled study (e.g., injecting known geometric errors and measuring reward correlation) to establish that the proxy remains informative.

    Authors: We agree that validating the reward's behavior on artifact-containing diffusion outputs is essential to ensure the RL stage optimizes for genuine geometric consistency rather than noise from the pre-trained predictors. In the revised manuscript we will add a controlled study: we will start from real videos, synthetically inject graded geometric errors (controlled object deformations, texture drifts, and non-rigid background motion), and report the Pearson correlation between the resulting reward scores and the magnitude of the injected errors. We will also quantify the accuracy drop of the off-the-shelf optical-flow, depth-pose, and correspondence modules when evaluated directly on generated samples. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts 'substantial reductions in temporal geometric artifacts' and preservation of perceptual quality, yet the strength of this claim depends on the specific metrics, baselines, and ablations reported. Quantitative results comparing against prior consistency methods, together with ablations isolating the rigid/dynamic separation and the RL reward weighting, are needed to confirm that improvements are attributable to the proposed objective rather than other factors.

    Authors: We acknowledge that stronger quantitative support and targeted ablations are needed to substantiate the claims. In the revised experiments section we will add direct comparisons against prior consistency-enhancement methods, reporting numerical scores on geometric-consistency metrics (rigid-region flow error, appearance-preservation along trajectories) and perceptual-quality metrics (CLIP similarity, FID, and a small-scale user study). We will also include ablations that separately disable the rigid/dynamic separation and vary the RL reward weighting, thereby isolating the contribution of the proposed objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reward uses external CV modules

full rationale

The paper's central construction defines a geometry-consistency reward by applying off-the-shelf optical flow, depth-pose, and feature correspondence networks to separate rigid background motion from dynamic objects and score appearance preservation along trajectories. These modules are pre-trained external components whose outputs are treated as measurements of physical consistency; the reward is then used as an RL objective. No equation or step reduces the reward value to a fitted parameter of the generator, a self-definition, or a self-citation chain. The derivation therefore remains self-contained against independent computer-vision benchmarks rather than being forced by construction from the video model's own outputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that standard off-the-shelf predictors for optical flow, depth, and pose can be trusted to separate rigid and non-rigid motion in synthetic videos; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption In physically consistent videos, background motion should be explainable by rigid camera-induced flow while independently moving objects preserve appearance identity along motion trajectories.
    This is the key insight stated in the abstract that underpins the reward design.

pith-pipeline@v0.9.0 · 5732 in / 1310 out tokens · 43559 ms · 2026-05-20T11:02:23.023813+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 33 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Asim, M., Wewer, C., Wimmer, T., Schiele, B., Lenssen, J.E.: Met3r: Measuring multi-view consistency in generated images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6034–6044 (2025) 12, S2

  2. [2]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

    Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasac- chi, A., Lindell, D.B., Tulyakov, S.: AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 22875–22889 (2025) 9

  3. [3]

    arXiv preprint arXiv:2407.12781 (2024) 2, 3

    Bahmani, S., Skorokhodov, I., Siarohin, A., Menapace, W., Qian, G., Vasilkovsky, M., Lee, H.Y., Wang, C., Zou, J., Tagliasacchi, A., et al.: VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control. arXiv preprint arXiv:2407.12781 (2024) 2, 3

  4. [4]

    arXiv preprint arXiv:2412.07760 (2024) 3

    Bai, J., Xia, M., Wang, X., Yuan, Z., Fu, X., Liu, Z., Hu, H., Wan, P., Zhang, D.: SyncamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints. arXiv preprint arXiv:2412.07760 (2024) 3

  5. [5]

    arXiv preprint arXiv:2512.03453 (2025) 4, 11

    Bai, Y., Fang, S., Yu, C., Wang, F., Huang, Q.: Geovideo: Introducing geometric regularization into video generation model. arXiv preprint arXiv:2512.03453 (2025) 4, 11

  6. [6]

    1 kontext: Flow matching for in-context image generation and editing in latent space

    Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., En- glish, J., English, Z., Esser, P., Kulal, S., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv e-prints pp. arXiv–2506 (2025) 1

  7. [7]

    In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

    Bengtson, J., Nilsson, D., Kahl, F.: Geometric consistency refinement for single im- age novel view synthesis via test-time adaptation of diffusion models. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 6399–6408 (2025) 4

  8. [8]

    Training Diffusion Models with Reinforcement Learning

    Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training Diffusion Models with Reinforcement Learning. arXiv preprint arXiv:2305.13301 (2023) 4, 8, 12

  9. [9]

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators3

  10. [10]

    In: Proceedings of the International Conference on Machine Learning (ICML) (2024) 3 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 17

    Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative Interac- tive Environments. In: Proceedings of the International Conference on Machine Learning (ICML) (2024) 3 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 17

  11. [11]

    arXiv preprint arXiv:2508.21058 (2025) 3

    Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., et al.: Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058 (2025) 3

  12. [12]

    Cao, C., et al.: MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025),https://openaccess.thecvf.com/content/ CVPR2025/papers/Cao_MVGenMaster_Scaling_Multi- View_Generation_from_ Any_Image_via_3D_Priors_CVPR_2025_paper.pdf3

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient Geometry-Aware 3D Generative Adversarial Networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16123–16133 (2022) 3

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-GAN: Pe- riodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5799–5809 (2021) 3

  15. [15]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Chan,E.R.,Nagano,K.,Chan,M.A.,Bergman,A.W.,Park,J.J.,Levy,A.,Aittala, M., De Mello, S., Karras, T., Wetzstein, G.: Generative Novel View Synthesis with 3D-Aware Diffusion Models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4217–4229 (2023) 3

  16. [16]

    Advances in Neural Information Processing Systems37, 24081–24125 (2024) 3

    Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024) 3

  17. [17]

    arXiv preprint arXiv:2410.18974 (2024),https://arxiv

    Chen, H., Shen, B., Liu, Y., Shi, R., Zhou, L., Lin, C.Z., Gu, J., Su, H., Wetzstein, G., Guibas, L.: 3D-Adapter: Geometry-Consistent Multi-View Diffusion for High- Quality 3D Generation. arXiv preprint arXiv:2410.18974 (2024),https://arxiv. org/abs/2410.189744

  18. [18]

    97141–97166 (2024) 2

    Chen, S., Chen, X., Pang, A., Zeng, X., Cheng, W., Fu, Y., Yin, F., Wang, B., Yu, J., Yu, G., et al.: MeshXL: Neural Coordinate Field for Generative 3D Foundation Models.In:AdvancesinNeuralInformationProcessingSystems(NeurIPS).vol.37, pp. 97141–97166 (2024) 2

  19. [19]

    In: Proceedings of the European Conference on Computer Vision (ECCV)

    Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 370–386. Springer (2024) 3

  20. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, Z., Tang, J., Dong, Y., Cao, Z., Hong, F., Lan, Y., Wang, T., Xie, H., Wu, T., Saito, S., et al.: 3DTopia-XL: Scaling High-Quality 3D Asset Generation via Primitive Diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26576–26586 (2025) 2

  21. [21]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chou, G., Zhang, K., Bi, S., Tan, H., Xu, Z., Luan, F., Hariharan, B., Snavely, N.: Generating 3d-consistent videos from unposed internet photos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27934–27945 (2025) 3

  22. [22]

    Du, H., Ye, J., Cong, X., Li, R., Ni, J., Agarwal, A., Zhou, Z., Li, Z., Balestriero, R., Wang, Y.: Videogpa: Distilling geometry priors for 3d-consistent video generation (2026),https://arxiv.org/abs/2601.232864, 11

  23. [23]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2025),https://openaccess.thecvf.com/content/CVPR2025/ 18 J

    Edelstein, Y., Patashnik, O., Cohen-Bar, D., Zelnik-Manor, L.: Sharp-It: A Multi- view to Multi-view Diffusion Model for 3D Synthesis and Manipulation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2025),https://openaccess.thecvf.com/content/CVPR2025/ 18 J. Ackermann et al. papers/Edelstein_Sharp-It_A_Mu...

  24. [24]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Gao,J.,Shen,T.,Wang,Z.,Chen,W.,Yin,K.,Li,D.,Litany,O.,Gojcic,Z.,Fidler, S.: GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 31841–31854 (2022) 3

  25. [25]

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models

    Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3D: Create Anything in 3D with Multi-View Diffusion Models. arXiv preprint arXiv:2405.10314 (2024) 3

  26. [26]

    GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure

    Gu, L., Hur, J., Herrmann, C., Zhan, F., Zickler, T., Sun, D., Pfister, H.: Geco: A differentiable geometric consistency metric for video generation. arXiv preprint arXiv:2512.22274 (2025) 4

  27. [27]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    He, H., et al.: CameraCtrl: Enabling Camera Control for Text-to-Video Diffusion Models. arXiv preprint arXiv:2404.02101 (2024),https://arxiv.org/abs/2404. 021012, 3

  28. [28]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022) 1, 3

  29. [29]

    LRM: Large Reconstruction Model for Single Image to 3D

    Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: LRM: Large Reconstruction Model for Single Image to 3D. arXiv preprint arXiv:2311.04400 (2023),https://arxiv.org/abs/2311.044003

  30. [30]

    Iclr1(2), 3 (2022) 9, S6

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 9, S6

  31. [31]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) 11, 12, S7

  32. [32]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 9, S6

  33. [33]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 3

  34. [34]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L.J., Wetzstein, G.: Collabo- rative Video Diffusion: Consistent Multi-Video Generation with Camera Control. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 37, pp. 16240–16271 (2024) 3

  35. [35]

    arXiv preprint arXiv:2510.21615 (2025) 4

    Kupyn,O.,Manhardt,F.,Tombari,F.,Rupprecht,C.:Epipolargeometryimproves video generation models. arXiv preprint arXiv:2510.21615 (2025) 4

  36. [36]

    In: Proceedings of the European Conference on Computer Vision (ECCV)

    Leroy,V.,Cabon,Y.,Revaud,J.:GroundingImageMatchingin3DwithMAST3R. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 71–

  37. [37]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Li, J., Cui, Y., Huang, T., Ma, Y., Fan, C., Yang, M., Zhong, Z.: Mixgrpo: Unlock- ing flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802 (2025) 4

  38. [38]

    Li, P., et al.: Era3D: High-Resolution Multiview Diffusion Using Efficient Re- arrangement Attention. In: Advances in Neural Information Processing Systems (NeurIPS) (2024),https://proceedings.neurips.cc/paper_files/paper/2024/ file/65a723bf7d8dad838c09178270d30e80-Paper-Conference.pdf3 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 19

  39. [39]

    Liang, Y., et al.: Rich Human Feedback for Text-to-Image Generation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024),https://openaccess.thecvf.com/content/CVPR2024/ papers/Liang_Rich_Human_Feedback_for_Text- to- Image_Generation_CVPR_ 2024_paper.pdf4

  40. [40]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025) 5, S1, S7

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22166–22176 (2024) 11, S6, S7

  42. [43]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) S4

  43. [44]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., Zhang, D., Ouyang, W.: Flow-GRPO: Training Flow Matching Models via Online RL. arXiv preprint arXiv:2505.05470 (2025) 2, 3, 4, 5, 7, 8

  44. [45]

    Zero-1-to-3: Zero-shot one image to 3d object.arXiv preprint arXiv:2303.11328, 2023

    Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot One Image to 3D Object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9298–9309 (2023), https://arxiv.org/abs/2303.113283

  45. [46]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. arXiv preprint arXiv:2209.03003 (2023) 7

  46. [47]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Liu, Y., et al.: SyncDreamer: Generating Multiview-Consistent Images from a Single-View Image. arXiv preprint arXiv:2309.03453 (2023),https://arxiv.org/ abs/2309.034533

  47. [48]

    Interna- tional journal of computer vision60(2), 91–110 (2004) S2

    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60(2), 91–110 (2004) S2

  48. [49]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Nan, K., Xie, R., Zhou, P., Fan, T., Zheng, Z., Huang, Z., Li, H., Li, J., Li, J.: OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-Video Generation. arXiv preprint arXiv:2407.02371 (2024) 11, S6, S7

  49. [50]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 5, S7

  50. [51]

    arXiv preprint arXiv:2512.12080 (2025) 3

    Po, R., Chan, E.R., Chen, C., Wetzstein, G.: Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models. arXiv preprint arXiv:2512.12080 (2025) 3

  51. [52]

    Long-context state-space video world models.ArXiv, abs/2505.20171, 2025

    Po, R., Nitzan, Y., Zhang, R., Chen, B., Dao, T., Shechtman, E., Wetzstein, G., Huang, X.: Long-Context State-Space Video World Models. arXiv preprint arXiv:2505.20171 (2025) 3

  52. [53]

    In: NeurIPS (2023) 4

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Di- rect preference optimization: Your language model is secretly a reward model. In: NeurIPS (2023) 4

  53. [54]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1

  54. [55]

    very scattered

    Sampson, P.D.: Fitting conic sections to “very scattered” data: An iterative refine- ment of the bookstein algorithm. Computer graphics and image processing18(1), 97–108 (1982) 12, S2 20 J. Ackermann et al

  55. [56]

    arXiv preprint arXiv:2303.07937 (2023) 4

    Seo, J., Jang, W., Kwak, M.S., Kim, H., Ko, J., Kim, J., Kim, J.H., Lee, J., Kim, S.: Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation. arXiv preprint arXiv:2303.07937 (2023) 4

  56. [57]

    In: arXiv (2024) 4

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. In: arXiv (2024) 4

  57. [58]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., Guo, D.: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300 (2024) 7

  58. [59]

    MVDream: Multi-view Diffusion for 3D Generation

    Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: Multi-view Dif- fusion for 3D Generation. arXiv preprint arXiv:2308.16512 (2023) 3

  59. [60]

    Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

    Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., Yang, X.: Zero123++: Single Image to Consistent Multi-view Diffusion Base Model. arXiv preprint arXiv:2310.15110 (2023),https://arxiv.org/abs/2310.151103

  60. [61]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3D Neural Field Generation using Triplane Diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20875–20886 (2023) 3

  61. [62]

    History-Guided Video Diffusion

    Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- Guided Video Diffusion. arXiv preprint arXiv:2502.06764 (2025) 3

  62. [63]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 8

  63. [64]

    Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024

    Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024),https://arxiv.org/ abs/2402.050543

  64. [65]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) S6

  65. [66]

    Team, G.: Mochi 1.https://github.com/genmoai/models(2024) 3

  66. [67]

    In: European conference on computer vision

    Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: European conference on computer vision. pp. 402–419. Springer (2020) S1, S3

  67. [68]

    TripoSR: Fast 3D Object Reconstruction from a Single Image

    Tochilkin, D., Pankratz, D., Liu, Z., Huang, Z., Letts, A., Li, Y., Liang, D., Laforte, C., Jampani, V., Cao, Y.P.: TripoSR: Fast 3D Object Reconstruction from a Single Image. arXiv preprint arXiv:2403.02151 (2024),https://arxiv.org/abs/2403. 021513

  68. [69]

    In: Proceedings of the European Conference on Com- puter Vision (ECCV)

    Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative Camera Dolly: Extreme Monocular Dy- namic Novel View Synthesis. In: Proceedings of the European Conference on Com- puter Vision (ECCV). pp. 313–331. Springer (2024) 3

  69. [70]

    In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) (2024),https://openaccess.thecvf

    Wallace, B., et al.: Diffusion Model Alignment Using Direct Preference Op- timization. In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) (2024),https://openaccess.thecvf. com/content/CVPR2024/papers/Wallace_Diffusion_Model_Alignment_Using_ Direct_Preference_Optimization_CVPR_2024_paper.pdf4

  70. [71]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan Team: Wan: Open and Advanced Large-Scale Video Generative Models. arXiv preprint arXiv:2503.20314 (2025) 2, 3, 9, S6

  71. [72]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π3: Scalable Permutation-Equivariant Visual Geometry Learning. arXiv preprint arXiv:2507.13347 (2025) S1 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 21

  72. [73]

    arXiv preprint arXiv:2506.21526 (2025) 5, S1, S7

    Wang, Y., Deng, J.: Waft: Warping-alone field transforms for optical flow. arXiv preprint arXiv:2506.21526 (2025) 5, S1, S7

  73. [74]

    Video models are zero-shot learners and reasoners

    Wiedemer, T., Li, Y., Vicol, P., Gu, S.S., Matarese, N., Swersky, K., Kim, B., Jaini, P., Geirhos, R.: Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328 (2025) 3

  74. [75]

    arXiv preprint arXiv:2512.02793 (2025) 4

    Wu, F., Wei, J., Li, R., Xu, Y., Li, J., Ye, D., Lin, G.: Ic-world: In-context gener- ation for shared world modeling. arXiv preprint arXiv:2512.02793 (2025) 4

  75. [76]

    Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

    Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling. arXiv preprint arXiv:2507.07982 (2025) 2, 4

  76. [77]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)

    Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J.T., Holynski, A.: Cat4D: Create Anything in 4D with Multi-View Video Diffusion Models. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 26057–26068 (2025) 3

  77. [78]

    arXiv preprint arXiv:2506.05284 (2025) 3

    Wu, T., Yang, S., Po, R., Xu, Y., Liu, Z., Lin, D., Wetzstein, G.: Video World Models with Long-term Spatial Memory. arXiv preprint arXiv:2506.05284 (2025) 3

  78. [79]

    arXiv preprint arXiv:2504.12369 (2025) 3

    Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: World- Mem: Long-Term Consistent World Simulation with Memory. arXiv preprint arXiv:2504.12369 (2025) 3

  79. [80]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xie, D., Li, J., Tan, H., Sun, X., Shu, Z., Zhou, Y., Bi, S., Pirk, S., Kauf- man, A.E.: Carve3D: Improving Multi-view Reconstruction Consistency for Diffu- sion Models with RL Finetuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6369–6379 (2024), https : / / openaccess . thecvf . com / content / CV...

  80. [81]

    In: Proceedings of the Asian conference on computer vision

    Xie, J., Yang, C., Xie, W., Zisserman, A.: Moving object segmentation: All you need is sam (and flow). In: Proceedings of the Asian conference on computer vision. pp. 162–178 (2024) S2

Showing first 80 references.