GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

Boyang Deng; Gordon Wetzstein; Jan Ackermann; Shengqu Cai; Songyou Peng; Zhengfei Kuang

arxiv: 2605.18365 · v1 · pith:FAD6OOFHnew · submitted 2026-05-18 · 💻 cs.CV

GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

Jan Ackermann , Shengqu Cai , Boyang Deng , Zhengfei Kuang , Songyou Peng , Gordon Wetzstein This is my paper

Pith reviewed 2026-05-20 11:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords geometric consistencyvideo generationtext-to-videoreinforcement fine-tuningoptical flowdiffusion modelsscene geometrycamera motion

0 comments

The pith

A geometry-consistency reward makes scene geometry an explicit optimization target for text-to-video models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-video diffusion models trained on web data generate motion that often breaks physical rules, such as objects stretching or backgrounds warping when the camera moves. The paper claims these failures happen because geometry is learned only as a side effect rather than as a direct goal. To fix this, the authors build a reward that checks whether background motion can be explained by rigid camera movement and whether moving objects keep their visual identity along their paths. The reward is computed by combining optical flow to measure pixel motion, depth and pose estimates to recover 3D structure, and feature matching to track object identity. When this reward is added to reinforcement fine-tuning, geometric consistency becomes something the model actively optimizes for instead of hoping it emerges.

Core claim

We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth–pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective.

What carries the argument

The geometry-consistency reward that separates rigid background regions from dynamic objects and scores each for physical plausibility using optical flow, depth-pose predictions, and feature-based correspondence.

If this is right

Generated videos exhibit fewer temporal geometric artifacts including object stretching and texture drift under camera motion.
The method applies to diverse scenes containing both camera movement and independently moving objects.
The reward can be added to existing video generators without altering their core architecture.
Perceptual quality metrics remain comparable to the original models after fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of rigid and dynamic regions could be reused to create evaluation benchmarks that specifically target 3D consistency rather than just pixel-level similarity.
Combining this reward with other explicit signals such as lighting or material consistency might produce videos that obey multiple physical constraints at once.
Because the reward is model-agnostic, it could be applied during continued training of future larger-scale video generators to maintain consistency as model capacity grows.

Load-bearing premise

The approach assumes that optical flow, depth-pose predictions, and feature-based correspondence can reliably separate rigid background regions from dynamic objects and accurately evaluate their respective consistency.

What would settle it

Fine-tuning a video model with the reward and then measuring no reduction in geometric artifacts such as object deformation or background warping on held-out test videos would show the reward does not achieve its claimed effect.

Figures

Figures reproduced from arXiv: 2605.18365 by Boyang Deng, Gordon Wetzstein, Jan Ackermann, Shengqu Cai, Songyou Peng, Zhengfei Kuang.

**Figure 2.** Figure 2: GeoFlow method overview.(A) A video diffusion model πθ generates G candidate videos from text prompts using the same initial noise. (B) A monocular depth model predicts depth maps, while an optical flow model estimates the flow between two frames. Rigid flow is derived from the predicted depth and compared with the estimated optical flow. The resulting residual flow is combined with the discrepancy betwee… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison. Each row-pair shows sampled frames from a generated video for a different prompt, comparing a baseline model (top) with our fine-tuned model (bottom). The baselines exhibit a range of inconsistencies: objects dissolve or vanish (examples 1, 4), scene layouts shift unexpectedly (ex. 3), identities drift over time (ex. 5), and object structures morph beyond recognition (ex. 2, 3). Ou… view at source ↗

read the original abstract

Generating geometrically consistent videos remains an open challenge: text-to-video diffusion models trained on web-scale data treat geometry only implicitly, leading to object deformation, texture drift, and non-rigid backgrounds under camera motion. Existing solutions either improve consistency as a byproduct, apply only to static scenes or realign the latent space of the model completely. We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth--pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective for video generators. The approach is model agnostic and applies to diverse dynamic scenes containing both camera and object motion. Experiments show substantial reductions in temporal geometric artifacts over strong baselines while preserving perceptual quality. Code and model weights are published.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoFlow turns geometric consistency into an explicit RL reward using off-the-shelf flow, depth, and correspondence, which is a practical step but rests on those predictors staying accurate on the flawed videos it aims to fix.

read the letter

GeoFlow's core move is to define a reward that scores whether generated video motion matches a coherent 3D scene: background flow should be rigid and camera-driven, while moving objects keep their appearance along trajectories. They compute this with optical flow, depth-pose, and feature matches, then optimize via reinforcement fine-tuning. This makes consistency a direct objective instead of an implicit side effect of training on web data. The method is presented as model-agnostic and usable on scenes with both camera and object motion. Releasing code and weights is a clear positive that lets others verify and build on it. The abstract reports substantial drops in deformation and drift artifacts while holding perceptual quality steady, which would be useful if the numbers and controls hold up in the full paper. The main soft spot is exactly the one in the stress test. The reward depends on pre-trained CV modules that were fit to real footage. On diffusion outputs that already contain warping, texture drift, or non-rigid backgrounds, those modules can easily give noisy or systematically wrong signals. If that happens, RL fine-tuning risks optimizing against the reward's own errors rather than actual geometry. The paper would be stronger with explicit checks on predictor reliability on its own generations or on controlled failure cases. Without those, the central claim is harder to trust at face value. This is aimed at researchers and engineers working on video diffusion models who need better 3D fidelity for downstream uses like editing or simulation. Readers who already work with RL fine-tuning or consistency losses will see the most direct value. It has a concrete new objective, empirical claims, and open resources, so it deserves a serious referee to examine the reward details, the balancing with perceptual loss, and the strength of the quantitative results.

Referee Report

2 major / 2 minor

Summary. The paper introduces GeoFlow, a geometry-consistency reward for text-to-video diffusion models. The reward operationalizes physical consistency by using optical flow, depth-pose predictions, and feature-based correspondence to separate rigid background motion (explainable by camera pose) from dynamic object motion (appearance-preserving trajectories). This reward is combined with reinforcement fine-tuning to make geometric consistency an explicit optimization target rather than an emergent property. The method is presented as model-agnostic and applicable to dynamic scenes with both camera and object motion. Experiments are claimed to show substantial reductions in temporal geometric artifacts (object deformation, texture drift, non-rigid backgrounds) while preserving perceptual quality, with code and model weights released.

Significance. If the results hold, the work provides a concrete mechanism for injecting geometric priors into generative video models via RL, moving beyond implicit learning from web data. The reliance on off-the-shelf CV modules for the reward is a strength, as it avoids self-referential or fitted parameters and enables falsifiable evaluation. Releasing code and weights supports reproducibility. The approach could influence downstream applications requiring reliable 3D-consistent video, such as simulation or robotics, provided the reward signal proves robust.

major comments (2)

[§3] §3 (Reward Formulation): The central claim requires that the geometry reward accurately scores consistency even when inputs contain the very artifacts it targets. Because the optical flow, depth-pose, and correspondence networks are pre-trained on real footage, their behavior on diffusion outputs with deformations and drift must be validated; otherwise the RL stage may optimize against predictor noise rather than scene geometry. The manuscript should include a controlled study (e.g., injecting known geometric errors and measuring reward correlation) to establish that the proxy remains informative.
[§4] §4 (Experiments): The abstract asserts 'substantial reductions in temporal geometric artifacts' and preservation of perceptual quality, yet the strength of this claim depends on the specific metrics, baselines, and ablations reported. Quantitative results comparing against prior consistency methods, together with ablations isolating the rigid/dynamic separation and the RL reward weighting, are needed to confirm that improvements are attributable to the proposed objective rather than other factors.

minor comments (2)

[§3] Notation for the combined reward (rigid flow term + dynamic trajectory term) should be introduced with an explicit equation early in §3 to aid readability.
[Figures] Figure captions describing qualitative results should explicitly label which artifacts are reduced (e.g., 'non-rigid background warping') for direct comparison with the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments and constructive feedback on the manuscript. We address each major comment below and outline the revisions planned for the next version.

read point-by-point responses

Referee: [§3] §3 (Reward Formulation): The central claim requires that the geometry reward accurately scores consistency even when inputs contain the very artifacts it targets. Because the optical flow, depth-pose, and correspondence networks are pre-trained on real footage, their behavior on diffusion outputs with deformations and drift must be validated; otherwise the RL stage may optimize against predictor noise rather than scene geometry. The manuscript should include a controlled study (e.g., injecting known geometric errors and measuring reward correlation) to establish that the proxy remains informative.

Authors: We agree that validating the reward's behavior on artifact-containing diffusion outputs is essential to ensure the RL stage optimizes for genuine geometric consistency rather than noise from the pre-trained predictors. In the revised manuscript we will add a controlled study: we will start from real videos, synthetically inject graded geometric errors (controlled object deformations, texture drifts, and non-rigid background motion), and report the Pearson correlation between the resulting reward scores and the magnitude of the injected errors. We will also quantify the accuracy drop of the off-the-shelf optical-flow, depth-pose, and correspondence modules when evaluated directly on generated samples. revision: yes
Referee: [§4] §4 (Experiments): The abstract asserts 'substantial reductions in temporal geometric artifacts' and preservation of perceptual quality, yet the strength of this claim depends on the specific metrics, baselines, and ablations reported. Quantitative results comparing against prior consistency methods, together with ablations isolating the rigid/dynamic separation and the RL reward weighting, are needed to confirm that improvements are attributable to the proposed objective rather than other factors.

Authors: We acknowledge that stronger quantitative support and targeted ablations are needed to substantiate the claims. In the revised experiments section we will add direct comparisons against prior consistency-enhancement methods, reporting numerical scores on geometric-consistency metrics (rigid-region flow error, appearance-preservation along trajectories) and perceptual-quality metrics (CLIP similarity, FID, and a small-scale user study). We will also include ablations that separately disable the rigid/dynamic separation and vary the RL reward weighting, thereby isolating the contribution of the proposed objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reward uses external CV modules

full rationale

The paper's central construction defines a geometry-consistency reward by applying off-the-shelf optical flow, depth-pose, and feature correspondence networks to separate rigid background motion from dynamic objects and score appearance preservation along trajectories. These modules are pre-trained external components whose outputs are treated as measurements of physical consistency; the reward is then used as an RL objective. No equation or step reduces the reward value to a fitted parameter of the generator, a self-definition, or a self-citation chain. The derivation therefore remains self-contained against independent computer-vision benchmarks rather than being forced by construction from the video model's own outputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that standard off-the-shelf predictors for optical flow, depth, and pose can be trusted to separate rigid and non-rigid motion in synthetic videos; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption In physically consistent videos, background motion should be explainable by rigid camera-induced flow while independently moving objects preserve appearance identity along motion trajectories.
This is the key insight stated in the abstract that underpins the reward design.

pith-pipeline@v0.9.0 · 5732 in / 1310 out tokens · 43559 ms · 2026-05-20T11:02:23.023813+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories... using optical flow, depth–pose predictions, and feature-based correspondence to separate rigid and dynamic regions
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

R_geo = 1/|Ω| Σ Q_geo(u) − 1 ... Q_geo(u) = (1 − min(Ē_epe(u),1)) · (1 − min(E_depth(u),1))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 33 internal anchors

[1]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Asim, M., Wewer, C., Wimmer, T., Schiele, B., Lenssen, J.E.: Met3r: Measuring multi-view consistency in generated images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6034–6044 (2025) 12, S2

work page 2025
[2]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasac- chi, A., Lindell, D.B., Tulyakov, S.: AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 22875–22889 (2025) 9

work page 2025
[3]

arXiv preprint arXiv:2407.12781 (2024) 2, 3

Bahmani, S., Skorokhodov, I., Siarohin, A., Menapace, W., Qian, G., Vasilkovsky, M., Lee, H.Y., Wang, C., Zou, J., Tagliasacchi, A., et al.: VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control. arXiv preprint arXiv:2407.12781 (2024) 2, 3

work page arXiv 2024
[4]

arXiv preprint arXiv:2412.07760 (2024) 3

Bai, J., Xia, M., Wang, X., Yuan, Z., Fu, X., Liu, Z., Hu, H., Wan, P., Zhang, D.: SyncamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints. arXiv preprint arXiv:2412.07760 (2024) 3

work page arXiv 2024
[5]

arXiv preprint arXiv:2512.03453 (2025) 4, 11

Bai, Y., Fang, S., Yu, C., Wang, F., Huang, Q.: Geovideo: Introducing geometric regularization into video generation model. arXiv preprint arXiv:2512.03453 (2025) 4, 11

work page arXiv 2025
[6]

1 kontext: Flow matching for in-context image generation and editing in latent space

Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., En- glish, J., English, Z., Esser, P., Kulal, S., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv e-prints pp. arXiv–2506 (2025) 1

work page 2025
[7]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

Bengtson, J., Nilsson, D., Kahl, F.: Geometric consistency refinement for single im- age novel view synthesis via test-time adaptation of diffusion models. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 6399–6408 (2025) 4

work page 2025
[8]

Training Diffusion Models with Reinforcement Learning

Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training Diffusion Models with Reinforcement Learning. arXiv preprint arXiv:2305.13301 (2023) 4, 8, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators3

work page 2024
[10]

In: Proceedings of the International Conference on Machine Learning (ICML) (2024) 3 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 17

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative Interac- tive Environments. In: Proceedings of the International Conference on Machine Learning (ICML) (2024) 3 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 17

work page 2024
[11]

arXiv preprint arXiv:2508.21058 (2025) 3

Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., et al.: Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058 (2025) 3

work page arXiv 2025
[12]

Cao, C., et al.: MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025),https://openaccess.thecvf.com/content/ CVPR2025/papers/Cao_MVGenMaster_Scaling_Multi- View_Generation_from_ Any_Image_via_3D_Priors_CVPR_2025_paper.pdf3

work page 2025
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient Geometry-Aware 3D Generative Adversarial Networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16123–16133 (2022) 3

work page 2022
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-GAN: Pe- riodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5799–5809 (2021) 3

work page 2021
[15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Chan,E.R.,Nagano,K.,Chan,M.A.,Bergman,A.W.,Park,J.J.,Levy,A.,Aittala, M., De Mello, S., Karras, T., Wetzstein, G.: Generative Novel View Synthesis with 3D-Aware Diffusion Models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4217–4229 (2023) 3

work page 2023
[16]

Advances in Neural Information Processing Systems37, 24081–24125 (2024) 3

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024) 3

work page 2024
[17]

arXiv preprint arXiv:2410.18974 (2024),https://arxiv

Chen, H., Shen, B., Liu, Y., Shi, R., Zhou, L., Lin, C.Z., Gu, J., Su, H., Wetzstein, G., Guibas, L.: 3D-Adapter: Geometry-Consistent Multi-View Diffusion for High- Quality 3D Generation. arXiv preprint arXiv:2410.18974 (2024),https://arxiv. org/abs/2410.189744

work page arXiv 2024
[18]

97141–97166 (2024) 2

Chen, S., Chen, X., Pang, A., Zeng, X., Cheng, W., Fu, Y., Yin, F., Wang, B., Yu, J., Yu, G., et al.: MeshXL: Neural Coordinate Field for Generative 3D Foundation Models.In:AdvancesinNeuralInformationProcessingSystems(NeurIPS).vol.37, pp. 97141–97166 (2024) 2

work page 2024
[19]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 370–386. Springer (2024) 3

work page 2024
[20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, Z., Tang, J., Dong, Y., Cao, Z., Hong, F., Lan, Y., Wang, T., Xie, H., Wu, T., Saito, S., et al.: 3DTopia-XL: Scaling High-Quality 3D Asset Generation via Primitive Diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26576–26586 (2025) 2

work page 2025
[21]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chou, G., Zhang, K., Bi, S., Tan, H., Xu, Z., Luan, F., Hariharan, B., Snavely, N.: Generating 3d-consistent videos from unposed internet photos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27934–27945 (2025) 3

work page 2025
[22]

Du, H., Ye, J., Cong, X., Li, R., Ni, J., Agarwal, A., Zhou, Z., Li, Z., Balestriero, R., Wang, Y.: Videogpa: Distilling geometry priors for 3d-consistent video generation (2026),https://arxiv.org/abs/2601.232864, 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2025),https://openaccess.thecvf.com/content/CVPR2025/ 18 J

Edelstein, Y., Patashnik, O., Cohen-Bar, D., Zelnik-Manor, L.: Sharp-It: A Multi- view to Multi-view Diffusion Model for 3D Synthesis and Manipulation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2025),https://openaccess.thecvf.com/content/CVPR2025/ 18 J. Ackermann et al. papers/Edelstein_Sharp-It_A_Mu...

work page 2025
[24]

In: Advances in Neural Information Processing Systems (NeurIPS)

Gao,J.,Shen,T.,Wang,Z.,Chen,W.,Yin,K.,Li,D.,Litany,O.,Gojcic,Z.,Fidler, S.: GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 31841–31854 (2022) 3

work page 2022
[25]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3D: Create Anything in 3D with Multi-View Diffusion Models. arXiv preprint arXiv:2405.10314 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure

Gu, L., Hur, J., Herrmann, C., Zhan, F., Zickler, T., Sun, D., Pfister, H.: Geco: A differentiable geometric consistency metric for video generation. arXiv preprint arXiv:2512.22274 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

He, H., et al.: CameraCtrl: Enabling Camera Control for Text-to-Video Diffusion Models. arXiv preprint arXiv:2404.02101 (2024),https://arxiv.org/abs/2404. 021012, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022) 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

LRM: Large Reconstruction Model for Single Image to 3D

Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: LRM: Large Reconstruction Model for Single Image to 3D. arXiv preprint arXiv:2311.04400 (2023),https://arxiv.org/abs/2311.044003

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Iclr1(2), 3 (2022) 9, S6

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 9, S6

work page 2022
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) 11, 12, S7

work page 2024
[32]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 9, S6

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

In: Advances in Neural Information Processing Systems (NeurIPS)

Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L.J., Wetzstein, G.: Collabo- rative Video Diffusion: Consistent Multi-Video Generation with Camera Control. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 37, pp. 16240–16271 (2024) 3

work page 2024
[35]

arXiv preprint arXiv:2510.21615 (2025) 4

Kupyn,O.,Manhardt,F.,Tombari,F.,Rupprecht,C.:Epipolargeometryimproves video generation models. arXiv preprint arXiv:2510.21615 (2025) 4

work page arXiv 2025
[36]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Leroy,V.,Cabon,Y.,Revaud,J.:GroundingImageMatchingin3DwithMAST3R. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 71–

work page
[37]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Li, J., Cui, Y., Huang, T., Ma, Y., Fan, C., Yang, M., Zhong, Z.: Mixgrpo: Unlock- ing flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Li, P., et al.: Era3D: High-Resolution Multiview Diffusion Using Efficient Re- arrangement Attention. In: Advances in Neural Information Processing Systems (NeurIPS) (2024),https://proceedings.neurips.cc/paper_files/paper/2024/ file/65a723bf7d8dad838c09178270d30e80-Paper-Conference.pdf3 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 19

work page 2024
[39]

Liang, Y., et al.: Rich Human Feedback for Text-to-Image Generation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024),https://openaccess.thecvf.com/content/CVPR2024/ papers/Liang_Rich_Human_Feedback_for_Text- to- Image_Generation_CVPR_ 2024_paper.pdf4

work page 2024
[40]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025) 5, S1, S7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22166–22176 (2024) 11, S6, S7

work page 2024
[43]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) S4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., Zhang, D., Ouyang, W.: Flow-GRPO: Training Flow Matching Models via Online RL. arXiv preprint arXiv:2505.05470 (2025) 2, 3, 4, 5, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Zero-1-to-3: Zero-shot one image to 3d object.arXiv preprint arXiv:2303.11328, 2023

Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot One Image to 3D Object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9298–9309 (2023), https://arxiv.org/abs/2303.113283

work page arXiv 2023
[46]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. arXiv preprint arXiv:2209.03003 (2023) 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Liu, Y., et al.: SyncDreamer: Generating Multiview-Consistent Images from a Single-View Image. arXiv preprint arXiv:2309.03453 (2023),https://arxiv.org/ abs/2309.034533

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Interna- tional journal of computer vision60(2), 91–110 (2004) S2

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60(2), 91–110 (2004) S2

work page 2004
[49]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Nan, K., Xie, R., Zhou, P., Fan, T., Zheng, Z., Huang, Z., Li, H., Li, J., Li, J.: OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-Video Generation. arXiv preprint arXiv:2407.02371 (2024) 11, S6, S7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 5, S7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

arXiv preprint arXiv:2512.12080 (2025) 3

Po, R., Chan, E.R., Chen, C., Wetzstein, G.: Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models. arXiv preprint arXiv:2512.12080 (2025) 3

work page arXiv 2025
[52]

Long-context state-space video world models.ArXiv, abs/2505.20171, 2025

Po, R., Nitzan, Y., Zhang, R., Chen, B., Dao, T., Shechtman, E., Wetzstein, G., Huang, X.: Long-Context State-Space Video World Models. arXiv preprint arXiv:2505.20171 (2025) 3

work page arXiv 2025
[53]

In: NeurIPS (2023) 4

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Di- rect preference optimization: Your language model is secretly a reward model. In: NeurIPS (2023) 4

work page 2023
[54]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1

work page 2022
[55]

very scattered

Sampson, P.D.: Fitting conic sections to “very scattered” data: An iterative refine- ment of the bookstein algorithm. Computer graphics and image processing18(1), 97–108 (1982) 12, S2 20 J. Ackermann et al

work page 1982
[56]

arXiv preprint arXiv:2303.07937 (2023) 4

Seo, J., Jang, W., Kwak, M.S., Kim, H., Ko, J., Kim, J., Kim, J.H., Lee, J., Kim, S.: Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation. arXiv preprint arXiv:2303.07937 (2023) 4

work page arXiv 2023
[57]

In: arXiv (2024) 4

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. In: arXiv (2024) 4

work page 2024
[58]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., Guo, D.: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300 (2024) 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

MVDream: Multi-view Diffusion for 3D Generation

Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: Multi-view Dif- fusion for 3D Generation. arXiv preprint arXiv:2308.16512 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., Yang, X.: Zero123++: Single Image to Consistent Multi-view Diffusion Base Model. arXiv preprint arXiv:2310.15110 (2023),https://arxiv.org/abs/2310.151103

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3D Neural Field Generation using Triplane Diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20875–20886 (2023) 3

work page 2023
[62]

History-Guided Video Diffusion

Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- Guided Video Diffusion. arXiv preprint arXiv:2502.06764 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 8

work page internal anchor Pith review Pith/arXiv arXiv 2011
[64]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024

Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024),https://arxiv.org/ abs/2402.050543

work page arXiv 2024
[65]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) S6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Team, G.: Mochi 1.https://github.com/genmoai/models(2024) 3

work page 2024
[67]

In: European conference on computer vision

Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: European conference on computer vision. pp. 402–419. Springer (2020) S1, S3

work page 2020
[68]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Tochilkin, D., Pankratz, D., Liu, Z., Huang, Z., Letts, A., Li, Y., Liang, D., Laforte, C., Jampani, V., Cao, Y.P.: TripoSR: Fast 3D Object Reconstruction from a Single Image. arXiv preprint arXiv:2403.02151 (2024),https://arxiv.org/abs/2403. 021513

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

In: Proceedings of the European Conference on Com- puter Vision (ECCV)

Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative Camera Dolly: Extreme Monocular Dy- namic Novel View Synthesis. In: Proceedings of the European Conference on Com- puter Vision (ECCV). pp. 313–331. Springer (2024) 3

work page 2024
[70]

In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) (2024),https://openaccess.thecvf

Wallace, B., et al.: Diffusion Model Alignment Using Direct Preference Op- timization. In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) (2024),https://openaccess.thecvf. com/content/CVPR2024/papers/Wallace_Diffusion_Model_Alignment_Using_ Direct_Preference_Optimization_CVPR_2024_paper.pdf4

work page 2024
[71]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan Team: Wan: Open and Advanced Large-Scale Video Generative Models. arXiv preprint arXiv:2503.20314 (2025) 2, 3, 9, S6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π3: Scalable Permutation-Equivariant Visual Geometry Learning. arXiv preprint arXiv:2507.13347 (2025) S1 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 21

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

arXiv preprint arXiv:2506.21526 (2025) 5, S1, S7

Wang, Y., Deng, J.: Waft: Warping-alone field transforms for optical flow. arXiv preprint arXiv:2506.21526 (2025) 5, S1, S7

work page arXiv 2025
[74]

Video models are zero-shot learners and reasoners

Wiedemer, T., Li, Y., Vicol, P., Gu, S.S., Matarese, N., Swersky, K., Kim, B., Jaini, P., Geirhos, R.: Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

arXiv preprint arXiv:2512.02793 (2025) 4

Wu, F., Wei, J., Li, R., Xu, Y., Li, J., Ye, D., Lin, G.: Ic-world: In-context gener- ation for shared world modeling. arXiv preprint arXiv:2512.02793 (2025) 4

work page arXiv 2025
[76]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling. arXiv preprint arXiv:2507.07982 (2025) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)

Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J.T., Holynski, A.: Cat4D: Create Anything in 4D with Multi-View Video Diffusion Models. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 26057–26068 (2025) 3

work page 2025
[78]

arXiv preprint arXiv:2506.05284 (2025) 3

Wu, T., Yang, S., Po, R., Xu, Y., Liu, Z., Lin, D., Wetzstein, G.: Video World Models with Long-term Spatial Memory. arXiv preprint arXiv:2506.05284 (2025) 3

work page arXiv 2025
[79]

arXiv preprint arXiv:2504.12369 (2025) 3

Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: World- Mem: Long-Term Consistent World Simulation with Memory. arXiv preprint arXiv:2504.12369 (2025) 3

work page arXiv 2025
[80]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xie, D., Li, J., Tan, H., Sun, X., Shu, Z., Zhou, Y., Bi, S., Pirk, S., Kauf- man, A.E.: Carve3D: Improving Multi-view Reconstruction Consistency for Diffu- sion Models with RL Finetuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6369–6379 (2024), https : / / openaccess . thecvf . com / content / CV...

work page 2024
[81]

In: Proceedings of the Asian conference on computer vision

Xie, J., Yang, C., Xie, W., Zisserman, A.: Moving object segmentation: All you need is sam (and flow). In: Proceedings of the Asian conference on computer vision. pp. 162–178 (2024) S2

work page 2024

Showing first 80 references.

[1] [1]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Asim, M., Wewer, C., Wimmer, T., Schiele, B., Lenssen, J.E.: Met3r: Measuring multi-view consistency in generated images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6034–6044 (2025) 12, S2

work page 2025

[2] [2]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasac- chi, A., Lindell, D.B., Tulyakov, S.: AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 22875–22889 (2025) 9

work page 2025

[3] [3]

arXiv preprint arXiv:2407.12781 (2024) 2, 3

Bahmani, S., Skorokhodov, I., Siarohin, A., Menapace, W., Qian, G., Vasilkovsky, M., Lee, H.Y., Wang, C., Zou, J., Tagliasacchi, A., et al.: VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control. arXiv preprint arXiv:2407.12781 (2024) 2, 3

work page arXiv 2024

[4] [4]

arXiv preprint arXiv:2412.07760 (2024) 3

Bai, J., Xia, M., Wang, X., Yuan, Z., Fu, X., Liu, Z., Hu, H., Wan, P., Zhang, D.: SyncamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints. arXiv preprint arXiv:2412.07760 (2024) 3

work page arXiv 2024

[5] [5]

arXiv preprint arXiv:2512.03453 (2025) 4, 11

Bai, Y., Fang, S., Yu, C., Wang, F., Huang, Q.: Geovideo: Introducing geometric regularization into video generation model. arXiv preprint arXiv:2512.03453 (2025) 4, 11

work page arXiv 2025

[6] [6]

1 kontext: Flow matching for in-context image generation and editing in latent space

Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., En- glish, J., English, Z., Esser, P., Kulal, S., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv e-prints pp. arXiv–2506 (2025) 1

work page 2025

[7] [7]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

Bengtson, J., Nilsson, D., Kahl, F.: Geometric consistency refinement for single im- age novel view synthesis via test-time adaptation of diffusion models. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 6399–6408 (2025) 4

work page 2025

[8] [8]

Training Diffusion Models with Reinforcement Learning

Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training Diffusion Models with Reinforcement Learning. arXiv preprint arXiv:2305.13301 (2023) 4, 8, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators3

work page 2024

[10] [10]

In: Proceedings of the International Conference on Machine Learning (ICML) (2024) 3 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 17

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative Interac- tive Environments. In: Proceedings of the International Conference on Machine Learning (ICML) (2024) 3 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 17

work page 2024

[11] [11]

arXiv preprint arXiv:2508.21058 (2025) 3

Cai, S., Yang, C., Zhang, L., Guo, Y., Xiao, J., Yang, Z., Xu, Y., Yang, Z., Yuille, A., Guibas, L., et al.: Mixture of contexts for long video generation. arXiv preprint arXiv:2508.21058 (2025) 3

work page arXiv 2025

[12] [12]

Cao, C., et al.: MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025),https://openaccess.thecvf.com/content/ CVPR2025/papers/Cao_MVGenMaster_Scaling_Multi- View_Generation_from_ Any_Image_via_3D_Priors_CVPR_2025_paper.pdf3

work page 2025

[13] [13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient Geometry-Aware 3D Generative Adversarial Networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16123–16133 (2022) 3

work page 2022

[14] [14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-GAN: Pe- riodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5799–5809 (2021) 3

work page 2021

[15] [15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Chan,E.R.,Nagano,K.,Chan,M.A.,Bergman,A.W.,Park,J.J.,Levy,A.,Aittala, M., De Mello, S., Karras, T., Wetzstein, G.: Generative Novel View Synthesis with 3D-Aware Diffusion Models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4217–4229 (2023) 3

work page 2023

[16] [16]

Advances in Neural Information Processing Systems37, 24081–24125 (2024) 3

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024) 3

work page 2024

[17] [17]

arXiv preprint arXiv:2410.18974 (2024),https://arxiv

Chen, H., Shen, B., Liu, Y., Shi, R., Zhou, L., Lin, C.Z., Gu, J., Su, H., Wetzstein, G., Guibas, L.: 3D-Adapter: Geometry-Consistent Multi-View Diffusion for High- Quality 3D Generation. arXiv preprint arXiv:2410.18974 (2024),https://arxiv. org/abs/2410.189744

work page arXiv 2024

[18] [18]

97141–97166 (2024) 2

Chen, S., Chen, X., Pang, A., Zeng, X., Cheng, W., Fu, Y., Yin, F., Wang, B., Yu, J., Yu, G., et al.: MeshXL: Neural Coordinate Field for Generative 3D Foundation Models.In:AdvancesinNeuralInformationProcessingSystems(NeurIPS).vol.37, pp. 97141–97166 (2024) 2

work page 2024

[19] [19]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 370–386. Springer (2024) 3

work page 2024

[20] [20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, Z., Tang, J., Dong, Y., Cao, Z., Hong, F., Lan, Y., Wang, T., Xie, H., Wu, T., Saito, S., et al.: 3DTopia-XL: Scaling High-Quality 3D Asset Generation via Primitive Diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26576–26586 (2025) 2

work page 2025

[21] [21]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chou, G., Zhang, K., Bi, S., Tan, H., Xu, Z., Luan, F., Hariharan, B., Snavely, N.: Generating 3d-consistent videos from unposed internet photos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27934–27945 (2025) 3

work page 2025

[22] [22]

Du, H., Ye, J., Cong, X., Li, R., Ni, J., Agarwal, A., Zhou, Z., Li, Z., Balestriero, R., Wang, Y.: Videogpa: Distilling geometry priors for 3d-consistent video generation (2026),https://arxiv.org/abs/2601.232864, 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2025),https://openaccess.thecvf.com/content/CVPR2025/ 18 J

Edelstein, Y., Patashnik, O., Cohen-Bar, D., Zelnik-Manor, L.: Sharp-It: A Multi- view to Multi-view Diffusion Model for 3D Synthesis and Manipulation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2025),https://openaccess.thecvf.com/content/CVPR2025/ 18 J. Ackermann et al. papers/Edelstein_Sharp-It_A_Mu...

work page 2025

[24] [24]

In: Advances in Neural Information Processing Systems (NeurIPS)

Gao,J.,Shen,T.,Wang,Z.,Chen,W.,Yin,K.,Li,D.,Litany,O.,Gojcic,Z.,Fidler, S.: GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 35, pp. 31841–31854 (2022) 3

work page 2022

[25] [25]

CAT3D: Create Anything in 3D with Multi-View Diffusion Models

Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3D: Create Anything in 3D with Multi-View Diffusion Models. arXiv preprint arXiv:2405.10314 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure

Gu, L., Hur, J., Herrmann, C., Zhan, F., Zickler, T., Sun, D., Pfister, H.: Geco: A differentiable geometric consistency metric for video generation. arXiv preprint arXiv:2512.22274 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

He, H., et al.: CameraCtrl: Enabling Camera Control for Text-to-Video Diffusion Models. arXiv preprint arXiv:2404.02101 (2024),https://arxiv.org/abs/2404. 021012, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022) 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

LRM: Large Reconstruction Model for Single Image to 3D

Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: LRM: Large Reconstruction Model for Single Image to 3D. arXiv preprint arXiv:2311.04400 (2023),https://arxiv.org/abs/2311.044003

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Iclr1(2), 3 (2022) 9, S6

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 9, S6

work page 2022

[31] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024) 11, 12, S7

work page 2024

[32] [32]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 9, S6

work page internal anchor Pith review Pith/arXiv arXiv 2014

[33] [33]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

In: Advances in Neural Information Processing Systems (NeurIPS)

Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L.J., Wetzstein, G.: Collabo- rative Video Diffusion: Consistent Multi-Video Generation with Camera Control. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 37, pp. 16240–16271 (2024) 3

work page 2024

[35] [35]

arXiv preprint arXiv:2510.21615 (2025) 4

Kupyn,O.,Manhardt,F.,Tombari,F.,Rupprecht,C.:Epipolargeometryimproves video generation models. arXiv preprint arXiv:2510.21615 (2025) 4

work page arXiv 2025

[36] [36]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Leroy,V.,Cabon,Y.,Revaud,J.:GroundingImageMatchingin3DwithMAST3R. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 71–

work page

[37] [37]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Li, J., Cui, Y., Huang, T., Ma, Y., Fan, C., Yang, M., Zhong, Z.: Mixgrpo: Unlock- ing flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Li, P., et al.: Era3D: High-Resolution Multiview Diffusion Using Efficient Re- arrangement Attention. In: Advances in Neural Information Processing Systems (NeurIPS) (2024),https://proceedings.neurips.cc/paper_files/paper/2024/ file/65a723bf7d8dad838c09178270d30e80-Paper-Conference.pdf3 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 19

work page 2024

[39] [39]

Liang, Y., et al.: Rich Human Feedback for Text-to-Image Generation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024),https://openaccess.thecvf.com/content/CVPR2024/ papers/Liang_Rich_Human_Feedback_for_Text- to- Image_Generation_CVPR_ 2024_paper.pdf4

work page 2024

[40] [40]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025) 5, S1, S7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22166–22176 (2024) 11, S6, S7

work page 2024

[42] [43]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) S4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [44]

Flow-GRPO: Training Flow Matching Models via Online RL

Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., Zhang, D., Ouyang, W.: Flow-GRPO: Training Flow Matching Models via Online RL. arXiv preprint arXiv:2505.05470 (2025) 2, 3, 4, 5, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [45]

Zero-1-to-3: Zero-shot one image to 3d object.arXiv preprint arXiv:2303.11328, 2023

Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot One Image to 3D Object. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9298–9309 (2023), https://arxiv.org/abs/2303.113283

work page arXiv 2023

[45] [46]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. arXiv preprint arXiv:2209.03003 (2023) 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [47]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Liu, Y., et al.: SyncDreamer: Generating Multiview-Consistent Images from a Single-View Image. arXiv preprint arXiv:2309.03453 (2023),https://arxiv.org/ abs/2309.034533

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [48]

Interna- tional journal of computer vision60(2), 91–110 (2004) S2

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60(2), 91–110 (2004) S2

work page 2004

[48] [49]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Nan, K., Xie, R., Zhou, P., Fan, T., Zheng, Z., Huang, Z., Li, H., Li, J., Li, J.: OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-Video Generation. arXiv preprint arXiv:2407.02371 (2024) 11, S6, S7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [50]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 5, S7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [51]

arXiv preprint arXiv:2512.12080 (2025) 3

Po, R., Chan, E.R., Chen, C., Wetzstein, G.: Bagger: Backwards aggregation for mitigating drift in autoregressive video diffusion models. arXiv preprint arXiv:2512.12080 (2025) 3

work page arXiv 2025

[51] [52]

Long-context state-space video world models.ArXiv, abs/2505.20171, 2025

Po, R., Nitzan, Y., Zhang, R., Chen, B., Dao, T., Shechtman, E., Wetzstein, G., Huang, X.: Long-Context State-Space Video World Models. arXiv preprint arXiv:2505.20171 (2025) 3

work page arXiv 2025

[52] [53]

In: NeurIPS (2023) 4

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Di- rect preference optimization: Your language model is secretly a reward model. In: NeurIPS (2023) 4

work page 2023

[53] [54]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1

work page 2022

[54] [55]

very scattered

Sampson, P.D.: Fitting conic sections to “very scattered” data: An iterative refine- ment of the bookstein algorithm. Computer graphics and image processing18(1), 97–108 (1982) 12, S2 20 J. Ackermann et al

work page 1982

[55] [56]

arXiv preprint arXiv:2303.07937 (2023) 4

Seo, J., Jang, W., Kwak, M.S., Kim, H., Ko, J., Kim, J., Kim, J.H., Lee, J., Kim, S.: Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation. arXiv preprint arXiv:2303.07937 (2023) 4

work page arXiv 2023

[56] [57]

In: arXiv (2024) 4

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. In: arXiv (2024) 4

work page 2024

[57] [58]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y., Wu, Y., Guo, D.: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300 (2024) 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [59]

MVDream: Multi-view Diffusion for 3D Generation

Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: MVDream: Multi-view Dif- fusion for 3D Generation. arXiv preprint arXiv:2308.16512 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [60]

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., Yang, X.: Zero123++: Single Image to Consistent Multi-view Diffusion Base Model. arXiv preprint arXiv:2310.15110 (2023),https://arxiv.org/abs/2310.151103

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [61]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Shue, J.R., Chan, E.R., Po, R., Ankner, Z., Wu, J., Wetzstein, G.: 3D Neural Field Generation using Triplane Diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20875–20886 (2023) 3

work page 2023

[61] [62]

History-Guided Video Diffusion

Song, K., Chen, B., Simchowitz, M., Du, Y., Tedrake, R., Sitzmann, V.: History- Guided Video Diffusion. arXiv preprint arXiv:2502.06764 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [63]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 8

work page internal anchor Pith review Pith/arXiv arXiv 2011

[63] [64]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024

Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024),https://arxiv.org/ abs/2402.050543

work page arXiv 2024

[64] [65]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) S6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [66]

Team, G.: Mochi 1.https://github.com/genmoai/models(2024) 3

work page 2024

[66] [67]

In: European conference on computer vision

Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: European conference on computer vision. pp. 402–419. Springer (2020) S1, S3

work page 2020

[67] [68]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Tochilkin, D., Pankratz, D., Liu, Z., Huang, Z., Letts, A., Li, Y., Liang, D., Laforte, C., Jampani, V., Cao, Y.P.: TripoSR: Fast 3D Object Reconstruction from a Single Image. arXiv preprint arXiv:2403.02151 (2024),https://arxiv.org/abs/2403. 021513

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [69]

In: Proceedings of the European Conference on Com- puter Vision (ECCV)

Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative Camera Dolly: Extreme Monocular Dy- namic Novel View Synthesis. In: Proceedings of the European Conference on Com- puter Vision (ECCV). pp. 313–331. Springer (2024) 3

work page 2024

[69] [70]

In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) (2024),https://openaccess.thecvf

Wallace, B., et al.: Diffusion Model Alignment Using Direct Preference Op- timization. In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) (2024),https://openaccess.thecvf. com/content/CVPR2024/papers/Wallace_Diffusion_Model_Alignment_Using_ Direct_Preference_Optimization_CVPR_2024_paper.pdf4

work page 2024

[70] [71]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan Team: Wan: Open and Advanced Large-Scale Video Generative Models. arXiv preprint arXiv:2503.20314 (2025) 2, 3, 9, S6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[71] [72]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π3: Scalable Permutation-Equivariant Visual Geometry Learning. arXiv preprint arXiv:2507.13347 (2025) S1 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation 21

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [73]

arXiv preprint arXiv:2506.21526 (2025) 5, S1, S7

Wang, Y., Deng, J.: Waft: Warping-alone field transforms for optical flow. arXiv preprint arXiv:2506.21526 (2025) 5, S1, S7

work page arXiv 2025

[73] [74]

Video models are zero-shot learners and reasoners

Wiedemer, T., Li, Y., Vicol, P., Gu, S.S., Matarese, N., Swersky, K., Kim, B., Jaini, P., Geirhos, R.: Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [75]

arXiv preprint arXiv:2512.02793 (2025) 4

Wu, F., Wei, J., Li, R., Xu, Y., Li, J., Ye, D., Lin, G.: Ic-world: In-context gener- ation for shared world modeling. arXiv preprint arXiv:2512.02793 (2025) 4

work page arXiv 2025

[75] [76]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling. arXiv preprint arXiv:2507.07982 (2025) 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [77]

In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)

Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J.T., Holynski, A.: Cat4D: Create Anything in 4D with Multi-View Video Diffusion Models. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 26057–26068 (2025) 3

work page 2025

[77] [78]

arXiv preprint arXiv:2506.05284 (2025) 3

Wu, T., Yang, S., Po, R., Xu, Y., Liu, Z., Lin, D., Wetzstein, G.: Video World Models with Long-term Spatial Memory. arXiv preprint arXiv:2506.05284 (2025) 3

work page arXiv 2025

[78] [79]

arXiv preprint arXiv:2504.12369 (2025) 3

Xiao, Z., Lan, Y., Zhou, Y., Ouyang, W., Yang, S., Zeng, Y., Pan, X.: World- Mem: Long-Term Consistent World Simulation with Memory. arXiv preprint arXiv:2504.12369 (2025) 3

work page arXiv 2025

[79] [80]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xie, D., Li, J., Tan, H., Sun, X., Shu, Z., Zhou, Y., Bi, S., Pirk, S., Kauf- man, A.E.: Carve3D: Improving Multi-view Reconstruction Consistency for Diffu- sion Models with RL Finetuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6369–6379 (2024), https : / / openaccess . thecvf . com / content / CV...

work page 2024

[80] [81]

In: Proceedings of the Asian conference on computer vision

Xie, J., Yang, C., Xie, W., Zisserman, A.: Moving object segmentation: All you need is sam (and flow). In: Proceedings of the Asian conference on computer vision. pp. 162–178 (2024) S2

work page 2024