Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Diankun Wu; Haoyu Wu; Jiang Bian; Junliang Guo; Tianyu He; Yang Ye; Yueqi Duan

arxiv: 2507.07982 · v2 · submitted 2025-07-10 · 💻 cs.CV · cs.AI

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu , Diankun Wu , Tianyu He , Junliang Guo , Yang Ye , Yueqi Duan , Jiang Bian This is my paper

Pith reviewed 2026-05-19 05:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video diffusion3D consistencygeometry alignmentworld modelingvideo generationfeature alignment

0 comments

The pith

Aligning intermediate features of video diffusion models with geometric representations improves 3D consistency in generated videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Videos are 2D projections of a 3D world, yet video diffusion models trained only on raw video data often fail to capture meaningful geometric structure in their internal representations. Geometry Forcing addresses this by adding two alignment objectives during training: angular alignment that matches directional information via cosine similarity, and scale alignment that preserves magnitude by regressing geometric features from the diffusion model's normalized representations. The method is tested on both camera-view conditioned and action-conditioned video generation. If the approach holds, the resulting models would produce videos with stronger visual quality and better 3D consistency across frames and viewpoints, moving closer to reliable world modeling from 2D data alone.

Core claim

The central claim is that guiding a video diffusion model's intermediate representations to align with features from a geometric foundation model causes the diffusion model to internalize 3D-aware structure. This alignment is implemented through Angular Alignment, which enforces directional consistency with cosine similarity, and Scale Alignment, which regresses geometric features from normalized diffusion representations. When applied to standard video generation tasks, the resulting models show substantially higher visual quality and 3D consistency than baselines trained without these objectives.

What carries the argument

Geometry Forcing, the training-time alignment of diffusion model intermediate representations with geometric foundation model features using angular and scale objectives.

If this is right

Video generation becomes more consistent across camera viewpoints and time steps.
The improvements apply to both camera-conditioned and action-conditioned generation settings.
Geometric awareness is added without altering the diffusion model's architecture or inference procedure.
Training on raw video alone is no longer sufficient for high-quality 3D-aware outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment principle might transfer to other generative models that operate on 2D projections of 3D scenes.
Stronger 3D consistency could support downstream uses such as simulation or planning that require reliable spatial understanding.
Longer video sequences might benefit most, as accumulated geometric errors would be reduced.

Load-bearing premise

That aligning diffusion features with geometric features will make the model internalize genuine 3D structure rather than superficial correlations.

What would settle it

Training identical video diffusion models with and without the two alignment objectives and finding no measurable difference in 3D consistency metrics such as multi-view reprojection error or temporal geometry stability.

Figures

Figures reproduced from arXiv: 2507.07982 by Diankun Wu, Haoyu Wu, Jiang Bian, Junliang Guo, Tianyu He, Yang Ye, Yueqi Duan.

**Figure 1.** Figure 1: Geometry Forcing equips video diffusion models with 3D awareness. (a) We propose Geometry Forcing (GF), a simple yet effective paradigm to internalize geometric-aware structure into video diffusion models by aligning with features from a pretrained geometric foundation model, i.e., VGGT (Wang et al., 2025). (b) Compared to the baseline method (Song et al., 2025), our method produces more consistent generat… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of camera view-conditioned video generation under fullcircle rotation. Videos are generated from a single input frame and corresponding per-frame camera poses simulating a full 360° rotation. Our method (GF) is compared with DFoT (Song et al., 2025), VideoREPA (Zhang et al., 2025c), and REPA (Zhang et al., 2025c). The results demonstrate that the baseline methods fail to maintain te… view at source ↗

**Figure 3.** Figure 3: Ablation study on alignment depth. We present FVD-256 and FVD-16 results for aligning VGGT to different layers of the diffusion model. The results suggest that mid-level feature alignment is most effective for improving long-term video quality [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons on camera-conditioned video generation. All the videos are generated given first frame and per-frame camera pose. We comprehensively compare GF (ours) with DFoT (Song et al., 2025), VideoREPA (Zhang et al., 2025c), REPA (Zhang et al., 2025c). The results demostrate consistency in long-term video generation both inside (left) and outside (right) scenes [PITH_FULL_IMAGE:figures/full_… view at source ↗

read the original abstract

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge the gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing geometric features from normalized diffusion representations. We evaluate Geometry Forcing on both camera-view conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Geometry Forcing adds cosine and regression alignment losses to push geometric features into video diffusion intermediates, but the abstract gives no numbers to show whether this changes the actual generation process.

read the letter

The central idea is to align intermediate features from a video diffusion model with outputs from a geometric foundation model using two losses: angular alignment via cosine similarity on normalized features and scale alignment via regression. This targets the known problem that standard video diffusion training on 2D data often misses underlying 3D structure needed for consistent world modeling under new views or actions. The approach is straightforward and applies the alignments during training without changing the core diffusion architecture much. That counts as the main novelty here, since earlier representation alignment work exists but this pairing is aimed specifically at 3D consistency in generative video. It does a reasonable job laying out the motivation and testing on both camera-view and action-conditioned tasks, which are relevant settings for checking consistency. If the full experiments include proper baselines, ablations on each loss, and metrics for visual quality plus 3D consistency like depth error or novel-view stability, the method could be a practical addition for people building simulation or robotics video models. The soft spot is that the abstract claims substantial gains without reporting any quantitative results, dataset details, or statistical comparisons, which leaves the central claim hard to evaluate. The stress-test concern also lands: feature alignment at intermediate layers might produce correlated outputs without actually reshaping the denoising trajectory to respect 3D geometry in unseen motions. This paper is mainly for computer vision researchers working on video generation and 3D-aware models. A reader looking for incremental training tricks in diffusion would find it worth scanning once the numbers are in. It has enough of a concrete proposal and clear motivation to deserve a serious referee rather than a desk reject, though the review would need to focus on whether the results demonstrate real propagation to generation dynamics.

Referee Report

2 major / 1 minor

Summary. The paper proposes Geometry Forcing to address the limitation that video diffusion models trained on raw video data fail to capture meaningful geometric structure. It aligns intermediate diffusion representations with features from a geometric foundation model via two objectives: Angular Alignment (cosine similarity on normalized features for directional consistency) and Scale Alignment (regression to preserve scale information). The approach is evaluated on camera-view conditioned and action-conditioned video generation tasks, claiming substantial improvements in visual quality and 3D consistency over baselines.

Significance. If the alignment objectives can be shown to reshape the learned denoising trajectory such that generated videos respect 3D structure under novel camera motions and actions, the method would offer a lightweight way to inject geometric awareness into video diffusion models. This could advance consistent world modeling for downstream applications in simulation and robotics, provided the gains are not reducible to superficial feature correlations.

major comments (2)

[Abstract] Abstract: The central claim that 'Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods' is unsupported by any reported quantitative metrics, baseline descriptions, dataset details, or statistical significance tests, making it impossible to assess whether the results validate the claim.
[Method] Method section (alignment objectives): Angular Alignment (cosine similarity) and Scale Alignment (regression) are applied to intermediate representations, but no analysis or ablation demonstrates that these losses propagate into the generative sampling process to enforce 3D-consistent dynamics under novel conditions rather than producing superficial correlations with the external geometric model outputs.

minor comments (1)

[Method] The paper would benefit from explicit layer indices or feature extraction details when aligning with the geometric foundation model to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the changes made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods' is unsupported by any reported quantitative metrics, baseline descriptions, dataset details, or statistical significance tests, making it impossible to assess whether the results validate the claim.

Authors: We agree that the abstract would benefit from more concrete support for its claims. The full manuscript presents quantitative metrics, baseline comparisons (including standard video diffusion models without geometry alignment), dataset specifications for the camera-view and action-conditioned tasks, and evaluation protocols in the experiments section. To directly address this point, we have revised the abstract to reference key quantitative improvements and the evaluation setup while maintaining its concise nature. revision: yes
Referee: [Method] Method section (alignment objectives): Angular Alignment (cosine similarity) and Scale Alignment (regression) are applied to intermediate representations, but no analysis or ablation demonstrates that these losses propagate into the generative sampling process to enforce 3D-consistent dynamics under novel conditions rather than producing superficial correlations with the external geometric model outputs.

Authors: This comment correctly identifies a gap in mechanistic analysis. Our current results demonstrate improved 3D consistency on novel camera motions and actions, which provides indirect evidence that the alignment influences generation beyond superficial correlations. However, we acknowledge that explicit analysis of the denoising trajectory would strengthen the paper. We have added ablations and visualizations in the revised manuscript examining how the objectives affect intermediate sampling steps and feature propagation during generation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in proposed alignment method

full rationale

The paper introduces Geometry Forcing as a training technique that aligns intermediate representations of a video diffusion model with features extracted from an external geometric foundation model, using two explicit objectives (Angular Alignment via cosine similarity and Scale Alignment via regression). These objectives are defined independently of the target generative outputs and are evaluated through separate experiments on camera-conditioned and action-conditioned video generation. No derivation step reduces a claimed prediction or 3D consistency result to a quantity defined in terms of the alignment losses themselves, nor does any central claim rely on a self-citation chain or uniqueness theorem imported from prior author work. The method remains self-contained against the external geometric model and experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that geometric foundation model features encode transferable 3D structure that can be injected into diffusion representations via alignment losses. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Aligning diffusion model intermediate representations with geometric foundation model features will produce geometry-aware structure inside the diffusion model.
This is the key insight stated in the abstract that motivates the two alignment objectives.

pith-pipeline@v0.9.0 · 5723 in / 1262 out tokens · 47928 ms · 2026-05-19T05:10:06.181067+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing geometric features

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Geo-Align: Video Generation Alignment via Metric Geometry Reward
cs.CV 2026-05 unverdicted novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R
cs.CV 2026-05 unverdicted novelty 7.0

Trust3R introduces a gated residual refinement plus Normal-Inverse-Wishart evidential head that produces closed-form multivariate Student-t uncertainty for per-point geometry in feed-forward 3D reconstruction and impr...
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
cs.CV 2026-05 unverdicted novelty 7.0

3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
Learning Visual Feature-Based World Models via Residual Latent Action
cs.CV 2026-05 unverdicted novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
TORA: Topological Representation Alignment for 3D Shape Assembly
cs.CV 2026-04 unverdicted novelty 7.0

TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmar...
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
Improved Baselines with Representation Autoencoders
cs.CV 2026-05 conditional novelty 6.0

RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
cs.CV 2026-05 unverdicted novelty 6.0

M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
cs.CV 2026-04 unverdicted novelty 6.0

CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
Lyra 2.0: Explorable Generative 3D Worlds
cs.CV 2026-04 unverdicted novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
cs.CV 2026-04 unverdicted novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
cs.CV 2026-05 unverdicted novelty 5.0

GEM-4D is a video world model that injects 4D correspondence supervision to improve geometric consistency and robot manipulation success from 61% to 81%.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

World-R1 applies RL via Flow-GRPO on a new text dataset for world simulation to enforce 3D constraints in video generation while preserving visual quality.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

World-R1 uses Flow-GRPO reinforcement learning and a new text dataset to enforce 3D consistency in text-to-video generation while keeping the original model's visual quality.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
cs.CV 2026-04 unverdicted novelty 4.0

World-R1 uses RL with 3D model feedback and a new text dataset to improve geometric consistency in text-to-video generation while keeping the base model unchanged.

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · cited by 15 Pith papers · 21 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025

Aether, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, et al. Aether: Geometric-aware unified world modeling. arXiv preprint arXiv:2503.18945, 2025

work page arXiv 2025
[3]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and Fran c ois Fleuret. Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37: 0 58757--58791, 2024

work page 2024
[5]

Tc4d: Trajectory-conditioned text-to-4d generation

Sherwin Bahmani, Xian Liu, Wang Yifan, Ivan Skorokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, et al. Tc4d: Trajectory-conditioned text-to-4d generation. In European Conference on Computer Vision, pp.\ 53--72. Springer, 2024 a

work page 2024
[6]

4d-fy: Text-to-4d generation using hybrid score distillation sampling

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 7996--8006, 2024 b

work page 2024
[7]

Video pretraining (vpt): Learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35: 0 24639--24654, 2022

work page 2022
[8]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 22669--22679, 2023

work page 2023
[9]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 15791--15801, 2025

work page 2025
[10]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 1877--1901, 2020

work page 1901
[11]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[12]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Mart \' Mons \'o , Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 37: 0 24081--24125, 2024 a

work page 2024
[13]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Videocrafter2: Overcoming data limita- tions for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024 b

work page arXiv 2024
[15]

Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis

Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, and Chongxuan Li. Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis. arXiv preprint arXiv:2503.13265, 2025

work page arXiv 2025
[16]

Playing with transformer at 30+ fps via next-frame diffusion.arXiv preprint arXiv:2506.01380, 2025

Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, and Jiang Bian. Playing with transformer at 30+ fps via next-frame diffusion. arXiv preprint arXiv:2506.01380, 2025

work page arXiv 2025
[17]

Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023

work page arXiv 2023
[18]

Oasis: A universe in a transformer

Decart, Quevedo Julian, McIntyre Quinn, Campbell Spruce, Chen Xinlei, and Wachen Robert. Oasis: A universe in a transformer. 2024. URL https://oasis-model.github.io/

work page 2024
[19]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. arXiv preprint arXiv:2504.00983, 2025

work page arXiv 2025
[20]

Institutionum calculi integralis, volume 4

Leonhard Euler. Institutionum calculi integralis, volume 4. impensis Academiae imperialis scientiarum, 1845

work page
[21]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. arXiv preprint arXiv:2505.20279, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Driv3R: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024

Xin Fei, Wenzhao Zheng, Yueqi Duan, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Jiwen Lu. Driv3r: Learning dense 4d reconstruction for autonomous driving. ArXiv, abs/2412.06777, 2024. URL https://api.semanticscholar.org/CorpusID:274610426

work page arXiv 2024
[23]

The matrix: Infinite-horizon world generation with real-time moving control

Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control. arXiv preprint arXiv:2412.03568, 2024

work page arXiv 2024
[24]

Maskflow: Discrete flows for flexible and efficient long video generation

Michael Fuest, Vincent Tao Hu, and Bj \"o rn Ommer. Maskflow: Discrete flows for flexible and efficient long video generation. arXiv preprint arXiv:2502.11234, 2025

work page arXiv 2025
[25]

An introduction to ray tracing

Andrew S Glassner. An introduction to ray tracing. Morgan Kaufmann, 1989

work page 1989
[26]

Google. Veo 3. https://deepmind.google/models/veo/, 2025

work page 2025
[27]

Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: a real-time and open-source interactive world model on minecraft. arXiv preprint arXiv:2504.08388, 2025

work page arXiv 2025
[28]

World Models

David Ha and J \"u rgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 9729--9738, 2020

work page 2020
[31]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

work page 2017
[32]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 6840--6851, 2020

work page 2020
[33]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

3drs: Mllms need 3d-aware representation supervision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation supervision for scene understanding. arXiv preprint arXiv:2506.01946, 2025 a

work page arXiv 2025
[35]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Geo4d: Leveraging video generators for geometric 4d scene reconstruction

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4d: Leveraging video generators for geometric 4d scene reconstruction. arXiv preprint arXiv:2504.07961, 2025

work page arXiv 2025
[37]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk \"u hler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42 0 (4): 0 1--14, 2023

work page 2023
[38]

Videopoet: A large language model for zero-shot video generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. In International Conference on Machine Learning, pp.\ 25105--25124. PMLR, 2024

work page 2024
[39]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Vividdream: Generating 3d scene with ambient dynamics

Yao-Chih Lee, Yi-Ting Chen, Andrew Wang, Ting-Hsuan Liao, Brandon Y Feng, and Jia-Bin Huang. Vividdream: Generating 3d scene with ambient dynamics. arXiv preprint arXiv:2405.20334, 2024

work page arXiv 2024
[41]

T2v- turbo: Breaking the quality bottleneck of video consis- tency model with mixed reward feedback

Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback. arXiv preprint arXiv:2405.18750, 2024

work page arXiv 2024
[42]

Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[43]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[44]

Videodpo: Omni-preference alignment for video diffusion generation

Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 8009--8019, 2025 a

work page 2025
[45]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[46]

Slam3r: Real-time dense scene reconstruction from monocular rgb videos

Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 16651--16662, 2025 b

work page 2025
[47]

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl(4) manifold. ArXiv, abs/2505.12549, 2025. URL https://api.semanticscholar.org/CorpusID:278739766

work page internal anchor Pith review arXiv 2025
[48]

Can video diffusion model reconstruct 4d geometry? arXiv preprint arXiv:2503.21082, 2025

Jinjie Mai, Wenxuan Zhu, Haozhe Liu, Bing Li, Cheng Zheng, J \"u rgen Schmidhuber, and Bernard Ghanem. Can video diffusion model reconstruct 4d geometry? arXiv preprint arXiv:2503.21082, 2025

work page arXiv 2025
[49]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65 0 (1): 0 99--106, 2021

work page 2021
[50]

Giraffe: Representing scenes as compositional generative neural feature fields

Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 11453--11464, 2021

work page 2021
[51]

OpenAI. Sora. https://openai.com/index/sora/, 2024

work page 2024
[52]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Genie 2: A large-scale foundation world model

J Parker-Holder, P Ball, J Bruce, V Dasagi, K Holsheimer, C Kaplanis, A Moufarek, G Scully, J Shar, J Shi, et al. Genie 2: A large-scale foundation world model. URL: https://deepmind. google/discover/blog/genie-2-a-large-scale-foundation-world-model, 2024

work page 2024
[54]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 4195--4205, 2023

work page 2023
[55]

Unidepth: Universal monocular metric depth estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10106--10116, 2024

work page 2024
[56]

Long-context state-space video world models.ArXiv, abs/2505.20171, 2025

Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models. arXiv preprint arXiv:2505.20171, 2025

work page arXiv 2025
[57]

Movie Gen: A Cast of Media Foundation Models

A Polyak, A Zohar, A Brown, A Tjandra, A Sinha, A Lee, A Vyas, B Shi, CY Ma, CY Chuang, et al. Movie gen: A cast of media foundation models. 2024a. arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[59]

Dreambooth3d: Subject-driven text-to-3d generation

Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, et al. Dreambooth3d: Subject-driven text-to-3d generation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 2349--2359, 2023

work page 2023
[60]

Vision transformers for dense prediction

Ren \'e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 12179--12188, 2021

work page 2021
[61]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 10684--10695, 2022

work page 2022
[62]

Wham: Reconstructing world-grounded humans with accurate 3d motion

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accurate 3d motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 2070--2080, 2024

work page 2070
[63]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs. arXiv preprint arXiv:2408.13912, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. arXiv preprint arXiv:2502.06764, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Ar-diffusion: Asynchronous video generation with auto-regressive diffusion

Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 7364--7373, 2025

work page 2025
[66]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras

Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34: 0 16558--16569, 2021

work page 2021
[67]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[68]

Diffusion Models Are Real-Time Game Engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

work page 2017
[70]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[72]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 14549--14560, 2023

work page 2023
[73]

Efros, and Angjoo Kanazawa

Qianqian Wang*, Yifei Zhang*, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[74]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 20697--20709, 2024

work page 2024
[75]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13 0 (4): 0 600--612, 2004

work page 2004
[76]

A learning algorithm for continually running fully recurrent neural networks

Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1 0 (2): 0 270--280, 1989

work page 1989
[77]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

4d-fly: Fast 4d reconstruction from a single monocular video

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yue Qian, Xiaohang Zhan, and Yueqi Duan. 4d-fly: Fast 4d reconstruction from a single monocular video. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp.\ 16663--16673, June 2025 b

work page 2025
[79]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory. arXiv preprint arXiv:2506.05284, 2025 c

work page arXiv 2025
[80]

Structured 3d latents for scalable and versatile 3d generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 21469--21480, 2025

work page 2025

Showing first 80 references.

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025

Aether, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, et al. Aether: Geometric-aware unified world modeling. arXiv preprint arXiv:2503.18945, 2025

work page arXiv 2025

[3] [3]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J Storkey, Tim Pearce, and Fran c ois Fleuret. Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37: 0 58757--58791, 2024

work page 2024

[5] [5]

Tc4d: Trajectory-conditioned text-to-4d generation

Sherwin Bahmani, Xian Liu, Wang Yifan, Ivan Skorokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, et al. Tc4d: Trajectory-conditioned text-to-4d generation. In European Conference on Computer Vision, pp.\ 53--72. Springer, 2024 a

work page 2024

[6] [6]

4d-fy: Text-to-4d generation using hybrid score distillation sampling

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 7996--8006, 2024 b

work page 2024

[7] [7]

Video pretraining (vpt): Learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35: 0 24639--24654, 2022

work page 2022

[8] [8]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 22669--22679, 2023

work page 2023

[9] [9]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 15791--15801, 2025

work page 2025

[10] [10]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 1877--1901, 2020

work page 1901

[11] [11]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[12] [12]

Diffusion forcing: Next-token prediction meets full-sequence diffusion

Boyuan Chen, Diego Mart \' Mons \'o , Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems, 37: 0 24081--24125, 2024 a

work page 2024

[13] [13]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Videocrafter2: Overcoming data limita- tions for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024 b

work page arXiv 2024

[15] [15]

Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis

Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, and Chongxuan Li. Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis. arXiv preprint arXiv:2503.13265, 2025

work page arXiv 2025

[16] [16]

Playing with transformer at 30+ fps via next-frame diffusion.arXiv preprint arXiv:2506.01380, 2025

Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, and Jiang Bian. Playing with transformer at 30+ fps via next-frame diffusion. arXiv preprint arXiv:2506.01380, 2025

work page arXiv 2025

[17] [17]

Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023

work page arXiv 2023

[18] [18]

Oasis: A universe in a transformer

Decart, Quevedo Julian, McIntyre Quinn, Campbell Spruce, Chen Xinlei, and Wachen Robert. Oasis: A universe in a transformer. 2024. URL https://oasis-model.github.io/

work page 2024

[19] [19]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. arXiv preprint arXiv:2504.00983, 2025

work page arXiv 2025

[20] [20]

Institutionum calculi integralis, volume 4

Leonhard Euler. Institutionum calculi integralis, volume 4. impensis Academiae imperialis scientiarum, 1845

work page

[21] [21]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. arXiv preprint arXiv:2505.20279, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Driv3R: Learn- ing dense 4d reconstruction for autonomous driving.arXiv preprint arXiv:2412.06777, 2024

Xin Fei, Wenzhao Zheng, Yueqi Duan, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Jiwen Lu. Driv3r: Learning dense 4d reconstruction for autonomous driving. ArXiv, abs/2412.06777, 2024. URL https://api.semanticscholar.org/CorpusID:274610426

work page arXiv 2024

[23] [23]

The matrix: Infinite-horizon world generation with real-time moving control

Ruili Feng, Han Zhang, Zhantao Yang, Jie Xiao, Zhilei Shu, Zhiheng Liu, Andy Zheng, Yukun Huang, Yu Liu, and Hongyang Zhang. The matrix: Infinite-horizon world generation with real-time moving control. arXiv preprint arXiv:2412.03568, 2024

work page arXiv 2024

[24] [24]

Maskflow: Discrete flows for flexible and efficient long video generation

Michael Fuest, Vincent Tao Hu, and Bj \"o rn Ommer. Maskflow: Discrete flows for flexible and efficient long video generation. arXiv preprint arXiv:2502.11234, 2025

work page arXiv 2025

[25] [25]

An introduction to ray tracing

Andrew S Glassner. An introduction to ray tracing. Morgan Kaufmann, 1989

work page 1989

[26] [26]

Google. Veo 3. https://deepmind.google/models/veo/, 2025

work page 2025

[27] [27]

Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: a real-time and open-source interactive world model on minecraft. arXiv preprint arXiv:2504.08388, 2025

work page arXiv 2025

[28] [28]

World Models

David Ha and J \"u rgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 9729--9738, 2020

work page 2020

[31] [31]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

work page 2017

[32] [32]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 6840--6851, 2020

work page 2020

[33] [33]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

3drs: Mllms need 3d-aware representation supervision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation supervision for scene understanding. arXiv preprint arXiv:2506.01946, 2025 a

work page arXiv 2025

[35] [35]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Geo4d: Leveraging video generators for geometric 4d scene reconstruction

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4d: Leveraging video generators for geometric 4d scene reconstruction. arXiv preprint arXiv:2504.07961, 2025

work page arXiv 2025

[37] [37]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk \"u hler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42 0 (4): 0 1--14, 2023

work page 2023

[38] [38]

Videopoet: A large language model for zero-shot video generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. In International Conference on Machine Learning, pp.\ 25105--25124. PMLR, 2024

work page 2024

[39] [39]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Vividdream: Generating 3d scene with ambient dynamics

Yao-Chih Lee, Yi-Ting Chen, Andrew Wang, Ting-Hsuan Liao, Brandon Y Feng, and Jia-Bin Huang. Vividdream: Generating 3d scene with ambient dynamics. arXiv preprint arXiv:2405.20334, 2024

work page arXiv 2024

[41] [41]

T2v- turbo: Breaking the quality bottleneck of video consis- tency model with mixed reward feedback

Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback. arXiv preprint arXiv:2405.18750, 2024

work page arXiv 2024

[42] [42]

Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[43] [43]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023

work page 2023

[44] [44]

Videodpo: Omni-preference alignment for video diffusion generation

Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 8009--8019, 2025 a

work page 2025

[45] [45]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023

work page 2023

[46] [46]

Slam3r: Real-time dense scene reconstruction from monocular rgb videos

Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 16651--16662, 2025 b

work page 2025

[47] [47]

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl(4) manifold. ArXiv, abs/2505.12549, 2025. URL https://api.semanticscholar.org/CorpusID:278739766

work page internal anchor Pith review arXiv 2025

[48] [48]

Can video diffusion model reconstruct 4d geometry? arXiv preprint arXiv:2503.21082, 2025

Jinjie Mai, Wenxuan Zhu, Haozhe Liu, Bing Li, Cheng Zheng, J \"u rgen Schmidhuber, and Bernard Ghanem. Can video diffusion model reconstruct 4d geometry? arXiv preprint arXiv:2503.21082, 2025

work page arXiv 2025

[49] [49]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65 0 (1): 0 99--106, 2021

work page 2021

[50] [50]

Giraffe: Representing scenes as compositional generative neural feature fields

Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 11453--11464, 2021

work page 2021

[51] [51]

OpenAI. Sora. https://openai.com/index/sora/, 2024

work page 2024

[52] [52]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Genie 2: A large-scale foundation world model

J Parker-Holder, P Ball, J Bruce, V Dasagi, K Holsheimer, C Kaplanis, A Moufarek, G Scully, J Shar, J Shi, et al. Genie 2: A large-scale foundation world model. URL: https://deepmind. google/discover/blog/genie-2-a-large-scale-foundation-world-model, 2024

work page 2024

[54] [54]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 4195--4205, 2023

work page 2023

[55] [55]

Unidepth: Universal monocular metric depth estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10106--10116, 2024

work page 2024

[56] [56]

Long-context state-space video world models.ArXiv, abs/2505.20171, 2025

Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, and Xun Huang. Long-context state-space video world models. arXiv preprint arXiv:2505.20171, 2025

work page arXiv 2025

[57] [57]

Movie Gen: A Cast of Media Foundation Models

A Polyak, A Zohar, A Brown, A Tjandra, A Sinha, A Lee, A Vyas, B Shi, CY Ma, CY Chuang, et al. Movie gen: A cast of media foundation models. 2024a. arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019

[59] [59]

Dreambooth3d: Subject-driven text-to-3d generation

Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, et al. Dreambooth3d: Subject-driven text-to-3d generation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 2349--2359, 2023

work page 2023

[60] [60]

Vision transformers for dense prediction

Ren \'e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 12179--12188, 2021

work page 2021

[61] [61]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 10684--10695, 2022

work page 2022

[62] [62]

Wham: Reconstructing world-grounded humans with accurate 3d motion

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accurate 3d motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 2070--2080, 2024

work page 2070

[63] [63]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs. arXiv preprint arXiv:2408.13912, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History-guided video diffusion. arXiv preprint arXiv:2502.06764, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [65]

Ar-diffusion: Asynchronous video generation with auto-regressive diffusion

Mingzhen Sun, Weining Wang, Gen Li, Jiawei Liu, Jiahui Sun, Wanquan Feng, Shanshan Lao, SiYu Zhou, Qian He, and Jing Liu. Ar-diffusion: Asynchronous video generation with auto-regressive diffusion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 7364--7373, 2025

work page 2025

[66] [66]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras

Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34: 0 16558--16569, 2021

work page 2021

[67] [67]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[68] [68]

Diffusion Models Are Real-Time Game Engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

work page 2017

[70] [70]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[71] [71]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[72] [72]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 14549--14560, 2023

work page 2023

[73] [73]

Efros, and Angjoo Kanazawa

Qianqian Wang*, Yifei Zhang*, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[74] [74]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 20697--20709, 2024

work page 2024

[75] [75]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13 0 (4): 0 600--612, 2004

work page 2004

[76] [76]

A learning algorithm for continually running fully recurrent neural networks

Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1 0 (2): 0 270--280, 1989

work page 1989

[77] [77]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

[78] [78]

4d-fly: Fast 4d reconstruction from a single monocular video

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yue Qian, Xiaohang Zhan, and Yueqi Duan. 4d-fly: Fast 4d reconstruction from a single monocular video. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp.\ 16663--16673, June 2025 b

work page 2025

[79] [79]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory. arXiv preprint arXiv:2506.05284, 2025 c

work page arXiv 2025

[80] [80]

Structured 3d latents for scalable and versatile 3d generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 21469--21480, 2025

work page 2025