pith. sign in

arxiv: 2606.02491 · v1 · pith:PMRVPZJVnew · submitted 2026-06-01 · 💻 cs.CV

MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

Pith reviewed 2026-06-28 15:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive 4D generationtemporal structured latentsdynamic 3D assetstemporal consistency3D Gaussiansradiance fieldsmeshesvideo to 4D
0
0 comments X

The pith

MORPHOS autoregressively generates dynamic 3D assets from videos using temporal structured latents that support multiple representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MORPHOS, an autoregressive framework for creating dynamic 3D assets from videos that works with meshes, 3D Gaussians, and radiance fields. It proposes Temporal Structured Latents (T-SLAT) as a unified representation encoding geometry and appearance over time. This setup uses causal attention so each new frame builds on the previous history, which helps keep things consistent even when topologies change. A temporal-structural augmentation is added to prevent errors from building up during long generations. A reader might care because this offers a single approach to long video-based 4D generation without the usual limits on representation or duration.

Core claim

MORPHOS is a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields, by introducing the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension, and leveraging it with causal attention to condition each frame on its preceding history for temporal consistency while handling evolving topologies, along with a temporal-structural augmentation to mitigate error accumulation.

What carries the argument

Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension to enable autoregressive generation with causal attention.

If this is right

  • State-of-the-art performance in appearance across multiple benchmarks.
  • Competitive results in geometry.
  • Superior generalization across various 3D representations.
  • Robustness in long-horizon generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The causal conditioning could support interactive applications where users modify scenes mid-generation.
  • Similar latent structures might apply to other sequential data like molecular dynamics or weather modeling.
  • Reducing error accumulation opens the door to even longer sequence generation without retraining.
  • Unified representation may simplify pipelines that currently switch between different 3D formats.

Load-bearing premise

The Temporal Structured Latents can jointly encode geometry and appearance along the temporal dimension to support autoregressive generation across meshes, Gaussians, and radiance fields.

What would settle it

If generated assets show accumulating inconsistencies or fail to handle topology changes in long video sequences on the benchmarks, the effectiveness of the T-SLAT and autoregressive approach would be questioned.

Figures

Figures reproduced from arXiv: 2606.02491 by Jaeyeong Kim, Jinhyeok Choi, Jongmin Lee, Minkyung Kwon, Seungryong Kim, Youngjin Shin.

Figure 1
Figure 1. Figure 1: Teaser. Given video inputs, MORPHOS autoregressively generates unified dynamic 3D representations (meshes, 3D Gaussians, and radiance fields). Abstract We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaus￾sians, and radiance fields. Existing methods are typically limited to a single representation, strugg… view at source ↗
Figure 2
Figure 2. Figure 2: Autoregressive streaming [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: T-SLAT generation. T-SLAT provides a unified representation for dynamic 3D generation. To enable AR generation, we employ a causal attention architecture over frames. The temporal augmentation applies different noise timesteps across frames for robust long-horizon modeling, while spatial augmentation uses voxel dropout to improve robustness against structural errors. fine-grained geometry and appearance fr… view at source ↗
Figure 3
Figure 3. Figure 3: Sparse structure generation. We introduce T-SLAT (Temporal Structured Latents) by extending SLAT into the temporal domain, which can be decoded into versatile dynamic 3D representations. We define an animated 3D asset, O1:T = {Ot} T t=1, where Ot = ⟨Vt , F t , T t ⟩ is a mesh at time t with vertices V t , faces F t , and texture T t . We then encode this mesh sequence into the T-SLAT z 1:T = {z t} T t=1, w… view at source ↗
Figure 5
Figure 5. Figure 5: Causal attention. We implement an autoregressive generation process using a causal attention architecture with the local window of size w. This architec￾ture ensures that the receptive field for a query at frame t is strictly limited to the preceding history frames within the local window. To capture the temporal ordering of token sequences, we apply 1D Rotary Position Embeddings (RoPE) [39] along the temp… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results on appearance. MORPHOS produces visually consistent and high￾fidelity appearance, maintaining stable textures throughout the entire sequence. 5.3 Results Quantitative results. We report quantitative results in Tables 2 to 4. For per-frame appearance, MORPHOS achieves state-of-the-art performance across almost all benchmarks [5, 19, 37]. In particular, MORPHOS obtains the highest CLIP sc… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on geometry. MORPHOS produces stable and coherent geometry, maintaining accurate shape and temporal consistency throughout the sequence [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Error accum. analysis. TRELLIS [43] fails to preserve temporal consistency due to its independent frame-wise generation process. In contrast, MORPHOS produces high-fidelity, temporally coherent appearances while seamlessly handling evolving topologies [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation on ‘w/o ΦL training’. Without ΦL training, MORPHOS leads to noticeable color degradation and texture inconsistency. Frame 1 Frame 2 Input Video Scene 1 Scene 2 Frame 3 T-SLAT Attention Visualization Frame 3 (Query) Frame 1 Frame 2 Frame 3 : Query voxel Weight : Query voxel Weight [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Attention analysis in T-SLAT space. For a query voxel (green star) in the target frame, we visualize its attention weights over voxel tokens of the conditioning frames. MORPHOS captures geometric correspondences in the attention map and even exploits symmetry across frames, supporting temporally consistent voxel generation. 6 Analysis Attention visualization. To analyze how MORPHOS encodes motion, we visu… view at source ↗
Figure 11
Figure 11. Figure 11: Long-horizon error accumulation analysis. We compare the temporal evolution of geometric and appearance metrics across frames. Our method maintains consistently stable performance over time with minimal degradation, demonstrating strong robustness against long-term error accumulation compared to prior methods. We study the effect of video length on generation quality to analyze error accumulation in long … view at source ↗
Figure 12
Figure 12. Figure 12: SS stage with Shortcut fine-tuning. Without shortcut fine-tuning, the model produces noticeable artifacts. In contrast, applying shortcut loss fine-tuning significantly reduces these artifacts and improves output quality. We evaluate the effect of reducing the number of denoising steps for the ΦS stage using the proposed shortcut model [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional qualitative results on appearance. We compare rendered videos of MOR￾PHOS against baselines [5, 20, 49, 36, 43] on additional input sequences. MORPHOS preserves fine-grained texture details and maintains temporally stable appearance across frames. improvements hold across diverse cases rather than being specific to the sequences shown in the main paper. Novel view video generation. Given an inp… view at source ↗
Figure 14
Figure 14. Figure 14: Additional qualitative results on geometry. Rendered normal maps of generated mesh sequences from MORPHOS and baselines [5, 20, 37, 43]. MORPHOS generates meshes with sharp surface details and consistent geometry throughout the sequence. Social impact. The proposed framework contributes to the advancement of 4D content generation, enabling more flexible and efficient creation of dynamic 3D assets from vid… view at source ↗
Figure 15
Figure 15. Figure 15: Novel view video generation. Given an input video, MORPHOS reconstructs a dynamic 4D representation, enabling high-quality rendering from novel viewpoints with consistent geometry and appearance over time. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Generalization to real-world domain [33]. Given real-world videos from DAVIS, MORPHOS generates temporally consistent 4D results with stable geometry and coherent appearance. Our method further enables novel-view video generation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
read the original abstract

We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces MORPHOS, an autoregressive framework for generating dynamic 3D assets from videos across representations including meshes, 3D Gaussians, and radiance fields. It proposes Temporal Structured Latents (T-SLAT) as a unified 4D representation jointly encoding geometry and appearance along the temporal dimension. The method generates assets via causal attention, conditioning each frame on preceding history for temporal consistency while handling evolving topologies, and introduces a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. It claims state-of-the-art appearance performance and competitive geometry results across multiple benchmarks, with superior generalization and robustness for long-horizon generation.

Significance. If the empirical claims are substantiated with detailed benchmarks and ablations, the work would be significant for unifying disparate 3D representations under a single autoregressive 4D model and for enabling consistent long-sequence generation with topological changes, which are open challenges in dynamic 3D content creation.

major comments (1)
  1. [Abstract] Abstract: The central claims of SOTA appearance performance and competitive geometry results are stated without any reference to specific quantitative metrics, baselines, benchmark datasets, or result tables, preventing verification of whether the data actually support the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for identifying this issue with the abstract. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of SOTA appearance performance and competitive geometry results are stated without any reference to specific quantitative metrics, baselines, benchmark datasets, or result tables, preventing verification of whether the data actually support the claims.

    Authors: We agree that the abstract presents the performance claims at a high level without explicit pointers to the supporting quantitative evidence. The body of the manuscript (Section 4) contains the detailed comparisons, including specific metrics, baselines, and benchmark datasets, along with the corresponding result tables. To improve verifiability directly from the abstract, we will revise it to include concise references to the relevant benchmarks and tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description introduce T-SLAT as a new unified 4D representation and MORPHOS as an autoregressive framework using causal attention and temporal-structural augmentation. No equations, derivations, or load-bearing steps are shown that reduce predictions to fitted inputs, self-definitions, or self-citation chains. Performance claims are presented as empirical results on external benchmarks, with no indication that any central result is equivalent to its inputs by construction. The derivation chain appears self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based solely on the abstract, the central addition is the invented representation T-SLAT; no free parameters, standard axioms, or other invented entities are described.

invented entities (1)
  • Temporal Structured Latents (T-SLAT) no independent evidence
    purpose: Unified 4D representation that jointly encodes geometry and appearance along the temporal dimension for autoregressive generation across multiple 3D formats.
    Core new component introduced to address limitations of existing methods; no independent evidence provided in the abstract.

pith-pipeline@v0.9.1-grok · 5709 in / 1227 out tokens · 29661 ms · 2026-06-28T15:01:24.552086+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 22 linked inside Pith

  1. [1]

    Method for registration of 3-d shapes

    Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. InSensor fusion IV: control paradigms and data structures, volume 1611, pages 586–606. Spie, 1992

  2. [2]

    Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  3. [3]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  4. [4]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

  5. [5]

    Motion 3-to-4: 3d motion reconstruction for 4d synthesis.arXiv preprint arXiv:2601.14253, 2026

    Hongyuan Chen, Xingyu Chen, Youjia Zhang, Zexiang Xu, and Anpei Chen. Motion 3-to-4: 3d motion reconstruction for 4d synthesis.arXiv preprint arXiv:2601.14253, 2026

  6. [6]

    Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

    Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

  7. [7]

    Ultra3d: Efficient and high-fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025

    Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, and Guosheng Lin. Ultra3d: Efficient and high-fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025

  8. [8]

    Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

  9. [9]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022

  10. [10]

    One step diffusion via shortcut models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. arXiv preprint arXiv:2410.12557, 2024

  11. [11]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.arXiv preprint arXiv:2306.09344, 2023

  12. [12]

    Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024

    Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, and Long Chen. Ca2-vdm: Efficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024

  13. [13]

    Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025

    Yuchao Gu, Weijia Mao, and Mike Zheng Shou. Long-context autoregressive video modeling with next-frame prediction.arXiv preprint arXiv:2503.19325, 2025

  14. [14]

    Sparseflex: High-resolution and arbitrary-topology 3d shape modeling

    Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, and Yangguang Li. Sparseflex: High-resolution and arbitrary-topology 3d shape modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14822–14833, 2025

  15. [15]

    Video diffusion models.arXiv:2204.03458, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.arXiv:2204.03458, 2022

  16. [16]

    Cupid: Generative 3d reconstruction via joint object and pose modeling.arXiv preprint arXiv:2510.20776, 2025

    Binbin Huang, Haobin Duan, Yiqun Zhao, Zibo Zhao, Yi Ma, and Shenghua Gao. Cupid: Generative 3d reconstruction via joint object and pose modeling.arXiv preprint arXiv:2510.20776, 2025

  17. [17]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  18. [18]

    AniGen: Unified S3 fields for animatable 3d asset generation.arXiv preprint arXiv:2604.08746, 2026

    Yi-Hua Huang, Zi-Xin Zou, Yuting He, Chirui Chang, Cheng-Feng Pu, Ziyi Yang, Yuan-Chen Guo, Yan-Pei Cao, and Xiaojuan Qi. AniGen: Unified S3 fields for animatable 3d asset generation.arXiv preprint arXiv:2604.08746, 2026

  19. [19]

    Consistent4d: Consistent 360{\deg} dynamic object generation from monocular video.arXiv preprint arXiv:2311.02848, 2023

    Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, and Yao Yao. Consistent4d: Consistent 360{\deg} dynamic object generation from monocular video.arXiv preprint arXiv:2311.02848, 2023. 22

  20. [20]

    Mesh4d: 4d mesh reconstruc- tion and tracking from monocular video.arXiv preprint arXiv:2601.05251, 2026

    Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Mesh4d: 4d mesh reconstruc- tion and tracking from monocular video.arXiv preprint arXiv:2601.05251, 2026

  21. [21]

    Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

  22. [22]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  23. [23]

    Craftsman3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner

    Weiyu Li, Jiarui Liu, Hongyu Yan, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner. arXiv preprint arXiv:2405.14979, 2024

  24. [24]

    Step1x-3d: Towards high-fidelity and controllable generation of textured 3d assets

    Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and controllable generation of textured 3d assets. arXiv preprint arXiv:2505.07747, 2025

  25. [25]

    Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  26. [26]

    Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  27. [27]

    Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

  28. [28]

    Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  29. [29]

    Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  30. [30]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021

  31. [31]

    Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  32. [32]

    Nerfies: Deformable neural radiance fields

    Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5865–5874, 2021

  33. [33]

    The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017

  34. [34]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  35. [35]

    Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  36. [36]

    L4gm: Large 4d gaussian reconstruction model.Advances in Neural Information Processing Systems, 37:56828–56858, 2024

    Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung W Kim, et al. L4gm: Large 4d gaussian reconstruction model.Advances in Neural Information Processing Systems, 37:56828–56858, 2024

  37. [37]

    Actionmesh: Animated 3d mesh generation with temporal 3d diffusion.arXiv preprint arXiv:2601.16148, 2026

    Remy Sabathier, David Novotny, Niloy J Mitra, and Tom Monnier. Actionmesh: Animated 3d mesh generation with temporal 3d diffusion.arXiv preprint arXiv:2601.16148, 2026

  38. [38]

    History- guided video diffusion.arXiv preprint arXiv:2502.06764, 2025

    Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History- guided video diffusion.arXiv preprint arXiv:2502.06764, 2025. 23

  39. [39]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  40. [40]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019

  41. [41]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  42. [42]

    Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

    Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, et al. Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

  43. [43]

    Structured 3d latents for scalable and versatile 3d generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21469–21480, 2025

  44. [44]

    Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  45. [45]

    Shapegen4d: Towards high quality 4d shape generation from videos.arXiv preprint arXiv:2510.06208, 2025

    Jiraphon Yenphraphai, Ashkan Mirzaei, Jianqi Chen, Jiaxu Zou, Sergey Tulyakov, Raymond A Yeh, Peter Wonka, and Chaoyang Wang. Shapegen4d: Towards high quality 4d shape generation from videos.arXiv preprint arXiv:2510.06208, 2025

  46. [46]

    Sculpt4d: Generating 4d shapes via sparse-attention diffusion transformers.arXiv preprint arXiv:2604.21592, 2026

    Minghao Yin, Wenbo Hu, Jiale Xu, Ying Shan, and Kai Han. Sculpt4d: Generating 4d shapes via sparse-attention diffusion transformers.arXiv preprint arXiv:2604.21592, 2026

  47. [47]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. 2025

  48. [48]

    3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

    Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

  49. [49]

    Gaussian variation field diffusion for high-fidelity video-to-4d synthesis

    Bowen Zhang, Sicheng Xu, Chuxin Wang, Jiaolong Yang, Feng Zhao, Dong Chen, and Baining Guo. Gaussian variation field diffusion for high-fidelity video-to-4d synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12502–12513, 2025

  50. [50]

    Clay: A controllable large-scale generative model for creating high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

    Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creating high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

  51. [51]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  52. [52]

    Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation.arXiv preprint arXiv:2501.12202, 2025

    Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al. Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation.arXiv preprint arXiv:2501.12202, 2025. 24