Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text

Ines Kim; Jaeyeong Kim; Jahyeok Koo; Seungryong Kim

arxiv: 2606.05162 · v1 · pith:QGVOSZZWnew · submitted 2026-06-03 · 💻 cs.CV

Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text

Jaeyeong Kim , Ines Kim , Jahyeok Koo , Seungryong Kim This is my paper

Pith reviewed 2026-06-28 06:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords controllable 3D generationdynamic shape generation3D trajectoriestext conditioningmotion controltrajectory embedding3D animation

0 comments

The pith

T2Mo generates dynamic 3D shapes whose motions follow input 3D trajectories while matching text semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces T2Mo, a feed-forward model that takes both 3D trajectories and text to produce controllable motions of 3D objects. Text alone leaves motion paths ambiguous, so trajectories give exact spatial instructions for selected points. The model produces motions that stay on those paths yet still carry the overall meaning described in the text. A dedicated embedding step turns any collection of trajectories into tokens tied to the object's shape so the whole surface moves consistently even when input points are sparse or uneven. This setup matters for tasks that need both precise path control and natural language intent, such as animation or simulation design.

Core claim

T2Mo is a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. It generates object motions that spatially adhere to the given trajectories while globally reflecting the text semantics. To handle arbitrary trajectory configurations, a shape-grounded trajectory embedding maps the input set into a shape-aware token set that covers the entire object. Quantitative and qualitative comparisons against text-based baselines and cascaded video-based baselines show higher faithfulness to the combined prompts along with greater expressiveness while maintaining motion quality.

What carries the argument

shape-grounded trajectory embedding, which converts any set of 3D trajectories into a shape-aware token set covering the full object

If this is right

Motions follow the supplied 3D trajectories at every point.
Motions also carry the global semantic content of the text prompt.
The same model works across dense, sparse, and uneven trajectory inputs.
Results exceed text-only generation and cascaded video pipelines in both faithfulness and expressiveness.
Motion quality remains comparable to prior methods while adding explicit spatial control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning approach could support interactive 3D editing tools where users drag paths and refine text descriptions in real time.
Robotics planning systems might use the model to translate natural-language task descriptions plus required end-effector paths into object motion sequences.
The embedding technique could be tested on other sparse 3D signals such as keypoint constraints or partial surface deformations.
Integration with physics engines would allow checking whether the generated motions remain physically plausible under the combined trajectory and text constraints.

Load-bearing premise

The shape-grounded trajectory embedding can turn arbitrary, possibly sparse or uneven trajectory sets into tokens that cover the whole object without artifacts or loss of motion fidelity.

What would settle it

Run the model on sparse, uneven trajectories and measure whether generated point paths stay within a small distance of the input trajectories; large systematic deviations or user preference for a baseline method would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.05162 by Ines Kim, Jaeyeong Kim, Jahyeok Koo, Seungryong Kim.

**Figure 1.** Figure 1: We present T2Mo, a feed-forward framework for controllable dynamic 3D shape generation guided by 3D trajectories and text. By combining spatial trajectory conditioning with semantic text conditioning, T2Mo generates spatially controlled motions. The versatility of trajectories also enables diverse applications, including interactive fine-grained control, motion editing, and motion transfer. Abstract We int… view at source ↗

**Figure 2.** Figure 2: Overview of T2Mo. Given a source mesh M0, condition 3D trajectories T , and a text prompt y, our model generates per-vertex displacements ∆V1:T via a DiT [31] backbone conditioned on shape Cshape, text Ctext, and trajectory tokens Ctraj through cross-attention, yielding the output mesh sequence {Mi} T i=1 via the decoder D. The trajectory condition is constructed by our shapegrounded trajectory embedding … view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with text-based baselines. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with video-based baselines. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of trajectory embedding methods. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: User study interface. (A) Initial instruction page describing the rating task and evaluation criteria. (B) Rating page for text-conditioned baselines, where participants rate three videos generated by different methods given a text prompt. (C) Rating page for cascaded video-based baselines, where participants additionally see the input trajectory overlay alongside the text prompt. Videos from compared meth… view at source ↗

**Figure 7.** Figure 7: Application showcase. Our framework supports a diverse range of motion controls through joint trajectory and text conditioning. Top: fine-grained motion control with user-specified trajectories. Middle: motion editing by modifying a few waypoints of representative trajectories. Bottom: motion transfer from a source motion to a target mesh via sampled representative trajectories. Our model uses only the vis… view at source ↗

read the original abstract

We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity of language, generating precisely intended motions using text alone remains challenging. To address this, we adopt 3D trajectories as controllable spatial guidance, specifying the exact paths along which selected points should move. By combining both, T2Mo generates object motions that spatially adhere to the given trajectories while globally reflecting the text semantics. To robustly handle trajectory inputs with arbitrary configurations, ranging from dense to sparse and unevenly distributed, we further propose a shape-grounded trajectory embedding that maps an input trajectory set into a shape-aware token set covering the entire object. We conduct extensive comparisons against text-based baselines and cascaded video-based baselines that combine trajectory-guided video generation with video-to-dynamic mesh generation. Quantitative and qualitative evaluations, along with user studies, demonstrate that our approach produces motions that more faithfully follow the given prompts with higher expressiveness while preserving motion quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

T2Mo adds explicit 3D trajectory control to text-conditioned dynamic shapes via a shape-grounded embedding, but the abstract supplies no mechanism or metrics to back the coverage claim.

read the letter

The main takeaway is that this work proposes T2Mo as a feed-forward model that takes both text and 3D trajectories to produce object motions, with a shape-grounded embedding meant to turn arbitrary trajectory sets into tokens that cover the whole shape.

What stands out as new is the dual conditioning plus that embedding step, which the authors position against pure text baselines and cascaded video pipelines. The abstract frames the embedding as the fix for dense-to-sparse or uneven inputs, which is a concrete engineering move worth noting.

The paper does a straightforward job of naming the ambiguity problem in text-only generation and offering trajectory guidance as the remedy. Comparisons and user studies are mentioned as evidence of better faithfulness and expressiveness.

The soft spot is exactly the one the stress-test note flags: no description of how the shape-grounded embedding actually works, no equations, and no detail on tokenization or handling of uncovered regions. Without that, the claim that motions adhere spatially while reflecting text rests on an unexamined assumption. The quantitative results and user-study construction are also not shown, so the superiority statements cannot be checked from what is here.

This is for people already working on controllable 3D generation or animation pipelines who want to see a trajectory-plus-text route. A reader could extract the high-level idea and the baseline comparison strategy even if the implementation details need more.

It deserves a serious referee to examine the full method section and results.

Referee Report

1 major / 0 minor

Summary. The paper introduces T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. It proposes a shape-grounded trajectory embedding to map arbitrary (dense to sparse, uneven) trajectory sets into shape-aware token sets covering the object. The central claim is that the resulting motions spatially adhere to the trajectories while reflecting text semantics, outperforming text-based baselines and cascaded video-based baselines in faithfulness, expressiveness, and motion quality, as demonstrated by quantitative metrics, qualitative results, and user studies.

Significance. If the central claims hold after addressing the embedding details, the work would advance controllable 3D generation by combining precise spatial guidance from trajectories with semantic control from text. The feed-forward design and explicit handling of arbitrary trajectory configurations represent practical strengths for applications in animation and design. The inclusion of both quantitative comparisons and user studies strengthens the evaluation if the metrics are clearly defined and the baselines are fairly constructed.

major comments (1)

[Method (shape-grounded trajectory embedding)] The shape-grounded trajectory embedding (method section): the central claim that motions 'spatially adhere to the given trajectories' while outperforming baselines depends on this embedding reliably converting arbitrary trajectory sets into a shape-aware token set that covers the entire object without artifacts or fidelity loss. The abstract states the mapping occurs but supplies no mechanism details on shape grounding, tokenization strategy, or handling of uncovered regions for sparse/uneven inputs; without ablations or analysis on such cases, the faithfulness advantage cannot be substantiated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and agree that additional details and analysis are warranted to fully substantiate the claims.

read point-by-point responses

Referee: [Method (shape-grounded trajectory embedding)] The shape-grounded trajectory embedding (method section): the central claim that motions 'spatially adhere to the given trajectories' while outperforming baselines depends on this embedding reliably converting arbitrary trajectory sets into a shape-aware token set that covers the entire object without artifacts or fidelity loss. The abstract states the mapping occurs but supplies no mechanism details on shape grounding, tokenization strategy, or handling of uncovered regions for sparse/uneven inputs; without ablations or analysis on such cases, the faithfulness advantage cannot be substantiated.

Authors: We agree that the current description in Section 3.2 provides only a high-level overview of the shape-grounded trajectory embedding and does not include sufficient implementation specifics on the grounding mechanism, tokenization strategy, or explicit handling of sparse/uneven inputs and uncovered regions. To address this, the revised manuscript will expand the method section with these details, including how the embedding produces a shape-aware token set that covers the object. We will also add targeted ablations and analysis on sparse and uneven trajectory configurations to demonstrate robustness and support the faithfulness claims. revision: yes

Circularity Check

0 steps flagged

No circularity: learned feed-forward model with no derivations or self-defined reductions

full rationale

The provided abstract and context present T2Mo as a learned feed-forward neural framework. The shape-grounded trajectory embedding is introduced as a proposed architectural component that maps trajectories to tokens, but no equations, fitted parameters, predictions, or derivation chains are shown that reduce outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are referenced. The central claims rest on empirical comparisons and user studies rather than any mathematical reduction. This is the normal case of a self-contained learned model without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The shape-grounded embedding is presented as a proposed component without further decomposition.

pith-pipeline@v0.9.1-grok · 5715 in / 1056 out tokens · 24532 ms · 2026-06-28T06:20:48.441848+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 23 canonical work pages · 8 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

2022
[2]

4d-fy: Text-to-4d generation using hybrid score distillation sampling

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024

2024
[3]

Sherwin Bahmani, Xian Liu, Wang Yifan, Ivan Skorokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, Andrea Tagliasacchi, and David B. Lindell. Tc4d: Trajectory-conditioned text-to-4d generation. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Rus- sakovsky, Torsten Sattler, and Gül Varol, editors,Computer Visi...

2024
[4]

Springer Nature Switzerland
[5]

Motion 3-to-4: 3d motion reconstruction for 4d synthesis.arXiv preprint arXiv:2601.14253, 2026

Hongyuan Chen, Xingyu Chen, Youjia Zhang, Zexiang Xu, and Anpei Chen. Motion 3-to-4: 3d motion reconstruction for 4d synthesis.arXiv preprint arXiv:2601.14253, 2026

work page arXiv 2026
[6]

Wan-move: Motion-controllable video generation via latent trajectory guidance.arXiv preprint arXiv:2512.08765, 2025

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, et al. Wan-move: Motion-controllable video generation via latent trajectory guidance.arXiv preprint arXiv:2512.08765, 2025

work page arXiv 2025
[7]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

2023
[8]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rom- bach. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024. URLhttps...

2024
[9]

3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation.arXiv preprint arXiv:2412.07759, 2024

Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, and Dahua Lin. 3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation. arXiv preprint arXiv:2412.07759, 2024

work page arXiv 2024
[10]

Charactershot: Controllable and consistent 4d character animation.arXiv preprint arXiv:2508.07409, 2025

Junyao Gao, Jiaxing Li, Wenran Liu, Yanhong Zeng, Fei Shen, Kai Chen, Yanan Sun, and Cairong Zhao. Charactershot: Controllable and consistent 4d character animation.arXiv preprint arXiv:2508.07409, 2025

work page arXiv 2025
[11]

Motion prompting: Controlling video generation with motion trajectories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez- Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, and Deqing Sun. Motion prompting: Controlling video generation with motion trajectories. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CV...

2025
[12]

Motionlab: Unified human motion generation and editing via the motion-condition-motion paradigm

Ziyan Guo, Zeyu Hu, De Wen Soh, and Na Zhao. Motionlab: Unified human motion generation and editing via the motion-condition-motion paradigm. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13869–13879, 2025

2025
[13]

Flex3D: Feed-forward 3D generation with flexible reconstruction model and input view curation

Junlin Han, Jianyuan Wang, Andrea Vedaldi, Philip Torr, and Filippos Kokkinos. Flex3D: Feed-forward 3D generation with flexible reconstruction model and input view curation. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference...

2025
[14]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Debiasing scores and prompts of 2d diffusion for view-consistent text-to-3d generation.Advances in Neural Information Processing Systems, 36: 11970–11987, 2023

Susung Hong, Donghoon Ahn, and Seungryong Kim. Debiasing scores and prompts of 2d diffusion for view-consistent text-to-3d generation.Advances in Neural Information Processing Systems, 36: 11970–11987, 2023

2023
[16]

Mvtokenflow: High-quality 4d content generation using multiview token flow.arXiv preprint arXiv:2502.11697, 2025

Hanzhuo Huang, Yuan Liu, Ge Zheng, Jiepeng Wang, Zhiyang Dou, and Sibei Yang. Mvtokenflow: High-quality 4d content generation using multiview token flow.arXiv preprint arXiv:2502.11697, 2025. 10

work page arXiv 2025
[17]

Animax: Animating the inanimate in 3d with joint video-pose diffusion models

Zehuan Huang, Haoran Feng, Yang-Tian Sun, Yuan-Chen Guo, Yan-Pei Cao, and Lu Sheng. Animax: Animating the inanimate in 3d with joint video-pose diffusion models. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–13, 2025

2025
[18]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[19]

Animate3d: Animating any 3d model with multi-view video diffusion.Advances in Neural Information Processing Systems, 37: 125879–125906, 2024

Yanqin Jiang, Chaohui Yu, Chenjie Cao, Fan Wang, Weiming Hu, and Jin Gao. Animate3d: Animating any 3d model with multi-view video diffusion.Advances in Neural Information Processing Systems, 37: 125879–125906, 2024

2024
[20]

Mesh4d: 4d mesh reconstruc- tion and tracking from monocular video.arXiv preprint arXiv:2601.05251, 2026

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Mesh4d: 4d mesh reconstruc- tion and tracking from monocular video.arXiv preprint arXiv:2601.05251, 2026

work page arXiv 2026
[21]

Trajevae: Controllable human motion generation from trajectories

Kacper Kania, Marek Kowalski, et al. Trajevae: Controllable human motion generation from trajectories. arXiv preprint arXiv:2104.00351, 2021

work page arXiv 2021
[22]

Guided motion dif- fusion for controllable human motion synthesis

Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion dif- fusion for controllable human motion synthesis. InProceedings of the IEEE/CVF international conference on computer vision, pages 2151–2162, 2023

2023
[23]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxiang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, et al. Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details.arXiv preprint arXiv:2506.16504, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Craftsman3d: High-fidelity mesh generation with 3d native diffusion and interactive geometry refiner

Weiyu Li, Jiarui Liu, Hongyu Yan, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman3d: High-fidelity mesh generation with 3d native diffusion and interactive geometry refiner. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5307–5317, 2025

2025
[26]

arXiv preprint arXiv:2505.07747 (2025) 2, 3, 4, 6, 8, 21, 30

Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and controllable generation of textured 3d assets. arXiv preprint arXiv:2505.07747, 2025

work page arXiv 2025
[27]

Articulated kinematics distillation from video diffusion models

Xuan Li, Qianli Ma, Tsung-Yi Lin, Yongxin Chen, Chenfanfu Jiang, Ming-Yu Liu, and Donglai Xiang. Articulated kinematics distillation from video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17571–17581, 2025

2025
[28]

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Dreammesh4d: Video-to-4d generation with sparse-controlled gaussian-mesh hybrid representation.Advances in Neural Information Processing Systems, 37:21377– 21400, 2024

Zhiqi Li, Yiming Chen, and Peidong Liu. Dreammesh4d: Video-to-4d generation with sparse-controlled gaussian-mesh hybrid representation.Advances in Neural Information Processing Systems, 37:21377– 21400, 2024

2024
[30]

Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models

Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8576–8588, 2024

2024
[31]

Trailblazer: Trajectory control for diffusion- based video generation

Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion- based video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

2024
[32]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[33]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Distilling multi-view diffusion models into 3d generators.arXiv preprint arXiv:2504.00457, 2025

Hao Qin, Luyuan Chen, Ming Kong, Mengxu Lu, and Qiang Zhu. Distilling multi-view diffusion models into 3d generators.arXiv preprint arXiv:2504.00457, 2025. 11

work page arXiv 2025
[35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

2021
[36]

Kimodo: Scaling controllable human motion generation

Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, David Minor, Michael de Ruyter, et al. Kimodo: Scaling controllable human motion generation. arXiv preprint arXiv:2603.15546, 2026

work page arXiv 2026
[37]

L4gm: Large 4d gaussian reconstruction model.Advances in Neural Information Processing Systems, 37:56828–56858, 2024

Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung W Kim, et al. L4gm: Large 4d gaussian reconstruction model.Advances in Neural Information Processing Systems, 37:56828–56858, 2024

2024
[38]

Actionmesh: Animated 3d mesh generation with temporal 3d diffusion.arXiv preprint arXiv:2601.16148, 2026

Remy Sabathier, David Novotny, Niloy J Mitra, and Tom Monnier. Actionmesh: Animated 3d mesh generation with temporal 3d diffusion.arXiv preprint arXiv:2601.16148, 2026

work page arXiv 2026
[39]

Eg4d: Explicit generation of 4d object without score distillation.arXiv preprint arXiv:2405.18132, 2024

Qi Sun, Zhiyang Guo, Ziyu Wan, Jing Nathan Yan, Shengming Yin, Wengang Zhou, Jing Liao, and Houqiang Li. Eg4d: Explicit generation of 4d object without score distillation.arXiv preprint arXiv:2405.18132, 2024

work page arXiv 2024
[40]

Tlcontrol: Trajectory and language control for human motion synthesis

Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, and Lingjie Liu. Tlcontrol: Trajectory and language control for human motion synthesis. InEuropean Conference on Computer Vision, pages 37–54. Springer, 2024

2024
[41]

Levitor: 3d trajectory oriented image-to-video synthesis

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, and Limin Wang. Levitor: 3d trajectory oriented image-to-video synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12490–12500, 2025

2025
[42]

Boximator: Generating rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024

Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, and Hang Li. Boximator: Generating rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024

work page arXiv 2024
[43]

Bimotion: B-spline motion for text-guided dynamic 3d character generation

Miaowei Wang, Qingxuan Yan, Zhi Cao, Yayuan Li, Oisin Mac Aodha, Jason J Corso, and Amir Vaxman. Bimotion: B-spline motion for text-guided dynamic 3d character generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10152–10164, 2026

2026
[44]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.Advances in neural information processing systems, 36:8406–8441, 2023

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.Advances in neural information processing systems, 36:8406–8441, 2023

2023
[45]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

2024
[46]

Cat4d: Create anything in 4d with multi-view video diffusion models

Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26057–26068, 2025

2025
[47]

Draganything: Motion control for anything using entity representation

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for anything using entity representation. In European Conference on Computer Vision, pages 331–348. Springer, 2024

2024
[48]

Animateanymesh: A feed-forward 4d foundation model for text-driven universal mesh animation

Zijie Wu, Chaohui Yu, Fan Wang, and Xiang Bai. Animateanymesh: A feed-forward 4d foundation model for text-driven universal mesh animation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13557–13568, October 2025

2025
[49]

AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation

Zijie Wu, Chaohui Yu, Fan Wang, and Xiang Bai. Animateanymesh++: A flexible 4d foundation model for high-fidelity text-driven mesh animation, 2026. URLhttps://arxiv.org/abs/2604.26917

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Structured 3d latents for scalable and versatile 3d generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21469–21480, 2025

2025
[51]

Videopanda: Video panoramic diffusion with multi-view attention.arXiv preprint arXiv:2504.11389, 2025

Kevin Xie, Amirmojtaba Sabour, Jiahui Huang, Despoina Paschalidou, Greg Klar, Umar Iqbal, Sanja Fidler, and Xiaohui Zeng. Videopanda: Video panoramic diffusion with multi-view attention.arXiv preprint arXiv:2504.11389, 2025. 12

work page arXiv 2025
[52]

Omnicontrol: Control any joint at any time for human motion generation

Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. Omnicontrol: Control any joint at any time for human motion generation. InInternational Conference on Learning Representations, volume 2024, pages 28176–28194, 2024

2024
[53]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Shapegen4d: Towards high quality 4d shape generation from videos.arXiv preprint arXiv:2510.06208, 2025

Jiraphon Yenphraphai, Ashkan Mirzaei, Jianqi Chen, Jiaxu Zou, Sergey Tulyakov, Raymond A Yeh, Peter Wonka, and Chaoyang Wang. Shapegen4d: Towards high quality 4d shape generation from videos.arXiv preprint arXiv:2510.06208, 2025

work page arXiv 2025
[55]

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

4real: Towards photorealistic 4d scene generation via video diffusion models.Advances in Neural Information Processing Systems, 37:45256–45280, 2024

Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, and Hsin-Ying Lee. 4real: Towards photorealistic 4d scene generation via video diffusion models.Advances in Neural Information Processing Systems, 37:45256–45280, 2024

2024
[57]

3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

2023
[58]

4diffusion: Multi- view video diffusion model for 4d generation.Advances in Neural Information Processing Systems, 37: 15272–15295, 2024

Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. 4diffusion: Multi- view video diffusion model for 4d generation.Advances in Neural Information Processing Systems, 37: 15272–15295, 2024

2024
[59]

Clay: A controllable large-scale generative model for creating high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creating high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

2024
[60]

The bulldozer lowers its blade and then return to initial pose

Zhenghao Zhang, Junchao Liao, Menghao Li, ZuoZhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2063–2073, June 2025. 13 Appendix A Implementation details Training details.Inspired...

2063

[1] [1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

2022

[2] [2]

4d-fy: Text-to-4d generation using hybrid score distillation sampling

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024

2024

[3] [3]

Sherwin Bahmani, Xian Liu, Wang Yifan, Ivan Skorokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, Andrea Tagliasacchi, and David B. Lindell. Tc4d: Trajectory-conditioned text-to-4d generation. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Rus- sakovsky, Torsten Sattler, and Gül Varol, editors,Computer Visi...

2024

[4] [4]

Springer Nature Switzerland

[5] [5]

Motion 3-to-4: 3d motion reconstruction for 4d synthesis.arXiv preprint arXiv:2601.14253, 2026

Hongyuan Chen, Xingyu Chen, Youjia Zhang, Zexiang Xu, and Anpei Chen. Motion 3-to-4: 3d motion reconstruction for 4d synthesis.arXiv preprint arXiv:2601.14253, 2026

work page arXiv 2026

[6] [6]

Wan-move: Motion-controllable video generation via latent trajectory guidance.arXiv preprint arXiv:2512.08765, 2025

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, et al. Wan-move: Motion-controllable video generation via latent trajectory guidance.arXiv preprint arXiv:2512.08765, 2025

work page arXiv 2025

[7] [7]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

2023

[8] [8]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rom- bach. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024. URLhttps...

2024

[9] [9]

3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation.arXiv preprint arXiv:2412.07759, 2024

Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, and Dahua Lin. 3dtrajmaster: Mastering 3d trajectory for multi-entity motion in video generation. arXiv preprint arXiv:2412.07759, 2024

work page arXiv 2024

[10] [10]

Charactershot: Controllable and consistent 4d character animation.arXiv preprint arXiv:2508.07409, 2025

Junyao Gao, Jiaxing Li, Wenran Liu, Yanhong Zeng, Fei Shen, Kai Chen, Yanan Sun, and Cairong Zhao. Charactershot: Controllable and consistent 4d character animation.arXiv preprint arXiv:2508.07409, 2025

work page arXiv 2025

[11] [11]

Motion prompting: Controlling video generation with motion trajectories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez- Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, and Deqing Sun. Motion prompting: Controlling video generation with motion trajectories. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CV...

2025

[12] [12]

Motionlab: Unified human motion generation and editing via the motion-condition-motion paradigm

Ziyan Guo, Zeyu Hu, De Wen Soh, and Na Zhao. Motionlab: Unified human motion generation and editing via the motion-condition-motion paradigm. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13869–13879, 2025

2025

[13] [13]

Flex3D: Feed-forward 3D generation with flexible reconstruction model and input view curation

Junlin Han, Jianyuan Wang, Andrea Vedaldi, Philip Torr, and Filippos Kokkinos. Flex3D: Feed-forward 3D generation with flexible reconstruction model and input view curation. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference...

2025

[14] [14]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Debiasing scores and prompts of 2d diffusion for view-consistent text-to-3d generation.Advances in Neural Information Processing Systems, 36: 11970–11987, 2023

Susung Hong, Donghoon Ahn, and Seungryong Kim. Debiasing scores and prompts of 2d diffusion for view-consistent text-to-3d generation.Advances in Neural Information Processing Systems, 36: 11970–11987, 2023

2023

[16] [16]

Mvtokenflow: High-quality 4d content generation using multiview token flow.arXiv preprint arXiv:2502.11697, 2025

Hanzhuo Huang, Yuan Liu, Ge Zheng, Jiepeng Wang, Zhiyang Dou, and Sibei Yang. Mvtokenflow: High-quality 4d content generation using multiview token flow.arXiv preprint arXiv:2502.11697, 2025. 10

work page arXiv 2025

[17] [17]

Animax: Animating the inanimate in 3d with joint video-pose diffusion models

Zehuan Huang, Haoran Feng, Yang-Tian Sun, Yuan-Chen Guo, Yan-Pei Cao, and Lu Sheng. Animax: Animating the inanimate in 3d with joint video-pose diffusion models. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–13, 2025

2025

[18] [18]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[19] [19]

Animate3d: Animating any 3d model with multi-view video diffusion.Advances in Neural Information Processing Systems, 37: 125879–125906, 2024

Yanqin Jiang, Chaohui Yu, Chenjie Cao, Fan Wang, Weiming Hu, and Jin Gao. Animate3d: Animating any 3d model with multi-view video diffusion.Advances in Neural Information Processing Systems, 37: 125879–125906, 2024

2024

[20] [20]

Mesh4d: 4d mesh reconstruc- tion and tracking from monocular video.arXiv preprint arXiv:2601.05251, 2026

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Mesh4d: 4d mesh reconstruc- tion and tracking from monocular video.arXiv preprint arXiv:2601.05251, 2026

work page arXiv 2026

[21] [21]

Trajevae: Controllable human motion generation from trajectories

Kacper Kania, Marek Kowalski, et al. Trajevae: Controllable human motion generation from trajectories. arXiv preprint arXiv:2104.00351, 2021

work page arXiv 2021

[22] [22]

Guided motion dif- fusion for controllable human motion synthesis

Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion dif- fusion for controllable human motion synthesis. InProceedings of the IEEE/CVF international conference on computer vision, pages 2151–2162, 2023

2023

[23] [23]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[24] [24]

Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxiang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, et al. Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details.arXiv preprint arXiv:2506.16504, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Craftsman3d: High-fidelity mesh generation with 3d native diffusion and interactive geometry refiner

Weiyu Li, Jiarui Liu, Hongyu Yan, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman3d: High-fidelity mesh generation with 3d native diffusion and interactive geometry refiner. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5307–5317, 2025

2025

[26] [26]

arXiv preprint arXiv:2505.07747 (2025) 2, 3, 4, 6, 8, 21, 30

Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and controllable generation of textured 3d assets. arXiv preprint arXiv:2505.07747, 2025

work page arXiv 2025

[27] [27]

Articulated kinematics distillation from video diffusion models

Xuan Li, Qianli Ma, Tsung-Yi Lin, Yongxin Chen, Chenfanfu Jiang, Ming-Yu Liu, and Donglai Xiang. Articulated kinematics distillation from video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17571–17581, 2025

2025

[28] [28]

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Dreammesh4d: Video-to-4d generation with sparse-controlled gaussian-mesh hybrid representation.Advances in Neural Information Processing Systems, 37:21377– 21400, 2024

Zhiqi Li, Yiming Chen, and Peidong Liu. Dreammesh4d: Video-to-4d generation with sparse-controlled gaussian-mesh hybrid representation.Advances in Neural Information Processing Systems, 37:21377– 21400, 2024

2024

[30] [30]

Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models

Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8576–8588, 2024

2024

[31] [31]

Trailblazer: Trajectory control for diffusion- based video generation

Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion- based video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

2024

[32] [32]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[33] [33]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Distilling multi-view diffusion models into 3d generators.arXiv preprint arXiv:2504.00457, 2025

Hao Qin, Luyuan Chen, Ming Kong, Mengxu Lu, and Qiang Zhu. Distilling multi-view diffusion models into 3d generators.arXiv preprint arXiv:2504.00457, 2025. 11

work page arXiv 2025

[35] [35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

2021

[36] [36]

Kimodo: Scaling controllable human motion generation

Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, David Minor, Michael de Ruyter, et al. Kimodo: Scaling controllable human motion generation. arXiv preprint arXiv:2603.15546, 2026

work page arXiv 2026

[37] [37]

L4gm: Large 4d gaussian reconstruction model.Advances in Neural Information Processing Systems, 37:56828–56858, 2024

Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung W Kim, et al. L4gm: Large 4d gaussian reconstruction model.Advances in Neural Information Processing Systems, 37:56828–56858, 2024

2024

[38] [38]

Actionmesh: Animated 3d mesh generation with temporal 3d diffusion.arXiv preprint arXiv:2601.16148, 2026

Remy Sabathier, David Novotny, Niloy J Mitra, and Tom Monnier. Actionmesh: Animated 3d mesh generation with temporal 3d diffusion.arXiv preprint arXiv:2601.16148, 2026

work page arXiv 2026

[39] [39]

Eg4d: Explicit generation of 4d object without score distillation.arXiv preprint arXiv:2405.18132, 2024

Qi Sun, Zhiyang Guo, Ziyu Wan, Jing Nathan Yan, Shengming Yin, Wengang Zhou, Jing Liao, and Houqiang Li. Eg4d: Explicit generation of 4d object without score distillation.arXiv preprint arXiv:2405.18132, 2024

work page arXiv 2024

[40] [40]

Tlcontrol: Trajectory and language control for human motion synthesis

Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, and Lingjie Liu. Tlcontrol: Trajectory and language control for human motion synthesis. InEuropean Conference on Computer Vision, pages 37–54. Springer, 2024

2024

[41] [41]

Levitor: 3d trajectory oriented image-to-video synthesis

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, and Limin Wang. Levitor: 3d trajectory oriented image-to-video synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12490–12500, 2025

2025

[42] [42]

Boximator: Generating rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024

Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping Yuan, and Hang Li. Boximator: Generating rich and controllable motions for video synthesis.arXiv preprint arXiv:2402.01566, 2024

work page arXiv 2024

[43] [43]

Bimotion: B-spline motion for text-guided dynamic 3d character generation

Miaowei Wang, Qingxuan Yan, Zhi Cao, Yayuan Li, Oisin Mac Aodha, Jason J Corso, and Amir Vaxman. Bimotion: B-spline motion for text-guided dynamic 3d character generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10152–10164, 2026

2026

[44] [44]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.Advances in neural information processing systems, 36:8406–8441, 2023

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.Advances in neural information processing systems, 36:8406–8441, 2023

2023

[45] [45]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

2024

[46] [46]

Cat4d: Create anything in 4d with multi-view video diffusion models

Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26057–26068, 2025

2025

[47] [47]

Draganything: Motion control for anything using entity representation

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for anything using entity representation. In European Conference on Computer Vision, pages 331–348. Springer, 2024

2024

[48] [48]

Animateanymesh: A feed-forward 4d foundation model for text-driven universal mesh animation

Zijie Wu, Chaohui Yu, Fan Wang, and Xiang Bai. Animateanymesh: A feed-forward 4d foundation model for text-driven universal mesh animation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13557–13568, October 2025

2025

[49] [49]

AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation

Zijie Wu, Chaohui Yu, Fan Wang, and Xiang Bai. Animateanymesh++: A flexible 4d foundation model for high-fidelity text-driven mesh animation, 2026. URLhttps://arxiv.org/abs/2604.26917

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

Structured 3d latents for scalable and versatile 3d generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21469–21480, 2025

2025

[51] [51]

Videopanda: Video panoramic diffusion with multi-view attention.arXiv preprint arXiv:2504.11389, 2025

Kevin Xie, Amirmojtaba Sabour, Jiahui Huang, Despoina Paschalidou, Greg Klar, Umar Iqbal, Sanja Fidler, and Xiaohui Zeng. Videopanda: Video panoramic diffusion with multi-view attention.arXiv preprint arXiv:2504.11389, 2025. 12

work page arXiv 2025

[52] [52]

Omnicontrol: Control any joint at any time for human motion generation

Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. Omnicontrol: Control any joint at any time for human motion generation. InInternational Conference on Learning Representations, volume 2024, pages 28176–28194, 2024

2024

[53] [53]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Shapegen4d: Towards high quality 4d shape generation from videos.arXiv preprint arXiv:2510.06208, 2025

Jiraphon Yenphraphai, Ashkan Mirzaei, Jianqi Chen, Jiaxu Zou, Sergey Tulyakov, Raymond A Yeh, Peter Wonka, and Chaoyang Wang. Shapegen4d: Towards high quality 4d shape generation from videos.arXiv preprint arXiv:2510.06208, 2025

work page arXiv 2025

[55] [55]

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Drag- nuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

4real: Towards photorealistic 4d scene generation via video diffusion models.Advances in Neural Information Processing Systems, 37:45256–45280, 2024

Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, and Hsin-Ying Lee. 4real: Towards photorealistic 4d scene generation via video diffusion models.Advances in Neural Information Processing Systems, 37:45256–45280, 2024

2024

[57] [57]

3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

2023

[58] [58]

4diffusion: Multi- view video diffusion model for 4d generation.Advances in Neural Information Processing Systems, 37: 15272–15295, 2024

Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. 4diffusion: Multi- view video diffusion model for 4d generation.Advances in Neural Information Processing Systems, 37: 15272–15295, 2024

2024

[59] [59]

Clay: A controllable large-scale generative model for creating high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creating high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

2024

[60] [60]

The bulldozer lowers its blade and then return to initial pose

Zhenghao Zhang, Junchao Liao, Menghao Li, ZuoZhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2063–2073, June 2025. 13 Appendix A Implementation details Training details.Inspired...

2063