ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

Baptiste Bellot-Gurlet; Fabio Pizzati; Ivan Laptev; Omar El Khalifi; Oscar Fossey; Philip Torr; Thibault Fouque; Thomas Rossi; Ulysse Mizrahi

arxiv: 2605.06667 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI· cs.LG

ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

Omar El Khalifi , Thomas Rossi , Oscar Fossey , Thibault Fouque , Ulysse Mizrahi , Philip Torr , Ivan Laptev , Fabio Pizzati

show 1 more author

Baptiste Bellot-Gurlet

This is my paper

Pith reviewed 2026-05-08 12:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords video generationcamera controlmotion transferdiffusion modelszero-shot learningpose conditioningdepth maps

0 comments

The pith

ActCam enables zero-shot joint control of character motion and per-frame camera parameters in video generation by generating consistent pose and depth conditions for pretrained diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ActCam as a method to control both the actor's performance and the camera's trajectory in generated videos without training any new models. It starts from a driving video that supplies the desired character motion and a separate specification of the target camera path. From these, the method creates pose and depth maps that stay geometrically consistent from frame to frame. These maps are then supplied to any existing image-to-video diffusion model through a two-phase schedule: the first part of denoising uses both pose and sparse depth to lock in overall scene structure, after which depth is removed and pose-only guidance refines motion details. The result is videos that more closely follow both the intended actions and the specified camera moves than prior pose-only or combined methods.

Core claim

ActCam generates pose and depth conditions that remain geometrically consistent across frames from a source video with a moving character and a target camera motion. It then runs a single sampling process with a two-phase conditioning schedule on any pretrained image-to-video diffusion model that accepts scene depth and character pose: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details. This produces per-frame control of intrinsic and extrinsic camera parameters together with motion transfer in a zero-shot setting.

What carries the argument

The two-phase conditioning schedule on pose and depth maps that are kept geometrically consistent across frames.

If this is right

Videos better match both the supplied character actions and the chosen camera paths than pose-only baselines.
Human viewers prefer the outputs especially when large viewpoint changes are required.
The same pretrained diffusion backbone can be reused for different motion sources and camera specifications without retraining.
Control extends to both intrinsic parameters such as focal length and extrinsic parameters such as camera position and orientation on a per-frame basis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The staged removal of depth conditioning may generalize to other auxiliary signals that become over-constraining in later denoising steps.
Success under large viewpoint changes implies that geometric consistency in the input conditions is more critical than the absolute amount of conditioning information.
The approach could be tested on driving videos captured with consumer cameras to check robustness when the source motion and target camera are less perfectly aligned.

Load-bearing premise

That pose and depth conditions can be generated to remain geometrically consistent across frames and that a two-phase conditioning schedule on a pretrained diffusion model is sufficient to achieve per-frame intrinsic and extrinsic camera control without any training or fine-tuning.

What would settle it

Apply ActCam to a driving video and a target camera trajectory that includes a 180-degree rotation around the character, then measure whether the generated video's viewpoints match the specified trajectory while preserving the character's motion sequence.

Figures

Figures reproduced from arXiv: 2605.06667 by Baptiste Bellot-Gurlet, Fabio Pizzati, Ivan Laptev, Omar El Khalifi, Oscar Fossey, Philip Torr, Thibault Fouque, Thomas Rossi, Ulysse Mizrahi.

**Figure 1.** Figure 1: Overview. ActCam enables zero-shot joint control of acting motion and camera motion for single-image video generation from a reference image, assuming only widespread conditioning capability of the backbone model on depth and keypoints. Given a reference image, an acting video representing the desired motion, and a target per-frame camera trajectory, ActCam generates a video that preserves identity while f… view at source ↗

**Figure 2.** Figure 2: ActCam pipeline. Given a reference image, an acting video, and a target camera trajectory, we (1) estimate background depth from an inpainted reference, (2) recover motion and align it to the background scene via fitting, and (3) rasterize pose and depth+pose control signals under the target viewpoint. A two-phase denoising schedule conditions early steps on depth+pose for stronger camera control, then ref… view at source ↗

**Figure 3.** Figure 3: User study. We compare with Uni3C on camera adherence (Camera) and motion faithfulness (Motion) with respect to the conditioning input, alongside overall visual quality (Visual). We considerably outperform Uni3C, the closest method to ours. generated videos are aligned with the performance boost reported in view at source ↗

**Figure 4.** Figure 4: Effect of 𝑁𝐷 on VBench score. The figure shows the average VBench scores as a function of 𝑁𝐷 , where the conditioning switches from pose+depth to pose-only. Early switching under-constrains the generation, while late switching (low 𝑡) can propagate depth artifacts into highfrequency details, harming results. We set an optimal 𝑁𝐷 = 0.2. Depth Map Without Condition Schedule With Condition Schedule view at source ↗

**Figure 5.** Figure 5: Importance of conditioning schedule. Excessive depth guidance (setting 𝑁𝐷 = 1) can overly constrain the scene, producing static backgrounds under camera motion (center, red circle). Instead, 𝑁𝐷 < 1 allows to flexibly move the barbell to follow the human motion (right). 4.4 Ablation studies Balance of depth conditioning. We vary the number of initial diffusion steps conditioned on both pose and depth (𝑁𝐷 … view at source ↗

**Figure 8.** Figure 8: Importance of scene transfer. Without scene transfer (No alignment), the condition does not respect 3D coherence. Uniform weighting improves placement but importance weighting (ours) is required to achieve best results. The red arrows (right column) show depth/positions offsets. Scene transfer. In Section 3.2, we describe how we align the composed character depth with the rendered environment depth to st… view at source ↗

**Figure 9.** Figure 9: Comparison with Uni3C. Uni3C yields suboptimal camera control (top, middle) and unrealistic character motion (bottom). In the insets, a visualization of the control signal for both Uni3C and ActCam view at source ↗

**Figure 10.** Figure 10: Different cameras. We first show the conditioning signal and ActCam results (top two rows). In the next three rows, we variate camera movements. As visible, the character appearance and motion remain consistent. ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: May 2026 view at source ↗

**Figure 11.** Figure 11: Different scenes. We display two outputs of ActCam showing the same motion rendered on two characters in different scenes, using the same camera controls view at source ↗

**Figure 12.** Figure 12: Different scenes and different cameras. To show the flexibility of our approach, we apply the same motion to two characters in different scenes, by also varying the camera control. ActCam still renders the correct motion. Conditioning Output Conditioning Output view at source ↗

**Figure 13.** Figure 13: Multi-character results. ActCam handles multiple characters by applying the scene transfer and motion fitting independently per character. ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: May 2026 view at source ↗

read the original abstract

For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor's motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: https://elkhomar.github.io/actcam/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ActCam's two-phase pose-plus-depth schedule offers a straightforward zero-shot route to joint motion and camera control on top of existing I2V models, but the abstract leaves the quantitative support and geometric consistency claims hard to verify.

read the letter

ActCam shows a workable zero-shot method for transferring character motion from a driving video while also steering the camera trajectory frame by frame. It generates pose and depth conditions that are meant to stay geometrically consistent, then feeds them into any pretrained image-to-video diffusion model with a two-phase schedule: early steps use both pose and sparse depth to lock in scene structure, after which depth is dropped and pose-only guidance finishes the details. That staged drop is the piece that looks new compared with prior pose-only or separate-camera approaches. The paper reports better camera adherence and motion fidelity than baselines, plus stronger human preference especially on large viewpoint shifts, all without any training or fine-tuning. That practical angle is useful for people who want to experiment with control on top of off-the-shelf models. The main soft spots sit in the evidence. The abstract mentions benchmark improvements and human studies but supplies no numbers, error bars, or evaluation protocol details, so the size of the gains is difficult to judge. The central assumption—that the generated conditions remain consistent enough and that removing depth after the first phase still anchors intrinsics and the full extrinsic path—needs checking. For big camera moves the later pose-only steps could let 3D geometry drift while still satisfying the remaining condition, which would undercut both claims. Without the full experimental section or code it is hard to tell how robust the condition generation actually is. This work is aimed at researchers and practitioners who build or use controllable video generation tools, especially those extending diffusion models for animation or simulation tasks. A reader looking for a simple way to add camera control to motion transfer would find the method worth trying. I would send it to peer review. The idea is coherent and addresses a concrete need, even if the current presentation leaves the strength of the results open to question.

Referee Report

3 major / 2 minor

Summary. The paper introduces ActCam, a zero-shot method for joint control of character motion (from a driving video) and per-frame camera intrinsics/extrinsics (from a target trajectory) in video generation. It builds on pretrained image-to-video diffusion models by generating geometrically consistent pose and depth conditions, then applies a two-phase sampling schedule: early denoising steps condition on both pose and sparse depth to enforce structure, after which depth is dropped and pose-only guidance is used to refine details. Evaluations on benchmarks with diverse motions and large viewpoint changes claim improved camera adherence and motion fidelity over pose-only and competing pose+camera methods, along with higher human preference.

Significance. If the central claims hold, ActCam offers a practical, training-free advance in controllable video synthesis by enabling simultaneous 3D motion transfer and cinematographic control. The zero-shot reliance on existing models combined with geometrically consistent conditioning and staged guidance is a notable strength that could generalize to other diffusion-based tasks. This addresses an important gap for artistic video generation applications where both actor performance and camera work must be specified precisely.

major comments (3)

[Method (two-phase schedule)] Method section (two-phase conditioning schedule): The claim that dropping depth after early steps still achieves per-frame intrinsic and extrinsic camera control rests on an untested assumption that the pretrained model will maintain 3D geometry from pose alone. For large viewpoint changes this risks drift in focal length, principal point, or trajectory adherence, directly undermining the joint control contribution; an ablation isolating the late-stage pose-only phase would be required to support the schedule's sufficiency.
[Experiments] Experiments and evaluation: The reported improvements in camera adherence and motion fidelity, plus human preference results, lack accompanying quantitative metrics, error bars, statistical significance tests, or detailed protocols for how pose/depth conditions were generated and how adherence was measured. This is load-bearing for the central empirical claim and leaves the comparisons vulnerable to unstated choices in condition generation or evaluation.
[Method (condition generation)] Condition generation (geometric consistency): The assertion that pose and depth maps remain geometrically consistent across frames when derived from the driving video plus target trajectory is foundational, yet the manuscript provides insufficient validation (e.g., no reprojection error statistics or cross-frame consistency metrics) to confirm this holds under the challenging viewpoint changes highlighted in the evaluation.

minor comments (2)

[Abstract] The abstract would benefit from naming the specific benchmarks and briefly indicating the scale of the human study.
[Figures] Figure captions and the project page reference could more explicitly describe how the visualized pose and depth conditions relate to the target camera parameters.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below, agreeing where revisions are needed to strengthen the claims, and describe the changes we will incorporate.

read point-by-point responses

Referee: Method section (two-phase conditioning schedule): The claim that dropping depth after early steps still achieves per-frame intrinsic and extrinsic camera control rests on an untested assumption that the pretrained model will maintain 3D geometry from pose alone. For large viewpoint changes this risks drift in focal length, principal point, or trajectory adherence, directly undermining the joint control contribution; an ablation isolating the late-stage pose-only phase would be required to support the schedule's sufficiency.

Authors: We agree that an explicit ablation isolating the late-stage pose-only phase is necessary to rigorously support the two-phase schedule, particularly for large viewpoint changes. While the full method demonstrates improved results, we will add this ablation (comparing two-phase conditioning against pose-only throughout sampling) to the revised manuscript to directly address potential drift concerns. revision: yes
Referee: Experiments and evaluation: The reported improvements in camera adherence and motion fidelity, plus human preference results, lack accompanying quantitative metrics, error bars, statistical significance tests, or detailed protocols for how pose/depth conditions were generated and how adherence was measured. This is load-bearing for the central empirical claim and leaves the comparisons vulnerable to unstated choices in condition generation or evaluation.

Authors: We acknowledge that the current presentation of results would benefit from greater quantitative detail and transparency. We will revise the Experiments section to report specific metrics for camera adherence and motion fidelity (with error bars and statistical tests where appropriate), along with full protocols for condition generation and adherence measurement, to make the empirical claims more robust and reproducible. revision: yes
Referee: Condition generation (geometric consistency): The assertion that pose and depth maps remain geometrically consistent across frames when derived from the driving video plus target trajectory is foundational, yet the manuscript provides insufficient validation (e.g., no reprojection error statistics or cross-frame consistency metrics) to confirm this holds under the challenging viewpoint changes highlighted in the evaluation.

Authors: We recognize that explicit quantitative validation of geometric consistency would strengthen the foundational claim. Our pipeline derives consistent conditions via 3D-aware processing of the driving video and target trajectory, but we will add reprojection error statistics and cross-frame consistency metrics (especially for large viewpoint changes) to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method relies on external pretrained models and empirical validation

full rationale

The paper introduces a zero-shot conditioning strategy on top of any pretrained image-to-video diffusion model, generating geometrically consistent pose and depth maps from driving video and target trajectory, then applying a two-phase schedule (pose+sparse-depth early, pose-only later). This construction is not self-definitional, does not rename fitted inputs as predictions, and contains no load-bearing self-citations or uniqueness theorems imported from the authors' prior work. The central claims of improved camera adherence and motion fidelity are supported by external benchmark comparisons and human evaluations rather than reducing to the method's own inputs by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that existing pretrained diffusion models can effectively utilize pose and depth conditioning when the inputs are made geometrically consistent; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Pretrained image-to-video diffusion models accept and respond to conditioning in terms of scene depth and character pose
The method is explicitly built on any such model as stated in the abstract.

pith-pipeline@v0.9.0 · 5569 in / 1228 out tokens · 50884 ms · 2026-05-08T12:11:38.591800+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Handi Chen, Hongming Zhang, Tianyu Pang, Chao Du, and Min Lin

Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation.arXiv(2025). Handi Chen, Hongming Zhang, Tianyu Pang, Chao Du, and Min Lin

work page 2025
[2]

Honglin Chu et al

Wan-Animate: Unified Character Animation and Replacement with Holistic Replication.arXiv(2025). Honglin Chu et al

work page 2025
[3]

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al

Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv(2025). Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al

work page 2025
[4]

arXiv(2024)

Training-free camera control for video generation. arXiv(2024). Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo, et al

work page 2024
[5]

Move-in-2d: 2d-conditioned human motion generation. In CVPR. Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. 2024a. VBench: Compre- hensive Benchmark Suite for Video Generative Models. InCVPR. Ziq...

work page 2024
[6]

Only Gets Better Here

VACE: All-in-One Video Creation and Editing.arXiv(2025). Juhi Marzia. 2025.“Only Gets Better Here”: Elon Musk Reacts to AI-Generated Short Film Harry Potter Set in Vietnam War Going Viral. Sportskeeda. https://www.sportskeeda.com/pop-culture/news-only-gets-better-here-elon- musk-reacts-ai-generated-short-film-harry-potter-set-vietnam-war-goes-viral Access...

work page 2025
[7]

Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A

Collaborative video diffusion: Consistent multi-video generation with camera control.NeurIPS(2024). Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A. Efros, and Krishna Kumar Singh

work page 2024
[8]

Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal

Realismotion: Decomposed human motion control and video generation in the world space.arXiv(2025). Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal

work page 2025
[9]

Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin

Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model.arXiv (2024). Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin

work page 2024
[10]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al

Motionclone: Training-free motion cloning for controllable video generation.arXiv(2024). Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al

work page 2024
[11]

Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem

Revideo: Remake a video with motion and content control.NeurIPS(2024). Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem

work page 2024
[12]

very scattered

Fitting conic sections to “very scattered” data: An iterative refinement of the Bookstein algorithm.Computer graphics and image processing (1982). Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, Ming Yang, et al

work page 1982
[13]

Animate-X: Universal Character Image Animation with Enhanced Motion Representation. InICLR. Victor Tangermann. 2025.OpenAI Says It’s Making a Full Hollywood Movie Using AI. https://futurism.com/openai-full-hollywood-movie-using-ai Accessed: 2026-01-22. S. Umeyama

work page 2025
[14]

Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, et al

Least-squares estimation of transformation parameters between two point patterns.IEEE T-PAMI(1991). Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, et al. 2025a. Human- Dreamer: Generating Controllable Human-Motion Videos via Decoupled Generation. InCVPR. Ruicheng Wang, Siche...

work page 1991
[15]

Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat

Video diffusion models are training-free motion interpreter and controller.NeurIPS(2024). Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. 2025a. Camco: Camera-controllable 3d-consistent image-to-video generation. In ICLR. Shuolin Xu, Siming Zheng, Ziyi Wang, H. C. Yu, Jinwei Chen, Huaqi Zhang, Bo Li, Peng-Tao Jiang, ...

work page 2024
[16]

Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gang- shan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma, et al

Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv(2023). Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gang- shan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma, et al. 2025a. SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation. arXiv(2025). ...

work page 2023
[17]

El Khalifi, T

12•O. El Khalifi, T. Rossi, O. Fossey, T. Fouque, U. Mizrahi, P. Torr, I. Laptev, F. Pizzati, and B. Bellot-Gurlet for video generation. InCVPR. Haitao Zhou, Chuang Wang, Rui Nie, Jinlin Liu, Dongdong Yu, Qian Yu, and Changhu Wang. 2025a. Trackgo: A flexible and efficient method for controllable video gener- ation. InAAAI. Jingkai Zhou, Yifan Wu, Shikai L...

work page 2025
[18]

Champ: Controllable and consistent human image animation with 3d parametric guidance. InECCV. ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: May 2026

work page 2026

[1] [1]

Handi Chen, Hongming Zhang, Tianyu Pang, Chao Du, and Min Lin

Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation.arXiv(2025). Handi Chen, Hongming Zhang, Tianyu Pang, Chao Du, and Min Lin

work page 2025

[2] [2]

Honglin Chu et al

Wan-Animate: Unified Character Animation and Replacement with Holistic Replication.arXiv(2025). Honglin Chu et al

work page 2025

[3] [3]

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al

Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv(2025). Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al

work page 2025

[4] [4]

arXiv(2024)

Training-free camera control for video generation. arXiv(2024). Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo, et al

work page 2024

[5] [5]

Move-in-2d: 2d-conditioned human motion generation. In CVPR. Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. 2024a. VBench: Compre- hensive Benchmark Suite for Video Generative Models. InCVPR. Ziq...

work page 2024

[6] [6]

Only Gets Better Here

VACE: All-in-One Video Creation and Editing.arXiv(2025). Juhi Marzia. 2025.“Only Gets Better Here”: Elon Musk Reacts to AI-Generated Short Film Harry Potter Set in Vietnam War Going Viral. Sportskeeda. https://www.sportskeeda.com/pop-culture/news-only-gets-better-here-elon- musk-reacts-ai-generated-short-film-harry-potter-set-vietnam-war-goes-viral Access...

work page 2025

[7] [7]

Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A

Collaborative video diffusion: Consistent multi-video generation with camera control.NeurIPS(2024). Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A. Efros, and Krishna Kumar Singh

work page 2024

[8] [8]

Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal

Realismotion: Decomposed human motion control and video generation in the world space.arXiv(2025). Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal

work page 2025

[9] [9]

Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin

Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model.arXiv (2024). Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin

work page 2024

[10] [10]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al

Motionclone: Training-free motion cloning for controllable video generation.arXiv(2024). Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al

work page 2024

[11] [11]

Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem

Revideo: Remake a video with motion and content control.NeurIPS(2024). Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem

work page 2024

[12] [12]

very scattered

Fitting conic sections to “very scattered” data: An iterative refinement of the Bookstein algorithm.Computer graphics and image processing (1982). Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, Ming Yang, et al

work page 1982

[13] [13]

Animate-X: Universal Character Image Animation with Enhanced Motion Representation. InICLR. Victor Tangermann. 2025.OpenAI Says It’s Making a Full Hollywood Movie Using AI. https://futurism.com/openai-full-hollywood-movie-using-ai Accessed: 2026-01-22. S. Umeyama

work page 2025

[14] [14]

Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, et al

Least-squares estimation of transformation parameters between two point patterns.IEEE T-PAMI(1991). Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, et al. 2025a. Human- Dreamer: Generating Controllable Human-Motion Videos via Decoupled Generation. InCVPR. Ruicheng Wang, Siche...

work page 1991

[15] [15]

Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat

Video diffusion models are training-free motion interpreter and controller.NeurIPS(2024). Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. 2025a. Camco: Camera-controllable 3d-consistent image-to-video generation. In ICLR. Shuolin Xu, Siming Zheng, Ziyi Wang, H. C. Yu, Jinwei Chen, Huaqi Zhang, Bo Li, Peng-Tao Jiang, ...

work page 2024

[16] [16]

Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gang- shan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma, et al

Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv(2023). Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gang- shan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma, et al. 2025a. SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation. arXiv(2025). ...

work page 2023

[17] [17]

El Khalifi, T

12•O. El Khalifi, T. Rossi, O. Fossey, T. Fouque, U. Mizrahi, P. Torr, I. Laptev, F. Pizzati, and B. Bellot-Gurlet for video generation. InCVPR. Haitao Zhou, Chuang Wang, Rui Nie, Jinlin Liu, Dongdong Yu, Qian Yu, and Changhu Wang. 2025a. Trackgo: A flexible and efficient method for controllable video gener- ation. InAAAI. Jingkai Zhou, Yifan Wu, Shikai L...

work page 2025

[18] [18]

Champ: Controllable and consistent human image animation with 3d parametric guidance. InECCV. ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: May 2026

work page 2026