ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation
Pith reviewed 2026-05-08 12:11 UTC · model grok-4.3
The pith
ActCam enables zero-shot joint control of character motion and per-frame camera parameters in video generation by generating consistent pose and depth conditions for pretrained diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ActCam generates pose and depth conditions that remain geometrically consistent across frames from a source video with a moving character and a target camera motion. It then runs a single sampling process with a two-phase conditioning schedule on any pretrained image-to-video diffusion model that accepts scene depth and character pose: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details. This produces per-frame control of intrinsic and extrinsic camera parameters together with motion transfer in a zero-shot setting.
What carries the argument
The two-phase conditioning schedule on pose and depth maps that are kept geometrically consistent across frames.
If this is right
- Videos better match both the supplied character actions and the chosen camera paths than pose-only baselines.
- Human viewers prefer the outputs especially when large viewpoint changes are required.
- The same pretrained diffusion backbone can be reused for different motion sources and camera specifications without retraining.
- Control extends to both intrinsic parameters such as focal length and extrinsic parameters such as camera position and orientation on a per-frame basis.
Where Pith is reading between the lines
- The staged removal of depth conditioning may generalize to other auxiliary signals that become over-constraining in later denoising steps.
- Success under large viewpoint changes implies that geometric consistency in the input conditions is more critical than the absolute amount of conditioning information.
- The approach could be tested on driving videos captured with consumer cameras to check robustness when the source motion and target camera are less perfectly aligned.
Load-bearing premise
That pose and depth conditions can be generated to remain geometrically consistent across frames and that a two-phase conditioning schedule on a pretrained diffusion model is sufficient to achieve per-frame intrinsic and extrinsic camera control without any training or fine-tuning.
What would settle it
Apply ActCam to a driving video and a target camera trajectory that includes a 180-degree rotation around the character, then measure whether the generated video's viewpoints match the specified trajectory while preserving the character's motion sequence.
Figures
read the original abstract
For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor's motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: https://elkhomar.github.io/actcam/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ActCam, a zero-shot method for joint control of character motion (from a driving video) and per-frame camera intrinsics/extrinsics (from a target trajectory) in video generation. It builds on pretrained image-to-video diffusion models by generating geometrically consistent pose and depth conditions, then applies a two-phase sampling schedule: early denoising steps condition on both pose and sparse depth to enforce structure, after which depth is dropped and pose-only guidance is used to refine details. Evaluations on benchmarks with diverse motions and large viewpoint changes claim improved camera adherence and motion fidelity over pose-only and competing pose+camera methods, along with higher human preference.
Significance. If the central claims hold, ActCam offers a practical, training-free advance in controllable video synthesis by enabling simultaneous 3D motion transfer and cinematographic control. The zero-shot reliance on existing models combined with geometrically consistent conditioning and staged guidance is a notable strength that could generalize to other diffusion-based tasks. This addresses an important gap for artistic video generation applications where both actor performance and camera work must be specified precisely.
major comments (3)
- [Method (two-phase schedule)] Method section (two-phase conditioning schedule): The claim that dropping depth after early steps still achieves per-frame intrinsic and extrinsic camera control rests on an untested assumption that the pretrained model will maintain 3D geometry from pose alone. For large viewpoint changes this risks drift in focal length, principal point, or trajectory adherence, directly undermining the joint control contribution; an ablation isolating the late-stage pose-only phase would be required to support the schedule's sufficiency.
- [Experiments] Experiments and evaluation: The reported improvements in camera adherence and motion fidelity, plus human preference results, lack accompanying quantitative metrics, error bars, statistical significance tests, or detailed protocols for how pose/depth conditions were generated and how adherence was measured. This is load-bearing for the central empirical claim and leaves the comparisons vulnerable to unstated choices in condition generation or evaluation.
- [Method (condition generation)] Condition generation (geometric consistency): The assertion that pose and depth maps remain geometrically consistent across frames when derived from the driving video plus target trajectory is foundational, yet the manuscript provides insufficient validation (e.g., no reprojection error statistics or cross-frame consistency metrics) to confirm this holds under the challenging viewpoint changes highlighted in the evaluation.
minor comments (2)
- [Abstract] The abstract would benefit from naming the specific benchmarks and briefly indicating the scale of the human study.
- [Figures] Figure captions and the project page reference could more explicitly describe how the visualized pose and depth conditions relate to the target camera parameters.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We address each major comment below, agreeing where revisions are needed to strengthen the claims, and describe the changes we will incorporate.
read point-by-point responses
-
Referee: Method section (two-phase conditioning schedule): The claim that dropping depth after early steps still achieves per-frame intrinsic and extrinsic camera control rests on an untested assumption that the pretrained model will maintain 3D geometry from pose alone. For large viewpoint changes this risks drift in focal length, principal point, or trajectory adherence, directly undermining the joint control contribution; an ablation isolating the late-stage pose-only phase would be required to support the schedule's sufficiency.
Authors: We agree that an explicit ablation isolating the late-stage pose-only phase is necessary to rigorously support the two-phase schedule, particularly for large viewpoint changes. While the full method demonstrates improved results, we will add this ablation (comparing two-phase conditioning against pose-only throughout sampling) to the revised manuscript to directly address potential drift concerns. revision: yes
-
Referee: Experiments and evaluation: The reported improvements in camera adherence and motion fidelity, plus human preference results, lack accompanying quantitative metrics, error bars, statistical significance tests, or detailed protocols for how pose/depth conditions were generated and how adherence was measured. This is load-bearing for the central empirical claim and leaves the comparisons vulnerable to unstated choices in condition generation or evaluation.
Authors: We acknowledge that the current presentation of results would benefit from greater quantitative detail and transparency. We will revise the Experiments section to report specific metrics for camera adherence and motion fidelity (with error bars and statistical tests where appropriate), along with full protocols for condition generation and adherence measurement, to make the empirical claims more robust and reproducible. revision: yes
-
Referee: Condition generation (geometric consistency): The assertion that pose and depth maps remain geometrically consistent across frames when derived from the driving video plus target trajectory is foundational, yet the manuscript provides insufficient validation (e.g., no reprojection error statistics or cross-frame consistency metrics) to confirm this holds under the challenging viewpoint changes highlighted in the evaluation.
Authors: We recognize that explicit quantitative validation of geometric consistency would strengthen the foundational claim. Our pipeline derives consistent conditions via 3D-aware processing of the driving video and target trajectory, but we will add reprojection error statistics and cross-frame consistency metrics (especially for large viewpoint changes) to the revised manuscript. revision: yes
Circularity Check
No significant circularity; method relies on external pretrained models and empirical validation
full rationale
The paper introduces a zero-shot conditioning strategy on top of any pretrained image-to-video diffusion model, generating geometrically consistent pose and depth maps from driving video and target trajectory, then applying a two-phase schedule (pose+sparse-depth early, pose-only later). This construction is not self-definitional, does not rename fitted inputs as predictions, and contains no load-bearing self-citations or uniqueness theorems imported from the authors' prior work. The central claims of improved camera adherence and motion fidelity are supported by external benchmark comparisons and human evaluations rather than reducing to the method's own inputs by construction. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained image-to-video diffusion models accept and respond to conditioning in terms of scene depth and character pose
Reference graph
Works this paper leans on
-
[1]
Handi Chen, Hongming Zhang, Tianyu Pang, Chao Du, and Min Lin
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation.arXiv(2025). Handi Chen, Hongming Zhang, Tianyu Pang, Chao Du, and Min Lin
work page 2025
-
[2]
Wan-Animate: Unified Character Animation and Replacement with Holistic Replication.arXiv(2025). Honglin Chu et al
work page 2025
-
[3]
Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv(2025). Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al
work page 2025
-
[4]
Training-free camera control for video generation. arXiv(2024). Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo, et al
work page 2024
-
[5]
Move-in-2d: 2d-conditioned human motion generation. In CVPR. Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. 2024a. VBench: Compre- hensive Benchmark Suite for Video Generative Models. InCVPR. Ziq...
work page 2024
-
[6]
VACE: All-in-One Video Creation and Editing.arXiv(2025). Juhi Marzia. 2025.“Only Gets Better Here”: Elon Musk Reacts to AI-Generated Short Film Harry Potter Set in Vietnam War Going Viral. Sportskeeda. https://www.sportskeeda.com/pop-culture/news-only-gets-better-here-elon- musk-reacts-ai-generated-short-film-harry-potter-set-vietnam-war-goes-viral Access...
work page 2025
-
[7]
Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A
Collaborative video diffusion: Consistent multi-video generation with camera control.NeurIPS(2024). Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A. Efros, and Krishna Kumar Singh
work page 2024
-
[8]
Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal
Realismotion: Decomposed human motion control and video generation in the world space.arXiv(2025). Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal
work page 2025
-
[9]
Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model.arXiv (2024). Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin
work page 2024
-
[10]
Motionclone: Training-free motion cloning for controllable video generation.arXiv(2024). Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al
work page 2024
-
[11]
Revideo: Remake a video with motion and content control.NeurIPS(2024). Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem
work page 2024
-
[12]
Fitting conic sections to “very scattered” data: An iterative refinement of the Bookstein algorithm.Computer graphics and image processing (1982). Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, Ming Yang, et al
work page 1982
-
[13]
Animate-X: Universal Character Image Animation with Enhanced Motion Representation. InICLR. Victor Tangermann. 2025.OpenAI Says It’s Making a Full Hollywood Movie Using AI. https://futurism.com/openai-full-hollywood-movie-using-ai Accessed: 2026-01-22. S. Umeyama
work page 2025
-
[14]
Least-squares estimation of transformation parameters between two point patterns.IEEE T-PAMI(1991). Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, et al. 2025a. Human- Dreamer: Generating Controllable Human-Motion Videos via Decoupled Generation. InCVPR. Ruicheng Wang, Siche...
work page 1991
-
[15]
Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat
Video diffusion models are training-free motion interpreter and controller.NeurIPS(2024). Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. 2025a. Camco: Camera-controllable 3d-consistent image-to-video generation. In ICLR. Shuolin Xu, Siming Zheng, Ziyi Wang, H. C. Yu, Jinwei Chen, Huaqi Zhang, Bo Li, Peng-Tao Jiang, ...
work page 2024
-
[16]
Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv(2023). Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gang- shan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma, et al. 2025a. SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation. arXiv(2025). ...
work page 2023
-
[17]
12•O. El Khalifi, T. Rossi, O. Fossey, T. Fouque, U. Mizrahi, P. Torr, I. Laptev, F. Pizzati, and B. Bellot-Gurlet for video generation. InCVPR. Haitao Zhou, Chuang Wang, Rui Nie, Jinlin Liu, Dongdong Yu, Qian Yu, and Changhu Wang. 2025a. Trackgo: A flexible and efficient method for controllable video gener- ation. InAAAI. Jingkai Zhou, Yifan Wu, Shikai L...
work page 2025
-
[18]
Champ: Controllable and consistent human image animation with 3d parametric guidance. InECCV. ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: May 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.