pith. sign in

arxiv: 2605.06667 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI· cs.LG

ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

Pith reviewed 2026-05-08 12:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords video generationcamera controlmotion transferdiffusion modelszero-shot learningpose conditioningdepth maps
1
0 comments X

The pith

ActCam enables zero-shot joint control of character motion and per-frame camera parameters in video generation by generating consistent pose and depth conditions for pretrained diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ActCam as a method to control both the actor's performance and the camera's trajectory in generated videos without training any new models. It starts from a driving video that supplies the desired character motion and a separate specification of the target camera path. From these, the method creates pose and depth maps that stay geometrically consistent from frame to frame. These maps are then supplied to any existing image-to-video diffusion model through a two-phase schedule: the first part of denoising uses both pose and sparse depth to lock in overall scene structure, after which depth is removed and pose-only guidance refines motion details. The result is videos that more closely follow both the intended actions and the specified camera moves than prior pose-only or combined methods.

Core claim

ActCam generates pose and depth conditions that remain geometrically consistent across frames from a source video with a moving character and a target camera motion. It then runs a single sampling process with a two-phase conditioning schedule on any pretrained image-to-video diffusion model that accepts scene depth and character pose: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details. This produces per-frame control of intrinsic and extrinsic camera parameters together with motion transfer in a zero-shot setting.

What carries the argument

The two-phase conditioning schedule on pose and depth maps that are kept geometrically consistent across frames.

If this is right

  • Videos better match both the supplied character actions and the chosen camera paths than pose-only baselines.
  • Human viewers prefer the outputs especially when large viewpoint changes are required.
  • The same pretrained diffusion backbone can be reused for different motion sources and camera specifications without retraining.
  • Control extends to both intrinsic parameters such as focal length and extrinsic parameters such as camera position and orientation on a per-frame basis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The staged removal of depth conditioning may generalize to other auxiliary signals that become over-constraining in later denoising steps.
  • Success under large viewpoint changes implies that geometric consistency in the input conditions is more critical than the absolute amount of conditioning information.
  • The approach could be tested on driving videos captured with consumer cameras to check robustness when the source motion and target camera are less perfectly aligned.

Load-bearing premise

That pose and depth conditions can be generated to remain geometrically consistent across frames and that a two-phase conditioning schedule on a pretrained diffusion model is sufficient to achieve per-frame intrinsic and extrinsic camera control without any training or fine-tuning.

What would settle it

Apply ActCam to a driving video and a target camera trajectory that includes a 180-degree rotation around the character, then measure whether the generated video's viewpoints match the specified trajectory while preserving the character's motion sequence.

Figures

Figures reproduced from arXiv: 2605.06667 by Baptiste Bellot-Gurlet, Fabio Pizzati, Ivan Laptev, Omar El Khalifi, Oscar Fossey, Philip Torr, Thibault Fouque, Thomas Rossi, Ulysse Mizrahi.

Figure 1
Figure 1. Figure 1: Overview. ActCam enables zero-shot joint control of acting motion and camera motion for single-image video generation from a reference image, assuming only widespread conditioning capability of the backbone model on depth and keypoints. Given a reference image, an acting video representing the desired motion, and a target per-frame camera trajectory, ActCam generates a video that preserves identity while f… view at source ↗
Figure 2
Figure 2. Figure 2: ActCam pipeline. Given a reference image, an acting video, and a target camera trajectory, we (1) estimate background depth from an inpainted reference, (2) recover motion and align it to the background scene via fitting, and (3) rasterize pose and depth+pose control signals under the target viewpoint. A two-phase denoising schedule conditions early steps on depth+pose for stronger camera control, then ref… view at source ↗
Figure 3
Figure 3. Figure 3: User study. We compare with Uni3C on camera adherence (Camera) and motion faithfulness (Motion) with respect to the conditioning input, alongside overall visual quality (Visual). We considerably outperform Uni3C, the closest method to ours. generated videos are aligned with the performance boost reported in view at source ↗
Figure 4
Figure 4. Figure 4: Effect of 𝑁𝐷 on VBench score. The figure shows the average VBench scores as a function of 𝑁𝐷 , where the conditioning switches from pose+depth to pose-only. Early switching under-constrains the genera￾tion, while late switching (low 𝑡) can propagate depth artifacts into high￾frequency details, harming results. We set an optimal 𝑁𝐷 = 0.2. Depth Map Without Condition Schedule With Condition Schedule view at source ↗
Figure 5
Figure 5. Figure 5: Importance of conditioning schedule. Excessive depth guidance (setting 𝑁𝐷 = 1) can overly constrain the scene, producing static back￾grounds under camera motion (center, red circle). Instead, 𝑁𝐷 < 1 allows to flexibly move the barbell to follow the human motion (right). 4.4 Ablation studies Balance of depth conditioning. We vary the number of initial dif￾fusion steps conditioned on both pose and depth (𝑁𝐷 … view at source ↗
Figure 8
Figure 8. Figure 8: Importance of scene transfer. Without scene transfer (No align￾ment), the condition does not respect 3D coherence. Uniform weighting improves placement but importance weighting (ours) is required to achieve best results. The red arrows (right column) show depth/positions offsets. Scene transfer. In Section 3.2, we describe how we align the com￾posed character depth with the rendered environment depth to st… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison with Uni3C. Uni3C yields suboptimal camera control (top, middle) and unrealistic character motion (bottom). In the insets, a visualization of the control signal for both Uni3C and ActCam view at source ↗
Figure 10
Figure 10. Figure 10: Different cameras. We first show the conditioning signal and ActCam results (top two rows). In the next three rows, we variate camera movements. As visible, the character appearance and motion remain consistent. ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: May 2026 view at source ↗
Figure 11
Figure 11. Figure 11: Different scenes. We display two outputs of ActCam showing the same motion rendered on two characters in different scenes, using the same camera controls view at source ↗
Figure 12
Figure 12. Figure 12: Different scenes and different cameras. To show the flexibility of our approach, we apply the same motion to two characters in different scenes, by also varying the camera control. ActCam still renders the correct motion. Conditioning Output Conditioning Output view at source ↗
Figure 13
Figure 13. Figure 13: Multi-character results. ActCam handles multiple characters by applying the scene transfer and motion fitting independently per character. ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: May 2026 view at source ↗
read the original abstract

For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor's motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: https://elkhomar.github.io/actcam/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ActCam, a zero-shot method for joint control of character motion (from a driving video) and per-frame camera intrinsics/extrinsics (from a target trajectory) in video generation. It builds on pretrained image-to-video diffusion models by generating geometrically consistent pose and depth conditions, then applies a two-phase sampling schedule: early denoising steps condition on both pose and sparse depth to enforce structure, after which depth is dropped and pose-only guidance is used to refine details. Evaluations on benchmarks with diverse motions and large viewpoint changes claim improved camera adherence and motion fidelity over pose-only and competing pose+camera methods, along with higher human preference.

Significance. If the central claims hold, ActCam offers a practical, training-free advance in controllable video synthesis by enabling simultaneous 3D motion transfer and cinematographic control. The zero-shot reliance on existing models combined with geometrically consistent conditioning and staged guidance is a notable strength that could generalize to other diffusion-based tasks. This addresses an important gap for artistic video generation applications where both actor performance and camera work must be specified precisely.

major comments (3)
  1. [Method (two-phase schedule)] Method section (two-phase conditioning schedule): The claim that dropping depth after early steps still achieves per-frame intrinsic and extrinsic camera control rests on an untested assumption that the pretrained model will maintain 3D geometry from pose alone. For large viewpoint changes this risks drift in focal length, principal point, or trajectory adherence, directly undermining the joint control contribution; an ablation isolating the late-stage pose-only phase would be required to support the schedule's sufficiency.
  2. [Experiments] Experiments and evaluation: The reported improvements in camera adherence and motion fidelity, plus human preference results, lack accompanying quantitative metrics, error bars, statistical significance tests, or detailed protocols for how pose/depth conditions were generated and how adherence was measured. This is load-bearing for the central empirical claim and leaves the comparisons vulnerable to unstated choices in condition generation or evaluation.
  3. [Method (condition generation)] Condition generation (geometric consistency): The assertion that pose and depth maps remain geometrically consistent across frames when derived from the driving video plus target trajectory is foundational, yet the manuscript provides insufficient validation (e.g., no reprojection error statistics or cross-frame consistency metrics) to confirm this holds under the challenging viewpoint changes highlighted in the evaluation.
minor comments (2)
  1. [Abstract] The abstract would benefit from naming the specific benchmarks and briefly indicating the scale of the human study.
  2. [Figures] Figure captions and the project page reference could more explicitly describe how the visualized pose and depth conditions relate to the target camera parameters.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below, agreeing where revisions are needed to strengthen the claims, and describe the changes we will incorporate.

read point-by-point responses
  1. Referee: Method section (two-phase conditioning schedule): The claim that dropping depth after early steps still achieves per-frame intrinsic and extrinsic camera control rests on an untested assumption that the pretrained model will maintain 3D geometry from pose alone. For large viewpoint changes this risks drift in focal length, principal point, or trajectory adherence, directly undermining the joint control contribution; an ablation isolating the late-stage pose-only phase would be required to support the schedule's sufficiency.

    Authors: We agree that an explicit ablation isolating the late-stage pose-only phase is necessary to rigorously support the two-phase schedule, particularly for large viewpoint changes. While the full method demonstrates improved results, we will add this ablation (comparing two-phase conditioning against pose-only throughout sampling) to the revised manuscript to directly address potential drift concerns. revision: yes

  2. Referee: Experiments and evaluation: The reported improvements in camera adherence and motion fidelity, plus human preference results, lack accompanying quantitative metrics, error bars, statistical significance tests, or detailed protocols for how pose/depth conditions were generated and how adherence was measured. This is load-bearing for the central empirical claim and leaves the comparisons vulnerable to unstated choices in condition generation or evaluation.

    Authors: We acknowledge that the current presentation of results would benefit from greater quantitative detail and transparency. We will revise the Experiments section to report specific metrics for camera adherence and motion fidelity (with error bars and statistical tests where appropriate), along with full protocols for condition generation and adherence measurement, to make the empirical claims more robust and reproducible. revision: yes

  3. Referee: Condition generation (geometric consistency): The assertion that pose and depth maps remain geometrically consistent across frames when derived from the driving video plus target trajectory is foundational, yet the manuscript provides insufficient validation (e.g., no reprojection error statistics or cross-frame consistency metrics) to confirm this holds under the challenging viewpoint changes highlighted in the evaluation.

    Authors: We recognize that explicit quantitative validation of geometric consistency would strengthen the foundational claim. Our pipeline derives consistent conditions via 3D-aware processing of the driving video and target trajectory, but we will add reprojection error statistics and cross-frame consistency metrics (especially for large viewpoint changes) to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method relies on external pretrained models and empirical validation

full rationale

The paper introduces a zero-shot conditioning strategy on top of any pretrained image-to-video diffusion model, generating geometrically consistent pose and depth maps from driving video and target trajectory, then applying a two-phase schedule (pose+sparse-depth early, pose-only later). This construction is not self-definitional, does not rename fitted inputs as predictions, and contains no load-bearing self-citations or uniqueness theorems imported from the authors' prior work. The central claims of improved camera adherence and motion fidelity are supported by external benchmark comparisons and human evaluations rather than reducing to the method's own inputs by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that existing pretrained diffusion models can effectively utilize pose and depth conditioning when the inputs are made geometrically consistent; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Pretrained image-to-video diffusion models accept and respond to conditioning in terms of scene depth and character pose
    The method is explicitly built on any such model as stated in the abstract.

pith-pipeline@v0.9.0 · 5569 in / 1228 out tokens · 50884 ms · 2026-05-08T12:11:38.591800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    Handi Chen, Hongming Zhang, Tianyu Pang, Chao Du, and Min Lin

    Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation.arXiv(2025). Handi Chen, Hongming Zhang, Tianyu Pang, Chao Du, and Min Lin

  2. [2]

    Honglin Chu et al

    Wan-Animate: Unified Character Animation and Replacement with Holistic Replication.arXiv(2025). Honglin Chu et al

  3. [3]

    Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al

    Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv(2025). Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al

  4. [4]

    arXiv(2024)

    Training-free camera control for video generation. arXiv(2024). Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo, et al

  5. [5]

    Move-in-2d: 2d-conditioned human motion generation. In CVPR. Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. 2024a. VBench: Compre- hensive Benchmark Suite for Video Generative Models. InCVPR. Ziq...

  6. [6]

    Only Gets Better Here

    VACE: All-in-One Video Creation and Editing.arXiv(2025). Juhi Marzia. 2025.“Only Gets Better Here”: Elon Musk Reacts to AI-Generated Short Film Harry Potter Set in Vietnam War Going Viral. Sportskeeda. https://www.sportskeeda.com/pop-culture/news-only-gets-better-here-elon- musk-reacts-ai-generated-short-film-harry-potter-set-vietnam-war-goes-viral Access...

  7. [7]

    Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A

    Collaborative video diffusion: Consistent multi-video generation with camera control.NeurIPS(2024). Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A. Efros, and Krishna Kumar Singh

  8. [8]

    Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal

    Realismotion: Decomposed human motion control and video generation in the world space.arXiv(2025). Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal

  9. [9]

    Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin

    Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model.arXiv (2024). Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin

  10. [10]

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al

    Motionclone: Training-free motion cloning for controllable video generation.arXiv(2024). Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al

  11. [11]

    Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem

    Revideo: Remake a video with motion and content control.NeurIPS(2024). Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem

  12. [12]

    very scattered

    Fitting conic sections to “very scattered” data: An iterative refinement of the Bookstein algorithm.Computer graphics and image processing (1982). Shuai Tan, Biao Gong, Xiang Wang, Shiwei Zhang, Dandan Zheng, Ruobing Zheng, Kecheng Zheng, Jingdong Chen, Ming Yang, et al

  13. [13]

    Animate-X: Universal Character Image Animation with Enhanced Motion Representation. InICLR. Victor Tangermann. 2025.OpenAI Says It’s Making a Full Hollywood Movie Using AI. https://futurism.com/openai-full-hollywood-movie-using-ai Accessed: 2026-01-22. S. Umeyama

  14. [14]

    Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, et al

    Least-squares estimation of transformation parameters between two point patterns.IEEE T-PAMI(1991). Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, et al. 2025a. Human- Dreamer: Generating Controllable Human-Motion Videos via Decoupled Generation. InCVPR. Ruicheng Wang, Siche...

  15. [15]

    Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat

    Video diffusion models are training-free motion interpreter and controller.NeurIPS(2024). Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. 2025a. Camco: Camera-controllable 3d-consistent image-to-video generation. In ICLR. Shuolin Xu, Siming Zheng, Ziyi Wang, H. C. Yu, Jinwei Chen, Huaqi Zhang, Bo Li, Peng-Tao Jiang, ...

  16. [16]

    Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gang- shan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma, et al

    Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv(2023). Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gang- shan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma, et al. 2025a. SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation. arXiv(2025). ...

  17. [17]

    El Khalifi, T

    12•O. El Khalifi, T. Rossi, O. Fossey, T. Fouque, U. Mizrahi, P. Torr, I. Laptev, F. Pizzati, and B. Bellot-Gurlet for video generation. InCVPR. Haitao Zhou, Chuang Wang, Rui Nie, Jinlin Liu, Dongdong Yu, Qian Yu, and Changhu Wang. 2025a. Trackgo: A flexible and efficient method for controllable video gener- ation. InAAAI. Jingkai Zhou, Yifan Wu, Shikai L...

  18. [18]

    Champ: Controllable and consistent human image animation with 3d parametric guidance. InECCV. ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: May 2026