AnyAct: Towards Human Reenactment of Character Motion From Video
Pith reviewed 2026-05-19 16:05 UTC · model grok-4.3
The pith
Sparse local 2D articulated motion cues enable direct human reenactment from non-human character videos without matching 3D structures or topologies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AnyAct formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion, supported by human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to reduce conditioning ambiguity, and global-local motion decoupling for reliable control, yielding high-fidelity initial human reenactments that preserve essential dynamics on a new benchmark of non-human character videos.
What carries the argument
Sparse local 2D articulated motion as the transferable conditioning signal for human motion generation, enabled by the three designs of human-motion-only supervision, progressive training, and global-local decoupling.
If this is right
- High-fidelity initial human reenactments can be generated directly from monocular videos of diverse non-human characters.
- The reenactments preserve the essential dynamics present in the reference videos.
- The three core designs each contribute measurably to output quality as shown by ablation studies.
- The resulting human motions are editable and suitable as starting points for downstream animation authoring.
Where Pith is reading between the lines
- The same sparse-cue bridge might apply to other cross-domain motion transfers such as animal to robot or stylized to realistic.
- If the local cues prove robust, the method could support semi-automatic pipelines that chain video input to full 3D animation with minimal manual cleanup.
- Extending the benchmark to include more extreme topology mismatches would test the limits of the stable-bridge assumption.
- Combining the approach with existing diffusion-based motion generators could produce longer or more varied sequences without additional 3D supervision.
Load-bearing premise
Sparse local articulated motion cues preserve essential dynamics across large structural differences between non-human characters and humans.
What would settle it
A test video where the output human motion misses a clear dynamic element such as a character's distinctive stride timing or balance shift even though the sparse 2D cues are extracted and fed into the model.
Figures
read the original abstract
We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AnyAct, a method for generating initial human reenactments directly from monocular videos of non-human characters. It formulates the task as conditional human motion generation driven by transferable sparse local 2D articulated motion cues, with three key designs: human-motion-only supervision through augmented 3D-to-2D projection, progressive 3D-to-2D training to reduce conditioning ambiguity, and global-local motion decoupling. A new benchmark focused on diverse non-human character videos is constructed, and experiments claim that AnyAct achieves high-fidelity reenactments preserving essential dynamics, with ablations validating the designs.
Significance. If the experimental claims hold, the work could offer a practical bridge for animation workflows by enabling motion transfer across large structural mismatches without requiring full 3D source reconstruction or known topologies. The emphasis on sparse local cues and the newly constructed benchmark represent a concrete contribution to character-driven human motion synthesis, though the significance depends on the strength of the quantitative evidence and comparisons in the full results section.
major comments (2)
- [§4] §4, Experiments: the central claim of 'high-fidelity initial human reenactments' is supported only by qualitative descriptions and ablations in the provided abstract and high-level summary; without reported quantitative metrics (e.g., MPJPE, FID scores, or user study percentages) or direct comparisons to baselines like motion retargeting methods, it is difficult to assess whether the results substantiate the performance claims over existing approaches.
- [§3.3] §3.3, Global-local motion decoupling: the design is presented as enabling reliable local motion control, but the manuscript does not specify how the decoupling is enforced during inference or whether it introduces artifacts when character topology differs substantially from human skeletons; this assumption is load-bearing for the key insight on sparse cues bridging structural differences.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a brief statement of the benchmark size (number of videos, character types) and evaluation protocol to allow readers to gauge the scope of the experiments.
- [§3.1] Notation for the sparse local 2D articulated motion cues should be defined consistently (e.g., as a set of 2D keypoints or heatmaps) when first introduced in §3.1.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript to improve clarity and strengthen the experimental section.
read point-by-point responses
-
Referee: [§4] §4, Experiments: the central claim of 'high-fidelity initial human reenactments' is supported only by qualitative descriptions and ablations in the provided abstract and high-level summary; without reported quantitative metrics (e.g., MPJPE, FID scores, or user study percentages) or direct comparisons to baselines like motion retargeting methods, it is difficult to assess whether the results substantiate the performance claims over existing approaches.
Authors: We agree that quantitative evidence is important for validating the performance claims. The current manuscript focuses on qualitative results and ablation studies on the newly constructed benchmark to demonstrate preservation of essential dynamics. To address this, we will add direct quantitative comparisons using metrics such as MPJPE for joint accuracy and FID for motion distribution quality, along with comparisons to motion retargeting baselines. A user study with percentage preferences will also be included in the revised experiments section. revision: yes
-
Referee: [§3.3] §3.3, Global-local motion decoupling: the design is presented as enabling reliable local motion control, but the manuscript does not specify how the decoupling is enforced during inference or whether it introduces artifacts when character topology differs substantially from human skeletons; this assumption is load-bearing for the key insight on sparse cues bridging structural differences.
Authors: The decoupling is implemented via separate network branches: a global branch processes overall pose from video context while a local branch operates on sparse 2D keypoints, with a dedicated consistency loss applied only during training. At inference, the branches remain separate by conditioning the generator on global features extracted once and local cues applied independently per frame. Our experiments across diverse character videos indicate minimal artifacts even under large topology mismatches, thanks to the topology-agnostic nature of the sparse cues. We will expand §3.3 with an explicit inference procedure subsection and additional qualitative analysis of topology variation cases to make this explicit. revision: yes
Circularity Check
No significant circularity
full rationale
The paper proposes AnyAct for character-video-driven human reenactment using sparse local 2D articulated motion cues as a bridge across structural differences. It describes three engineering designs (human-motion-only 3D-to-2D supervision, progressive training, global-local decoupling) and evaluates them on a newly constructed benchmark. No equations, fitted parameters, self-citations, or derivation steps are present in the text that reduce any claimed prediction or result to an input by construction. The central claims rest on experimental validation rather than self-referential definitions or renamed fits.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sparse local articulated motion cues preserve essential dynamics across large structural differences
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.