AnyAct: Towards Human Reenactment of Character Motion From Video

arxiv: 2605.15497 · v2 · pith:25765X4Cnew · submitted 2026-05-15 · 💻 cs.CV · cs.GR

AnyAct: Towards Human Reenactment of Character Motion From Video

Liuhan Chen , Lei Zhong , Jiewei Wang , Qin Shuai , Li Yuan , Leidong Fan , Qing Li , Kanglin Liu This is my paper

Pith reviewed 2026-05-19 16:05 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords human reenactmentcharacter motion transfervideo-based animationsparse 2D motion cuesmotion retargetingnon-human charactersconditional motion generation

0 comments p. Extension

pith:25765X4C Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{25765X4C}

Prints a linked pith:25765X4C badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Sparse local 2D articulated motion cues enable direct human reenactment from non-human character videos without matching 3D structures or topologies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that monocular videos of non-human characters can be reinterpreted as initial plausible human performances by using sparse local articulated motion as a transferable signal. Existing methods fail here because they assume human-centric structures or require full 3D source motions with known topologies. The proposed AnyAct method generates conditional human motion from these 2D cues through targeted supervision and training strategies. A sympathetic reader would care because this removes the need for manual retargeting or reconstruction steps, letting animators bootstrap editable human motion directly from diverse video sources. If the central claim holds, it creates a practical starting point for animation authoring pipelines that handle arbitrary characters.

Core claim

AnyAct formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion, supported by human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to reduce conditioning ambiguity, and global-local motion decoupling for reliable control, yielding high-fidelity initial human reenactments that preserve essential dynamics on a new benchmark of non-human character videos.

What carries the argument

Sparse local 2D articulated motion as the transferable conditioning signal for human motion generation, enabled by the three designs of human-motion-only supervision, progressive training, and global-local decoupling.

If this is right

High-fidelity initial human reenactments can be generated directly from monocular videos of diverse non-human characters.
The reenactments preserve the essential dynamics present in the reference videos.
The three core designs each contribute measurably to output quality as shown by ablation studies.
The resulting human motions are editable and suitable as starting points for downstream animation authoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparse-cue bridge might apply to other cross-domain motion transfers such as animal to robot or stylized to realistic.
If the local cues prove robust, the method could support semi-automatic pipelines that chain video input to full 3D animation with minimal manual cleanup.
Extending the benchmark to include more extreme topology mismatches would test the limits of the stable-bridge assumption.
Combining the approach with existing diffusion-based motion generators could produce longer or more varied sequences without additional 3D supervision.

Load-bearing premise

Sparse local articulated motion cues preserve essential dynamics across large structural differences between non-human characters and humans.

What would settle it

A test video where the output human motion misses a clear dynamic element such as a character's distinctive stride timing or balance shift even though the sparse 2D cues are extracted and fed into the model.

Figures

Figures reproduced from arXiv: 2605.15497 by Jiewei Wang, Kanglin Liu, Leidong Fan, Lei Zhong, Liuhan Chen, Li Yuan, Qing Li, Qin Shuai.

**Figure 1.** Figure 1: Human reenactment from reference videos. Given monocular videos of non-human characters with diverse topologies, AnyAct reinterprets their characteristic motion patterns as plausible human performances rather than reproducing their source structures literally. Shown here are reenactments of (a) kangaroo-like jumping, (b) butterfly-like wing flapping, and (c) the periodic paw motion of a beckoning cat. We s… view at source ↗

**Figure 2.** Figure 2: , although the motions of a non-human character (e.g., a jumping kangaroo) and its human reenactment may differ substantially in morphology and topology, their sparse local articulated movements still carry essential and similar dynamic tendencies. This suggests that local sparse motion patterns provide a more stable bridge between monocular character video observations and human reenactment than source-… view at source ↗

**Figure 3.** Figure 3: Given reference videos of characters, AnyAct first extracts local sparse 2D joint trajectories as transferable motion cues from the input video using our model-ensemble-based Versatile Feature Extractor (VFE). These cues are then injected into a human motion generator (MoMask++) through the ControlNet-like 2D Local Adapter (2D-LA) to produce the initial human reenactments that follow the observed character… view at source ↗

**Figure 4.** Figure 4: Motion Condition Learning. We learn reliable motion control for our AnyAct using only human motion data. This is achieved by our proposed augmented 3D-to-2D projection for providing paired supervision, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for suppressing unreliable global root motion. of local sparse 2D joint trajectories, which serves as th… view at source ↗

**Figure 5.** Figure 5: Result Gallery. (1) dancing with side-to-side swaying, following the rhythm of the ghost, (2) deer-like walking, (3) penguin-like walking, (4) monkey-like walking, (5) dinosaur-like walking, (6) seal-like walking (with side-to-side swaying), (7) mechanical-spider-like in-place jumping, (8) toy-robot-like walking (with side-to-side swaying). videos is also difficult in practice. Therefore, we adopt a progre… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of AnyAct against VLM+HY-Motion and EchoMotion. Based on the reference videos, human should perform: (1) monster-like flying, (2) cartoon bear-like jumping, (3) penguin-like walking, and (4) rabbit-like bounding. The results demonstrate that our method achieves superior reenactment quality compared to the other two baselines, while preserving the plausibility of the motion. Adapter (… view at source ↗

**Figure 7.** Figure 7: Result of the user study. We report the preference rates of our AnyAct in pairwise comparisons against (a) VLM+HY-Motion and (b) EchoMotion. Participants evaluated the generated motions based on Reenactment Similarity, Motion Quality, and Overall Preference, respectively. Our method consistently outperforms both baselines across all criteria. et al. 2025a], to condense the detailed descriptions. The weig… view at source ↗

**Figure 8.** Figure 8: Trajectory control and intuitive editing. (1) cat-like walking with forward and half-circle trajectory control, (2) editing the height of human when perform kangaroo-like jumping, (3) editing the arm spread of human when perform penguin-like walking [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Generalization to the reenactment of human characters. Although our AnyAct does not aim to achieve the same level of absolute reconstruction as human-centric mocap-based methods, it still demonstrates the ability to perform reliable reenactment of human characters from monocular videos [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Results of DancingBox. Reference videos are obtained from the DancingBox project page, where the source captures and results are concatenated together. Our results (in the right part) confirms that AnyAct can effectively recovers the essential dynamics of physical proxies as DancingBox. to our human reenactment task. Specifically, given a reference video depicting non-human character motion, we first leve… view at source ↗

**Figure 11.** Figure 11: Examples of adapting Seedance 2.0 for human reenactment. Although the leading closed-source video generation model, Seedance 2.0, possesses generalized world knowledge, it still struggles to produce consistent human reenactment videos driven by non-human motion references. Moreover, 3D motions directly reconstructed from monocular generated videos suffer from inferior quality and artifacts. Thus, such a n… view at source ↗

**Figure 12.** Figure 12: Limitation. Left: AnyAct struggles to generate plausible reenactments for motions far outside the training distribution, such as frog-like swimming involving rapid leg kicks and a rare prone posture. Right: Actions like crab-like sideways walking with rapid multi-leg movements and pixel-level similarities with neighboring points can cause CoTracker3 tracking failures, resulting in noisy 2D features that i… view at source ↗

read the original abstract

We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AnyAct, a method for generating initial human reenactments directly from monocular videos of non-human characters. It formulates the task as conditional human motion generation driven by transferable sparse local 2D articulated motion cues, with three key designs: human-motion-only supervision through augmented 3D-to-2D projection, progressive 3D-to-2D training to reduce conditioning ambiguity, and global-local motion decoupling. A new benchmark focused on diverse non-human character videos is constructed, and experiments claim that AnyAct achieves high-fidelity reenactments preserving essential dynamics, with ablations validating the designs.

Significance. If the experimental claims hold, the work could offer a practical bridge for animation workflows by enabling motion transfer across large structural mismatches without requiring full 3D source reconstruction or known topologies. The emphasis on sparse local cues and the newly constructed benchmark represent a concrete contribution to character-driven human motion synthesis, though the significance depends on the strength of the quantitative evidence and comparisons in the full results section.

major comments (2)

[§4] §4, Experiments: the central claim of 'high-fidelity initial human reenactments' is supported only by qualitative descriptions and ablations in the provided abstract and high-level summary; without reported quantitative metrics (e.g., MPJPE, FID scores, or user study percentages) or direct comparisons to baselines like motion retargeting methods, it is difficult to assess whether the results substantiate the performance claims over existing approaches.
[§3.3] §3.3, Global-local motion decoupling: the design is presented as enabling reliable local motion control, but the manuscript does not specify how the decoupling is enforced during inference or whether it introduces artifacts when character topology differs substantially from human skeletons; this assumption is load-bearing for the key insight on sparse cues bridging structural differences.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief statement of the benchmark size (number of videos, character types) and evaluation protocol to allow readers to gauge the scope of the experiments.
[§3.1] Notation for the sparse local 2D articulated motion cues should be defined consistently (e.g., as a set of 2D keypoints or heatmaps) when first introduced in §3.1.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript to improve clarity and strengthen the experimental section.

read point-by-point responses

Referee: [§4] §4, Experiments: the central claim of 'high-fidelity initial human reenactments' is supported only by qualitative descriptions and ablations in the provided abstract and high-level summary; without reported quantitative metrics (e.g., MPJPE, FID scores, or user study percentages) or direct comparisons to baselines like motion retargeting methods, it is difficult to assess whether the results substantiate the performance claims over existing approaches.

Authors: We agree that quantitative evidence is important for validating the performance claims. The current manuscript focuses on qualitative results and ablation studies on the newly constructed benchmark to demonstrate preservation of essential dynamics. To address this, we will add direct quantitative comparisons using metrics such as MPJPE for joint accuracy and FID for motion distribution quality, along with comparisons to motion retargeting baselines. A user study with percentage preferences will also be included in the revised experiments section. revision: yes
Referee: [§3.3] §3.3, Global-local motion decoupling: the design is presented as enabling reliable local motion control, but the manuscript does not specify how the decoupling is enforced during inference or whether it introduces artifacts when character topology differs substantially from human skeletons; this assumption is load-bearing for the key insight on sparse cues bridging structural differences.

Authors: The decoupling is implemented via separate network branches: a global branch processes overall pose from video context while a local branch operates on sparse 2D keypoints, with a dedicated consistency loss applied only during training. At inference, the branches remain separate by conditioning the generator on global features extracted once and local cues applied independently per frame. Our experiments across diverse character videos indicate minimal artifacts even under large topology mismatches, thanks to the topology-agnostic nature of the sparse cues. We will expand §3.3 with an explicit inference procedure subsection and additional qualitative analysis of topology variation cases to make this explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes AnyAct for character-video-driven human reenactment using sparse local 2D articulated motion cues as a bridge across structural differences. It describes three engineering designs (human-motion-only 3D-to-2D supervision, progressive training, global-local decoupling) and evaluates them on a newly constructed benchmark. No equations, fitted parameters, self-citations, or derivation steps are present in the text that reduce any claimed prediction or result to an input by construction. The central claims rest on experimental validation rather than self-referential definitions or renamed fits.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that local motion cues transfer across body structures; no free parameters or invented entities are named in the abstract, and the training designs are presented as engineering choices rather than new physical postulates.

axioms (1)

domain assumption Sparse local articulated motion cues preserve essential dynamics across large structural differences
This is explicitly stated as the key insight that enables the bridge from character video to human reenactment.

pith-pipeline@v0.9.0 · 5786 in / 1333 out tokens · 70641 ms · 2026-05-19T16:05:29.268465+00:00 · methodology

AnyAct: Towards Human Reenactment of Character Motion From Video

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)