arxiv: 2604.20157 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

HumanScore: Benchmarking Human Motions in Generated Videos

Yusu Fang , Tiange Xiang , Tian Tan , Narayan Schuetz , Scott Delp , Li Fei-Fei , Ehsan Adeli

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords human motion evaluationvideo generationbiomechanical metricskinematic analysistemporal stabilitymotion fidelityAI video qualityfailure mode analysis

0 comments

The pith

HumanScore metrics show AI video models produce motions that look real but violate biomechanical rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HumanScore, a framework of six metrics drawn from kinematics and biomechanics to assess human body movements in AI-generated videos. It applies these metrics to outputs from thirteen state-of-the-art models across a range of movement prompts at different intensities. The evaluation uncovers systematic differences between what viewers perceive as plausible and what meets physical motion standards, along with common problems such as frame-to-frame jitter, impossible body configurations, and gradual loss of consistency. This distinction matters for any use of generated video where accurate human dynamics are required, such as training simulations or animation pipelines. The framework supplies concrete numerical scores that support direct model comparisons based on physically grounded criteria rather than visual impression alone.

Core claim

HumanScore defines six interpretable metrics spanning kinematic plausibility, temporal stability, and biomechanical consistency. Applied to videos generated by thirteen state-of-the-art models, the metrics reveal consistent gaps between perceptual plausibility and motion biomechanical fidelity, identify recurrent failure modes including temporal jitter, anatomically implausible poses, and motion drift, and yield robust model rankings derived from quantitative and physically meaningful criteria.

What carries the argument

HumanScore, the set of six metrics based on kinematic and biomechanical principles that quantify motion quality separately from overall visual appearance.

If this is right

Model rankings shift when evaluation uses physical motion criteria instead of viewer preference alone.
Developers gain specific targets for fixing temporal jitter and anatomical violations in future architectures.
Applications that depend on realistic human dynamics, such as virtual training or robotics simulation, can use the metrics for validation beyond visual checks.
The identified failure modes point to the need for generation methods that enforce longer-term anatomical and physical constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training video generators with direct biomechanical penalties derived from these metrics could reduce reliance on purely perceptual objectives.
The same metrics could be tested on interactive or multi-person generated scenes to check whether current gaps widen under more complex conditions.
Correlations between HumanScore values and downstream task performance, such as accurate pose tracking, would indicate practical utility beyond diagnostic ranking.

Load-bearing premise

The six metrics derived from kinematic and biomechanical principles accurately and comprehensively measure human motion quality in generated videos without interference from unrelated visual factors, and the selected prompts represent typical human movements.

What would settle it

A blind rating study by biomechanics experts on motion naturalness for videos scored at opposite ends of the HumanScore distribution, where expert rankings fail to match the metric order, would falsify the claim that the metrics capture biomechanical fidelity.

Figures

Figures reproduced from arXiv: 2604.20157 by Ehsan Adeli, Li Fei-Fei, Narayan Schuetz, Scott Delp, Tiange Xiang, Tian Tan, Yusu Fang.

**Figure 1.** Figure 1: Real or AI? This figure compares frames from real videos alongside AIgenerated ones from modern video generators. Can you tell which is which? Processed full videos are in supplementary material and answers are in the footnote1 . HumanScore is designed to automatically detect subtle biomechanical violations produced by AI video generators, even when they are imperceptible to the human eye. Abstract. Rec… view at source ↗

**Figure 2.** Figure 2: The performance of human mesh recovery methods is converging at low errors and video generators are becoming more realistic. Evaluation of human motion quality requires moving beyond pixel appearance to 3D structure and dynamics. It demands representations that capture biomechanics (e.g., limb length consistency), kinematics (e.g., range of motion), and dynamics (e.g., acceleration feasibility). This l… view at source ↗

**Figure 3.** Figure 3: Overview of HumanScore. The pipeline begins with curating a representative set of human motions from a large pool of common actions. For each motion, we carefully design prompts to mitigate model-specific biases and ensure consistent conditioning across generators. The refined prompts are then passed to both proprietary and open-source state-of-the-art video generation models. Human verification is incor… view at source ↗

**Figure 4.** Figure 4: Our curated motion set includes 17 distinct motions for each difficulty level, each with two levels of intensity, resulting in 102 unique prompts in total. High frequent words across prompts are illustrated. The construction of HumanScore begins with the careful curation of a standardized set of motion types. Human movements are inherently complex, often unstructured, and non-deterministic. To create a r… view at source ↗

**Figure 5.** Figure 5: Prompt Design. Different models tend to have different biases when generating videos, which may lead to unnatural scenes, truncated bodies, moving cameras, or multiple people. We experimented extensively with prompt engineering to mitigate these model biases and obtain the most stable motions in the generated videos, which facilitates metric calculation. Best viewed when zoomed in. we run each motion thro… view at source ↗

**Figure 6.** Figure 6: Biomechanical hierarchy of our evaluation framework. Each tier, from fundamental (bottom) to advanced (top), is evaluated using two independent metrics. Anatomical Correctness is the foundation of all biomechanical evaluations, ensuring a valid and stable topological structure across time. It guards against implausible phenomena such as duplicated arms/legs or frame-to-frame stretching and shrinking of … view at source ↗

**Figure 7.** Figure 7: HumanScore metric values show strong alignment with human preference. The plot compares the averaged HumanScore win rate (Y-axis) against the overall human preference win ratio (X-axis). A linear fit is included to visualize the correlation and the overall Spearman’s correlation coefficient (ρ) is reported. the same prompt by different models. In total, we collect ∼ 1200 responses from researchers in both … view at source ↗

**Figure 8.** Figure 8: Model rankings (y-axis) across different tolerance scales (x-axis). The rankings remain consistent across scales, demonstrating the robustness of our metrics. Robustness to different pose estimation methods. For metrics that rely on SMPL sequences, we evaluate robustness by replacing the pose estimator and recomputing all metrics for comparison. Specifically, we consider two alternative pipelines. F… view at source ↗

**Figure 9.** Figure 9: Ternary plots of model rankings under varying hyperparameters. Each point in the ternary diagram corresponds to one combination of (α, β, γ). For each model, their ranking remains consistent across different ternary coordinates. Robustness to tolerance scales. To accommodate natural estimation variability and motion diversity, we experiment with a range of tolerance settings (from strict to permissive). Co… view at source ↗

**Figure 10.** Figure 10: Detailed breakdown of benchmark results across each evaluation dimension (left) and motion difficulty level (right). The value range shown in the plots is (50, 100). one with the lowest HumanScore, and visualize the fitted skeletons throughout the sequence. We observe that our fitting pipeline captures the overall human motion well, even when the generated videos contain noticeable visual artifacts. This … view at source ↗

**Figure 11.** Figure 11: For metrics that rely on skeletal fitting, we use a two-stage pipeline. First, we apply a pretrained pose/keypoint detector to infer 87 3D keypoints from monocular observations. Second, we fit a biomechanics-informed human skeleton model to these keypoints via iterative optimization. attention window to focus on relative temporal structure, and predicts both camera-space motion and a compact representatio… view at source ↗

read the original abstract

Recent advances in model architectures, compute, and data scale have driven rapid progress in video generation, producing increasingly realistic content. Yet, no prior method systematically measures how faithfully these systems render human bodies and motion dynamics. In this paper, we present HumanScore, a systematic framework to evaluate the quality of human motions in AI-generated videos. HumanScore defines six interpretable metrics spanning kinematic plausibility, temporal stability, and biomechanical consistency, enabling fine-grained diagnosis beyond visual realism alone. Through carefully designed prompts, we elicit a diverse set of movements at varying intensities and evaluate videos generated by thirteen state-of-the-art models. Our analysis reveals consistent gaps between perceptual plausibility and motion biomechanical fidelity, identifies recurrent failure modes (e.g., temporal jitter, anatomically implausible poses, and motion drift), and produces robust model rankings from quantitative and physically meaningful criteria.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HumanScore gives a practical set of motion-focused metrics for video generation but needs tighter validation against real capture data before the rankings can be taken as definitive.

read the letter

The paper's core contribution is a six-metric benchmark that tries to measure biomechanical and kinematic fidelity in generated human motion separately from overall visual quality. They run it on videos from thirteen models using a range of prompts and report consistent gaps plus specific failure modes such as jitter, bad poses, and drift. That separation is useful because most existing video eval leans on perceptual scores that can overlook physically implausible movement. The work does a decent job laying out the failure categories and producing model orderings from the new criteria. It is straightforward to see why someone working on human-centric video models would want to check these numbers. The main weakness is that the letter does not show the precise formulas or any direct comparison to ground-truth motion capture sequences, so it is hard to judge how well the metrics actually track real biomechanical quality or whether small differences in scores are reliable. The prompt set is described as diverse but without more detail on coverage or statistical tests it is unclear how sensitive the rankings are to prompt choice. This is the kind of paper that belongs in a reading group for people who build or benchmark generative video systems. It fills an evaluation gap that matters for practical model development. I would send it out for peer review because the idea is sound and the experiments are scoped, even if the metric definitions and validation steps will need more scrutiny in revision.

Referee Report

1 major / 1 minor

Summary. The paper presents HumanScore, a systematic framework to evaluate the quality of human motions in AI-generated videos. It defines six interpretable metrics spanning kinematic plausibility, temporal stability, and biomechanical consistency. The authors apply these to videos generated by thirteen state-of-the-art models using carefully designed prompts that elicit diverse movements, revealing consistent gaps between perceptual plausibility and motion biomechanical fidelity, identifying recurrent failure modes such as temporal jitter, anatomically implausible poses, and motion drift, and producing robust model rankings from quantitative and physically meaningful criteria.

Significance. If the metrics prove to be well-validated and independent of other visual factors, this work could provide a valuable, physically grounded benchmark for video generation research, helping diagnose limitations in current models and guiding improvements in human motion synthesis. The separation of motion fidelity from general visual realism addresses a clear gap in the field.

major comments (1)

[Abstract] Abstract: The central claims depend on the six metrics accurately capturing human motion quality from kinematic and biomechanical principles, but no exact formulas, computation details from video frames, validation against real motion-capture ground truth, or statistical significance tests for the rankings are provided. This undermines verification of the reported gaps, failure modes, and model rankings.

minor comments (1)

The abstract refers to 'carefully designed prompts' and 'diverse set of movements' without describing the prompt set, intensity variations, or how representativeness was ensured; this should be expanded for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback highlights important aspects for improving the transparency and rigor of HumanScore. We address the major comment point-by-point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims depend on the six metrics accurately capturing human motion quality from kinematic and biomechanical principles, but no exact formulas, computation details from video frames, validation against real motion-capture ground truth, or statistical significance tests for the rankings are provided. This undermines verification of the reported gaps, failure modes, and model rankings.

Authors: We thank the referee for this observation. The six metrics are defined with exact mathematical formulations in Section 3 of the manuscript, derived from standard kinematic and biomechanical principles (e.g., joint angle constraints based on anatomical ranges, temporal derivatives for stability, and center-of-mass trajectory analysis for consistency). Computation proceeds by first applying a pose estimator (OpenPose) to extract 2D keypoints per frame, then computing the metrics from the resulting time-series trajectories. However, we acknowledge that the current version lacks (i) explicit pseudocode or expanded derivation steps for each metric, (ii) direct validation against MoCap ground-truth datasets, and (iii) statistical significance tests (e.g., paired t-tests or Wilcoxon rank-sum tests with p-values and effect sizes) on the model rankings. In the revised manuscript we will add a dedicated subsection in Section 3 with full equations, pseudocode, and implementation details; include a validation experiment on real videos from Human3.6M/AMASS to confirm metric alignment with expected biomechanical properties; and report statistical tests supporting the observed gaps and rankings. These changes will make the claims fully verifiable while preserving the original experimental results and conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation or evaluation chain

full rationale

The paper defines HumanScore as a new benchmark consisting of six metrics drawn from standard kinematic and biomechanical principles, then applies them directly to a set of generated videos elicited via prompts. Model rankings and gap analysis follow from straightforward computation of these metrics on the outputs; no equations, predictions, or first-principles claims are shown to reduce by construction to fitted parameters, self-citations, or the target results themselves. The framework is self-contained as an empirical measurement tool without load-bearing internal loops or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted; the framework appears to rest on standard kinematic and biomechanical concepts from prior literature.

pith-pipeline@v0.9.0 · 5456 in / 1033 out tokens · 22519 ms · 2026-05-10T01:14:26.142611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 27 canonical work pages · 6 internal anchors

[1]

https://research.google.com/youtube8m/(2016) 4 HumanScore 15

Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: A large-scale video classification benchmark. https://research.google.com/youtube8m/(2016) 4 HumanScore 15

2016
[2]

AI, K.K.: Kling 2.5 turbo pro video generation model.https://app.klingai.com/ (2025), accessed: 2025-11-14 9, 10, 13, 16

2025
[3]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3d human pose reconstruction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1446–1455 (2015) 12

2015
[4]

Accessed: 2026-03-03 10, 16

Alibaba Cloud Community: Model studio: Wan 2.6 & wan 2.5 video generation prompt guide.https://www.alibabacloud.com/blog/602776(Jan 2026), us- age/prompting guide for the WAN 2.6/2.5 series. Accessed: 2026-03-03 10, 16

2026
[5]

Arkhipkin, V., Korviakov, V., Gerasimenko, N., Parkhomenko, D., Vasilev, V., Letunovskiy, A., Vaulin, N., Kovaleva, M., Kirillov, I., Novitskiy, L., Koposov, D., Kiselev, N., Varlamov, A., Mikhailov, D., Polovnikov, V., Shutkin, A., Agafonova, J., Vasiliev, I., Kargapoltseva, A., Dmitrienko, A., Maltseva, A., Averchenkova, A., Kim, O., Nikulina, T., Dimit...

work page arXiv 2025
[6]

Journal of neuroengineering and rehabilitation 12(1), 112 (2015) 13

Balasubramanian, S., Melendez-Calderon, A., Roby-Brami, A., Burdet, E.: On the analysis of movement smoothness. Journal of neuroengineering and rehabilitation 12(1), 112 (2015) 13

2015
[7]

In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

Baumgartner, T., Klatt, S.: Monocular 3d human pose estimation for sports broad- casts using partial sports field registration. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 5109–5118 (2023) 5

2023
[8]

Bilibili: Bilibili.https://www.bilibili.com/(2025) 4

2025
[9]

IEEE transactions on pattern analysis and machine intelligence43(1), 172–186 (2019) 11

Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: Realtime multi- person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence43(1), 172–186 (2019) 11

2019
[10]

A short note on the kinetics-700 human action dataset

Carreira, J., Noland, E., Hillier, C., Zisserman, A.: A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987 (2019) 5

work page arXiv 1907
[11]

Computer methods in biomechanics and biomedical engineering22(1), 21–24 (2019) 8

Catelli, D.S., Wesseling, M., Jonkers, I., Lamontagne, M.: A musculoskeletal model customized for squatting task. Computer methods in biomechanics and biomedical engineering22(1), 21–24 (2019) 8

2019
[12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Davydov, A., Engilberge, M., Salzmann, M., Fua, P.: Cloaf: Collision-aware hu- man flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1176–1185 (2024) 13

2024
[13]

IEEE transactions on biomedical engineering 54(11), 1940–1950 (2007) 8

Delp, S.L., Anderson, F.C., Arnold, A.S., Loan, P., Habib, A., John, C.T., Guen- delman, E., Thelen, D.G.: Opensim: open-source software to create and analyze dynamic simulations of movement. IEEE transactions on biomedical engineering 54(11), 1940–1950 (2007) 8

1940
[14]

Duan, H.-X

Duan, H., Yu, H.X., Chen, S., Fei-Fei, L., Wu, J.: Worldscore: A unified evaluation benchmark for world generation. arXiv preprint arXiv:2504.00983 (2025) 3, 14

work page arXiv 2025
[15]

Accessed: 2026-03-03 10, 16

fal.ai: Pixverse v5.5 developer guide.https://fal.ai/learn/devs/pixverse- v5-5-developer-guide(Dec 2025), aPI usage guide for PixVerse v5.5. Accessed: 2026-03-03 10, 16

2025
[16]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025) 9, 10, 13, 16

work page internal anchor Pith review arXiv 2025
[17]

2023 , url =

Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4d: Reconstructing and tracking humans with transformers. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 14737–14748 (2023). https://doi.org/10.1109/ICCV51070.2023.013584, 13 16 Y. Fang, T. Xiang et al

work page doi:10.1109/iccv51070.2023.013584 2023
[18]

Frontiers in Robotics and AI7, 13 (2020).https://doi.org/10.3389/frobt.2020.00013, https://www.frontiersin.org/articles/10.3389/frobt.2020.00013/full9, 10

Grimmer, M.: Human lower limb joint biomechanics in daily life activities: A lit- erature based requirement analysis for anthropomorphic robot design. Frontiers in Robotics and AI7, 13 (2020).https://doi.org/10.3389/frobt.2020.00013, https://www.frontiersin.org/articles/10.3389/frobt.2020.00013/full9, 10

work page doi:10.3389/frobt.2020.00013 2020
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating di- verse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5152–5161 (June 2022) 4

2022
[20]

Lippincott Williams & Wilkins (2006) 7

Hamill, J., Knutzen, K.M.: Biomechanical basis of human movement. Lippincott Williams & Wilkins (2006) 7

2006
[21]

What’s in the image? a deep-dive into the vision of vision language models

Han, H., Li, S., Chen, J., Yuan, Y., Wu, Y., Deng, Y., Leong, C.T., Du, H., Fu, J., Li, Y., Zhang, J., Zhang, C., Li, L.j., Ni, Y.: Video-bench: Human-aligned video generation benchmark. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18858–18868 (2025).https://doi.org/10. 1109/CVPR52734.2025.017573

work page arXiv 2025
[22]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Hong, W., Cheng, Y., Yang, Z., Wang, W., Wang, L., Gu, X., Huang, S., Dong, Y., Tang, J.: Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8450–8460 (2025) 3

2025
[23]

In: CVPR, pp

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21807– 21818 (2024).https://doi.org/10.1109/CVP...

work page doi:10.1109/cvpr52733.2024.020602 2024
[24]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

Huang, Z., Zhang, F., Xu, X., He, Y., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., et al.: Vbench++: Comprehensive and versatile benchmark suite for video generative models. arXiv preprint arXiv:2411.13503 (2024) 3

work page arXiv 2024
[25]

In: Computer Vision and Pattern Recognition (CVPR) (2018) 11

Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Computer Vision and Pattern Recognition (CVPR) (2018) 11

2018
[26]

arXiv preprint arXiv:2403.09669 (2024) 3

Kim,P.J.,Kim,S.,Yoo,J.:Stream:Spatio-temporalevaluationandanalysismetric for video generative models. arXiv preprint arXiv:2403.09669 (2024) 3

work page arXiv 2024
[27]

In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020) 11

Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: Video inference for human body pose and shape estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020) 11

2020
[28]

GenAI-Bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

Li, B., Lin, Z., Pathak, D., Li, J., Fei, Y., Wu, K., Ling, T., Xia, X., Zhang, P., Neubig, G., Ramanan, D.: Genai-bench: Evaluating and improving compositional text-to-visual generation. arXiv preprint arXiv:2406.13743 (2024) 3

work page arXiv 2024
[29]

The quest for generalizable motion generation: Data, model, and evaluation.arXiv preprint arXiv:2510.26794, 2025

Lin, J., Wang, R., Lu, J., Huang, Z., Song, G., Zeng, A., Liu, X., Wei, C., Yin, W., Sun, Q., et al.: The quest for generalizable motion generation: Data, model, and evaluation. arXiv preprint arXiv:2510.26794 (2025) 4

work page arXiv 2025
[30]

Advances in Neural Information Processing Systems (2023) 4

Lin, J., Zeng, A., Lu, S., Cai, Y., Zhang, R., Wang, H., Zhang, L.: Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems (2023) 4

2023
[31]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014) 5

2014
[32]

Towards un- derstanding camera motions in any video.arXiv preprint arXiv:2504.15376,

Lin, Z., Cen, S., Jiang, D., Karhade, J., Wang, H., Mitra, C., Ling, Y.T.T., Huang, Y., Liu, S., Chen, M., Zawar, R., Bai, X., Du, Y., Gan, C., Ramanan, D.: Towards HumanScore 17 understanding camera motions in any video. arXiv preprint arXiv:2504.15376 (2025) 3

work page arXiv 2025
[33]

Evaluating text-to-visual generation with image-to-text generation, 2024

Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., Ramanan, D.: Evaluating text-to-visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291 (2024) 3

work page arXiv 2024
[34]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ling, X., Zhu, C., Wu, M., Li, H., Feng, X., Yang, C., Hao, A., Zhu, J., Wu, J., Chu, X.: Vmbench: A benchmark for perception-aligned video motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13087–13098 (2025) 3

2025
[35]

com/content/ICCV2025/supplemental/Ling_VMBench_A_Benchmark_ICCV_2025_ supplemental.pdf, supplemental material available athttps://openaccess

Ling, X., Zhu, C., Wu, M., Li, H., Feng, X., Yang, C., Hao, A., Zhu, J., Wu, J., Chu, X.: Vmbench: A benchmark for perception-aligned video mo- tion generation (supplemental material) (2025),https://openaccess.thecvf. com/content/ICCV2025/supplemental/Ling_VMBench_A_Benchmark_ICCV_2025_ supplemental.pdf, supplemental material available athttps://openacces...

2025
[36]

Fr\’echet video motion distance: A metric for evaluating motion consistency in videos,

Liu, J., Qu, Y., Yan, Q., Zeng, X., Wang, L., Liao, R.: Fréchet video motion distance: A metric for evaluating motion consistency in videos. arXiv preprint arXiv:2407.16124 (2024) 4

work page arXiv 2024
[37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., Shan, Y.: Evalcrafter: Benchmarking and evaluating large video generation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22139–22149 (2024) 3

2024
[38]

LLC, G.: Veo 3.1 video generation model.https://ai.google.dev/gemini-api/ video-generation(2025), accessed: 2025-11-14 9, 10, 13, 16

2025
[39]

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinnedmulti-personlinearmodel.ACMTrans.Graphics(Proc.SIGGRAPHAsia) 34(6), 248:1–248:16 (Oct 2015) 4, 5

2015
[40]

ACM Transactions on Graphics (TOG)34(6), 1–16 (2015) 5

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: a skinned multi-person linear model. ACM Transactions on Graphics (TOG)34(6), 1–16 (2015) 5

2015
[41]

ai/ray(2026), product/model page for Ray3

Luma AI: Ai video generation with ray3 & dream machine.https://lumalabs. ai/ray(2026), product/model page for Ray3. Accessed: 2026-03-03 10, 15

2026
[42]

In: Proceedings of the IEEE/CVF international conference on computer vision

Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5442–5451 (2019) 4

2019
[43]

In: Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games

Maiorca, A., Bohy, H., Yoon, Y., Dutoit, T.: Objective evaluation metric for motion generative models: Validating fréchet motion distance on foot skating and over- smoothing artifacts. In: Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games. pp. 1–11 (2023) 4

2023
[44]

Maiorca, A., Yoon, Y., Dutoit, T.: Evaluating the quality of a synthesized motion with the fréchet motion distance (2022),https://arxiv.org/abs/2204.12318, sIGGRAPH Poster 2022 4

work page arXiv 2022
[45]

MiniMax Hailuo 02 Fan-site: Minimax hailuo 02 | world’s #2 ai video generation model.https://www.hailuo02.net/(2025), accessed: 2026-03-04 9, 10, 16

2025
[46]

OpenAI: Sora 2 system card.https://openai.com/index/sora-2-system-card/ (Sep 2025), accessed: 2025-11-14 9, 10, 13, 16

2025
[47]

In: Proceedings IEEE Conf

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). pp. 10975–10985 (2019) 4, 5 18 Y. Fang, T. Xiang et al

2019
[48]

arXiv preprint arXiv:2507.08268 (2025) 7

Peiffer, J., Shah, K., Djuraskovic, I., Anarwala, S., Abdou, K., Patel, R., Jaya- balan, P., Pennicooke, B., Cotton, R.J.: Portable biomechanics laboratory: Clin- ically accessible movement analysis from a handheld smartphone. arXiv preprint arXiv:2507.08268 (2025) 7

work page arXiv 2025
[49]

Pika Community: Pika 2.2.https://pikalabs.org/pika-2-2/(Mar 2025), ac- cessed: 2026-03-04 9, 10, 16

2025
[50]

Big Data 4(4), 236–252 (Dec 2016).https://doi.org/10.1089/big.2016.0028,http:// dx.doi.org/10.1089/big.2016.00284

Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big Data 4(4), 236–252 (Dec 2016).https://doi.org/10.1089/big.2016.0028,http:// dx.doi.org/10.1089/big.2016.00284

work page doi:10.1089/big.2016.0028 2016
[51]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 5

2021
[52]

IEEE transactions on biomedical engineering63(10), 2068–2079 (2016) 8, 5

Rajagopal, A., Dembia, C.L., DeMers, M.S., Delp, D.D., Hicks, J.L., Delp, S.L.: Full-body musculoskeletal model for muscle-driven simulation of human gait. IEEE transactions on biomedical engineering63(10), 2068–2079 (2016) 8, 5

2068
[53]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. arXiv preprint arXiv:1908.10084 (2019) 5

work page internal anchor Pith review arXiv 1908
[54]

In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

Rempe, D., Guibas, L.J., Hertzmann, A., Russell, B., Villegas, R., Yang, J.: Con- tact and human dynamics from monocular video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 71–87. Springer International Publishing, Cham (2020) 4

2020
[55]

a pilot study

Roren, A., Mazarguil, A., Vaquero-Ramos, D., Deloose, J.B., Vidal, P.P., Nguyen, C., Rannou, F., Wang, D., Oudre, L., Lefèvre-Colau, M.M.: Assessing smoothness of arm movements with jerk: a comparison of laterality, contraction mode and plane of elevation. a pilot study. Frontiers in Bioengineering and Biotechnology9, 782740 (2022) 13

2022
[56]

In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2023) 7

Sárándi, I., Hermans, A., Leibe, B.: Learning 3D human pose estimation from dozens of datasets using a geometry-aware autoencoder to bridge between skeleton formats. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2023) 7

2023
[57]

IEEE Transactions on Biometrics, Behavior, and Identity Science3(1), 16–30 (2020) 8

Sárándi, I., Linder, T., Arras, K.O., Leibe, B.: Metrabs: metric-scale truncation- robust heatmaps for absolute 3d human pose estimation. IEEE Transactions on Biometrics, Behavior, and Identity Science3(1), 16–30 (2020) 8

2020
[58]

In: SIGGRAPH Asia Conference Proceedings (2024) 13

Shen,Z.,Pi,H.,Xia,Y.,Cen,Z.,Peng,S.,Hu,Z.,Bao,H.,Hu,R.,Zhou,X.:World- grounded human motion recovery via gravity-view coordinates. In: SIGGRAPH Asia Conference Proceedings (2024) 13

2024
[59]

ACM Transactions on Graphics39(6) (dec 2020) 4

Shimada, S., Golyanik, V., Xu, W., Theobalt, C.: Physcap: Physically plausible monocular 3d motion capture in real time. ACM Transactions on Graphics39(6) (dec 2020) 4

2020
[60]

Singh, S., Bible, J., Liu, Z., Zhang, Z., Singapogu, R.: Motion smoothness metrics for cannulation skill assessment: What factors matter? Frontiers in Robotics and AI8, 625003 (2021) 13

2021
[61]

What’s in the image? a deep-dive into the vision of vision language models

Sun, K., Huang, K., Liu, X., Wu, Y., Xu, Z., Li, Z., Liu, X.: T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8406–8416 (2025).https://doi.org/10.1109/CVPR52734.2025.007872, 3

work page doi:10.1109/cvpr52734.2025.007872 2025
[62]

In: Proceedings of the IEEE international conference on computer vision

Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: Proceedings of the IEEE international conference on computer vision. pp. 2602– 2611 (2017) 7, 9, 10 HumanScore 19

2017
[63]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tang, S., Chen, C., Xie, Q., Chen, M., Wang, Y., Ci, Y., Bai, L., Zhu, F., Yang, H., Yi, L., et al.: Humanbench: Towards general human-centric perception with projector assisted pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21970–21982 (2023) 3

2023
[64]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2023) 4

Tripathi, S., Müller, L., Huang, C.H.P., Omid, T., Black, M.J., Tzionas, D.: 3D human pose estimation via intuitive physics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2023) 4

2023
[65]

In: Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 4

Ugrinovic, N., Pan, B., Pavlakos, G., Paschalidou, D., Shen, B., Sanchez-Riera, J., Moreno-Noguer, F., Guibas, L.: Multiphys: Multi-person physics-aware 3d motion estimation. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 4

2024
[66]

Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges (2019), https://arxiv.org/abs/1812.017174

work page internal anchor Pith review arXiv 2019
[67]

In: Proceedings of the European conference on computer vision (ECCV)

Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Re- covering accurate 3d human pose in the wild using imus and a moving camera. In: Proceedings of the European conference on computer vision (ECCV). pp. 601–617 (2018) 2

2018
[68]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 9, 10, 13, 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Detecting human artifacts from text-to- image models.arXiv preprint arXiv:2411.13842, 2024

Wang, K., Zhang, L., Zhang, J.: Detecting human artifacts from text-to-image models. arXiv preprint arXiv:2411.13842 (2024) 7, 4

work page arXiv 2024
[70]

In: Proceedings of the computer vision and pattern recognition conference

Wang, Y., Sun, Y., Patel, P., Daniilidis, K., Black, M.J., Kocabas, M.: Prompthmr: Promptable human mesh recovery. In: Proceedings of the computer vision and pattern recognition conference. pp. 1148–1159 (2025) 8, 12, 4, 9

2025
[71]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang,Z.,Li,P.,Liu,H.,Deng,Z.,Wang,C.,Liu,J.,Yuan,J.,Liu,M.:Recognizing actions from robotic view for natural human-robot interaction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14218–14227 (2025) 3

2025
[72]

HunyuanVideo 1.5 Technical Report

Wu, B., Zou, C., Li, C., Huang, D., Yang, F., Tan, H., Peng, J., Wu, J., Xiong, J., Jiang, J., et al.: Hunyuanvideo 1.5 technical report. arXiv preprint arXiv:2511.18870 (2025) 9, 10, 13, 15

work page internal anchor Pith review arXiv 2025
[73]

In: 2025 International Conference on 3D Vision (3DV)

Xiang, T., Wang, K.C., Heo, J., Adeli, E., Yeung, S., Delp, S., Fei-Fei, L.: Neuhmr: Neural rendering-guided human motion reconstruction. In: 2025 International Con- ference on 3D Vision (3DV). pp. 1518–1528 (2025).https://doi.org/10.1109/ 3DV66043.2025.001424

work page arXiv 2025
[74]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024) 9, 10, 13, 15

work page internal anchor Pith review arXiv 2024
[75]

Youtube: Youtube.https://www.youtube.com/(2025) 4

2025
[76]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, Y., Kephart, J.O., Cui, Z., Ji, Q.: Physpt: Physics-aware pretrained trans- former for estimating human dynamics from monocular videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2305–2317 (June 2024) 4

2024
[77]

violations

Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., Kehtarnavaz, N., Shah, M.: Deep learning-based human pose estimation: A survey. ACM Comput. Surv. (jun 2023).https://doi.org/10.1145/3603618,https://doi.org/10.1145/ 360361811 HumanScore: Benchmarking Human Motions in Generated Videos Supplementary Material Table of Contents A Teaser..................

work page doi:10.1145/3603618 2023