pith. machine review for the scientific record. sign in

arxiv: 2604.25361 · v1 · submitted 2026-04-28 · 💻 cs.CV

Recognition: unknown

HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords human-centric video evaluationvideo generation assessmenthuman motion qualitycoarse-to-fine frameworkvision language model2D pose verification3D motion stabilityHuM-Bench benchmark
0
0 comments X

The pith

HuM-Eval evaluates generated human motion videos through a coarse-to-fine process that first checks overall quality then verifies anatomical pose and motion stability to match human judgments more closely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video generation models struggle to produce natural human movement, and standard quality metrics overlook the fine details that matter to viewers. The paper proposes HuM-Eval to address this by starting with a broad vision-language model assessment of the entire video, then narrowing to 2D pose checks for correct body structure and 3D motion analysis for smooth movement. This staged approach yields 58.2 percent average correlation with human ratings, higher than prior methods. The work also supplies HuM-Bench, a set of 1,000 varied prompts, to test current text-to-video systems on human-centric quality. If the framework holds, developers gain a more reliable signal for improving human figures in generated clips.

Core claim

HuM-Eval is a human-centric evaluation framework that adopts a coarse-to-fine strategy: it first utilizes a Vision Language Model to perform a coarse assessment of global video quality, then proceeds to a fine-grained analysis using 2D pose to verify anatomical correctness and 3D human motion to evaluate motion stability.

What carries the argument

The coarse-to-fine pipeline that integrates VLM-based global assessment with 2D anatomical verification and 3D motion stability checks to produce a single quality score aligned with human preference.

If this is right

  • Text-to-video models can be ranked more accurately on human motion quality using the HuM-Bench prompts.
  • Developers receive clearer feedback on anatomical errors and movement jitter that current global metrics miss.
  • Evaluation scores become more predictive of viewer satisfaction for clips centered on people.
  • The framework supports systematic comparison of next-generation human motion generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged checks could be applied to evaluate non-human motion such as animals or vehicles by swapping the pose and motion modules.
  • Adding audio or lighting consistency checks might raise the correlation further if human raters weigh those factors heavily.
  • The benchmark set could expose which video generators systematically fail on specific human actions like dancing or sports.
  • If the method generalizes, it offers a practical way to filter training data for future video models before expensive human review.

Load-bearing premise

The specific mix of broad VLM scoring, 2D pose checks, and 3D motion analysis will track human subjective preferences across many different generated videos without overfitting to the chosen test examples.

What would settle it

A new set of human ratings on videos outside the original test collection that shows HuM-Eval correlation falling below the 58.2 percent mark or below competing baselines.

Figures

Figures reproduced from arXiv: 2604.25361 by Bingzi Zhang, Kaisi Guan, Ruihua Song.

Figure 1
Figure 1. Figure 1: Human video evaluation comparison. Baselines often view at source ↗
Figure 2
Figure 2. Figure 2: The Framework Overview. Our proposed Coarse-to-Fine evaluation strategy. The VLM provides a holistic perceptual view at source ↗
Figure 3
Figure 3. Figure 3: Fine-grained Performance Comparison. We report view at source ↗
read the original abstract

Video generation models have developed rapidly in recent years, where generating natural human motion plays a pivotal role. However, accurately evaluating the quality of generated human motion video remains a significant challenge. Existing evaluation metrics primarily focus on global scene statistics, often overlooking fine-grained human details and consequently failing to align with human subjective preference. To bridge this gap, we propose HuM-Eval, a novel human-centric evaluation framework that adopts a coarse-to-fine strategy. Specifically, our framework first utilizes a Vision Language Model to perform a coarse assessment of global video quality. It then proceeds to a fine-grained analysis, using 2D pose to verify anatomical correctness and 3D human motion to evaluate motion stability. Extensive experiments demonstrate that HuM-Eval achieves an average human correlation of 58.2%, outperforming state-of-the-art baselines. Furthermore, we introduce HuM-Bench, a comprehensive benchmark comprising 1,000 diverse prompts, and conduct a detailed evaluation of existing text-to-video models, paving the way for next-generation human motion generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces HuM-Eval, a coarse-to-fine human-centric evaluation framework for generated videos. It begins with a Vision Language Model performing coarse global quality assessment, followed by fine-grained analysis using 2D pose estimation to verify anatomical correctness and 3D human motion analysis to evaluate motion stability. The authors also present HuM-Bench, a benchmark of 1,000 diverse prompts, and report that HuM-Eval achieves an average correlation of 58.2% with human judgments, outperforming state-of-the-art baselines.

Significance. If the reported correlation is shown to be robust and generalizable, HuM-Eval would address a clear gap in video generation evaluation by focusing on fine-grained human motion details that global metrics overlook. The introduction of HuM-Bench provides a useful standardized resource for assessing text-to-video models on human-centric aspects.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central claim of 58.2% average human correlation is presented without any description of data splits, held-out sets, cross-validation procedure, number of human raters, or error bars. This directly impacts whether the outperformance over baselines reflects genuine improvement or in-sample optimization of the pipeline components.
  2. [§3] §3 (HuM-Eval Framework): The combination rule fusing the VLM coarse score, 2D anatomical verification outputs, and 3D motion stability features is not specified (e.g., no equations for weighting, thresholds, or aggregation). Without this, it is impossible to determine if the metric is parameter-free or if thresholds/weights were tuned against human ratings on HuM-Bench.
  3. [§4.2] §4.2 (Benchmark and Correlation Results): No external validation set or separate test videos are mentioned for the final 58.2% figure. If the VLM prompts, pose error thresholds, or 3D features were selected by optimizing directly on the 1,000-prompt HuM-Bench, the correlation becomes an in-sample statistic rather than evidence of generalization.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by naming the specific state-of-the-art baselines used for comparison.
  2. [§3] Notation for the 2D pose and 3D motion components could be made more consistent across sections to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback on our paper. We address the major comments point by point, indicating the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of 58.2% average human correlation is presented without any description of data splits, held-out sets, cross-validation procedure, number of human raters, or error bars. This directly impacts whether the outperformance over baselines reflects genuine improvement or in-sample optimization of the pipeline components.

    Authors: We acknowledge the lack of detailed description regarding the human evaluation setup in the current manuscript. In the revised version, we will add comprehensive information on the data collection, including the number of human raters, the rating procedure, any cross-validation or split methods used, and error bars for the reported correlations. This will allow readers to better assess the reliability of the 58.2% average correlation and confirm that it reflects genuine improvement rather than optimization artifacts. revision: yes

  2. Referee: [§3] §3 (HuM-Eval Framework): The combination rule fusing the VLM coarse score, 2D anatomical verification outputs, and 3D motion stability features is not specified (e.g., no equations for weighting, thresholds, or aggregation). Without this, it is impossible to determine if the metric is parameter-free or if thresholds/weights were tuned against human ratings on HuM-Bench.

    Authors: We agree that the specific combination rule for fusing the different components of HuM-Eval is not sufficiently detailed in §3. We will revise this section to include the exact equations for weighting and aggregation, as well as the thresholds used for 2D and 3D analyses. We will also explicitly state how these parameters were determined to clarify that they were not tuned against the human ratings on HuM-Bench. revision: yes

  3. Referee: [§4.2] §4.2 (Benchmark and Correlation Results): No external validation set or separate test videos are mentioned for the final 58.2% figure. If the VLM prompts, pose error thresholds, or 3D features were selected by optimizing directly on the 1,000-prompt HuM-Bench, the correlation becomes an in-sample statistic rather than evidence of generalization.

    Authors: We note the referee's concern about the absence of an external validation set. In the revised manuscript, we will clarify the process by which the VLM prompts, pose thresholds, and 3D features were selected, emphasizing that they were based on general principles and not optimized directly on HuM-Bench. Additionally, we will include results from a separate held-out set of videos to provide evidence of generalization beyond the 1,000-prompt benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: framework definition and reported correlation are independent

full rationale

The paper defines HuM-Eval as a fixed coarse-to-fine pipeline (VLM global assessment followed by 2D anatomical verification and 3D stability checks) and separately introduces HuM-Bench with 1,000 prompts. The 58.2% human correlation is presented strictly as an empirical outcome of applying this pre-defined framework to the benchmark. No equations, combination rules, thresholds, or weights are shown to be fitted against the human ratings on the same set, and no self-citation chain or self-definitional step reduces the metric to its own inputs. The result is therefore an out-of-sample-style measurement relative to the framework's construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that existing VLM, 2D pose estimators, and 3D motion trackers are sufficiently accurate on generated videos; no new physical entities or free parameters are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Vision-language models can perform reliable coarse global video quality assessment
    Invoked in the first stage of the coarse-to-fine pipeline
  • domain assumption 2D pose estimation accurately verifies anatomical correctness in generated human videos
    Used in the fine-grained analysis step

pith-pipeline@v0.9.0 · 5484 in / 1320 out tokens · 37613 ms · 2026-05-07T16:49:18.893897+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 14 canonical work pages · 6 internal anchors

  1. [1]

    Video generation models as world simulators,

    Tim Brooks et al., “Video generation models as world simulators,” 2024

  2. [2]

    Our latest video generation model, designed to empower filmmakers and storytellers.,

    Inc Google, “Our latest video generation model, designed to empower filmmakers and storytellers.,” 2025

  3. [3]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas et al., “Towards accurate generative models of video: A new metric & challenges,”arXiv preprint arXiv:1812.01717, 2018

  4. [4]

    VBench: Comprehensive benchmark suite for video generative models,

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu, “VBench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni...

  5. [5]

    Vm- bench: A benchmark for perception-aligned video motion generation,

    Xinran Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng, Cundian Yang, Aiming Hao, Jiashu Zhu, Jiahong Wu, and Xiangxiang Chu, “Vm- bench: A benchmark for perception-aligned video motion generation,” arXiv preprint arXiv:2503.10076, 2025

  6. [6]

    Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,

    Xuan et al. He, “Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 2105–2123

  7. [8]

    3d u-net: Learning dense volumetric segmen- tation from sparse annotation,

    ¨Ozg¨un C ¸ ic ¸ek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, and Olaf Ronneberger, “3d u-net: Learning dense volumetric segmen- tation from sparse annotation,” 2016

  8. [9]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie, “Scalable diffusion models with transformers,”arXiv preprint arXiv:2212.09748, 2022

  9. [10]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

  10. [11]

    Hunyuanvideo: A systematic framework for large video generative models,

    Qi Tian Weijie Kong, “Hunyuanvideo: A systematic framework for large video generative models,” 2024

  11. [12]

    Etva: Evaluation of text-to-video alignment via fine-grained question generation and answering,

    Kaisi Guan, Zhengfeng Lai, Yuchong Sun, Peng Zhang, Wei Liu, Kieran Liu, Meng Cao, and Ruihua Song, “Etva: Evaluation of text-to-video alignment via fine-grained question generation and answering,”arXiv preprint arXiv:2503.16867, 2025

  12. [13]

    Animate and sound an image,

    Xihua Wang, Ruihua Song, Chongxuan Li, Xin Cheng, Boyuan Li, Yihan Wu, Yuyue Wang, Hongteng Xu, and Yunfeng Wang, “Animate and sound an image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 23369–23378

  13. [14]

    Taming text-to-sounding video generation via advanced modality condition and interaction,

    Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang, XiaoJiang Liu, Ruihua Song, and Meng Cao, “Taming text-to-sounding video generation via advanced modality condition and interaction,” 2025

  14. [15]

    Synthetic video enhances physical fidelity in video synthesis,

    Qi Zhao, Xingyu Ni, Ziyu Wang, Feng Cheng, Ziyan Yang, Lu Jiang, and Bohan Wang, “Synthetic video enhances physical fidelity in video synthesis,” 2025

  15. [16]

    Fr\’echet video motion dis- tance: A metric for evaluating motion consistency in videos.arXiv preprint arXiv:2407.16124, 2024

    Jiahe Liu, Youran Qu, Qi Yan, Xiaohui Zeng, Lele Wang, and Renjie Liao, “Fr\’echet video motion distance: A metric for evaluating motion consistency in videos,”arXiv preprint arXiv:2407.16124, 2024

  16. [17]

    VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation.arXiv preprint arXiv:2406.15252,

    Xuan et al. He, “Videoscore: Building automatic metrics to simu- late fine-grained human feedback for video generation,”ArXiv, vol. abs/2406.15252, 2024

  17. [18]

    Improving Video Generation with Human Feedback

    Jie et al. Liu, “Improving video generation with human feedback,”arXiv preprint arXiv:2501.13918, 2025

  18. [19]

    Qwen3-vl technical report,

    Shuai Bai et al, “Qwen3-vl technical report,” 2025

  19. [20]

    Sapiens: Foundation for human vision models,

    Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito, “Sapiens: Foundation for human vision models,” 2024

  20. [21]

    Activitynet: A large-scale video benchmark for human activity understanding,

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” inProceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–970

  21. [22]

    World-grounded human motion recovery via gravity-view coordinates,

    Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou, “World-grounded human motion recovery via gravity-view coordinates,” inSIGGRAPH Asia 2024 Conference Papers, 2024, pp. 1–11

  22. [23]

    Motion-x++: A large-scale multimodal 3d whole-body human motion dataset.arXiv preprint arXiv:2501.05098, 2025

    Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, Shunlin Lu, Yurong Fu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang, “Motion-x++: A large-scale multimodal 3d whole-body human motion dataset,”arXiv preprint arXiv:2501.05098, 2025

  23. [24]

    Inter-x: Towards versatile human-human interaction analysis,

    Liang Xu, Xintao Lv, Yichao Yan, Xin Jin, Shuwen Wu, Congsheng Xu, Yifan Liu, Yizhou Zhou, Fengyun Rao, Xingdong Sheng, et al., “Inter-x: Towards versatile human-human interaction analysis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 22260–22271

  24. [25]

    Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

    Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al., “Sana-video: Efficient video generation with block linear diffusion transformer,”arXiv preprint arXiv:2509.24695, 2025

  25. [26]

    Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al., “Open- sora plan: Open-source large video generation model,”arXiv preprint arXiv:2412.00131, 2024

  26. [27]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

  27. [28]

    Disentangling aesthetic and technical effects for video quality assessment of user generated content,

    Haoning Wu, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Disentangling aesthetic and technical effects for video quality assessment of user generated content,” arXiv preprint arXiv:2211.04894, vol. 2, no. 5, pp. 6, 2022

  28. [29]

    Evalcrafter: Benchmarking and evaluating large video generation models,

    Yaofang et al. Liu, “Evalcrafter: Benchmarking and evaluating large video generation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22139–22149

  29. [30]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian et al. Zheng, “Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness,”arXiv preprint arXiv:2503.21755, 2025