pith. machine review for the scientific record. sign in

arxiv: 2605.07354 · v1 · submitted 2026-05-08 · 📡 eess.SP · cs.CV

Recognition: 1 theorem link

· Lean Theorem

Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:00 UTC · model grok-4.3

classification 📡 eess.SP cs.CV
keywords task-oriented communicationhuman action understandingedge-cloud co-inferenceVQ-VAEmotion tokensvision-language modelpose estimationsemantic communication
0
0 comments X

The pith

By turning raw video into a short sequence of discrete motion tokens from pose joints at the edge, the system transmits roughly 1 percent of the data of video codecs while letting a cloud vision-language model deliver comparable action理解.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a task-oriented communication framework that moves human action understanding from full video transmission to a compact token pipeline across edge and cloud. A monocular pose estimator pulls joint coordinates from video at the device, a VQ-VAE turns those coordinates into discrete tokens, and only the short list of codebook indices travels over the network at 9 bits per frame. On the cloud side a lightweight projector maps the tokens into the embedding space of a large vision-language model that has been tuned with instructions to interpret the actions. This design directly attacks the bandwidth, latency, and privacy costs that have blocked scalable edge sensing. Readers interested in real-time smart environments would see immediate value if the accuracy holds, because the same hardware could now run many more simultaneous streams without saturating links or exposing raw footage.

Core claim

Converting continuous joint coordinates extracted by a monocular pose estimator into discrete motion tokens through a vector-quantized variational autoencoder, then transmitting only the sequence of codebook indices and aligning them at the cloud via a lightweight projector to a vision-language model trained by instruction tuning, produces action understanding accuracy comparable to video-codec baselines while cutting transmission payload to approximately 1 percent and end-to-end latency to approximately 20 percent on three standard benchmarks.

What carries the argument

The vector-quantized variational autoencoder (VQ-VAE) that maps continuous pose joint coordinates into a compact sequence of discrete codebook indices, together with the lightweight projector that aligns those indices to the embedding space of the vision-language model.

If this is right

  • Uplink payload falls to roughly 1 percent of conventional video-codec streams.
  • End-to-end latency drops to about 20 percent of codec-based pipelines.
  • Action understanding accuracy stays comparable on three public benchmarks.
  • Raw video never leaves the edge device, eliminating the main privacy vector.
  • The projector is trained efficiently through instruction tuning rather than full model retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token pipeline could be tested on related edge tasks such as fall detection or gesture recognition without redesigning the transmission layer.
  • Because the tokens are discrete and short, they may tolerate packet loss better than compressed video frames if simple repetition or forward-error correction is added.
  • Domain-specific codebooks trained only on expected actions could shrink the index size even further for narrow deployments.
  • Measuring token robustness under realistic wireless packet errors would show whether the latency gains survive imperfect channels.

Load-bearing premise

The discrete motion tokens produced by the VQ-VAE, once aligned by the projector, still contain enough semantic detail for the vision-language model to reach action-understanding accuracy close to what richer video features would allow.

What would settle it

Run the same vision-language model on a benchmark dataset both with the full original video and with only the 9-bit-per-frame token sequence; if recognition accuracy drops by more than a few percentage points under the token-only path, the comparability claim fails.

Figures

Figures reproduced from arXiv: 2605.07354 by Cheng Yuan, Jiawei Shao, Jingyi Liu, Jun Zhang, Lijun He.

Figure 1
Figure 1. Figure 1: An example of the proposed task-oriented action understanding sys [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System framework of the proposed action understanding pipeline for edge-cloud co-inference. The edge side performs monocular human pose estimation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The rate-performance curves of different methods in action understanding benchmarks. Subfigures (a) and (b) show results on Motion-Bench and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average inference latency per user request (the bar plot corresponding to the left y-axis) and benchmark score (the line plot corresponding to the right [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A qualitative comparison between the proposed TOAU method and traditional AV1 compression. Under a low data rate, the AV1-compressed video is [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A qualitative comparison between the proposed TOAU method and [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

The expanding application of smart sensing has created a growing demand for the accurate understanding of human action at the network edge. Traditional approaches require massive video data to be transmitted from resource-constrained edge devices to powerful cloud servers, incurring prohibitive uplink bandwidth consumption and unacceptable latency while raising privacy concerns. To overcome these bottlenecks, we propose a task-oriented communication framework for human action understanding (TOAU) through edge-cloud collaboration. Our framework utilizes a monocular pose estimator to extract continuous joint coordinates from raw videos, followed by a vector quantized variational autoencoder (VQ-VAE) to convert these coordinates into discrete motion tokens. Consequently, only a compact sequence of codebook indices is transmitted over the network, consuming as few as 9 bits per frame and avoiding privacy leakages. At the cloud server, a lightweight projector aligns these motion tokens with the embedding space of a large vision-language model (VLM) to facilitate complex action understanding, which is trained with an efficient instruction tuning paradigm. Comprehensive evaluations on three benchmarks demonstrate that our TOAU system reduces the transmission payload to approximately 1\% and the system latency to around 20\% compared to video codec-based solutions, while delivering comparable action understanding accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes TOAU, a task-oriented edge-cloud framework for human action understanding. A monocular pose estimator extracts joint coordinates from raw video; a VQ-VAE quantizes these into discrete motion tokens (reported as 9 bits per frame); only the codebook indices are transmitted. At the cloud, a lightweight projector aligns the token sequence to the embedding space of a vision-language model, which is instruction-tuned for action recognition. On three benchmarks the system is claimed to reduce transmission payload to ~1 % and end-to-end latency to ~20 % of conventional video-codec baselines while preserving comparable accuracy.

Significance. If the accuracy claim is substantiated, the work offers a concrete, privacy-preserving alternative to raw-video or feature transmission in bandwidth-limited edge scenarios. The combination of pose-based VQ-VAE compression with VLM instruction tuning is a timely contribution to task-oriented communication literature and could influence practical deployments in surveillance, robotics, and assisted living.

major comments (3)
  1. [Evaluation section] §4 (or equivalent evaluation section): the headline claim of 'comparable action understanding accuracy' is not supported by the reported evidence. No ablation tables or figures quantify accuracy versus VQ-VAE codebook size, tokens per frame, or reconstruction error on held-out actions; likewise, no comparison is shown between the projector-aligned tokens and either raw continuous pose or richer video features. Without these controls the central performance assertion cannot be evaluated.
  2. [Methods / VQ-VAE subsection] §3.2 (VQ-VAE description): the manuscript states that 9-bit-per-frame indices suffice, yet provides neither the codebook size nor the bit-allocation scheme used to reach this figure, nor any quantitative reconstruction metric (e.g., MPJPE or joint-velocity error) on the three target benchmarks. This information is load-bearing for the claim that kinematic semantics are preserved.
  3. [Evaluation / Baseline subsection] §4 (baseline comparison): the latency and payload reductions are reported relative to 'video codec-based solutions,' but the exact codecs, encoding parameters, and transmission protocols are not specified, nor are statistical error bars or multiple-run averages provided. These omissions prevent verification of the 1 % / 20 % figures.
minor comments (3)
  1. [Abstract and §1] The abstract and introduction refer to 'three benchmarks' without naming them; the first paragraph of the evaluation section should list the datasets explicitly.
  2. [§3.3] Notation for the projector alignment (e.g., the mapping from discrete indices to VLM token embeddings) is introduced without an equation number; adding a numbered equation would improve traceability.
  3. [Figure 1] Figure 1 (system diagram) would benefit from explicit bit-rate annotations on the uplink arrow to match the 9-bit-per-frame claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We have addressed each major point below and will revise the manuscript to incorporate the requested clarifications and additional results.

read point-by-point responses
  1. Referee: [Evaluation section] §4 (or equivalent evaluation section): the headline claim of 'comparable action understanding accuracy' is not supported by the reported evidence. No ablation tables or figures quantify accuracy versus VQ-VAE codebook size, tokens per frame, or reconstruction error on held-out actions; likewise, no comparison is shown between the projector-aligned tokens and either raw continuous pose or richer video features. Without these controls the central performance assertion cannot be evaluated.

    Authors: We agree that additional ablation studies and controls are needed to fully substantiate the claim of comparable accuracy. In the revised manuscript we will add tables and figures that report action-understanding accuracy versus VQ-VAE codebook size and number of tokens per frame, together with reconstruction error on held-out actions. We will also include direct comparisons of the projector-aligned tokens against both raw continuous pose coordinates and richer video features extracted from standard encoders. These additions will allow readers to evaluate the performance trade-offs more rigorously. revision: yes

  2. Referee: [Methods / VQ-VAE subsection] §3.2 (VQ-VAE description): the manuscript states that 9-bit-per-frame indices suffice, yet provides neither the codebook size nor the bit-allocation scheme used to reach this figure, nor any quantitative reconstruction metric (e.g., MPJPE or joint-velocity error) on the three target benchmarks. This information is load-bearing for the claim that kinematic semantics are preserved.

    Authors: We thank the referee for highlighting this omission. The revised §3.2 will explicitly state the codebook size and the bit-allocation scheme that yields the reported 9 bits per frame. We will also add quantitative reconstruction metrics (MPJPE and joint-velocity error) evaluated on the three target benchmarks to demonstrate that kinematic semantics are preserved after quantization. revision: yes

  3. Referee: [Evaluation / Baseline subsection] §4 (baseline comparison): the latency and payload reductions are reported relative to 'video codec-based solutions,' but the exact codecs, encoding parameters, and transmission protocols are not specified, nor are statistical error bars or multiple-run averages provided. These omissions prevent verification of the 1 % / 20 % figures.

    Authors: We acknowledge that precise specification of the baselines is required for verification. In the revised manuscript we will detail the exact video codecs employed, their encoding parameters, and the transmission protocols assumed. We will also report statistical error bars and averages computed over multiple runs to support the stated 1 % payload and 20 % latency reductions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system design with independent benchmark validation

full rationale

The paper presents a practical task-oriented communication pipeline (pose estimation → VQ-VAE tokenization → projector alignment → VLM inference) evaluated on three external benchmarks. No equations, uniqueness theorems, or predictions are shown that reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations. All performance claims (payload/latency reduction, comparable accuracy) rest on direct empirical measurements rather than tautological renaming or ansatz smuggling. The derivation chain is a sequence of standard ML components whose outputs are independently verifiable against held-out data.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central performance claims rest on the effectiveness of standard pose estimation and VQ-VAE compression for preserving action semantics, plus successful alignment of the resulting tokens to an existing VLM; no new entities are postulated and free parameters such as codebook size are chosen rather than derived.

free parameters (1)
  • VQ-VAE codebook size and bit allocation
    Determines the discrete token vocabulary and per-frame bit rate (stated as 9 bits); chosen to trade off compression against information retention for downstream action tasks.
axioms (2)
  • domain assumption A monocular pose estimator can reliably extract continuous joint coordinates from raw video frames at the edge
    Invoked as the first processing step to avoid transmitting raw video pixels.
  • domain assumption The discrete motion tokens retain sufficient information for the cloud VLM to perform complex action understanding after projector alignment
    This underpins the claim that accuracy remains comparable despite extreme compression.

pith-pipeline@v0.9.0 · 5517 in / 1677 out tokens · 48984 ms · 2026-05-11T02:00:23.575788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 1 internal anchor

  1. [1]

    A multi-model fusion approach for hidden camera detection in smart city scenarios,

    C. Zhang, “A multi-model fusion approach for hidden camera detection in smart city scenarios,” inProc. Int. Conf. Comput., Internet Things Smart City (CIoTSC), 2025, pp. 1–5

  2. [2]

    Internet of things in smart cities: Comprehensive review, open issues, and challenges,

    E. H. Houssein, M. A. Othman, W. M. Mohamed, and M. Younan, “Internet of things in smart cities: Comprehensive review, open issues, and challenges,”IEEE Internet Things J., vol. 11, no. 21, pp. 34 941– 34 952, 2024

  3. [3]

    Guest editors’ introduction: Human-centered computing–toward a human revolution,

    A. Jaimes, D. Gatica-Perez, N. Sebe, and T. S. Huang, “Guest editors’ introduction: Human-centered computing–toward a human revolution,” Computer, vol. 40, no. 5, pp. 30–34, 2007

  4. [4]

    Real-world community-in-the-loop smart video surveillance system,

    S. Yao, B. R. Ardabili, A. Danesh Pazho, G. A. Noghre, C. Neff, and H. Tabkhi, “Real-world community-in-the-loop smart video surveillance system,” inProc. IEEE Int. Conf. Smart Comput. (SMARTCOMP), 2023, pp. 183–185

  5. [5]

    Construction of an intelligent system for elderly’s health and elderly care from the perspective of the integration of smart sensors and physical medicine,

    Y . Jiang, “Construction of an intelligent system for elderly’s health and elderly care from the perspective of the integration of smart sensors and physical medicine,” inProc. Int. Conf. Smart Syst. Inventive Technol. (ICSSIT), 2022, pp. 541–544

  6. [6]

    Real-time healthcare monitoring using smart systems: A step towards healthcare service orchestration smart systems for futuristic healthcare,

    V . Bhatt and S. Chakraborty, “Real-time healthcare monitoring using smart systems: A step towards healthcare service orchestration smart systems for futuristic healthcare,” inProc. Int. Conf. Artif. Intell. Smart Syst. (ICAIS), 2021, pp. 772–777

  7. [7]

    Pepper humanoid robot as a service robot: a customer approach,

    Z. A. barakeh, S. alkork, A. S. Karar, S. Said, and T. Beyrouthy, “Pepper humanoid robot as a service robot: a customer approach,” inProc. Int. Conf. Bio-engineering Smart Technol. (BioSMART), 2019, pp. 1–4

  8. [8]

    Gesture vs. touch control for unforeseen situations of human- robot collaborative assembly,

    P. Kranz, D. Kristhofen, F. Schirmer, C. G. Rose, J. Schmitt, and T. Kaupp, “Gesture vs. touch control for unforeseen situations of human- robot collaborative assembly,” inProc. ACM/IEEE Int. Conf. Human- Robot Interact. (HRI), 2025, pp. 1433–1437

  9. [9]

    Conceptual design of a UA V- UGV autonomous collaborative robot system,

    R. Szabolcsi, G. Moln ´ar, and T. W ¨uhrl, “Conceptual design of a UA V- UGV autonomous collaborative robot system,” inProc. IEEE Int. Conf. Workshop ´Obuda Electr . Power Eng. (CANDO-EPE), 2024, pp. 207– 212

  10. [10]

    User-aware shared perception for em- bodied agents,

    D. G. McNeely-Whiteet al., “User-aware shared perception for em- bodied agents,” inProc. IEEE Int. Conf. Humanized Comput. Commun. (HCC), 2019, pp. 46–51

  11. [11]

    Exploring mediation by an embodied virtual agent in immersive triadic collaborative decision-making,

    B. Hanet al., “Exploring mediation by an embodied virtual agent in immersive triadic collaborative decision-making,”IEEE Trans. Vis. Comput. Graph., pp. 1–10, 2026

  12. [12]

    MotionLLM: Understanding human behaviors from human motions and videos,

    L. Chenet al., “MotionLLM: Understanding human behaviors from human motions and videos,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2025

  13. [13]

    Human motion instruction tuning,

    L. Liet al., “Human motion instruction tuning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Los Alamitos, CA, USA, Jun 2025, pp. 17 582–17 591

  14. [14]

    Communication-computation trade-off in resource-constrained edge inference,

    J. Shao and J. Zhang, “Communication-computation trade-off in resource-constrained edge inference,”IEEE Commun. Mag., vol. 58, no. 12, pp. 20–26, 2021

  15. [15]

    Wireless edge machine learning: Resource allocation and trade-offs,

    M. Merluzzi, P. D. Lorenzo, and S. Barbarossa, “Wireless edge machine learning: Resource allocation and trade-offs,”IEEE Access, vol. 9, pp. 45 377–45 398, 2021

  16. [16]

    Task- oriented communications for 6G: Vision, principles, and technologies,

    Y . Shi, Y . Zhou, D. Wen, Y . Wu, C. Jiang, and K. B. Letaief, “Task- oriented communications for 6G: Vision, principles, and technologies,” IEEE Wireless Commun., vol. 30, no. 3, pp. 78–85, Jun. 2023

  17. [17]

    Learning task-oriented communication for edge inference: An information bottleneck approach,

    J. Shao, Y . Mao, and J. Zhang, “Learning task-oriented communication for edge inference: An information bottleneck approach,”IEEE J. Sel. Areas Commun., vol. 40, no. 1, pp. 197–211, 2022

  18. [18]

    Task-oriented communication for multidevice cooperative edge inference,

    ——, “Task-oriented communication for multidevice cooperative edge inference,”IEEE Trans. Wireless Commun., vol. 22, no. 1, pp. 73–87, 2023

  19. [19]

    Tackling distribution shifts in task-oriented communication with information bottleneck,

    H. Li, J. Shao, H. He, S. Song, J. Zhang, and K. B. Letaief, “Tackling distribution shifts in task-oriented communication with information bottleneck,”IEEE J. Sel. Areas Commun., vol. 43, no. 7, pp. 2667– 2683, 2025

  20. [20]

    Task-oriented communication for vehicle- to-infrastructure cooperative perception,

    J. Shao, T. Li, and J. Zhang, “Task-oriented communication for vehicle- to-infrastructure cooperative perception,” inProc. IEEE Int. Workshop Mach. Learn. Signal Process. (MLSP), 2024, pp. 1–6

  21. [21]

    Task-oriented feature compression for multimodal un- derstanding via device-edge co-inference,

    C. Yuanet al., “Task-oriented feature compression for multimodal un- derstanding via device-edge co-inference,”IEEE Trans. Mobile Comput., vol. 25, no. 4, pp. 4762–4775, 2026

  22. [22]

    Robust information bottleneck for task-oriented communication with digital modulation,

    S. Xie, S. Ma, M. Ding, Y . Shi, M. Tang, and Y . Wu, “Robust information bottleneck for task-oriented communication with digital modulation,” IEEE J. Sel. Areas Commun., vol. 41, no. 8, pp. 2577–2591, 2023

  23. [23]

    Overview of the H.264/A VC video coding standard,

    T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/A VC video coding standard,”IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, 2003

  24. [24]

    Overview of the high efficiency video coding (HEVC) standard,

    G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,”IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, 2012

  25. [25]

    A technical overview of A V1,

    J. Hanet al., “A technical overview of A V1,”Proc. IEEE, vol. 109, no. 9, pp. 1435–1462, 2021

  26. [26]

    Overview of the versatile video coding (VVC) standard and its applications,

    B. Brosset al., “Overview of the versatile video coding (VVC) standard and its applications,”IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 10, pp. 3736–3764, 2021

  27. [27]

    DVC: An end-to-end deep video compression framework,

    G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: An end-to-end deep video compression framework,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Los Alamitos, CA, USA, Jun 2019, pp. 10 998–11 007

  28. [28]

    Deep contextual video compression,

    J. Li, B. Li, and Y . Lu, “Deep contextual video compression,”Adv. Neural Inf. Process. Syst., vol. 34, pp. 18 114–18 125, 2021

  29. [29]

    Temporal context min- ing for learned video compression,

    X. Sheng, J. Li, B. Li, L. Li, D. Liu, and Y . Lu, “Temporal context min- ing for learned video compression,”IEEE Trans. Multimedia, vol. 25, pp. 7311–7322, 2022

  30. [30]

    Hybrid spatial-temporal entropy modelling for neural video compression,

    J. Li, B. Li, and Y . Lu, “Hybrid spatial-temporal entropy modelling for neural video compression,” inProc. ACM Int. Conf. Multimedia, 2022, pp. 1503–1511

  31. [31]

    arXiv preprint arXiv:2409.01199 (2024)

    L. Chenet al., “OD-V AE: An omni-dimensional video compressor for improving latent video diffusion model,” 2024, arXiv:2409.01199. [Online]. Available: https://arxiv.org/abs/2409.01199

  32. [32]

    Vibe: Video inference for human body pose and shape estimation,

    M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 5253–5263

  33. [33]

    Beyond static features for temporally consistent 3D human pose and shape from a video,

    H. Choi, G. Moon, J. Y . Chang, and K. M. Lee, “Beyond static features for temporally consistent 3D human pose and shape from a video,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Los Alamitos, CA, USA, Jun 2021, pp. 1964–1973

  34. [34]

    Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video,

    W.-L. Wei, J.-C. Lin, T.-L. Liu, and H.-Y . M. Liao, “Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 13 201–13 210

  35. [35]

    Global-to- local modeling for video-based 3D human pose and shape estimation,

    X. Shen, Z. Yang, X. Wang, J. Ma, C. Zhou, and Y . Yang, “Global-to- local modeling for video-based 3D human pose and shape estimation,” 12 inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 8887–8896

  36. [36]

    End-to-end recovery of human shape and pose,

    A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 7122–7131

  37. [37]

    Learning to reconstruct 3D human pose and shape via model-fitting in the loop,

    N. Kolotouros, G. Pavlakos, M. Black, and K. Daniilidis, “Learning to reconstruct 3D human pose and shape via model-fitting in the loop,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 2252–2261

  38. [38]

    CLIFF: Carrying location information in full frames into human pose and shape estimation,

    Z. Li, J. Liu, Z. Zhang, S. Xu, and Y . Yan, “CLIFF: Carrying location information in full frames into human pose and shape estimation,” in Proc. Eur . Conf. Comput. Vis. (ECCV), Berlin, Heidelberg, Oct 2022, pp. 590–606

  39. [39]

    PyMAF-X: Towards well-aligned full-body model regression from monocular images,

    H. Zhanget al., “PyMAF-X: Towards well-aligned full-body model regression from monocular images,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 12 287–12 303, 2023

  40. [40]

    SMPL: a skinned multi-person linear model,

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: a skinned multi-person linear model,”ACM Trans. Graph., vol. 34, no. 6, nov 2015

  41. [41]

    A geometric interpretation of weak-perspective motion,

    I. Shimshoni, R. Basri, and E. Rivlin, “A geometric interpretation of weak-perspective motion,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 3, pp. 252–257, 1999

  42. [42]

    Co-evolution of pose and mesh for 3D human body estimation from video,

    Y . You, H. Liu, T. Wang, W. Li, R. Ding, and X. Li, “Co-evolution of pose and mesh for 3D human body estimation from video,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Los Alamitos, CA, USA, Oct 2023, pp. 14 917–14 927

  43. [43]

    WHAM: Reconstructing world-grounded humans with accurate 3D motion,

    S. Shin, J. Kim, E. Halilaj, and M. J. Black, “WHAM: Reconstructing world-grounded humans with accurate 3D motion,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 2070–2080

  44. [44]

    ViTPose: simple vision transformer baselines for human pose estimation,

    Y . Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: simple vision transformer baselines for human pose estimation,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), Red Hook, NY , USA, 2022

  45. [45]

    Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions,

    W. Takano and Y . Nakamura, “Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions,”Int. J. Rob. Res., vol. 34, no. 10, pp. 1314–1328, sep 2015

  46. [46]

    PoseScript: Linking 3D human poses and natural language,

    G. Delmas, P. Weinzaepfel, T. Lucas, F. Moreno-Noguer, and G. Rogez, “PoseScript: Linking 3D human poses and natural language,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 7, pp. 5146–5159, 2025

  47. [47]

    Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descrip- tions,

    T. Yamada, H. Matsunaga, and T. Ogata, “Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descrip- tions,”IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 3441–3448, 2018

  48. [48]

    Motionclip: Exposing human motion generation to clip space

    G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “MotionCLIP: Exposing human motion generation to CLIP space,” 2022, arXiv:2203.08063. [Online]. Available: https://arxiv.org/abs/2203. 08063

  49. [49]

    TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,

    C. Guo, X. Zuo, S. Wang, and L. Cheng, “TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,” inProc. Eur . Conf. Comput. Vis. (ECCV), 2022, pp. 580–597

  50. [50]

    AvatarGPT: All-in-one framework for motion understanding, planning, generation and beyond,

    Z. Zhou, Y . Wan, and B. Wang, “AvatarGPT: All-in-one framework for motion understanding, planning, generation and beyond,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 1357–1366

  51. [51]

    MotionGPT: human motion as a foreign language,

    B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “MotionGPT: human motion as a foreign language,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), Red Hook, NY , USA, 2023

  52. [52]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), Red Hook, NY , USA, 2023

  53. [53]

    Video-LLaV A: Learning united visual representation by alignment before projection,

    B. Linet al., “Video-LLaV A: Learning united visual representation by alignment before projection,” inProc. Conf. Empir . Methods Nat. Lang. Process. (EMNLP), 2024, pp. 5971–5984

  54. [54]

    Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,

    W.-L. Chianget al., “Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/

  55. [55]

    Generating diverse and natural 3D human motions from text,

    C. Guoet al., “Generating diverse and natural 3D human motions from text,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 5142–5151

  56. [56]

    On the continuity of rotation representations in neural networks,

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 5738–5746

  57. [57]

    Generating human motion from textual descriptions with discrete representations,

    J. Zhanget al., “Generating human motion from textual descriptions with discrete representations,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 14 730–14 740

  58. [58]

    Qwen3-VL Technical Report

    S. Baiet al., “Qwen3-VL technical report,” 2025, arXiv:2511.21631. [Online]. Available: https://arxiv.org/abs/2511.21631

  59. [59]

    AMASS: Archive of motion capture as surface shapes,

    N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. Black, “AMASS: Archive of motion capture as surface shapes,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Los Alamitos, CA, USA, 2019, pp. 5441–5450

  60. [60]

    Motion question answering via modular motion programs,

    M. Endo, J. Hsu, J. Li, and J. Wu, “Motion question answering via modular motion programs,” 2023, arXiv:2305.08953. [Online]. Available: https://arxiv.org/abs/2305.08953

  61. [61]

    Motion-X: A large-scale 3D expressive whole-body human motion dataset,

    J. Linet al., “Motion-X: A large-scale 3D expressive whole-body human motion dataset,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), Red Hook, NY , USA, 2023

  62. [62]

    BABEL: Bodies, action and behavior with English labels,

    A. R. Punnakkal, A. Chandrasekaran, N. Athanasiou, A. Quir ´os- Ram´ırez, and M. J. Black, “BABEL: Bodies, action and behavior with English labels,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 722–731