arxiv: 2605.07354 · v1 · submitted 2026-05-08 · 📡 eess.SP · cs.CV

Recognition: 1 theorem link

· Lean Theorem

Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference

Jingyi Liu , Cheng Yuan , Lijun He , Jun Zhang , Jiawei Shao

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:00 UTC · model grok-4.3

classification 📡 eess.SP cs.CV

keywords task-oriented communicationhuman action understandingedge-cloud co-inferenceVQ-VAEmotion tokensvision-language modelpose estimationsemantic communication

0 comments

The pith

By turning raw video into a short sequence of discrete motion tokens from pose joints at the edge, the system transmits roughly 1 percent of the data of video codecs while letting a cloud vision-language model deliver comparable action理解.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a task-oriented communication framework that moves human action understanding from full video transmission to a compact token pipeline across edge and cloud. A monocular pose estimator pulls joint coordinates from video at the device, a VQ-VAE turns those coordinates into discrete tokens, and only the short list of codebook indices travels over the network at 9 bits per frame. On the cloud side a lightweight projector maps the tokens into the embedding space of a large vision-language model that has been tuned with instructions to interpret the actions. This design directly attacks the bandwidth, latency, and privacy costs that have blocked scalable edge sensing. Readers interested in real-time smart environments would see immediate value if the accuracy holds, because the same hardware could now run many more simultaneous streams without saturating links or exposing raw footage.

Core claim

Converting continuous joint coordinates extracted by a monocular pose estimator into discrete motion tokens through a vector-quantized variational autoencoder, then transmitting only the sequence of codebook indices and aligning them at the cloud via a lightweight projector to a vision-language model trained by instruction tuning, produces action understanding accuracy comparable to video-codec baselines while cutting transmission payload to approximately 1 percent and end-to-end latency to approximately 20 percent on three standard benchmarks.

What carries the argument

The vector-quantized variational autoencoder (VQ-VAE) that maps continuous pose joint coordinates into a compact sequence of discrete codebook indices, together with the lightweight projector that aligns those indices to the embedding space of the vision-language model.

If this is right

Uplink payload falls to roughly 1 percent of conventional video-codec streams.
End-to-end latency drops to about 20 percent of codec-based pipelines.
Action understanding accuracy stays comparable on three public benchmarks.
Raw video never leaves the edge device, eliminating the main privacy vector.
The projector is trained efficiently through instruction tuning rather than full model retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token pipeline could be tested on related edge tasks such as fall detection or gesture recognition without redesigning the transmission layer.
Because the tokens are discrete and short, they may tolerate packet loss better than compressed video frames if simple repetition or forward-error correction is added.
Domain-specific codebooks trained only on expected actions could shrink the index size even further for narrow deployments.
Measuring token robustness under realistic wireless packet errors would show whether the latency gains survive imperfect channels.

Load-bearing premise

The discrete motion tokens produced by the VQ-VAE, once aligned by the projector, still contain enough semantic detail for the vision-language model to reach action-understanding accuracy close to what richer video features would allow.

What would settle it

Run the same vision-language model on a benchmark dataset both with the full original video and with only the 9-bit-per-frame token sequence; if recognition accuracy drops by more than a few percentage points under the token-only path, the comparability claim fails.

Figures

Figures reproduced from arXiv: 2605.07354 by Cheng Yuan, Jiawei Shao, Jingyi Liu, Jun Zhang, Lijun He.

**Figure 2.** Figure 2: System framework of the proposed action understanding pipeline for edge-cloud co-inference. The edge side performs monocular human pose estimation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The rate-performance curves of different methods in action understanding benchmarks. Subfigures (a) and (b) show results on Motion-Bench and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Average inference latency per user request (the bar plot corresponding to the left y-axis) and benchmark score (the line plot corresponding to the right [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: A qualitative comparison between the proposed TOAU method and traditional AV1 compression. Under a low data rate, the AV1-compressed video is [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: A qualitative comparison between the proposed TOAU method and [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

The expanding application of smart sensing has created a growing demand for the accurate understanding of human action at the network edge. Traditional approaches require massive video data to be transmitted from resource-constrained edge devices to powerful cloud servers, incurring prohibitive uplink bandwidth consumption and unacceptable latency while raising privacy concerns. To overcome these bottlenecks, we propose a task-oriented communication framework for human action understanding (TOAU) through edge-cloud collaboration. Our framework utilizes a monocular pose estimator to extract continuous joint coordinates from raw videos, followed by a vector quantized variational autoencoder (VQ-VAE) to convert these coordinates into discrete motion tokens. Consequently, only a compact sequence of codebook indices is transmitted over the network, consuming as few as 9 bits per frame and avoiding privacy leakages. At the cloud server, a lightweight projector aligns these motion tokens with the embedding space of a large vision-language model (VLM) to facilitate complex action understanding, which is trained with an efficient instruction tuning paradigm. Comprehensive evaluations on three benchmarks demonstrate that our TOAU system reduces the transmission payload to approximately 1\% and the system latency to around 20\% compared to video codec-based solutions, while delivering comparable action understanding accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TOAU gives a workable edge-cloud pipeline that shrinks video to 9-bit VQ-VAE pose tokens for VLM action understanding, but the comparable-accuracy claim needs ablations the abstract does not show.

read the letter

The paper's core move is to run monocular pose estimation on the edge, quantize the joint trajectories with a VQ-VAE into a short string of codebook indices, ship only those indices, and let a lightweight projector map them into a large VLM at the cloud for instruction-tuned action recognition. That produces the headline numbers: roughly 1 % of the payload and 20 % of the latency of a standard video codec while matching accuracy on three benchmarks. The integration itself is the main novelty; each piece (pose estimator, VQ-VAE, projector) exists already, but putting them together for task-oriented communication in this setting is a fresh combination that directly targets bandwidth, latency, and privacy constraints in surveillance or robotics deployments. The design is clean and the privacy benefit is real because raw frames never leave the edge. The empirical results are presented as the main evidence, and the system appears reproducible from the description. The soft spot is exactly where the stress-test note points: the claim that the discrete tokens plus projector preserve enough kinematic detail for the VLM to stay comparable to video features. Without reported codebook sizes, tokens per frame, reconstruction error on held-out actions, or ablations that remove the projector or shrink the codebook, it is difficult to know whether the accuracy holds for fast motions or subtle hand-object interactions. The baselines are video codecs, which is fair, but a comparison to other low-bitrate pose encodings would strengthen the case. This is applied systems work aimed at edge-AI and multimedia communications researchers. A reader working on task-oriented communication or efficient video analytics would get concrete implementation ideas and a clear performance target to beat. The paper is coherent on its own terms and the central engineering claim is testable, so it deserves a serious referee even if the experiments will probably need tightening on the ablation side.

Referee Report

3 major / 3 minor

Summary. The paper proposes TOAU, a task-oriented edge-cloud framework for human action understanding. A monocular pose estimator extracts joint coordinates from raw video; a VQ-VAE quantizes these into discrete motion tokens (reported as 9 bits per frame); only the codebook indices are transmitted. At the cloud, a lightweight projector aligns the token sequence to the embedding space of a vision-language model, which is instruction-tuned for action recognition. On three benchmarks the system is claimed to reduce transmission payload to ~1 % and end-to-end latency to ~20 % of conventional video-codec baselines while preserving comparable accuracy.

Significance. If the accuracy claim is substantiated, the work offers a concrete, privacy-preserving alternative to raw-video or feature transmission in bandwidth-limited edge scenarios. The combination of pose-based VQ-VAE compression with VLM instruction tuning is a timely contribution to task-oriented communication literature and could influence practical deployments in surveillance, robotics, and assisted living.

major comments (3)

[Evaluation section] §4 (or equivalent evaluation section): the headline claim of 'comparable action understanding accuracy' is not supported by the reported evidence. No ablation tables or figures quantify accuracy versus VQ-VAE codebook size, tokens per frame, or reconstruction error on held-out actions; likewise, no comparison is shown between the projector-aligned tokens and either raw continuous pose or richer video features. Without these controls the central performance assertion cannot be evaluated.
[Methods / VQ-VAE subsection] §3.2 (VQ-VAE description): the manuscript states that 9-bit-per-frame indices suffice, yet provides neither the codebook size nor the bit-allocation scheme used to reach this figure, nor any quantitative reconstruction metric (e.g., MPJPE or joint-velocity error) on the three target benchmarks. This information is load-bearing for the claim that kinematic semantics are preserved.
[Evaluation / Baseline subsection] §4 (baseline comparison): the latency and payload reductions are reported relative to 'video codec-based solutions,' but the exact codecs, encoding parameters, and transmission protocols are not specified, nor are statistical error bars or multiple-run averages provided. These omissions prevent verification of the 1 % / 20 % figures.

minor comments (3)

[Abstract and §1] The abstract and introduction refer to 'three benchmarks' without naming them; the first paragraph of the evaluation section should list the datasets explicitly.
[§3.3] Notation for the projector alignment (e.g., the mapping from discrete indices to VLM token embeddings) is introduced without an equation number; adding a numbered equation would improve traceability.
[Figure 1] Figure 1 (system diagram) would benefit from explicit bit-rate annotations on the uplink arrow to match the 9-bit-per-frame claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We have addressed each major point below and will revise the manuscript to incorporate the requested clarifications and additional results.

read point-by-point responses

Referee: [Evaluation section] §4 (or equivalent evaluation section): the headline claim of 'comparable action understanding accuracy' is not supported by the reported evidence. No ablation tables or figures quantify accuracy versus VQ-VAE codebook size, tokens per frame, or reconstruction error on held-out actions; likewise, no comparison is shown between the projector-aligned tokens and either raw continuous pose or richer video features. Without these controls the central performance assertion cannot be evaluated.

Authors: We agree that additional ablation studies and controls are needed to fully substantiate the claim of comparable accuracy. In the revised manuscript we will add tables and figures that report action-understanding accuracy versus VQ-VAE codebook size and number of tokens per frame, together with reconstruction error on held-out actions. We will also include direct comparisons of the projector-aligned tokens against both raw continuous pose coordinates and richer video features extracted from standard encoders. These additions will allow readers to evaluate the performance trade-offs more rigorously. revision: yes
Referee: [Methods / VQ-VAE subsection] §3.2 (VQ-VAE description): the manuscript states that 9-bit-per-frame indices suffice, yet provides neither the codebook size nor the bit-allocation scheme used to reach this figure, nor any quantitative reconstruction metric (e.g., MPJPE or joint-velocity error) on the three target benchmarks. This information is load-bearing for the claim that kinematic semantics are preserved.

Authors: We thank the referee for highlighting this omission. The revised §3.2 will explicitly state the codebook size and the bit-allocation scheme that yields the reported 9 bits per frame. We will also add quantitative reconstruction metrics (MPJPE and joint-velocity error) evaluated on the three target benchmarks to demonstrate that kinematic semantics are preserved after quantization. revision: yes
Referee: [Evaluation / Baseline subsection] §4 (baseline comparison): the latency and payload reductions are reported relative to 'video codec-based solutions,' but the exact codecs, encoding parameters, and transmission protocols are not specified, nor are statistical error bars or multiple-run averages provided. These omissions prevent verification of the 1 % / 20 % figures.

Authors: We acknowledge that precise specification of the baselines is required for verification. In the revised manuscript we will detail the exact video codecs employed, their encoding parameters, and the transmission protocols assumed. We will also report statistical error bars and averages computed over multiple runs to support the stated 1 % payload and 20 % latency reductions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system design with independent benchmark validation

full rationale

The paper presents a practical task-oriented communication pipeline (pose estimation → VQ-VAE tokenization → projector alignment → VLM inference) evaluated on three external benchmarks. No equations, uniqueness theorems, or predictions are shown that reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations. All performance claims (payload/latency reduction, comparable accuracy) rest on direct empirical measurements rather than tautological renaming or ansatz smuggling. The derivation chain is a sequence of standard ML components whose outputs are independently verifiable against held-out data.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central performance claims rest on the effectiveness of standard pose estimation and VQ-VAE compression for preserving action semantics, plus successful alignment of the resulting tokens to an existing VLM; no new entities are postulated and free parameters such as codebook size are chosen rather than derived.

free parameters (1)

VQ-VAE codebook size and bit allocation
Determines the discrete token vocabulary and per-frame bit rate (stated as 9 bits); chosen to trade off compression against information retention for downstream action tasks.

axioms (2)

domain assumption A monocular pose estimator can reliably extract continuous joint coordinates from raw video frames at the edge
Invoked as the first processing step to avoid transmitting raw video pixels.
domain assumption The discrete motion tokens retain sufficient information for the cloud VLM to perform complex action understanding after projector alignment
This underpins the claim that accuracy remains comparable despite extreme compression.

pith-pipeline@v0.9.0 · 5517 in / 1677 out tokens · 48984 ms · 2026-05-11T02:00:23.575788+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

only a compact sequence of codebook indices is transmitted... consuming as few as 9 bits per frame... lightweight projector aligns these motion tokens with the embedding space of a large vision-language model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 1 internal anchor

[1]

A multi-model fusion approach for hidden camera detection in smart city scenarios,

C. Zhang, “A multi-model fusion approach for hidden camera detection in smart city scenarios,” inProc. Int. Conf. Comput., Internet Things Smart City (CIoTSC), 2025, pp. 1–5

work page 2025
[2]

Internet of things in smart cities: Comprehensive review, open issues, and challenges,

E. H. Houssein, M. A. Othman, W. M. Mohamed, and M. Younan, “Internet of things in smart cities: Comprehensive review, open issues, and challenges,”IEEE Internet Things J., vol. 11, no. 21, pp. 34 941– 34 952, 2024

work page 2024
[3]

Guest editors’ introduction: Human-centered computing–toward a human revolution,

A. Jaimes, D. Gatica-Perez, N. Sebe, and T. S. Huang, “Guest editors’ introduction: Human-centered computing–toward a human revolution,” Computer, vol. 40, no. 5, pp. 30–34, 2007

work page 2007
[4]

Real-world community-in-the-loop smart video surveillance system,

S. Yao, B. R. Ardabili, A. Danesh Pazho, G. A. Noghre, C. Neff, and H. Tabkhi, “Real-world community-in-the-loop smart video surveillance system,” inProc. IEEE Int. Conf. Smart Comput. (SMARTCOMP), 2023, pp. 183–185

work page 2023
[5]

Construction of an intelligent system for elderly’s health and elderly care from the perspective of the integration of smart sensors and physical medicine,

Y . Jiang, “Construction of an intelligent system for elderly’s health and elderly care from the perspective of the integration of smart sensors and physical medicine,” inProc. Int. Conf. Smart Syst. Inventive Technol. (ICSSIT), 2022, pp. 541–544

work page 2022
[6]

Real-time healthcare monitoring using smart systems: A step towards healthcare service orchestration smart systems for futuristic healthcare,

V . Bhatt and S. Chakraborty, “Real-time healthcare monitoring using smart systems: A step towards healthcare service orchestration smart systems for futuristic healthcare,” inProc. Int. Conf. Artif. Intell. Smart Syst. (ICAIS), 2021, pp. 772–777

work page 2021
[7]

Pepper humanoid robot as a service robot: a customer approach,

Z. A. barakeh, S. alkork, A. S. Karar, S. Said, and T. Beyrouthy, “Pepper humanoid robot as a service robot: a customer approach,” inProc. Int. Conf. Bio-engineering Smart Technol. (BioSMART), 2019, pp. 1–4

work page 2019
[8]

Gesture vs. touch control for unforeseen situations of human- robot collaborative assembly,

P. Kranz, D. Kristhofen, F. Schirmer, C. G. Rose, J. Schmitt, and T. Kaupp, “Gesture vs. touch control for unforeseen situations of human- robot collaborative assembly,” inProc. ACM/IEEE Int. Conf. Human- Robot Interact. (HRI), 2025, pp. 1433–1437

work page 2025
[9]

Conceptual design of a UA V- UGV autonomous collaborative robot system,

R. Szabolcsi, G. Moln ´ar, and T. W ¨uhrl, “Conceptual design of a UA V- UGV autonomous collaborative robot system,” inProc. IEEE Int. Conf. Workshop ´Obuda Electr . Power Eng. (CANDO-EPE), 2024, pp. 207– 212

work page 2024
[10]

User-aware shared perception for em- bodied agents,

D. G. McNeely-Whiteet al., “User-aware shared perception for em- bodied agents,” inProc. IEEE Int. Conf. Humanized Comput. Commun. (HCC), 2019, pp. 46–51

work page 2019
[11]

Exploring mediation by an embodied virtual agent in immersive triadic collaborative decision-making,

B. Hanet al., “Exploring mediation by an embodied virtual agent in immersive triadic collaborative decision-making,”IEEE Trans. Vis. Comput. Graph., pp. 1–10, 2026

work page 2026
[12]

MotionLLM: Understanding human behaviors from human motions and videos,

L. Chenet al., “MotionLLM: Understanding human behaviors from human motions and videos,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–15, 2025

work page 2025
[13]

Human motion instruction tuning,

L. Liet al., “Human motion instruction tuning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Los Alamitos, CA, USA, Jun 2025, pp. 17 582–17 591

work page 2025
[14]

Communication-computation trade-off in resource-constrained edge inference,

J. Shao and J. Zhang, “Communication-computation trade-off in resource-constrained edge inference,”IEEE Commun. Mag., vol. 58, no. 12, pp. 20–26, 2021

work page 2021
[15]

Wireless edge machine learning: Resource allocation and trade-offs,

M. Merluzzi, P. D. Lorenzo, and S. Barbarossa, “Wireless edge machine learning: Resource allocation and trade-offs,”IEEE Access, vol. 9, pp. 45 377–45 398, 2021

work page 2021
[16]

Task- oriented communications for 6G: Vision, principles, and technologies,

Y . Shi, Y . Zhou, D. Wen, Y . Wu, C. Jiang, and K. B. Letaief, “Task- oriented communications for 6G: Vision, principles, and technologies,” IEEE Wireless Commun., vol. 30, no. 3, pp. 78–85, Jun. 2023

work page 2023
[17]

Learning task-oriented communication for edge inference: An information bottleneck approach,

J. Shao, Y . Mao, and J. Zhang, “Learning task-oriented communication for edge inference: An information bottleneck approach,”IEEE J. Sel. Areas Commun., vol. 40, no. 1, pp. 197–211, 2022

work page 2022
[18]

Task-oriented communication for multidevice cooperative edge inference,

——, “Task-oriented communication for multidevice cooperative edge inference,”IEEE Trans. Wireless Commun., vol. 22, no. 1, pp. 73–87, 2023

work page 2023
[19]

Tackling distribution shifts in task-oriented communication with information bottleneck,

H. Li, J. Shao, H. He, S. Song, J. Zhang, and K. B. Letaief, “Tackling distribution shifts in task-oriented communication with information bottleneck,”IEEE J. Sel. Areas Commun., vol. 43, no. 7, pp. 2667– 2683, 2025

work page 2025
[20]

Task-oriented communication for vehicle- to-infrastructure cooperative perception,

J. Shao, T. Li, and J. Zhang, “Task-oriented communication for vehicle- to-infrastructure cooperative perception,” inProc. IEEE Int. Workshop Mach. Learn. Signal Process. (MLSP), 2024, pp. 1–6

work page 2024
[21]

Task-oriented feature compression for multimodal un- derstanding via device-edge co-inference,

C. Yuanet al., “Task-oriented feature compression for multimodal un- derstanding via device-edge co-inference,”IEEE Trans. Mobile Comput., vol. 25, no. 4, pp. 4762–4775, 2026

work page 2026
[22]

Robust information bottleneck for task-oriented communication with digital modulation,

S. Xie, S. Ma, M. Ding, Y . Shi, M. Tang, and Y . Wu, “Robust information bottleneck for task-oriented communication with digital modulation,” IEEE J. Sel. Areas Commun., vol. 41, no. 8, pp. 2577–2591, 2023

work page 2023
[23]

Overview of the H.264/A VC video coding standard,

T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/A VC video coding standard,”IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, 2003

work page 2003
[24]

Overview of the high efficiency video coding (HEVC) standard,

G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,”IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, 2012

work page 2012
[25]

A technical overview of A V1,

J. Hanet al., “A technical overview of A V1,”Proc. IEEE, vol. 109, no. 9, pp. 1435–1462, 2021

work page 2021
[26]

Overview of the versatile video coding (VVC) standard and its applications,

B. Brosset al., “Overview of the versatile video coding (VVC) standard and its applications,”IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 10, pp. 3736–3764, 2021

work page 2021
[27]

DVC: An end-to-end deep video compression framework,

G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: An end-to-end deep video compression framework,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Los Alamitos, CA, USA, Jun 2019, pp. 10 998–11 007

work page 2019
[28]

Deep contextual video compression,

J. Li, B. Li, and Y . Lu, “Deep contextual video compression,”Adv. Neural Inf. Process. Syst., vol. 34, pp. 18 114–18 125, 2021

work page 2021
[29]

Temporal context min- ing for learned video compression,

X. Sheng, J. Li, B. Li, L. Li, D. Liu, and Y . Lu, “Temporal context min- ing for learned video compression,”IEEE Trans. Multimedia, vol. 25, pp. 7311–7322, 2022

work page 2022
[30]

Hybrid spatial-temporal entropy modelling for neural video compression,

J. Li, B. Li, and Y . Lu, “Hybrid spatial-temporal entropy modelling for neural video compression,” inProc. ACM Int. Conf. Multimedia, 2022, pp. 1503–1511

work page 2022
[31]

arXiv preprint arXiv:2409.01199 (2024)

L. Chenet al., “OD-V AE: An omni-dimensional video compressor for improving latent video diffusion model,” 2024, arXiv:2409.01199. [Online]. Available: https://arxiv.org/abs/2409.01199

work page arXiv 2024
[32]

Vibe: Video inference for human body pose and shape estimation,

M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 5253–5263

work page 2020
[33]

Beyond static features for temporally consistent 3D human pose and shape from a video,

H. Choi, G. Moon, J. Y . Chang, and K. M. Lee, “Beyond static features for temporally consistent 3D human pose and shape from a video,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Los Alamitos, CA, USA, Jun 2021, pp. 1964–1973

work page 2021
[34]

Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video,

W.-L. Wei, J.-C. Lin, T.-L. Liu, and H.-Y . M. Liao, “Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 13 201–13 210

work page 2022
[35]

Global-to- local modeling for video-based 3D human pose and shape estimation,

X. Shen, Z. Yang, X. Wang, J. Ma, C. Zhou, and Y . Yang, “Global-to- local modeling for video-based 3D human pose and shape estimation,” 12 inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 8887–8896

work page 2023
[36]

End-to-end recovery of human shape and pose,

A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 7122–7131

work page 2018
[37]

Learning to reconstruct 3D human pose and shape via model-fitting in the loop,

N. Kolotouros, G. Pavlakos, M. Black, and K. Daniilidis, “Learning to reconstruct 3D human pose and shape via model-fitting in the loop,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 2252–2261

work page 2019
[38]

CLIFF: Carrying location information in full frames into human pose and shape estimation,

Z. Li, J. Liu, Z. Zhang, S. Xu, and Y . Yan, “CLIFF: Carrying location information in full frames into human pose and shape estimation,” in Proc. Eur . Conf. Comput. Vis. (ECCV), Berlin, Heidelberg, Oct 2022, pp. 590–606

work page 2022
[39]

PyMAF-X: Towards well-aligned full-body model regression from monocular images,

H. Zhanget al., “PyMAF-X: Towards well-aligned full-body model regression from monocular images,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 12 287–12 303, 2023

work page 2023
[40]

SMPL: a skinned multi-person linear model,

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: a skinned multi-person linear model,”ACM Trans. Graph., vol. 34, no. 6, nov 2015

work page 2015
[41]

A geometric interpretation of weak-perspective motion,

I. Shimshoni, R. Basri, and E. Rivlin, “A geometric interpretation of weak-perspective motion,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 3, pp. 252–257, 1999

work page 1999
[42]

Co-evolution of pose and mesh for 3D human body estimation from video,

Y . You, H. Liu, T. Wang, W. Li, R. Ding, and X. Li, “Co-evolution of pose and mesh for 3D human body estimation from video,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Los Alamitos, CA, USA, Oct 2023, pp. 14 917–14 927

work page 2023
[43]

WHAM: Reconstructing world-grounded humans with accurate 3D motion,

S. Shin, J. Kim, E. Halilaj, and M. J. Black, “WHAM: Reconstructing world-grounded humans with accurate 3D motion,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 2070–2080

work page 2024
[44]

ViTPose: simple vision transformer baselines for human pose estimation,

Y . Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: simple vision transformer baselines for human pose estimation,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), Red Hook, NY , USA, 2022

work page 2022
[45]

Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions,

W. Takano and Y . Nakamura, “Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions,”Int. J. Rob. Res., vol. 34, no. 10, pp. 1314–1328, sep 2015

work page 2015
[46]

PoseScript: Linking 3D human poses and natural language,

G. Delmas, P. Weinzaepfel, T. Lucas, F. Moreno-Noguer, and G. Rogez, “PoseScript: Linking 3D human poses and natural language,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 7, pp. 5146–5159, 2025

work page 2025
[47]

Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descrip- tions,

T. Yamada, H. Matsunaga, and T. Ogata, “Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descrip- tions,”IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 3441–3448, 2018

work page 2018
[48]

Motionclip: Exposing human motion generation to clip space

G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “MotionCLIP: Exposing human motion generation to CLIP space,” 2022, arXiv:2203.08063. [Online]. Available: https://arxiv.org/abs/2203. 08063

work page arXiv 2022
[49]

TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,

C. Guo, X. Zuo, S. Wang, and L. Cheng, “TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts,” inProc. Eur . Conf. Comput. Vis. (ECCV), 2022, pp. 580–597

work page 2022
[50]

AvatarGPT: All-in-one framework for motion understanding, planning, generation and beyond,

Z. Zhou, Y . Wan, and B. Wang, “AvatarGPT: All-in-one framework for motion understanding, planning, generation and beyond,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 1357–1366

work page 2024
[51]

MotionGPT: human motion as a foreign language,

B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “MotionGPT: human motion as a foreign language,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), Red Hook, NY , USA, 2023

work page 2023
[52]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), Red Hook, NY , USA, 2023

work page 2023
[53]

Video-LLaV A: Learning united visual representation by alignment before projection,

B. Linet al., “Video-LLaV A: Learning united visual representation by alignment before projection,” inProc. Conf. Empir . Methods Nat. Lang. Process. (EMNLP), 2024, pp. 5971–5984

work page 2024
[54]

Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,

W.-L. Chianget al., “Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[55]

Generating diverse and natural 3D human motions from text,

C. Guoet al., “Generating diverse and natural 3D human motions from text,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 5142–5151

work page 2022
[56]

On the continuity of rotation representations in neural networks,

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 5738–5746

work page 2019
[57]

Generating human motion from textual descriptions with discrete representations,

J. Zhanget al., “Generating human motion from textual descriptions with discrete representations,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 14 730–14 740

work page 2023
[58]

Qwen3-VL Technical Report

S. Baiet al., “Qwen3-VL technical report,” 2025, arXiv:2511.21631. [Online]. Available: https://arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

AMASS: Archive of motion capture as surface shapes,

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. Black, “AMASS: Archive of motion capture as surface shapes,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Los Alamitos, CA, USA, 2019, pp. 5441–5450

work page 2019
[60]

Motion question answering via modular motion programs,

M. Endo, J. Hsu, J. Li, and J. Wu, “Motion question answering via modular motion programs,” 2023, arXiv:2305.08953. [Online]. Available: https://arxiv.org/abs/2305.08953

work page arXiv 2023
[61]

Motion-X: A large-scale 3D expressive whole-body human motion dataset,

J. Linet al., “Motion-X: A large-scale 3D expressive whole-body human motion dataset,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), Red Hook, NY , USA, 2023

work page 2023
[62]

BABEL: Bodies, action and behavior with English labels,

A. R. Punnakkal, A. Chandrasekaran, N. Athanasiou, A. Quir ´os- Ram´ırez, and M. J. Black, “BABEL: Bodies, action and behavior with English labels,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 722–731

work page 2021