pith. sign in

arxiv: 2512.14234 · v2 · submitted 2025-12-16 · 💻 cs.CV

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Pith reviewed 2026-05-16 22:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords conversational agent3D virtual bodymultimodal generationco-speech motionmixture of expertsspeech-language-behavior modelagentic interaction
0
0 comments X

The pith

ViBES builds a 3D conversational agent that jointly plans language, prosody, and body movements from speech or text inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViBES as a speech-language-behavior model that moves past isolated text-to-motion or co-speech gesture generation. It uses a mixture-of-modality-experts transformer where speech, facial, and body experts process interleaved tokens with hard routing per modality and cross-expert attention to share context. This design lets the agent decide when to move, adapt across dialogue turns, and respond to mixed user inputs like spoken words, typed text, or mid-conversation body directives. The result is measured through improved alignment between generated dialogue and 3D actions on multi-turn benchmarks.

Core claim

ViBES jointly generates language and 3D body actions by processing interleaved multimodal token streams through modality-partitioned transformer experts connected by cross-expert attention, enabling agentic planning of when and how to act during conversation rather than mapping fixed utterances to motion clips.

What carries the argument

Mixture-of-modality-experts (MoME) backbone that applies hard routing by modality to separate transformer experts for speech, facial expression, and body motion while sharing information via cross-expert attention on interleaved token streams.

Load-bearing premise

Hard routing by modality plus cross-expert attention on interleaved tokens is enough to keep language and body actions coherent across multiple dialogue turns without losing cross-modal context.

What would settle it

A multi-turn dialogue test where the agent produces body motions that contradict the spoken content or timing after three or more turns, showing loss of joint planning.

Figures

Figures reproduced from arXiv: 2512.14234 by Ali Sartaz Khan, Changan Chen, Ehsan Adeli, Heng Yu, Juze Zhang, Shrinidhi K. Lakshmikanth, Tiange Xiang, Xin Chen.

Figure 1
Figure 1. Figure 1: We present a novel speech–language–behavior (SLB) model with a mixture–of–modality–experts (MoME) architecture that [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Model overview. The model adopts an autoregressive structure that converts all modalities into a unified token space. It consists [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples of conversational behavior. We [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with prior methods on the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison for text-to-motion. While [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of our YouTube data processing pipeline. The [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Application: driving video generation with ViBES. We [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Word cloud visualization of our Converse3D data from [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional qualitative comparisons with prior methods [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative examples for text-to-motion generation. Given a text caption, we compare the 3D motion generated by our [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: System prompt for motion-conditioned conversational answer generation. We instruct the model to generate answers as if the [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: System Prompt for Conversational Agent Behavior Evaluation. We instruct the model to assess semantic alignment, content–motion match, and social appropriateness on videos rendered from motion sequences. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
read the original abstract

Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: ai.stanford.edu/~juze/ViBES/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ViBES, a conversational 3D agent based on a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone. Modality-partitioned transformer experts handle speech, facial expression, and body motion on interleaved token streams using hard routing by modality and cross-expert attention. The system jointly plans language and movement for multi-turn dialogue, supports mixed-initiative input, and claims consistent gains over co-speech gesture and text-to-motion baselines on dialogue-motion alignment and behavior quality metrics, advancing beyond isolated translation tasks toward agentic virtual bodies.

Significance. If the empirical results hold, the work would advance integrated multimodal conversational agents by combining pretrained speech-language components with controllable 3D behavior generation, addressing brittle timing and fragmented modality stacks in prior systems.

major comments (2)
  1. [Abstract] Abstract: the claim of 'consistent gains over strong co-speech and text-to-motion baselines' on dialogue-motion alignment metrics is unsupported by any numerical values, error bars, data-split details, or baseline implementation descriptions, which is load-bearing for the central superiority claim.
  2. [Model description] Model section (MoME backbone): hard routing splits parameters per expert while cross-expert attention is the sole sharing mechanism; no ablations on routing or long-horizon multi-turn coherence metrics are reported, leaving unverified whether this suffices for joint language-body planning without context loss.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'Code and data will be made available' should include a specific repository URL or DOI for reproducibility.
  2. [Introduction] The terms 'controllable behavior hooks' and 'streaming responses' are introduced without precise definitions or interface specifications.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We appreciate the focus on strengthening the empirical claims and model analysis. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'consistent gains over strong co-speech and text-to-motion baselines' on dialogue-motion alignment metrics is unsupported by any numerical values, error bars, data-split details, or baseline implementation descriptions, which is load-bearing for the central superiority claim.

    Authors: We agree that the abstract should explicitly support the superiority claim with quantitative details. In the revised manuscript, we will update the abstract to include specific numerical gains on dialogue-motion alignment metrics (drawn from the results in Section 4), along with error bars, data-split information, and pointers to the baseline implementation details provided in the supplementary material. The full experimental comparisons, including all metrics and baselines, remain unchanged in the body of the paper. revision: yes

  2. Referee: [Model description] Model section (MoME backbone): hard routing splits parameters per expert while cross-expert attention is the sole sharing mechanism; no ablations on routing or long-horizon multi-turn coherence metrics are reported, leaving unverified whether this suffices for joint language-body planning without context loss.

    Authors: The MoME backbone employs hard routing by modality to partition parameters for efficiency while relying on cross-expert attention for inter-modality information sharing during joint language-body planning. We acknowledge the value of ablations; however, the current work prioritizes end-to-end system evaluation over isolated routing studies. We will expand the model section with additional justification for the design and include any long-horizon coherence metrics already computed as part of our multi-turn dialogue experiments. Comprehensive routing ablations are not added at this stage due to computational scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity in architectural description or empirical evaluation

full rationale

The paper describes a multimodal SLB model with MoME backbone built from pretrained speech-language components, using hard routing and cross-expert attention for interleaved tokens. All claims rest on empirical benchmarks for dialogue-motion alignment rather than any mathematical derivations, fitted parameters renamed as predictions, or self-citation chains. No equations appear that reduce outputs to inputs by construction, and the architecture is presented as an engineering composition evaluated externally. This matches the default expectation of a self-contained system description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract relies on the existence of strong pretrained speech-language models and assumes that modality-partitioned experts with cross-attention can integrate interleaved streams; no explicit free parameters, new axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5643 in / 1149 out tokens · 67431 ms · 2026-05-16T22:00:49.073235+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control

    cs.GR 2026-05 unverdicted novelty 6.0

    A new diffusion transformer policy with joint attention over actions, states, and text plus RL post-training outperforms prior methods on language alignment and motion quality for humanoid control.

  2. IAM: Identity-Aware Human Motion and Shape Joint Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.

  3. PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

    cs.RO 2026-01 unverdicted novelty 6.0

    PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · cited by 3 Pith papers · 26 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716–23736, 2022

  3. [3]

    Vlmo: Unified vision- language pre-training with mixture-of-modality-experts

    Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision- language pre-training with mixture-of-modality-experts. Advances in neural information processing systems, 35: 32897–32912, 2022

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision- language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164

  5. [5]

    Higgs Audio V2: Redefining Expressiveness in Audio Generation.https://github.com/boson- ai/higgs- audio, 2025

    Boson AI. Higgs Audio V2: Redefining Expressiveness in Audio Generation.https://github.com/boson- ai/higgs- audio, 2025. GitHub repository. Release blog available athttps://www.boson.ai/blog/ higgs-audio-v2

  6. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world con- trol at scale.arXiv preprint arXiv:2212.06817, 2022

  7. [7]

    Language models are few-shot learners.Advances in neu- ral information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neu- ral information processing systems, 33:1877–1901, 2020

  8. [8]

    Digital life project: Au- tonomous 3d characters with social intelligence

    Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, et al. Digital life project: Au- tonomous 3d characters with social intelligence. InCVPR, pages 582–592, 2024

  9. [9]

    Enabling synergistic full-body control in prompt-based co-speech motion generation

    Bohong Chen, Yumeng Li, Yao-Xiang Ding, Tianjia Shao, and Kun Zhou. Enabling synergistic full-body control in prompt-based co-speech motion generation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 6774–6783, 2024

  10. [10]

    The language of motion: Unifying verbal and non-verbal language of 3d human motion

    Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 6200–6211, 2025

  11. [11]

    Talkcuts: A large-scale dataset for multi-shot human speech video generation.arXiv preprint arXiv:2510.07249, 2025

    Jiaben Chen, Zixin Wang, Ailing Zeng, Yang Fu, Xueyang Yu, Siyuan Cen, Julian Tanke, Yihang Chen, Koichi Saito, Yuki Mitsufuji, et al. Talkcuts: A large-scale dataset for multi-shot human speech video generation.arXiv preprint arXiv:2510.07249, 2025

  12. [12]

    Rapverse: Coherent vocals and whole-body motion generation from text

    Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Zixin Wang, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, and Chuang Gan. Rapverse: Coherent vocals and whole-body motion generation from text. InICCV, pages 10097–10107, 2025

  13. [13]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InCVPR, pages 18000–18010, 2023

  14. [14]

    Artalk: Speech- driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323,

    Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, and Tatsuya Harada. Artalk: Speech-driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323, 2025

  15. [15]

    Qwen2-Audio Technical Report

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

  16. [16]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  17. [17]

    Weakly su- pervised 3d multi-person pose estimation for large-scale scenes based on monocular camera and single lidar

    Peishan Cong, Yiteng Xu, Yiming Ren, Juze Zhang, Lan Xu, Jingya Wang, Jingyi Yu, and Yuexin Ma. Weakly su- pervised 3d multi-person pose estimation for large-scale scenes based on monocular camera and single lidar. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 461–469, 2023

  18. [18]

    Supervising 3d talking head avatars with analysis-by-audio-synthesis.arXiv preprint arXiv:2504.13386, 2025

    Radek Dan ˇeˇcek, Carolin Schmitt, Senya Polikovsky, and Michael J Black. Supervising 3d talking head avatars with analysis-by-audio-synthesis.arXiv preprint arXiv:2504.13386, 2025

  19. [19]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre D ´efossez, Laurent Mazar´e, Manu Orsini, Am´elie Royer, Patrick P ´erez, Herv ´e J ´egou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model 9 for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

  20. [20]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multi- modal pretraining.arXiv preprint arXiv:2505.14683, 2025

  21. [21]

    Palm-e: An embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. 2023

  22. [22]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024

  23. [23]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xi- ang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable stream- ing speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024

  24. [24]

    CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Xian Shi, Keyu An, et al. Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025

  25. [25]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  26. [26]

    Unitalker: Scaling up audio-driven 3d facial anima- tion through a unified model

    Xiangyu Fan, Jiaqi Li, Zhiqian Lin, Weiye Xiao, and Lei Yang. Unitalker: Scaling up audio-driven 3d facial anima- tion through a unified model. InECCV, pages 204–221. Springer, 2024

  27. [27]

    Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos

    Panagiotis P. Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. Visual speech-aware perceptual 3d facial expression reconstruction from videos, 2022

  28. [28]

    Zeroeggs: Zero-shot example-based gesture generation from speech

    Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F Troje, and Marc-Andr ´e Carbonneau. Zeroeggs: Zero-shot example-based gesture generation from speech. InCom- puter Graphics Forum, pages 206–216. Wiley Online Li- brary, 2023

  29. [29]

    Duetgen: Music driven two-person dance generation via hierarchical masked modeling

    Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, and Chuan Guo. Duetgen: Music driven two-person dance generation via hierarchical masked modeling. InProceed- ings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

  30. [30]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

  31. [31]

    Humans in 4D: Reconstructing and tracking humans with transformers

    Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4D: Reconstructing and tracking humans with transformers. In ICCV, 2023

  32. [32]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InCVPR, pages 5152–5161, 2022

  33. [33]

    Tm2t: Stochastic and tokenized modeling for the reciprocal gen- eration of 3d human motions and texts

    Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal gen- eration of 3d human motions and texts. InECCV, pages 580–597. Springer, 2022

  34. [34]

    Momask: Generative masked mod- eling of 3d human motions

    Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked mod- eling of 3d human motions. InCVPR, pages 1900–1910, 2024

  35. [35]

    Liveportrait: Efficient portrait animation with stitching and retargeting control

    Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Live- portrait: Efficient portrait animation with stitching and re- targeting control.arXiv preprint arXiv:2407.03168, 2024

  36. [36]

    Learning speech-driven 3d conversational gestures from video

    Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed El- gharib, and Christian Theobalt. Learning speech-driven 3d conversational gestures from video. InProceedings of the 21st ACM international conference on intelligent vir- tual agents, pages 101–108, 2021

  37. [37]

    Video-bench: Human-aligned video gen- eration benchmark

    Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Yuan, Yuling Wu, Yufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Youhua Li, et al. Video-bench: Human-aligned video gen- eration benchmark. InCVPR, pages 18858–18868, 2025

  38. [38]

    Motionverse: A unified multimodal framework for motion comprehension, generation and edit- ing.arXiv preprint arXiv:2509.23635, 2025

    Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, and Shiguang Shan. Motionverse: A unified multimodal framework for motion comprehension, generation and edit- ing.arXiv preprint arXiv:2509.23635, 2025

  39. [39]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

  40. [40]

    Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

    Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Min- grui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction.arXiv preprint arXiv:2502.11946, 2025

  41. [41]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, pages 21807–21818, 2024

  42. [42]

    Beat-it: Beat- synchronized multi-condition 3d dance generation

    Zikai Huang, Xuemiao Xu, Cheng Xu, Huaidong Zhang, Chenxi Zheng, Jing Qin, and Shengfeng He. Beat-it: Beat- synchronized multi-condition 3d dance generation. InEu- ropean conference on computer vision, pages 273–290. Springer, 2024

  43. [43]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 10

  44. [44]

    Motiongpt: Human motion as a foreign lan- guage.NeurIPS, 36:20067–20079, 2023

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.NeurIPS, 36:20067–20079, 2023

  45. [45]

    Loopy: Taming audio- driven portrait avatar with long-term motion dependency

    Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio- driven portrait avatar with long-term motion dependency. arXiv preprint arXiv:2409.02634, 2024

  46. [46]

    Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters

    Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, and Ziwei Liu. Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters. InCVPR, 2025

  47. [47]

    Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

    Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

  48. [48]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- vla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  49. [49]

    Talking with hands 16.2 m: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis

    Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. Talking with hands 16.2 m: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 763–772, 2019

  50. [50]

    Audio2gestures: Generating diverse gestures from speech audio with conditional varia- tional autoencoders

    Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. Audio2gestures: Generating diverse gestures from speech audio with conditional varia- tional autoencoders. InICCV, pages 11293–11302, 2021

  51. [51]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023

  52. [52]

    Genmo: A GENer- alist model for human MOtion

    Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A GENer- alist model for human MOtion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  53. [53]

    Ross, and Angjoo Kanazawa

    Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InICCV, 2021

  54. [54]

    Finedance: A fine-grained choreography dataset for 3d full body dance generation

    Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, and Xiu Li. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InICCV, pages 10234–10243, 2023

  55. [55]

    Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017

  56. [56]

    Infinityhuman: Towards long-term audio-driven hu- man.arXiv preprint arXiv:2508.20210, 2025

    Xiaodi Li, Pan Xie, Yi Ren, Qijun Gan, Chen Zhang, Fangyuan Kong, Xiang Yin, Bingyue Peng, and Zehuan Yuan. Infinityhuman: Towards long-term audio-driven hu- man.arXiv preprint arXiv:2508.20210, 2025

  57. [57]

    Llava-slt: Visual language tuning for sign language transla- tion.arXiv preprint arXiv:2412.16524, 2024

    Han Liang, Chengyu Huang, Yuecheng Xu, Cheng Tang, Weicai Ye, Juze Zhang, Xin Chen, Jingyi Yu, and Lan Xu. Llava-slt: Visual language tuning for sign language transla- tion.arXiv preprint arXiv:2412.16524, 2024

  58. [58]

    Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

  59. [59]

    Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- imation models

    Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- imation models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 13847– 13858, 2025

  60. [60]

    arXiv preprint arXiv:2510.26794 , year=

    Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qing- ping Sun, et al. The quest for generalizable motion gen- eration: Data, model, and evaluation.arXiv preprint arXiv:2510.26794, 2025

  61. [61]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  62. [62]

    Disco: Dis- entangled implicit content and rhythm learning for di- verse co-speech gestures synthesis

    Haiyang Liu, Naoya Iwamoto, Zihao Zhu, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Disco: Dis- entangled implicit content and rhythm learning for di- verse co-speech gestures synthesis. InProceedings of the 30th ACM International Conference on Multimedia, pages 3764–3773, 2022

  63. [63]

    Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

    Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InECCV, pages 612–630. Springer, 2022

  64. [64]

    Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

    Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InECCV, 2022

  65. [65]

    Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J. Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InCVPR, 2024

  66. [66]

    Mimicparts: Part-aware style injec- tion for speech-driven 3d motion generation.arXiv preprint arXiv:2510.13208, 2025

    Lianlian Liu, YongKang He, Zhaojie Chu, Xiaofen Xing, and Xiangmin Xu. Mimicparts: Part-aware style injec- tion for speech-driven 3d motion generation.arXiv preprint arXiv:2510.13208, 2025

  67. [67]

    Mosa: Motion generation with scalable autoregressive modeling.arXiv preprint arXiv:2511.01200, 2025

    Mengyuan Liu, Sheng Yan, Yong Wang, Yingjie Li, Gui- Bin Bian, and Hong Liu. Mosa: Motion generation with scalable autoregressive modeling.arXiv preprint arXiv:2511.01200, 2025

  68. [68]

    Learning hierarchical cross-modal associa- tion for co-speech gesture generation

    Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and 11 Bolei Zhou. Learning hierarchical cross-modal associa- tion for co-speech gesture generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10462–10472, 2022

  69. [69]

    GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation

    Xinran Liu, Xu Dong, Diptesh Kanojia, Wenwu Wang, and Zhenhua Feng. Gcdance: Genre-controlled 3d full body dance generation driven by music.arXiv preprint arXiv:2502.18309, 2025

  70. [70]

    Dgfm: Full body dance generation driven by mu- sic foundation models.arXiv preprint arXiv:2502.20176, 2025

    Xinran Liu, Zhenhua Feng, Diptesh Kanojia, and Wenwu Wang. Dgfm: Full body dance generation driven by mu- sic foundation models.arXiv preprint arXiv:2502.20176, 2025

  71. [71]

    Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015

  72. [72]

    Diversemotion: Towards diverse human motion generation via discrete diffusion.arXiv preprint arXiv:2309.01372, 2023

    Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, and Yi Yang. Diversemotion: Towards diverse human motion generation via discrete diffusion.arXiv preprint arXiv:2309.01372, 2023

  73. [73]

    Vil- bert: Pretraining task-agnostic visiolinguistic representa- tions for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil- bert: Pretraining task-agnostic visiolinguistic representa- tions for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

  74. [74]

    Build llm-based zero-shot streaming tts system with cosyvoice

    Xiang Lyu, Yuxuan Wang, Tianyu Zhao, Hao Wang, Huadai Liu, and Zhihao Du. Build llm-based zero-shot streaming tts system with cosyvoice. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE, 2025

  75. [75]

    Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

    Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, et al. Follow-your- emoji-faster: Towards efficient, fine-controllable, and ex- pressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

  76. [76]

    Troje, Gerard Pons-Moll, and Michael J

    Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Con- ference on Computer Vision, pages 5442–5451, 2019

  77. [77]

    arXiv preprint arXiv:2510.16258 , year=

    Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, et al. Embody 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025

  78. [78]

    Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis

    Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, and Christian Theobalt. Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1388–1398, 2024

  79. [79]

    Multimodal con- trastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems, 35:9564–9576, 2022

    Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal con- trastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems, 35:9564–9576, 2022

  80. [80]

    Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

    Rajmund Nagy, Hendric V oss, Thanh Hoang-Minh, Mi- hail Tsakov, Teodor Nikolov, Zeyi Zhang, Tenglong Ao, Sicheng Yang, Shaoli Huang, Yongkang Cheng, et al. Ges- ture generation (still) needs improved human evaluation practices: Insights from a community-driven state-of-the- art benchmark.arXiv preprint arXiv:2511.01233, 2025

Showing first 80 references.