ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
Pith reviewed 2026-05-16 22:00 UTC · model grok-4.3
The pith
ViBES builds a 3D conversational agent that jointly plans language, prosody, and body movements from speech or text inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViBES jointly generates language and 3D body actions by processing interleaved multimodal token streams through modality-partitioned transformer experts connected by cross-expert attention, enabling agentic planning of when and how to act during conversation rather than mapping fixed utterances to motion clips.
What carries the argument
Mixture-of-modality-experts (MoME) backbone that applies hard routing by modality to separate transformer experts for speech, facial expression, and body motion while sharing information via cross-expert attention on interleaved token streams.
Load-bearing premise
Hard routing by modality plus cross-expert attention on interleaved tokens is enough to keep language and body actions coherent across multiple dialogue turns without losing cross-modal context.
What would settle it
A multi-turn dialogue test where the agent produces body motions that contradict the spoken content or timing after three or more turns, showing loss of joint planning.
Figures
read the original abstract
Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: ai.stanford.edu/~juze/ViBES/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ViBES, a conversational 3D agent based on a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone. Modality-partitioned transformer experts handle speech, facial expression, and body motion on interleaved token streams using hard routing by modality and cross-expert attention. The system jointly plans language and movement for multi-turn dialogue, supports mixed-initiative input, and claims consistent gains over co-speech gesture and text-to-motion baselines on dialogue-motion alignment and behavior quality metrics, advancing beyond isolated translation tasks toward agentic virtual bodies.
Significance. If the empirical results hold, the work would advance integrated multimodal conversational agents by combining pretrained speech-language components with controllable 3D behavior generation, addressing brittle timing and fragmented modality stacks in prior systems.
major comments (2)
- [Abstract] Abstract: the claim of 'consistent gains over strong co-speech and text-to-motion baselines' on dialogue-motion alignment metrics is unsupported by any numerical values, error bars, data-split details, or baseline implementation descriptions, which is load-bearing for the central superiority claim.
- [Model description] Model section (MoME backbone): hard routing splits parameters per expert while cross-expert attention is the sole sharing mechanism; no ablations on routing or long-horizon multi-turn coherence metrics are reported, leaving unverified whether this suffices for joint language-body planning without context loss.
minor comments (2)
- [Abstract] Abstract: the phrase 'Code and data will be made available' should include a specific repository URL or DOI for reproducibility.
- [Introduction] The terms 'controllable behavior hooks' and 'streaming responses' are introduced without precise definitions or interface specifications.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We appreciate the focus on strengthening the empirical claims and model analysis. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'consistent gains over strong co-speech and text-to-motion baselines' on dialogue-motion alignment metrics is unsupported by any numerical values, error bars, data-split details, or baseline implementation descriptions, which is load-bearing for the central superiority claim.
Authors: We agree that the abstract should explicitly support the superiority claim with quantitative details. In the revised manuscript, we will update the abstract to include specific numerical gains on dialogue-motion alignment metrics (drawn from the results in Section 4), along with error bars, data-split information, and pointers to the baseline implementation details provided in the supplementary material. The full experimental comparisons, including all metrics and baselines, remain unchanged in the body of the paper. revision: yes
-
Referee: [Model description] Model section (MoME backbone): hard routing splits parameters per expert while cross-expert attention is the sole sharing mechanism; no ablations on routing or long-horizon multi-turn coherence metrics are reported, leaving unverified whether this suffices for joint language-body planning without context loss.
Authors: The MoME backbone employs hard routing by modality to partition parameters for efficiency while relying on cross-expert attention for inter-modality information sharing during joint language-body planning. We acknowledge the value of ablations; however, the current work prioritizes end-to-end system evaluation over isolated routing studies. We will expand the model section with additional justification for the design and include any long-horizon coherence metrics already computed as part of our multi-turn dialogue experiments. Comprehensive routing ablations are not added at this stage due to computational scope. revision: partial
Circularity Check
No significant circularity in architectural description or empirical evaluation
full rationale
The paper describes a multimodal SLB model with MoME backbone built from pretrained speech-language components, using hard routing and cross-expert attention for interleaved tokens. All claims rest on empirical benchmarks for dialogue-motion alignment rather than any mathematical derivations, fitted parameters renamed as predictions, or self-citation chains. No equations appear that reduce outputs to inputs by construction, and the architecture is presented as an engineering composition evaluated externally. This matches the default expectation of a self-contained system description.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ViBES is a speech–language–behavior (SLB) model with a mixture–of–modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We standardize on a 25 fps universal clock... fractional index... Rotary positional encoding.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control
A new diffusion transformer policy with joint attention over actions, states, and text plus RL post-training outperforms prior methods on language alignment and motion quality for humanoid control.
-
IAM: Identity-Aware Human Motion and Shape Joint Generation
IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.
-
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716–23736, 2022
work page 2022
-
[3]
Vlmo: Unified vision- language pre-training with mixture-of-modality-experts
Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision- language pre-training with mixture-of-modality-experts. Advances in neural information processing systems, 35: 32897–32912, 2022
work page 2022
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision- language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Boson AI. Higgs Audio V2: Redefining Expressiveness in Audio Generation.https://github.com/boson- ai/higgs- audio, 2025. GitHub repository. Release blog available athttps://www.boson.ai/blog/ higgs-audio-v2
work page 2025
-
[6]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world con- trol at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neu- ral information processing systems, 33:1877–1901, 2020
work page 1901
-
[8]
Digital life project: Au- tonomous 3d characters with social intelligence
Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, et al. Digital life project: Au- tonomous 3d characters with social intelligence. InCVPR, pages 582–592, 2024
work page 2024
-
[9]
Enabling synergistic full-body control in prompt-based co-speech motion generation
Bohong Chen, Yumeng Li, Yao-Xiang Ding, Tianjia Shao, and Kun Zhou. Enabling synergistic full-body control in prompt-based co-speech motion generation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 6774–6783, 2024
work page 2024
-
[10]
The language of motion: Unifying verbal and non-verbal language of 3d human motion
Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 6200–6211, 2025
work page 2025
-
[11]
Jiaben Chen, Zixin Wang, Ailing Zeng, Yang Fu, Xueyang Yu, Siyuan Cen, Julian Tanke, Yihang Chen, Koichi Saito, Yuki Mitsufuji, et al. Talkcuts: A large-scale dataset for multi-shot human speech video generation.arXiv preprint arXiv:2510.07249, 2025
-
[12]
Rapverse: Coherent vocals and whole-body motion generation from text
Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Zixin Wang, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, and Chuang Gan. Rapverse: Coherent vocals and whole-body motion generation from text. InICCV, pages 10097–10107, 2025
work page 2025
-
[13]
Executing your commands via motion diffusion in latent space
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InCVPR, pages 18000–18010, 2023
work page 2023
-
[14]
Artalk: Speech- driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323,
Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, and Tatsuya Harada. Artalk: Speech-driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323, 2025
-
[15]
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Peishan Cong, Yiteng Xu, Yiming Ren, Juze Zhang, Lan Xu, Jingya Wang, Jingyi Yu, and Yuexin Ma. Weakly su- pervised 3d multi-person pose estimation for large-scale scenes based on monocular camera and single lidar. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 461–469, 2023
work page 2023
-
[18]
Radek Dan ˇeˇcek, Carolin Schmitt, Senya Polikovsky, and Michael J Black. Supervising 3d talking head avatars with analysis-by-audio-synthesis.arXiv preprint arXiv:2504.13386, 2025
-
[19]
Moshi: a speech-text foundation model for real-time dialogue
Alexandre D ´efossez, Laurent Mazar´e, Manu Orsini, Am´elie Royer, Patrick P ´erez, Herv ´e J ´egou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model 9 for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multi- modal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Palm-e: An embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. 2023
work page 2023
-
[22]
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xi- ang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable stream- ing speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Xian Shi, Keyu An, et al. Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
work page 2024
-
[26]
Unitalker: Scaling up audio-driven 3d facial anima- tion through a unified model
Xiangyu Fan, Jiaqi Li, Zhiqian Lin, Weiye Xiao, and Lei Yang. Unitalker: Scaling up audio-driven 3d facial anima- tion through a unified model. InECCV, pages 204–221. Springer, 2024
work page 2024
-
[27]
Panagiotis P. Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. Visual speech-aware perceptual 3d facial expression reconstruction from videos, 2022
work page 2022
-
[28]
Zeroeggs: Zero-shot example-based gesture generation from speech
Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F Troje, and Marc-Andr ´e Carbonneau. Zeroeggs: Zero-shot example-based gesture generation from speech. InCom- puter Graphics Forum, pages 206–216. Wiley Online Li- brary, 2023
work page 2023
-
[29]
Duetgen: Music driven two-person dance generation via hierarchical masked modeling
Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, and Chuan Guo. Duetgen: Music driven two-person dance generation via hierarchical masked modeling. InProceed- ings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025
work page 2025
-
[30]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Humans in 4D: Reconstructing and tracking humans with transformers
Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4D: Reconstructing and tracking humans with transformers. In ICCV, 2023
work page 2023
-
[32]
Generating diverse and natural 3d human motions from text
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InCVPR, pages 5152–5161, 2022
work page 2022
-
[33]
Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal gen- eration of 3d human motions and texts. InECCV, pages 580–597. Springer, 2022
work page 2022
-
[34]
Momask: Generative masked mod- eling of 3d human motions
Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked mod- eling of 3d human motions. InCVPR, pages 1900–1910, 2024
work page 1900
-
[35]
Liveportrait: Efficient portrait animation with stitching and retargeting control
Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Live- portrait: Efficient portrait animation with stitching and re- targeting control.arXiv preprint arXiv:2407.03168, 2024
-
[36]
Learning speech-driven 3d conversational gestures from video
Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed El- gharib, and Christian Theobalt. Learning speech-driven 3d conversational gestures from video. InProceedings of the 21st ACM international conference on intelligent vir- tual agents, pages 101–108, 2021
work page 2021
-
[37]
Video-bench: Human-aligned video gen- eration benchmark
Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Yuan, Yuling Wu, Yufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Youhua Li, et al. Video-bench: Human-aligned video gen- eration benchmark. InCVPR, pages 18858–18868, 2025
work page 2025
-
[38]
Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, and Shiguang Shan. Motionverse: A unified multimodal framework for motion comprehension, generation and edit- ing.arXiv preprint arXiv:2509.23635, 2025
-
[39]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021
work page 2021
-
[40]
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Min- grui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction.arXiv preprint arXiv:2502.11946, 2025
work page internal anchor Pith review arXiv 2025
-
[41]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, pages 21807–21818, 2024
work page 2024
-
[42]
Beat-it: Beat- synchronized multi-condition 3d dance generation
Zikai Huang, Xuemiao Xu, Cheng Xu, Huaidong Zhang, Chenxi Zheng, Jing Qin, and Shengfeng He. Beat-it: Beat- synchronized multi-condition 3d dance generation. InEu- ropean conference on computer vision, pages 273–290. Springer, 2024
work page 2024
-
[43]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Motiongpt: Human motion as a foreign lan- guage.NeurIPS, 36:20067–20079, 2023
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.NeurIPS, 36:20067–20079, 2023
work page 2023
-
[45]
Loopy: Taming audio- driven portrait avatar with long-term motion dependency
Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio- driven portrait avatar with long-term motion dependency. arXiv preprint arXiv:2409.02634, 2024
-
[46]
Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, and Ziwei Liu. Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters. InCVPR, 2025
work page 2025
-
[47]
Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025
-
[48]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- vla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. Talking with hands 16.2 m: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 763–772, 2019
work page 2019
-
[50]
Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. Audio2gestures: Generating diverse gestures from speech audio with conditional varia- tional autoencoders. InICCV, pages 11293–11302, 2021
work page 2021
-
[51]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023
work page 2023
-
[52]
Genmo: A GENer- alist model for human MOtion
Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A GENer- alist model for human MOtion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[53]
Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InICCV, 2021
work page 2021
-
[54]
Finedance: A fine-grained choreography dataset for 3d full body dance generation
Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, and Xiu Li. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InICCV, pages 10234–10243, 2023
work page 2023
-
[55]
Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017
work page 2017
-
[56]
Infinityhuman: Towards long-term audio-driven hu- man.arXiv preprint arXiv:2508.20210, 2025
Xiaodi Li, Pan Xie, Yi Ren, Qijun Gan, Chen Zhang, Fangyuan Kong, Xiang Yin, Bingyue Peng, and Zehuan Yuan. Infinityhuman: Towards long-term audio-driven hu- man.arXiv preprint arXiv:2508.20210, 2025
-
[57]
Han Liang, Chengyu Huang, Yuecheng Xu, Cheng Tang, Weicai Ye, Juze Zhang, Xin Chen, Jingyi Yu, and Lan Xu. Llava-slt: Visual language tuning for sign language transla- tion.arXiv preprint arXiv:2412.16524, 2024
-
[58]
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024
work page internal anchor Pith review arXiv 2024
-
[59]
Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- imation models
Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- imation models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 13847– 13858, 2025
work page 2025
-
[60]
arXiv preprint arXiv:2510.26794 , year=
Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qing- ping Sun, et al. The quest for generalizable motion gen- eration: Data, model, and evaluation.arXiv preprint arXiv:2510.26794, 2025
-
[61]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Haiyang Liu, Naoya Iwamoto, Zihao Zhu, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Disco: Dis- entangled implicit content and rhythm learning for di- verse co-speech gestures synthesis. InProceedings of the 30th ACM International Conference on Multimedia, pages 3764–3773, 2022
work page 2022
-
[63]
Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis
Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InECCV, pages 612–630. Springer, 2022
work page 2022
-
[64]
Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis
Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InECCV, 2022
work page 2022
-
[65]
Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J. Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InCVPR, 2024
work page 2024
-
[66]
Lianlian Liu, YongKang He, Zhaojie Chu, Xiaofen Xing, and Xiangmin Xu. Mimicparts: Part-aware style injec- tion for speech-driven 3d motion generation.arXiv preprint arXiv:2510.13208, 2025
-
[67]
Mosa: Motion generation with scalable autoregressive modeling.arXiv preprint arXiv:2511.01200, 2025
Mengyuan Liu, Sheng Yan, Yong Wang, Yingjie Li, Gui- Bin Bian, and Hong Liu. Mosa: Motion generation with scalable autoregressive modeling.arXiv preprint arXiv:2511.01200, 2025
-
[68]
Learning hierarchical cross-modal associa- tion for co-speech gesture generation
Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and 11 Bolei Zhou. Learning hierarchical cross-modal associa- tion for co-speech gesture generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10462–10472, 2022
work page 2022
-
[69]
GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation
Xinran Liu, Xu Dong, Diptesh Kanojia, Wenwu Wang, and Zhenhua Feng. Gcdance: Genre-controlled 3d full body dance generation driven by music.arXiv preprint arXiv:2502.18309, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Xinran Liu, Zhenhua Feng, Diptesh Kanojia, and Wenwu Wang. Dgfm: Full body dance generation driven by mu- sic foundation models.arXiv preprint arXiv:2502.20176, 2025
-
[71]
Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015
work page 2015
-
[72]
Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, and Yi Yang. Diversemotion: Towards diverse human motion generation via discrete diffusion.arXiv preprint arXiv:2309.01372, 2023
-
[73]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil- bert: Pretraining task-agnostic visiolinguistic representa- tions for vision-and-language tasks.Advances in neural information processing systems, 32, 2019
work page 2019
-
[74]
Build llm-based zero-shot streaming tts system with cosyvoice
Xiang Lyu, Yuxuan Wang, Tianyu Zhao, Hao Wang, Huadai Liu, and Zhihao Du. Build llm-based zero-shot streaming tts system with cosyvoice. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE, 2025
work page 2025
-
[75]
Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, et al. Follow-your- emoji-faster: Towards efficient, fine-controllable, and ex- pressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025
-
[76]
Troje, Gerard Pons-Moll, and Michael J
Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Con- ference on Computer Vision, pages 5442–5451, 2019
work page 2019
-
[77]
arXiv preprint arXiv:2510.16258 , year=
Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, et al. Embody 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025
-
[78]
Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis
Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, and Christian Theobalt. Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1388–1398, 2024
work page 2024
-
[79]
Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal con- trastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems, 35:9564–9576, 2022
work page 2022
-
[80]
Rajmund Nagy, Hendric V oss, Thanh Hoang-Minh, Mi- hail Tsakov, Teodor Nikolov, Zeyi Zhang, Tenglong Ao, Sicheng Yang, Shaoli Huang, Yongkang Cheng, et al. Ges- ture generation (still) needs improved human evaluation practices: Insights from a community-driven state-of-the- art benchmark.arXiv preprint arXiv:2511.01233, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.