ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Ali Sartaz Khan; Changan Chen; Ehsan Adeli; Heng Yu; Juze Zhang; Shrinidhi K. Lakshmikanth; Tiange Xiang; Xin Chen

arxiv: 2512.14234 · v2 · submitted 2025-12-16 · 💻 cs.CV

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Juze Zhang , Changan Chen , Xin Chen , Heng Yu , Tiange Xiang , Ali Sartaz Khan , Shrinidhi K. Lakshmikanth , Ehsan Adeli This is my paper

Pith reviewed 2026-05-16 22:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords conversational agent3D virtual bodymultimodal generationco-speech motionmixture of expertsspeech-language-behavior modelagentic interaction

0 comments

The pith

ViBES builds a 3D conversational agent that jointly plans language, prosody, and body movements from speech or text inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViBES as a speech-language-behavior model that moves past isolated text-to-motion or co-speech gesture generation. It uses a mixture-of-modality-experts transformer where speech, facial, and body experts process interleaved tokens with hard routing per modality and cross-expert attention to share context. This design lets the agent decide when to move, adapt across dialogue turns, and respond to mixed user inputs like spoken words, typed text, or mid-conversation body directives. The result is measured through improved alignment between generated dialogue and 3D actions on multi-turn benchmarks.

Core claim

ViBES jointly generates language and 3D body actions by processing interleaved multimodal token streams through modality-partitioned transformer experts connected by cross-expert attention, enabling agentic planning of when and how to act during conversation rather than mapping fixed utterances to motion clips.

What carries the argument

Mixture-of-modality-experts (MoME) backbone that applies hard routing by modality to separate transformer experts for speech, facial expression, and body motion while sharing information via cross-expert attention on interleaved token streams.

Load-bearing premise

Hard routing by modality plus cross-expert attention on interleaved tokens is enough to keep language and body actions coherent across multiple dialogue turns without losing cross-modal context.

What would settle it

A multi-turn dialogue test where the agent produces body motions that contradict the spoken content or timing after three or more turns, showing loss of joint planning.

Figures

Figures reproduced from arXiv: 2512.14234 by Ali Sartaz Khan, Changan Chen, Ehsan Adeli, Heng Yu, Juze Zhang, Shrinidhi K. Lakshmikanth, Tiange Xiang, Xin Chen.

**Figure 1.** Figure 1: We present a novel speech–language–behavior (SLB) model with a mixture–of–modality–experts (MoME) architecture that [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Model overview. The model adopts an autoregressive structure that converts all modalities into a unified token space. It consists [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative examples of conversational behavior. We [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with prior methods on the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison for text-to-motion. While [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of our YouTube data processing pipeline. The [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Application: driving video generation with ViBES. We [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Word cloud visualization of our Converse3D data from [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Additional qualitative comparisons with prior methods [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative examples for text-to-motion generation. Given a text caption, we compare the 3D motion generated by our [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: System prompt for motion-conditioned conversational answer generation. We instruct the model to generate answers as if the [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: System Prompt for Conversational Agent Behavior Evaluation. We instruct the model to assess semantic alignment, content–motion match, and social appropriateness on videos rendered from motion sequences. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

read the original abstract

Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task co-speech gesture or text-to-motion that maps a fixed utterance to motion clips-without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue-motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction. Code and data will be made available at: ai.stanford.edu/~juze/ViBES/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViBES tries joint language-and-body planning in a 3D conversational agent via modality-routed experts, but the abstract gives no numbers to show it works.

read the letter

The one thing to know is that this paper shifts from treating body motion as a separate translation step after speech or text to making the agent plan both together in an ongoing dialogue. It uses a mixture-of-modality-experts backbone with separate transformers for speech, face, and body, hard-routed on interleaved tokens and linked only by cross-attention, all sitting on top of pretrained speech-language models. That lets users mix speech, typing, and direct body commands while the system streams responses with controllable hooks for timing and style.

Referee Report

2 major / 2 minor

Summary. The paper introduces ViBES, a conversational 3D agent based on a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone. Modality-partitioned transformer experts handle speech, facial expression, and body motion on interleaved token streams using hard routing by modality and cross-expert attention. The system jointly plans language and movement for multi-turn dialogue, supports mixed-initiative input, and claims consistent gains over co-speech gesture and text-to-motion baselines on dialogue-motion alignment and behavior quality metrics, advancing beyond isolated translation tasks toward agentic virtual bodies.

Significance. If the empirical results hold, the work would advance integrated multimodal conversational agents by combining pretrained speech-language components with controllable 3D behavior generation, addressing brittle timing and fragmented modality stacks in prior systems.

major comments (2)

[Abstract] Abstract: the claim of 'consistent gains over strong co-speech and text-to-motion baselines' on dialogue-motion alignment metrics is unsupported by any numerical values, error bars, data-split details, or baseline implementation descriptions, which is load-bearing for the central superiority claim.
[Model description] Model section (MoME backbone): hard routing splits parameters per expert while cross-expert attention is the sole sharing mechanism; no ablations on routing or long-horizon multi-turn coherence metrics are reported, leaving unverified whether this suffices for joint language-body planning without context loss.

minor comments (2)

[Abstract] Abstract: the phrase 'Code and data will be made available' should include a specific repository URL or DOI for reproducibility.
[Introduction] The terms 'controllable behavior hooks' and 'streaming responses' are introduced without precise definitions or interface specifications.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We appreciate the focus on strengthening the empirical claims and model analysis. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'consistent gains over strong co-speech and text-to-motion baselines' on dialogue-motion alignment metrics is unsupported by any numerical values, error bars, data-split details, or baseline implementation descriptions, which is load-bearing for the central superiority claim.

Authors: We agree that the abstract should explicitly support the superiority claim with quantitative details. In the revised manuscript, we will update the abstract to include specific numerical gains on dialogue-motion alignment metrics (drawn from the results in Section 4), along with error bars, data-split information, and pointers to the baseline implementation details provided in the supplementary material. The full experimental comparisons, including all metrics and baselines, remain unchanged in the body of the paper. revision: yes
Referee: [Model description] Model section (MoME backbone): hard routing splits parameters per expert while cross-expert attention is the sole sharing mechanism; no ablations on routing or long-horizon multi-turn coherence metrics are reported, leaving unverified whether this suffices for joint language-body planning without context loss.

Authors: The MoME backbone employs hard routing by modality to partition parameters for efficiency while relying on cross-expert attention for inter-modality information sharing during joint language-body planning. We acknowledge the value of ablations; however, the current work prioritizes end-to-end system evaluation over isolated routing studies. We will expand the model section with additional justification for the design and include any long-horizon coherence metrics already computed as part of our multi-turn dialogue experiments. Comprehensive routing ablations are not added at this stage due to computational scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity in architectural description or empirical evaluation

full rationale

The paper describes a multimodal SLB model with MoME backbone built from pretrained speech-language components, using hard routing and cross-expert attention for interleaved tokens. All claims rest on empirical benchmarks for dialogue-motion alignment rather than any mathematical derivations, fitted parameters renamed as predictions, or self-citation chains. No equations appear that reduce outputs to inputs by construction, and the architecture is presented as an engineering composition evaluated externally. This matches the default expectation of a self-contained system description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract relies on the existence of strong pretrained speech-language models and assumes that modality-partitioned experts with cross-attention can integrate interleaved streams; no explicit free parameters, new axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5643 in / 1149 out tokens · 67431 ms · 2026-05-16T22:00:49.073235+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ViBES is a speech–language–behavior (SLB) model with a mixture–of–modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We standardize on a 25 fps universal clock... fractional index... Rotary positional encoding.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control
cs.GR 2026-05 unverdicted novelty 6.0

A new diffusion transformer policy with joint attention over actions, states, and text plus RL post-training outperforms prior methods on language alignment and motion quality for humanoid control.
IAM: Identity-Aware Human Motion and Shape Joint Generation
cs.CV 2026-04 unverdicted novelty 6.0

IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · cited by 3 Pith papers · 26 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716–23736, 2022

work page 2022
[3]

Vlmo: Unified vision- language pre-training with mixture-of-modality-experts

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision- language pre-training with mixture-of-modality-experts. Advances in neural information processing systems, 35: 32897–32912, 2022

work page 2022
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision- language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Higgs Audio V2: Redefining Expressiveness in Audio Generation.https://github.com/boson- ai/higgs- audio, 2025

Boson AI. Higgs Audio V2: Redefining Expressiveness in Audio Generation.https://github.com/boson- ai/higgs- audio, 2025. GitHub repository. Release blog available athttps://www.boson.ai/blog/ higgs-audio-v2

work page 2025
[6]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world con- trol at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Language models are few-shot learners.Advances in neu- ral information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neu- ral information processing systems, 33:1877–1901, 2020

work page 1901
[8]

Digital life project: Au- tonomous 3d characters with social intelligence

Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, et al. Digital life project: Au- tonomous 3d characters with social intelligence. InCVPR, pages 582–592, 2024

work page 2024
[9]

Enabling synergistic full-body control in prompt-based co-speech motion generation

Bohong Chen, Yumeng Li, Yao-Xiang Ding, Tianjia Shao, and Kun Zhou. Enabling synergistic full-body control in prompt-based co-speech motion generation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 6774–6783, 2024

work page 2024
[10]

The language of motion: Unifying verbal and non-verbal language of 3d human motion

Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 6200–6211, 2025

work page 2025
[11]

Talkcuts: A large-scale dataset for multi-shot human speech video generation.arXiv preprint arXiv:2510.07249, 2025

Jiaben Chen, Zixin Wang, Ailing Zeng, Yang Fu, Xueyang Yu, Siyuan Cen, Julian Tanke, Yihang Chen, Koichi Saito, Yuki Mitsufuji, et al. Talkcuts: A large-scale dataset for multi-shot human speech video generation.arXiv preprint arXiv:2510.07249, 2025

work page arXiv 2025
[12]

Rapverse: Coherent vocals and whole-body motion generation from text

Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Zixin Wang, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, and Chuang Gan. Rapverse: Coherent vocals and whole-body motion generation from text. InICCV, pages 10097–10107, 2025

work page 2025
[13]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InCVPR, pages 18000–18010, 2023

work page 2023
[14]

Artalk: Speech- driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323,

Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, and Tatsuya Harada. Artalk: Speech-driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323, 2025

work page arXiv 2025
[15]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Weakly su- pervised 3d multi-person pose estimation for large-scale scenes based on monocular camera and single lidar

Peishan Cong, Yiteng Xu, Yiming Ren, Juze Zhang, Lan Xu, Jingya Wang, Jingyi Yu, and Yuexin Ma. Weakly su- pervised 3d multi-person pose estimation for large-scale scenes based on monocular camera and single lidar. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 461–469, 2023

work page 2023
[18]

Supervising 3d talking head avatars with analysis-by-audio-synthesis.arXiv preprint arXiv:2504.13386, 2025

Radek Dan ˇeˇcek, Carolin Schmitt, Senya Polikovsky, and Michael J Black. Supervising 3d talking head avatars with analysis-by-audio-synthesis.arXiv preprint arXiv:2504.13386, 2025

work page arXiv 2025
[19]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre D ´efossez, Laurent Mazar´e, Manu Orsini, Am´elie Royer, Patrick P ´erez, Herv ´e J ´egou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model 9 for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multi- modal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. 2023

work page 2023
[22]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xi- ang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable stream- ing speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Xian Shi, Keyu An, et al. Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024
[26]

Unitalker: Scaling up audio-driven 3d facial anima- tion through a unified model

Xiangyu Fan, Jiaqi Li, Zhiqian Lin, Weiye Xiao, and Lei Yang. Unitalker: Scaling up audio-driven 3d facial anima- tion through a unified model. InECCV, pages 204–221. Springer, 2024

work page 2024
[27]

Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos

Panagiotis P. Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. Visual speech-aware perceptual 3d facial expression reconstruction from videos, 2022

work page 2022
[28]

Zeroeggs: Zero-shot example-based gesture generation from speech

Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F Troje, and Marc-Andr ´e Carbonneau. Zeroeggs: Zero-shot example-based gesture generation from speech. InCom- puter Graphics Forum, pages 206–216. Wiley Online Li- brary, 2023

work page 2023
[29]

Duetgen: Music driven two-person dance generation via hierarchical masked modeling

Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, and Chuan Guo. Duetgen: Music driven two-person dance generation via hierarchical masked modeling. InProceed- ings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

work page 2025
[30]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Humans in 4D: Reconstructing and tracking humans with transformers

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4D: Reconstructing and tracking humans with transformers. In ICCV, 2023

work page 2023
[32]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InCVPR, pages 5152–5161, 2022

work page 2022
[33]

Tm2t: Stochastic and tokenized modeling for the reciprocal gen- eration of 3d human motions and texts

Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal gen- eration of 3d human motions and texts. InECCV, pages 580–597. Springer, 2022

work page 2022
[34]

Momask: Generative masked mod- eling of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked mod- eling of 3d human motions. InCVPR, pages 1900–1910, 2024

work page 1900
[35]

Liveportrait: Efficient portrait animation with stitching and retargeting control

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Live- portrait: Efficient portrait animation with stitching and re- targeting control.arXiv preprint arXiv:2407.03168, 2024

work page arXiv 2024
[36]

Learning speech-driven 3d conversational gestures from video

Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed El- gharib, and Christian Theobalt. Learning speech-driven 3d conversational gestures from video. InProceedings of the 21st ACM international conference on intelligent vir- tual agents, pages 101–108, 2021

work page 2021
[37]

Video-bench: Human-aligned video gen- eration benchmark

Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Yuan, Yuling Wu, Yufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Youhua Li, et al. Video-bench: Human-aligned video gen- eration benchmark. InCVPR, pages 18858–18868, 2025

work page 2025
[38]

Motionverse: A unified multimodal framework for motion comprehension, generation and edit- ing.arXiv preprint arXiv:2509.23635, 2025

Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, and Shiguang Shan. Motionverse: A unified multimodal framework for motion comprehension, generation and edit- ing.arXiv preprint arXiv:2509.23635, 2025

work page arXiv 2025
[39]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

work page 2021
[40]

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Min- grui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction.arXiv preprint arXiv:2502.11946, 2025

work page internal anchor Pith review arXiv 2025
[41]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, pages 21807–21818, 2024

work page 2024
[42]

Beat-it: Beat- synchronized multi-condition 3d dance generation

Zikai Huang, Xuemiao Xu, Cheng Xu, Huaidong Zhang, Chenxi Zheng, Jing Qin, and Shengfeng He. Beat-it: Beat- synchronized multi-condition 3d dance generation. InEu- ropean conference on computer vision, pages 273–290. Springer, 2024

work page 2024
[43]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Motiongpt: Human motion as a foreign lan- guage.NeurIPS, 36:20067–20079, 2023

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.NeurIPS, 36:20067–20079, 2023

work page 2023
[45]

Loopy: Taming audio- driven portrait avatar with long-term motion dependency

Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio- driven portrait avatar with long-term motion dependency. arXiv preprint arXiv:2409.02634, 2024

work page arXiv 2024
[46]

Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters

Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, and Ziwei Liu. Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters. InCVPR, 2025

work page 2025
[47]

Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

work page arXiv 2025
[48]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- vla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Talking with hands 16.2 m: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis

Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. Talking with hands 16.2 m: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 763–772, 2019

work page 2019
[50]

Audio2gestures: Generating diverse gestures from speech audio with conditional varia- tional autoencoders

Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. Audio2gestures: Generating diverse gestures from speech audio with conditional varia- tional autoencoders. InICCV, pages 11293–11302, 2021

work page 2021
[51]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023

work page 2023
[52]

Genmo: A GENer- alist model for human MOtion

Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A GENer- alist model for human MOtion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[53]

Ross, and Angjoo Kanazawa

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InICCV, 2021

work page 2021
[54]

Finedance: A fine-grained choreography dataset for 3d full body dance generation

Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, and Xiu Li. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InICCV, pages 10234–10243, 2023

work page 2023
[55]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017

work page 2017
[56]

Infinityhuman: Towards long-term audio-driven hu- man.arXiv preprint arXiv:2508.20210, 2025

Xiaodi Li, Pan Xie, Yi Ren, Qijun Gan, Chen Zhang, Fangyuan Kong, Xiang Yin, Bingyue Peng, and Zehuan Yuan. Infinityhuman: Towards long-term audio-driven hu- man.arXiv preprint arXiv:2508.20210, 2025

work page arXiv 2025
[57]

Llava-slt: Visual language tuning for sign language transla- tion.arXiv preprint arXiv:2412.16524, 2024

Han Liang, Chengyu Huang, Yuecheng Xu, Cheng Tang, Weicai Ye, Juze Zhang, Xin Chen, Jingyi Yu, and Lan Xu. Llava-slt: Visual language tuning for sign language transla- tion.arXiv preprint arXiv:2412.16524, 2024

work page arXiv 2024
[58]

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

work page internal anchor Pith review arXiv 2024
[59]

Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- imation models

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- imation models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 13847– 13858, 2025

work page 2025
[60]

arXiv preprint arXiv:2510.26794 , year=

Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qing- ping Sun, et al. The quest for generalizable motion gen- eration: Data, model, and evaluation.arXiv preprint arXiv:2510.26794, 2025

work page arXiv 2025
[61]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Disco: Dis- entangled implicit content and rhythm learning for di- verse co-speech gestures synthesis

Haiyang Liu, Naoya Iwamoto, Zihao Zhu, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Disco: Dis- entangled implicit content and rhythm learning for di- verse co-speech gestures synthesis. InProceedings of the 30th ACM International Conference on Multimedia, pages 3764–3773, 2022

work page 2022
[63]

Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InECCV, pages 612–630. Springer, 2022

work page 2022
[64]

Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InECCV, 2022

work page 2022
[65]

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J. Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InCVPR, 2024

work page 2024
[66]

Mimicparts: Part-aware style injec- tion for speech-driven 3d motion generation.arXiv preprint arXiv:2510.13208, 2025

Lianlian Liu, YongKang He, Zhaojie Chu, Xiaofen Xing, and Xiangmin Xu. Mimicparts: Part-aware style injec- tion for speech-driven 3d motion generation.arXiv preprint arXiv:2510.13208, 2025

work page arXiv 2025
[67]

Mosa: Motion generation with scalable autoregressive modeling.arXiv preprint arXiv:2511.01200, 2025

Mengyuan Liu, Sheng Yan, Yong Wang, Yingjie Li, Gui- Bin Bian, and Hong Liu. Mosa: Motion generation with scalable autoregressive modeling.arXiv preprint arXiv:2511.01200, 2025

work page arXiv 2025
[68]

Learning hierarchical cross-modal associa- tion for co-speech gesture generation

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and 11 Bolei Zhou. Learning hierarchical cross-modal associa- tion for co-speech gesture generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10462–10472, 2022

work page 2022
[69]

GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation

Xinran Liu, Xu Dong, Diptesh Kanojia, Wenwu Wang, and Zhenhua Feng. Gcdance: Genre-controlled 3d full body dance generation driven by music.arXiv preprint arXiv:2502.18309, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Dgfm: Full body dance generation driven by mu- sic foundation models.arXiv preprint arXiv:2502.20176, 2025

Xinran Liu, Zhenhua Feng, Diptesh Kanojia, and Wenwu Wang. Dgfm: Full body dance generation driven by mu- sic foundation models.arXiv preprint arXiv:2502.20176, 2025

work page arXiv 2025
[71]

Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015

work page 2015
[72]

Diversemotion: Towards diverse human motion generation via discrete diffusion.arXiv preprint arXiv:2309.01372, 2023

Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, and Yi Yang. Diversemotion: Towards diverse human motion generation via discrete diffusion.arXiv preprint arXiv:2309.01372, 2023

work page arXiv 2023
[73]

Vil- bert: Pretraining task-agnostic visiolinguistic representa- tions for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil- bert: Pretraining task-agnostic visiolinguistic representa- tions for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

work page 2019
[74]

Build llm-based zero-shot streaming tts system with cosyvoice

Xiang Lyu, Yuxuan Wang, Tianyu Zhao, Hao Wang, Huadai Liu, and Zhihao Du. Build llm-based zero-shot streaming tts system with cosyvoice. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE, 2025

work page 2025
[75]

Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, et al. Follow-your- emoji-faster: Towards efficient, fine-controllable, and ex- pressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

work page arXiv 2025
[76]

Troje, Gerard Pons-Moll, and Michael J

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Con- ference on Computer Vision, pages 5442–5451, 2019

work page 2019
[77]

arXiv preprint arXiv:2510.16258 , year=

Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, et al. Embody 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025

work page arXiv 2025
[78]

Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis

Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, and Christian Theobalt. Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1388–1398, 2024

work page 2024
[79]

Multimodal con- trastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems, 35:9564–9576, 2022

Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal con- trastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems, 35:9564–9576, 2022

work page 2022
[80]

Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

Rajmund Nagy, Hendric V oss, Thanh Hoang-Minh, Mi- hail Tsakov, Teodor Nikolov, Zeyi Zhang, Tenglong Ao, Sicheng Yang, Shaoli Huang, Yongkang Cheng, et al. Ges- ture generation (still) needs improved human evaluation practices: Insights from a community-driven state-of-the- art benchmark.arXiv preprint arXiv:2511.01233, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716–23736, 2022

work page 2022

[3] [3]

Vlmo: Unified vision- language pre-training with mixture-of-modality-experts

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision- language pre-training with mixture-of-modality-experts. Advances in neural information processing systems, 35: 32897–32912, 2022

work page 2022

[4] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision- language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Higgs Audio V2: Redefining Expressiveness in Audio Generation.https://github.com/boson- ai/higgs- audio, 2025

Boson AI. Higgs Audio V2: Redefining Expressiveness in Audio Generation.https://github.com/boson- ai/higgs- audio, 2025. GitHub repository. Release blog available athttps://www.boson.ai/blog/ higgs-audio-v2

work page 2025

[6] [6]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world con- trol at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Language models are few-shot learners.Advances in neu- ral information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neu- ral information processing systems, 33:1877–1901, 2020

work page 1901

[8] [8]

Digital life project: Au- tonomous 3d characters with social intelligence

Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, et al. Digital life project: Au- tonomous 3d characters with social intelligence. InCVPR, pages 582–592, 2024

work page 2024

[9] [9]

Enabling synergistic full-body control in prompt-based co-speech motion generation

Bohong Chen, Yumeng Li, Yao-Xiang Ding, Tianjia Shao, and Kun Zhou. Enabling synergistic full-body control in prompt-based co-speech motion generation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 6774–6783, 2024

work page 2024

[10] [10]

The language of motion: Unifying verbal and non-verbal language of 3d human motion

Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 6200–6211, 2025

work page 2025

[11] [11]

Talkcuts: A large-scale dataset for multi-shot human speech video generation.arXiv preprint arXiv:2510.07249, 2025

Jiaben Chen, Zixin Wang, Ailing Zeng, Yang Fu, Xueyang Yu, Siyuan Cen, Julian Tanke, Yihang Chen, Koichi Saito, Yuki Mitsufuji, et al. Talkcuts: A large-scale dataset for multi-shot human speech video generation.arXiv preprint arXiv:2510.07249, 2025

work page arXiv 2025

[12] [12]

Rapverse: Coherent vocals and whole-body motion generation from text

Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Zixin Wang, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, and Chuang Gan. Rapverse: Coherent vocals and whole-body motion generation from text. InICCV, pages 10097–10107, 2025

work page 2025

[13] [13]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InCVPR, pages 18000–18010, 2023

work page 2023

[14] [14]

Artalk: Speech- driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323,

Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, and Tatsuya Harada. Artalk: Speech-driven 3d head animation via autoregressive model.arXiv preprint arXiv:2502.20323, 2025

work page arXiv 2025

[15] [15]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Weakly su- pervised 3d multi-person pose estimation for large-scale scenes based on monocular camera and single lidar

Peishan Cong, Yiteng Xu, Yiming Ren, Juze Zhang, Lan Xu, Jingya Wang, Jingyi Yu, and Yuexin Ma. Weakly su- pervised 3d multi-person pose estimation for large-scale scenes based on monocular camera and single lidar. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 461–469, 2023

work page 2023

[18] [18]

Supervising 3d talking head avatars with analysis-by-audio-synthesis.arXiv preprint arXiv:2504.13386, 2025

Radek Dan ˇeˇcek, Carolin Schmitt, Senya Polikovsky, and Michael J Black. Supervising 3d talking head avatars with analysis-by-audio-synthesis.arXiv preprint arXiv:2504.13386, 2025

work page arXiv 2025

[19] [19]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre D ´efossez, Laurent Mazar´e, Manu Orsini, Am´elie Royer, Patrick P ´erez, Herv ´e J ´egou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model 9 for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multi- modal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Palm-e: An embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. 2023

work page 2023

[22] [22]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xi- ang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable stream- ing speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Xian Shi, Keyu An, et al. Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024

[26] [26]

Unitalker: Scaling up audio-driven 3d facial anima- tion through a unified model

Xiangyu Fan, Jiaqi Li, Zhiqian Lin, Weiye Xiao, and Lei Yang. Unitalker: Scaling up audio-driven 3d facial anima- tion through a unified model. InECCV, pages 204–221. Springer, 2024

work page 2024

[27] [27]

Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos

Panagiotis P. Filntisis, George Retsinas, Foivos Paraperas- Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. Visual speech-aware perceptual 3d facial expression reconstruction from videos, 2022

work page 2022

[28] [28]

Zeroeggs: Zero-shot example-based gesture generation from speech

Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F Troje, and Marc-Andr ´e Carbonneau. Zeroeggs: Zero-shot example-based gesture generation from speech. InCom- puter Graphics Forum, pages 206–216. Wiley Online Li- brary, 2023

work page 2023

[29] [29]

Duetgen: Music driven two-person dance generation via hierarchical masked modeling

Anindita Ghosh, Bing Zhou, Rishabh Dabral, Jian Wang, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek, and Chuan Guo. Duetgen: Music driven two-person dance generation via hierarchical masked modeling. InProceed- ings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

work page 2025

[30] [30]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Humans in 4D: Reconstructing and tracking humans with transformers

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4D: Reconstructing and tracking humans with transformers. In ICCV, 2023

work page 2023

[32] [32]

Generating diverse and natural 3d human motions from text

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InCVPR, pages 5152–5161, 2022

work page 2022

[33] [33]

Tm2t: Stochastic and tokenized modeling for the reciprocal gen- eration of 3d human motions and texts

Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal gen- eration of 3d human motions and texts. InECCV, pages 580–597. Springer, 2022

work page 2022

[34] [34]

Momask: Generative masked mod- eling of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked mod- eling of 3d human motions. InCVPR, pages 1900–1910, 2024

work page 1900

[35] [35]

Liveportrait: Efficient portrait animation with stitching and retargeting control

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Live- portrait: Efficient portrait animation with stitching and re- targeting control.arXiv preprint arXiv:2407.03168, 2024

work page arXiv 2024

[36] [36]

Learning speech-driven 3d conversational gestures from video

Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed El- gharib, and Christian Theobalt. Learning speech-driven 3d conversational gestures from video. InProceedings of the 21st ACM international conference on intelligent vir- tual agents, pages 101–108, 2021

work page 2021

[37] [37]

Video-bench: Human-aligned video gen- eration benchmark

Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Yuan, Yuling Wu, Yufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Youhua Li, et al. Video-bench: Human-aligned video gen- eration benchmark. InCVPR, pages 18858–18868, 2025

work page 2025

[38] [38]

Motionverse: A unified multimodal framework for motion comprehension, generation and edit- ing.arXiv preprint arXiv:2509.23635, 2025

Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, and Shiguang Shan. Motionverse: A unified multimodal framework for motion comprehension, generation and edit- ing.arXiv preprint arXiv:2509.23635, 2025

work page arXiv 2025

[39] [39]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

work page 2021

[40] [40]

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Min- grui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction.arXiv preprint arXiv:2502.11946, 2025

work page internal anchor Pith review arXiv 2025

[41] [41]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, pages 21807–21818, 2024

work page 2024

[42] [42]

Beat-it: Beat- synchronized multi-condition 3d dance generation

Zikai Huang, Xuemiao Xu, Cheng Xu, Huaidong Zhang, Chenxi Zheng, Jing Qin, and Shengfeng He. Beat-it: Beat- synchronized multi-condition 3d dance generation. InEu- ropean conference on computer vision, pages 273–290. Springer, 2024

work page 2024

[43] [43]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Motiongpt: Human motion as a foreign lan- guage.NeurIPS, 36:20067–20079, 2023

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.NeurIPS, 36:20067–20079, 2023

work page 2023

[45] [45]

Loopy: Taming audio- driven portrait avatar with long-term motion dependency

Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio- driven portrait avatar with long-term motion dependency. arXiv preprint arXiv:2409.02634, 2024

work page arXiv 2024

[46] [46]

Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters

Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, and Ziwei Liu. Solami: Social vision- language-action modeling for immersive interaction with 3d autonomous characters. InCVPR, 2025

work page 2025

[47] [47]

Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, and Mingyuan Gao. Omnihuman-1.5: Instilling an active mind in avatars via cognitive simulation.arXiv preprint arXiv:2508.19209, 2025

work page arXiv 2025

[48] [48]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- vla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Talking with hands 16.2 m: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis

Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. Talking with hands 16.2 m: A large-scale dataset of synchronized body- finger motion and audio for conversational motion analy- sis and synthesis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 763–772, 2019

work page 2019

[50] [50]

Audio2gestures: Generating diverse gestures from speech audio with conditional varia- tional autoencoders

Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. Audio2gestures: Generating diverse gestures from speech audio with conditional varia- tional autoencoders. InICCV, pages 11293–11302, 2021

work page 2021

[51] [51]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023

work page 2023

[52] [52]

Genmo: A GENer- alist model for human MOtion

Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: A GENer- alist model for human MOtion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025

[53] [53]

Ross, and Angjoo Kanazawa

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InICCV, 2021

work page 2021

[54] [54]

Finedance: A fine-grained choreography dataset for 3d full body dance generation

Ronghui Li, Junfan Zhao, Yachao Zhang, Mingyang Su, Zeping Ren, Han Zhang, Yansong Tang, and Xiu Li. Finedance: A fine-grained choreography dataset for 3d full body dance generation. InICCV, pages 10234–10243, 2023

work page 2023

[55] [55]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017

work page 2017

[56] [56]

Infinityhuman: Towards long-term audio-driven hu- man.arXiv preprint arXiv:2508.20210, 2025

Xiaodi Li, Pan Xie, Yi Ren, Qijun Gan, Chen Zhang, Fangyuan Kong, Xiang Yin, Bingyue Peng, and Zehuan Yuan. Infinityhuman: Towards long-term audio-driven hu- man.arXiv preprint arXiv:2508.20210, 2025

work page arXiv 2025

[57] [57]

Llava-slt: Visual language tuning for sign language transla- tion.arXiv preprint arXiv:2412.16524, 2024

Han Liang, Chengyu Huang, Yuecheng Xu, Cheng Tang, Weicai Ye, Juze Zhang, Xin Chen, Jingyi Yu, and Lan Xu. Llava-slt: Visual language tuning for sign language transla- tion.arXiv preprint arXiv:2412.16524, 2024

work page arXiv 2024

[58] [58]

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

work page internal anchor Pith review arXiv 2024

[59] [59]

Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- imation models

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- imation models. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 13847– 13858, 2025

work page 2025

[60] [60]

arXiv preprint arXiv:2510.26794 , year=

Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qing- ping Sun, et al. The quest for generalizable motion gen- eration: Data, model, and evaluation.arXiv preprint arXiv:2510.26794, 2025

work page arXiv 2025

[61] [61]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

Disco: Dis- entangled implicit content and rhythm learning for di- verse co-speech gestures synthesis

Haiyang Liu, Naoya Iwamoto, Zihao Zhu, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Disco: Dis- entangled implicit content and rhythm learning for di- verse co-speech gestures synthesis. InProceedings of the 30th ACM International Conference on Multimedia, pages 3764–3773, 2022

work page 2022

[63] [63]

Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InECCV, pages 612–630. Springer, 2022

work page 2022

[64] [64]

Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InECCV, 2022

work page 2022

[65] [65]

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J. Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InCVPR, 2024

work page 2024

[66] [66]

Mimicparts: Part-aware style injec- tion for speech-driven 3d motion generation.arXiv preprint arXiv:2510.13208, 2025

Lianlian Liu, YongKang He, Zhaojie Chu, Xiaofen Xing, and Xiangmin Xu. Mimicparts: Part-aware style injec- tion for speech-driven 3d motion generation.arXiv preprint arXiv:2510.13208, 2025

work page arXiv 2025

[67] [67]

Mosa: Motion generation with scalable autoregressive modeling.arXiv preprint arXiv:2511.01200, 2025

Mengyuan Liu, Sheng Yan, Yong Wang, Yingjie Li, Gui- Bin Bian, and Hong Liu. Mosa: Motion generation with scalable autoregressive modeling.arXiv preprint arXiv:2511.01200, 2025

work page arXiv 2025

[68] [68]

Learning hierarchical cross-modal associa- tion for co-speech gesture generation

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and 11 Bolei Zhou. Learning hierarchical cross-modal associa- tion for co-speech gesture generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10462–10472, 2022

work page 2022

[69] [69]

GCDance: Genre-Controlled Music-Driven 3D Full Body Dance Generation

Xinran Liu, Xu Dong, Diptesh Kanojia, Wenwu Wang, and Zhenhua Feng. Gcdance: Genre-controlled 3d full body dance generation driven by music.arXiv preprint arXiv:2502.18309, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [70]

Dgfm: Full body dance generation driven by mu- sic foundation models.arXiv preprint arXiv:2502.20176, 2025

Xinran Liu, Zhenhua Feng, Diptesh Kanojia, and Wenwu Wang. Dgfm: Full body dance generation driven by mu- sic foundation models.arXiv preprint arXiv:2502.20176, 2025

work page arXiv 2025

[71] [71]

Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015

work page 2015

[72] [72]

Diversemotion: Towards diverse human motion generation via discrete diffusion.arXiv preprint arXiv:2309.01372, 2023

Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, and Yi Yang. Diversemotion: Towards diverse human motion generation via discrete diffusion.arXiv preprint arXiv:2309.01372, 2023

work page arXiv 2023

[73] [73]

Vil- bert: Pretraining task-agnostic visiolinguistic representa- tions for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil- bert: Pretraining task-agnostic visiolinguistic representa- tions for vision-and-language tasks.Advances in neural information processing systems, 32, 2019

work page 2019

[74] [74]

Build llm-based zero-shot streaming tts system with cosyvoice

Xiang Lyu, Yuxuan Wang, Tianyu Zhao, Hao Wang, Huadai Liu, and Zhihao Du. Build llm-based zero-shot streaming tts system with cosyvoice. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE, 2025

work page 2025

[75] [75]

Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, et al. Follow-your- emoji-faster: Towards efficient, fine-controllable, and ex- pressive freestyle portrait animation.arXiv preprint arXiv:2509.16630, 2025

work page arXiv 2025

[76] [76]

Troje, Gerard Pons-Moll, and Michael J

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Con- ference on Computer Vision, pages 5442–5451, 2019

work page 2019

[77] [77]

arXiv preprint arXiv:2510.16258 , year=

Claire McLean, Makenzie Meendering, Tristan Swartz, Orri Gabbay, Alexandra Olsen, Rachel Jacobs, Nicholas Rosen, Philippe de Bree, Tony Garcia, Gadsden Merrill, et al. Embody 3d: A large-scale multimodal motion and behavior dataset.arXiv preprint arXiv:2510.16258, 2025

work page arXiv 2025

[78] [78]

Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis

Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, and Christian Theobalt. Convofusion: Multi-modal conversational dif- fusion for co-speech gesture synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1388–1398, 2024

work page 2024

[79] [79]

Multimodal con- trastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems, 35:9564–9576, 2022

Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal con- trastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems, 35:9564–9576, 2022

work page 2022

[80] [80]

Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

Rajmund Nagy, Hendric V oss, Thanh Hoang-Minh, Mi- hail Tsakov, Teodor Nikolov, Zeyi Zhang, Tenglong Ao, Sicheng Yang, Shaoli Huang, Yongkang Cheng, et al. Ges- ture generation (still) needs improved human evaluation practices: Insights from a community-driven state-of-the- art benchmark.arXiv preprint arXiv:2511.01233, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025