DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

Amrita Mazumdar; Christian Jacobsen; Hongyu Liu; Ka Chun Cheung; Koki Nagano; Michael Stengel; Rajarshi Roy; Seonwook Park; Shalini De Mello; Shengze Wang

arxiv: 2606.03874 · v1 · pith:XDVQVNZ7new · submitted 2026-06-02 · 💻 cs.CV · cs.RO

DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

Koki Nagano , Hongyu Liu , Seonwook Park , Tianye Li , Amrita Mazumdar , Christian Jacobsen , Shengze Wang , Michael Stengel

show 4 more authors

Rajarshi Roy Ka Chun Cheung Simon See Shalini De Mello

This is my paper

Pith reviewed 2026-06-28 10:20 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords full-duplex modeldyadic interactionspeech and motiondual-tower transformertoken interleavingRoPE positional encodinghuman interactionstreaming generation

0 comments

The pith

DyaPlex adds a streaming motion pathway to a frozen full-duplex speech model via dual-tower Transformers for synchronized dyadic interaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DyaPlex is designed as a full-duplex model that can both listen and speak while also generating physical motions in real time during conversations with another person. The core idea is to take an existing speech model that already handles full-duplex conversations and attach a separate but coupled motion generation system without breaking the original model's abilities. This is done through a dual-tower setup where tokens for speech and motion are interleaved and aligned in time using a special positional encoding. The result is a model trained on thousands of hours of data that reportedly outperforms previous approaches on tasks involving single or paired human interactions.

Core claim

Our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features.

What carries the argument

dual-tower Transformer architecture with unified dyadic token interleaving and time-aligned speech-motion RoPE added to a frozen base speech model

Load-bearing premise

Adding the dual-tower Transformer and motion pathway to the frozen base speech model does not degrade its zero-shot conversational reasoning abilities.

What would settle it

Measuring the base speech model's performance on zero-shot conversational tasks before and after integrating the motion pathway; a drop would indicate the assumption fails.

Figures

Figures reproduced from arXiv: 2606.03874 by Amrita Mazumdar, Christian Jacobsen, Hongyu Liu, Ka Chun Cheung, Koki Nagano, Michael Stengel, Rajarshi Roy, Seonwook Park, Shalini De Mello, Shengze Wang, Simon See, Tianye Li.

**Figure 1.** Figure 1: DyaPlex is a causal, full-duplex speech-motion model that simultaneously speaks and listens to a partner while perceiving a partner motion and generating agent’s motion. Our model could be applied to applications, such as, dyadic interactions with a human user and agent/robot (right) as well as generating synthetic speech-motion dyadic interaction data. Abstract We present DyaPlex, a streaming, full-duplex… view at source ↗

**Figure 2.** Figure 2: Architecture overview. DyaPlex consists of three components: (a) part-aware RVQ-VAE decoders and (b) a frozen speech tower, and a trainable motion tower. The speech tower (PersonaPlex) takes in dyadic speech, emits agent speech autoregressively, and exposes its per-layer residual-stream hidden states {Hℓ} 32 ℓ=1. For training, we precompute {Hℓ} once (Sec. 4.1) to serve as cross-attention keys and values f… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on Seamless test clips. The agent (speaker [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: In this example the pair reads a story in a turn taking fashion and they use body languages [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Comparisons without (a) and with (b) time-aligned speech-motion RoPE. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Interactive interface for the user study. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, our model effectively captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DyaPlex adds a dual-tower motion pathway to a frozen speech model with interleaving and aligned RoPE, but supplies no ablations confirming the base model's zero-shot speech reasoning survives the change.

read the letter

The main contribution is a dual-tower Transformer that keeps a base full-duplex speech model frozen while adding a streaming motion generator for dyadic human interaction. The design uses unified dyadic token interleaving and time-aligned speech-motion RoPE to couple the modalities without retraining the speech side. Training on 4000 hours of the Seamless Interaction dataset is a respectable scale for this kind of work.

The architecture itself is laid out clearly enough to see how they intend to preserve the base model's priors while generating aligned motion. The goal of handling reciprocal speech and motion in real time is a reasonable target for robotics and virtual agents.

The soft spot is the missing verification for the central assumption. The abstract states that zero-shot conversational reasoning is preserved, yet there are no pre/post speech-only scores, no ablation on coherence or response quality after the motion pathway is added, and no numbers or baselines attached to the SOTA claims on monadic and dyadic benchmarks. Without those checks it is hard to know whether the reported gains come from the new components or from some unintended side effect on the frozen model.

This paper is for groups already working on streaming multimodal agents. A reader who wants concrete ideas for token interleaving across speech and motion could pull something useful from the description, but anyone evaluating whether the approach actually works will need the missing experiments.

I would send it to peer review once the authors add the speech-preservation ablations; the idea is worth testing properly.

Referee Report

1 major / 0 minor

Summary. The paper presents DyaPlex, a streaming full-duplex speech-and-motion model for dyadic interaction. It builds a dual-tower Transformer on a frozen foundational full-duplex speech model, using unified dyadic token interleaving and time-aligned speech-motion RoPE to align autoregressive motions with speech features. Trained on the 4,000-hour Seamless Interaction dataset, the model is claimed to capture cross-speaker dependencies while preserving the base model's zero-shot conversational reasoning and achieving new state-of-the-art results on both monadic and dyadic human interaction benchmarks.

Significance. If the central architectural claim holds—that a motion pathway can be grafted onto a frozen speech model without degrading its zero-shot capabilities—the work would advance streaming multi-modal agents for synchronized speech-motion dyadic interaction. The 4000-hour training scale and dual-tower design could provide a template for extending speech priors to motion, but the absence of supporting metrics makes the significance currently unassessable.

major comments (1)

[Abstract] Abstract: The claim that the dual-tower Transformer 'preserves the zero-shot conversational reasoning of a frozen base speech model' is load-bearing for attributing any benchmark gains to the proposed architecture rather than interference or degradation. No pre/post integration results on speech-only tasks (e.g., coherence, response quality, or zero-shot benchmarks), ablation studies removing the motion pathway, or quantitative verification of preserved autoregressive generation are supplied.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of verifying that the frozen base model’s zero-shot capabilities are preserved. We address this point directly below and will strengthen the manuscript with additional evidence.

read point-by-point responses

Referee: The claim that the dual-tower Transformer 'preserves the zero-shot conversational reasoning of a frozen base speech model' is load-bearing for attributing any benchmark gains to the proposed architecture rather than interference or degradation. No pre/post integration results on speech-only tasks (e.g., coherence, response quality, or zero-shot benchmarks), ablation studies removing the motion pathway, or quantitative verification of preserved autoregressive generation are supplied.

Authors: We agree that quantitative verification is necessary to substantiate the claim. Because the speech tower is kept frozen and the motion pathway operates through a separate tower with time-aligned cross-attention, speech autoregression is architecturally isolated from motion parameters. To make this explicit, the revised manuscript will include (1) direct comparisons of the base model versus DyaPlex on held-out speech-only zero-shot benchmarks (coherence, response quality) and (2) an ablation that removes the motion tower while keeping all other components fixed. These results will be reported in a new subsection of the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical training and architecture, not self-referential reduction

full rationale

The provided abstract and description contain no equations, derivations, or parameter-fitting steps that reduce by construction to the inputs. The central claim—that the dual-tower Transformer with dyadic interleaving and time-aligned RoPE preserves zero-shot reasoning of a frozen base model while adding motion—is presented as an architectural outcome trained on the external 4,000-hour Seamless Interaction dataset and evaluated on monadic/dyadic benchmarks. No self-citation load-bearing premises, uniqueness theorems, ansatzes smuggled via citation, or fitted inputs renamed as predictions appear. The derivation chain is self-contained against external benchmarks, yielding a normal non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information is given on free parameters, background axioms, or new entities introduced by the model.

pith-pipeline@v0.9.1-grok · 5748 in / 1115 out tokens · 35170 ms · 2026-06-28T10:20:34.389792+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Seamless interaction: Dyadic audiovisual motion modeling and large-scale dataset

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, Praveen Chowdary, Joe Chuang, Antony D’Avirro, Jon Daly, Ning Dong, Mark Duppenthaler, Cynthia Gao, Jeff Girard, Martin Gleize, Sahir Gomez, Hongyu Gong, Srivathsan Govindarajan, Brandon Han, Sen He, D...

2025
[2]

Ready-to-react: Online reaction policy for two-character interaction generation

Zhi Cen, Huaijin Pi, Sida Peng, Qing Shuai, Yujun Shen, Hujun Bao, Xiaowei Zhou, and Ruizhen Hu. Ready-to-react: Online reaction policy for two-character interaction generation. InICLR, 2025

2025
[3]

Seeing what matters: Generalizable ai-generated video detection with forensic-oriented augmentation

Riccardo Corvi, Davide Cozzolino, Ekta Prashnani, Shalini De Mello, Koki Nagano, and Luisa Verdoliva. Seeing what matters: Generalizable ai-generated video detection with forensic-oriented augmentation. In NeurIPS, 2025

2025
[4]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Visual speech-aware perceptual 3d facial expression reconstruction from videos.arXiv preprint arXiv:2207.11094, 2022

Panagiotis P Filntisis, George Retsinas, Foivos Paraperas-Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. Visual speech-aware perceptual 3d facial expression reconstruction from videos.arXiv preprint arXiv:2207.11094, 2022

work page arXiv 2022
[6]

Remos: 3d motion-conditioned reaction synthesis for two-person interactions

Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. Remos: 3d motion-conditioned reaction synthesis for two-person interactions. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[7]

Hu- mans in 4d: Reconstructing and tracking humans with transformers

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Hu- mans in 4d: Reconstructing and tracking humans with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14783–14794, October 2023. 10

2023
[8]

Learning speech-driven 3d conversational gestures from video

Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. Learning speech-driven 3d conversational gestures from video. InProceedings of the 21st ACM international conference on intelligent virtual agents, pages 101–108, 2021

2021
[9]

Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

2023
[10]

Arflow: Human action-reaction flow matching with physical guidance.ArXiv, 2025

Wentao Jiang, Jingya Wang, Haotao Lu, Kaiyang Ji, Baoxiong Jia, Siyuan Huang, and Ye Shi. Arflow: Human action-reaction flow matching with physical guidance.ArXiv, 2025

2025
[11]

Panoptic studio: A massively multiview system for social interaction

Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee, Timothy Scott Godisart, Bart Nabbe, Iain Matthews, et al. Panoptic studio: A massively multiview system for social interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16, 2017

2017
[12]

Ross, and Angjoo Kanazawa

Ruilong Li, Sha Yang, David A. Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InICCV, 2021

2021
[13]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), 2017

2017
[14]

Intergen: Diffusion-based multi- human motion generation under complex interactions.International Journal of Computer Vision, 132(9): 3463–3483, 2024

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi- human motion generation under complex interactions.International Journal of Computer Vision, 132(9): 3463–3483, 2024

2024
[15]

Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InECCV, 2022

2022
[16]

Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InCVPR, 2024

2024
[17]

Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling.arXiv preprint arXiv:2501.18898, 2025

Pinxin Liu, Luchuan Song, Junhua Huang, Haiyang Liu, and Chenliang Xu. Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling.arXiv preprint arXiv:2501.18898, 2025

work page arXiv 2025
[18]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Posegpt: Quantization-based 3d human motion generation and forecasting

Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel, and Grégory Rogez. Posegpt: Quantization-based 3d human motion generation and forecasting. InEuropean Conference on Computer Vision, pages 417–435. Springer, 2022

2022
[20]

Synergy and synchrony in couple dances.arXiv preprint arXiv:2409.04440, 2024

V ongani H Maluleke, Lea MÃ¼ller, Jathushan Rajasegaran, Georgios Pavlakos, Shiry Ginosar, Angjoo Kanazawa, and Jitendra Malik. Synergy and synchrony in couple dances.arXiv preprint arXiv:2409.04440, 2024

work page arXiv 2024
[21]

Retrieving semantics from the deep: an rag solution for gesture synthesis

M Hamza Mughal, Rishabh Dabral, Merel CJ Scholman, Vera Demberg, and Christian Theobalt. Retrieving semantics from the deep: an rag solution for gesture synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16578–16588, 2025

2025
[22]

Hamza Mughal, Rishabh Dabral, Vera Demberg, and Christian Theobalt

M. Hamza Mughal, Rishabh Dabral, Vera Demberg, and Christian Theobalt. Miburi: Towards expressive interactive gesture synthesis. InCVPR, 2026

2026
[23]

Learning to listen: Modeling non-deterministic dyadic facial motion.CVPR, 2022

Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. Learning to listen: Modeling non-deterministic dyadic facial motion.CVPR, 2022

2022
[24]

From audio to photoreal embodiment: Synthesizing humans in conversations

Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, and Alexander Richard. From audio to photoreal embodiment: Synthesizing humans in conversations. In CVPR, 2024

2024
[25]

Sarah: Spatially aware real-time agentic humans, 2026

Evonne Ng, Siwei Zhang, Zhang Chen, Michael Zollhoefer, and Alexander Richard. Sarah: Spatially aware real-time agentic humans, 2026. URLhttps://arxiv.org/abs/2602.18432

work page arXiv 2026
[26]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019. 11

2019
[27]

Reconstructing hands in 3D with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InCVPR, 2024

2024
[28]

Dyadit: A multi-modal diffusion transformer for socially favorable dyadic gesture generation

Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, and Kris Kitani. Dyadit: A multi-modal diffusion transformer for socially favorable dyadic gesture generation. InCVPR, 2026

2026
[29]

Dualtalk: Dual-speaker interaction for 3d talking head conversations

Ziqiao Peng, Yanbo Fan, Haoyu Wu, Xuan Wang, Hongyan Liu, Jun He, and Zhaoxin Fan. Dualtalk: Dual-speaker interaction for 3d talking head conversations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21055–21064, 2025

2025
[30]

Ope- nAI announcement, accessed 2026-05-18

Rajarshi Roy, Jonathan Raiman, Sang gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. Personaplex: V oice and role control for full duplex conversational speech models, 2026. URLhttps://arxiv.org/abs/2602.06053

work page arXiv 2026
[31]

Glu variants improve transformer, 2020

Noam Shazeer. Glu variants improve transformer, 2020

2020
[32]

Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 2024

2024
[33]

Motionclip: Exposing human motion generation to clip space

Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. InEuropean Conference on Computer Vision, pages 358–374. Springer, 2022

2022
[34]

Intercontrol: Zero-shot human interaction generation by controlling every joint.Advances in Neural Information Processing Systems, 37:105397– 105424, 2024

Zhenzhi Wang, Jingbo Wang, Yixuan Li, Dahua Lin, and Bo Dai. Intercontrol: Zero-shot human interaction generation by controlling every joint.Advances in Neural Information Processing Systems, 37:105397– 105424, 2024

2024
[35]

Regennet: Towards human action-reaction synthesis

Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InCVPR, 2024

2024
[36]

Generating holistic 3D human motion from speech

Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. Generating holistic 3D human motion from speech. InCVPR, 2023

2023
[37]

Speech gesture generation from the trimodal context of text, audio, and speaker identity.ACM TOG, page 1–16, 2020

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. Speech gesture generation from the trimodal context of text, audio, and speaker identity.ACM TOG, page 1–16, 2020

2020
[38]

T2m-gpt: Generating human motion from textual descriptions with discrete representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[39]

Vibes: A conversational agent with behaviorally-intelligent 3d virtual body

Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi Kowshika Lakshmikanth, and Ehsan Adeli. Vibes: A conversational agent with behaviorally-intelligent 3d virtual body. InCVPR, 2026

2026
[40]

On the continuity of rotation representa- tions in neural networks

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representa- tions in neural networks. InCVPR, 2019

2019
[41]

<system> You enjoy having a good conversation. <system>

Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. Taming diffusion models for audio-driven co-speech gesture generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10544–10553, 2023. 12 Figure 5: Comparisons without (a) and with (b) time-aligned speech-motion RoPE. A Discussion A.1 Limi...

2023

[1] [1]

Seamless interaction: Dyadic audiovisual motion modeling and large-scale dataset

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, Praveen Chowdary, Joe Chuang, Antony D’Avirro, Jon Daly, Ning Dong, Mark Duppenthaler, Cynthia Gao, Jeff Girard, Martin Gleize, Sahir Gomez, Hongyu Gong, Srivathsan Govindarajan, Brandon Han, Sen He, D...

2025

[2] [2]

Ready-to-react: Online reaction policy for two-character interaction generation

Zhi Cen, Huaijin Pi, Sida Peng, Qing Shuai, Yujun Shen, Hujun Bao, Xiaowei Zhou, and Ruizhen Hu. Ready-to-react: Online reaction policy for two-character interaction generation. InICLR, 2025

2025

[3] [3]

Seeing what matters: Generalizable ai-generated video detection with forensic-oriented augmentation

Riccardo Corvi, Davide Cozzolino, Ekta Prashnani, Shalini De Mello, Koki Nagano, and Luisa Verdoliva. Seeing what matters: Generalizable ai-generated video detection with forensic-oriented augmentation. In NeurIPS, 2025

2025

[4] [4]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Visual speech-aware perceptual 3d facial expression reconstruction from videos.arXiv preprint arXiv:2207.11094, 2022

Panagiotis P Filntisis, George Retsinas, Foivos Paraperas-Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. Visual speech-aware perceptual 3d facial expression reconstruction from videos.arXiv preprint arXiv:2207.11094, 2022

work page arXiv 2022

[6] [6]

Remos: 3d motion-conditioned reaction synthesis for two-person interactions

Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. Remos: 3d motion-conditioned reaction synthesis for two-person interactions. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[7] [7]

Hu- mans in 4d: Reconstructing and tracking humans with transformers

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Hu- mans in 4d: Reconstructing and tracking humans with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14783–14794, October 2023. 10

2023

[8] [8]

Learning speech-driven 3d conversational gestures from video

Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. Learning speech-driven 3d conversational gestures from video. InProceedings of the 21st ACM international conference on intelligent virtual agents, pages 101–108, 2021

2021

[9] [9]

Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

2023

[10] [10]

Arflow: Human action-reaction flow matching with physical guidance.ArXiv, 2025

Wentao Jiang, Jingya Wang, Haotao Lu, Kaiyang Ji, Baoxiong Jia, Siyuan Huang, and Ye Shi. Arflow: Human action-reaction flow matching with physical guidance.ArXiv, 2025

2025

[11] [11]

Panoptic studio: A massively multiview system for social interaction

Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee, Timothy Scott Godisart, Bart Nabbe, Iain Matthews, et al. Panoptic studio: A massively multiview system for social interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16, 2017

2017

[12] [12]

Ross, and Angjoo Kanazawa

Ruilong Li, Sha Yang, David A. Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InICCV, 2021

2021

[13] [13]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), 2017

2017

[14] [14]

Intergen: Diffusion-based multi- human motion generation under complex interactions.International Journal of Computer Vision, 132(9): 3463–3483, 2024

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi- human motion generation under complex interactions.International Journal of Computer Vision, 132(9): 3463–3483, 2024

2024

[15] [15]

Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InECCV, 2022

2022

[16] [16]

Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InCVPR, 2024

2024

[17] [17]

Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling.arXiv preprint arXiv:2501.18898, 2025

Pinxin Liu, Luchuan Song, Junhua Huang, Haiyang Liu, and Chenliang Xu. Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling.arXiv preprint arXiv:2501.18898, 2025

work page arXiv 2025

[18] [18]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Posegpt: Quantization-based 3d human motion generation and forecasting

Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel, and Grégory Rogez. Posegpt: Quantization-based 3d human motion generation and forecasting. InEuropean Conference on Computer Vision, pages 417–435. Springer, 2022

2022

[20] [20]

Synergy and synchrony in couple dances.arXiv preprint arXiv:2409.04440, 2024

V ongani H Maluleke, Lea MÃ¼ller, Jathushan Rajasegaran, Georgios Pavlakos, Shiry Ginosar, Angjoo Kanazawa, and Jitendra Malik. Synergy and synchrony in couple dances.arXiv preprint arXiv:2409.04440, 2024

work page arXiv 2024

[21] [21]

Retrieving semantics from the deep: an rag solution for gesture synthesis

M Hamza Mughal, Rishabh Dabral, Merel CJ Scholman, Vera Demberg, and Christian Theobalt. Retrieving semantics from the deep: an rag solution for gesture synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16578–16588, 2025

2025

[22] [22]

Hamza Mughal, Rishabh Dabral, Vera Demberg, and Christian Theobalt

M. Hamza Mughal, Rishabh Dabral, Vera Demberg, and Christian Theobalt. Miburi: Towards expressive interactive gesture synthesis. InCVPR, 2026

2026

[23] [23]

Learning to listen: Modeling non-deterministic dyadic facial motion.CVPR, 2022

Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. Learning to listen: Modeling non-deterministic dyadic facial motion.CVPR, 2022

2022

[24] [24]

From audio to photoreal embodiment: Synthesizing humans in conversations

Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, and Alexander Richard. From audio to photoreal embodiment: Synthesizing humans in conversations. In CVPR, 2024

2024

[25] [25]

Sarah: Spatially aware real-time agentic humans, 2026

Evonne Ng, Siwei Zhang, Zhang Chen, Michael Zollhoefer, and Alexander Richard. Sarah: Spatially aware real-time agentic humans, 2026. URLhttps://arxiv.org/abs/2602.18432

work page arXiv 2026

[26] [26]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019. 11

2019

[27] [27]

Reconstructing hands in 3D with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InCVPR, 2024

2024

[28] [28]

Dyadit: A multi-modal diffusion transformer for socially favorable dyadic gesture generation

Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, and Kris Kitani. Dyadit: A multi-modal diffusion transformer for socially favorable dyadic gesture generation. InCVPR, 2026

2026

[29] [29]

Dualtalk: Dual-speaker interaction for 3d talking head conversations

Ziqiao Peng, Yanbo Fan, Haoyu Wu, Xuan Wang, Hongyan Liu, Jun He, and Zhaoxin Fan. Dualtalk: Dual-speaker interaction for 3d talking head conversations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21055–21064, 2025

2025

[30] [30]

Ope- nAI announcement, accessed 2026-05-18

Rajarshi Roy, Jonathan Raiman, Sang gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. Personaplex: V oice and role control for full duplex conversational speech models, 2026. URLhttps://arxiv.org/abs/2602.06053

work page arXiv 2026

[31] [31]

Glu variants improve transformer, 2020

Noam Shazeer. Glu variants improve transformer, 2020

2020

[32] [32]

Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 2024

2024

[33] [33]

Motionclip: Exposing human motion generation to clip space

Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. InEuropean Conference on Computer Vision, pages 358–374. Springer, 2022

2022

[34] [34]

Intercontrol: Zero-shot human interaction generation by controlling every joint.Advances in Neural Information Processing Systems, 37:105397– 105424, 2024

Zhenzhi Wang, Jingbo Wang, Yixuan Li, Dahua Lin, and Bo Dai. Intercontrol: Zero-shot human interaction generation by controlling every joint.Advances in Neural Information Processing Systems, 37:105397– 105424, 2024

2024

[35] [35]

Regennet: Towards human action-reaction synthesis

Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InCVPR, 2024

2024

[36] [36]

Generating holistic 3D human motion from speech

Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. Generating holistic 3D human motion from speech. InCVPR, 2023

2023

[37] [37]

Speech gesture generation from the trimodal context of text, audio, and speaker identity.ACM TOG, page 1–16, 2020

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. Speech gesture generation from the trimodal context of text, audio, and speaker identity.ACM TOG, page 1–16, 2020

2020

[38] [38]

T2m-gpt: Generating human motion from textual descriptions with discrete representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[39] [39]

Vibes: A conversational agent with behaviorally-intelligent 3d virtual body

Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi Kowshika Lakshmikanth, and Ehsan Adeli. Vibes: A conversational agent with behaviorally-intelligent 3d virtual body. InCVPR, 2026

2026

[40] [40]

On the continuity of rotation representa- tions in neural networks

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representa- tions in neural networks. InCVPR, 2019

2019

[41] [41]

<system> You enjoy having a good conversation. <system>

Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. Taming diffusion models for audio-driven co-speech gesture generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10544–10553, 2023. 12 Figure 5: Comparisons without (a) and with (b) time-aligned speech-motion RoPE. A Discussion A.1 Limi...

2023