pith. sign in

arxiv: 2606.03874 · v1 · pith:XDVQVNZ7new · submitted 2026-06-02 · 💻 cs.CV · cs.RO

DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

Pith reviewed 2026-06-28 10:20 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords full-duplex modeldyadic interactionspeech and motiondual-tower transformertoken interleavingRoPE positional encodinghuman interactionstreaming generation
0
0 comments X

The pith

DyaPlex adds a streaming motion pathway to a frozen full-duplex speech model via dual-tower Transformers for synchronized dyadic interaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DyaPlex is designed as a full-duplex model that can both listen and speak while also generating physical motions in real time during conversations with another person. The core idea is to take an existing speech model that already handles full-duplex conversations and attach a separate but coupled motion generation system without breaking the original model's abilities. This is done through a dual-tower setup where tokens for speech and motion are interleaved and aligned in time using a special positional encoding. The result is a model trained on thousands of hours of data that reportedly outperforms previous approaches on tasks involving single or paired human interactions.

Core claim

Our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features.

What carries the argument

dual-tower Transformer architecture with unified dyadic token interleaving and time-aligned speech-motion RoPE added to a frozen base speech model

Load-bearing premise

Adding the dual-tower Transformer and motion pathway to the frozen base speech model does not degrade its zero-shot conversational reasoning abilities.

What would settle it

Measuring the base speech model's performance on zero-shot conversational tasks before and after integrating the motion pathway; a drop would indicate the assumption fails.

Figures

Figures reproduced from arXiv: 2606.03874 by Amrita Mazumdar, Christian Jacobsen, Hongyu Liu, Ka Chun Cheung, Koki Nagano, Michael Stengel, Rajarshi Roy, Seonwook Park, Shalini De Mello, Shengze Wang, Simon See, Tianye Li.

Figure 1
Figure 1. Figure 1: DyaPlex is a causal, full-duplex speech-motion model that simultaneously speaks and listens to a partner while perceiving a partner motion and generating agent’s motion. Our model could be applied to applications, such as, dyadic interactions with a human user and agent/robot (right) as well as generating synthetic speech-motion dyadic interaction data. Abstract We present DyaPlex, a streaming, full-duplex… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture overview. DyaPlex consists of three components: (a) part-aware RVQ-VAE decoders and (b) a frozen speech tower, and a trainable motion tower. The speech tower (PersonaPlex) takes in dyadic speech, emits agent speech autoregressively, and exposes its per-layer residual-stream hidden states {Hℓ} 32 ℓ=1. For training, we precompute {Hℓ} once (Sec. 4.1) to serve as cross-attention keys and values f… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on Seamless test clips. The agent (speaker [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: In this example the pair reads a story in a turn taking fashion and they use body languages [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparisons without (a) and with (b) time-aligned speech-motion RoPE. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Interactive interface for the user study. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, our model effectively captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents DyaPlex, a streaming full-duplex speech-and-motion model for dyadic interaction. It builds a dual-tower Transformer on a frozen foundational full-duplex speech model, using unified dyadic token interleaving and time-aligned speech-motion RoPE to align autoregressive motions with speech features. Trained on the 4,000-hour Seamless Interaction dataset, the model is claimed to capture cross-speaker dependencies while preserving the base model's zero-shot conversational reasoning and achieving new state-of-the-art results on both monadic and dyadic human interaction benchmarks.

Significance. If the central architectural claim holds—that a motion pathway can be grafted onto a frozen speech model without degrading its zero-shot capabilities—the work would advance streaming multi-modal agents for synchronized speech-motion dyadic interaction. The 4000-hour training scale and dual-tower design could provide a template for extending speech priors to motion, but the absence of supporting metrics makes the significance currently unassessable.

major comments (1)
  1. [Abstract] Abstract: The claim that the dual-tower Transformer 'preserves the zero-shot conversational reasoning of a frozen base speech model' is load-bearing for attributing any benchmark gains to the proposed architecture rather than interference or degradation. No pre/post integration results on speech-only tasks (e.g., coherence, response quality, or zero-shot benchmarks), ablation studies removing the motion pathway, or quantitative verification of preserved autoregressive generation are supplied.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of verifying that the frozen base model’s zero-shot capabilities are preserved. We address this point directly below and will strengthen the manuscript with additional evidence.

read point-by-point responses
  1. Referee: The claim that the dual-tower Transformer 'preserves the zero-shot conversational reasoning of a frozen base speech model' is load-bearing for attributing any benchmark gains to the proposed architecture rather than interference or degradation. No pre/post integration results on speech-only tasks (e.g., coherence, response quality, or zero-shot benchmarks), ablation studies removing the motion pathway, or quantitative verification of preserved autoregressive generation are supplied.

    Authors: We agree that quantitative verification is necessary to substantiate the claim. Because the speech tower is kept frozen and the motion pathway operates through a separate tower with time-aligned cross-attention, speech autoregression is architecturally isolated from motion parameters. To make this explicit, the revised manuscript will include (1) direct comparisons of the base model versus DyaPlex on held-out speech-only zero-shot benchmarks (coherence, response quality) and (2) an ablation that removes the motion tower while keeping all other components fixed. These results will be reported in a new subsection of the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical training and architecture, not self-referential reduction

full rationale

The provided abstract and description contain no equations, derivations, or parameter-fitting steps that reduce by construction to the inputs. The central claim—that the dual-tower Transformer with dyadic interleaving and time-aligned RoPE preserves zero-shot reasoning of a frozen base model while adding motion—is presented as an architectural outcome trained on the external 4,000-hour Seamless Interaction dataset and evaluated on monadic/dyadic benchmarks. No self-citation load-bearing premises, uniqueness theorems, ansatzes smuggled via citation, or fitted inputs renamed as predictions appear. The derivation chain is self-contained against external benchmarks, yielding a normal non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information is given on free parameters, background axioms, or new entities introduced by the model.

pith-pipeline@v0.9.1-grok · 5748 in / 1115 out tokens · 35170 ms · 2026-06-28T10:20:34.389792+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Seamless interaction: Dyadic audiovisual motion modeling and large-scale dataset

    Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, Praveen Chowdary, Joe Chuang, Antony D’Avirro, Jon Daly, Ning Dong, Mark Duppenthaler, Cynthia Gao, Jeff Girard, Martin Gleize, Sahir Gomez, Hongyu Gong, Srivathsan Govindarajan, Brandon Han, Sen He, D...

  2. [2]

    Ready-to-react: Online reaction policy for two-character interaction generation

    Zhi Cen, Huaijin Pi, Sida Peng, Qing Shuai, Yujun Shen, Hujun Bao, Xiaowei Zhou, and Ruizhen Hu. Ready-to-react: Online reaction policy for two-character interaction generation. InICLR, 2025

  3. [3]

    Seeing what matters: Generalizable ai-generated video detection with forensic-oriented augmentation

    Riccardo Corvi, Davide Cozzolino, Ekta Prashnani, Shalini De Mello, Koki Nagano, and Luisa Verdoliva. Seeing what matters: Generalizable ai-generated video detection with forensic-oriented augmentation. In NeurIPS, 2025

  4. [4]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

  5. [5]

    Visual speech-aware perceptual 3d facial expression reconstruction from videos.arXiv preprint arXiv:2207.11094, 2022

    Panagiotis P Filntisis, George Retsinas, Foivos Paraperas-Papantoniou, Athanasios Katsamanis, Anastasios Roussos, and Petros Maragos. Visual speech-aware perceptual 3d facial expression reconstruction from videos.arXiv preprint arXiv:2207.11094, 2022

  6. [6]

    Remos: 3d motion-conditioned reaction synthesis for two-person interactions

    Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. Remos: 3d motion-conditioned reaction synthesis for two-person interactions. InEuropean Conference on Computer Vision (ECCV), 2024

  7. [7]

    Hu- mans in 4d: Reconstructing and tracking humans with transformers

    Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Hu- mans in 4d: Reconstructing and tracking humans with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14783–14794, October 2023. 10

  8. [8]

    Learning speech-driven 3d conversational gestures from video

    Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. Learning speech-driven 3d conversational gestures from video. InProceedings of the 21st ACM international conference on intelligent virtual agents, pages 101–108, 2021

  9. [9]

    Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

  10. [10]

    Arflow: Human action-reaction flow matching with physical guidance.ArXiv, 2025

    Wentao Jiang, Jingya Wang, Haotao Lu, Kaiyang Ji, Baoxiong Jia, Siyuan Huang, and Ye Shi. Arflow: Human action-reaction flow matching with physical guidance.ArXiv, 2025

  11. [11]

    Panoptic studio: A massively multiview system for social interaction

    Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee, Timothy Scott Godisart, Bart Nabbe, Iain Matthews, et al. Panoptic studio: A massively multiview system for social interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16, 2017

  12. [12]

    Ross, and Angjoo Kanazawa

    Ruilong Li, Sha Yang, David A. Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InICCV, 2021

  13. [13]

    Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), 2017

  14. [14]

    Intergen: Diffusion-based multi- human motion generation under complex interactions.International Journal of Computer Vision, 132(9): 3463–3483, 2024

    Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi- human motion generation under complex interactions.International Journal of Computer Vision, 132(9): 3463–3483, 2024

  15. [15]

    Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

    Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InECCV, 2022

  16. [16]

    Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling

    Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InCVPR, 2024

  17. [17]

    Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling.arXiv preprint arXiv:2501.18898, 2025

    Pinxin Liu, Luchuan Song, Junhua Huang, Haiyang Liu, and Chenliang Xu. Gesturelsm: Latent shortcut based co-speech gesture generation with spatial-temporal modeling.arXiv preprint arXiv:2501.18898, 2025

  18. [18]

    Decoupled Weight Decay Regularization

    I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  19. [19]

    Posegpt: Quantization-based 3d human motion generation and forecasting

    Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel, and Grégory Rogez. Posegpt: Quantization-based 3d human motion generation and forecasting. InEuropean Conference on Computer Vision, pages 417–435. Springer, 2022

  20. [20]

    Synergy and synchrony in couple dances.arXiv preprint arXiv:2409.04440, 2024

    V ongani H Maluleke, Lea Müller, Jathushan Rajasegaran, Georgios Pavlakos, Shiry Ginosar, Angjoo Kanazawa, and Jitendra Malik. Synergy and synchrony in couple dances.arXiv preprint arXiv:2409.04440, 2024

  21. [21]

    Retrieving semantics from the deep: an rag solution for gesture synthesis

    M Hamza Mughal, Rishabh Dabral, Merel CJ Scholman, Vera Demberg, and Christian Theobalt. Retrieving semantics from the deep: an rag solution for gesture synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16578–16588, 2025

  22. [22]

    Hamza Mughal, Rishabh Dabral, Vera Demberg, and Christian Theobalt

    M. Hamza Mughal, Rishabh Dabral, Vera Demberg, and Christian Theobalt. Miburi: Towards expressive interactive gesture synthesis. InCVPR, 2026

  23. [23]

    Learning to listen: Modeling non-deterministic dyadic facial motion.CVPR, 2022

    Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. Learning to listen: Modeling non-deterministic dyadic facial motion.CVPR, 2022

  24. [24]

    From audio to photoreal embodiment: Synthesizing humans in conversations

    Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, and Alexander Richard. From audio to photoreal embodiment: Synthesizing humans in conversations. In CVPR, 2024

  25. [25]

    Sarah: Spatially aware real-time agentic humans, 2026

    Evonne Ng, Siwei Zhang, Zhang Chen, Michael Zollhoefer, and Alexander Richard. Sarah: Spatially aware real-time agentic humans, 2026. URLhttps://arxiv.org/abs/2602.18432

  26. [26]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019. 11

  27. [27]

    Reconstructing hands in 3D with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InCVPR, 2024

  28. [28]

    Dyadit: A multi-modal diffusion transformer for socially favorable dyadic gesture generation

    Yichen Peng, Jyun-Ting Song, Siyeol Jung, Ruofan Liu, Haiyang Liu, Xuangeng Chu, Ruicong Liu, Erwin Wu, Hideki Koike, and Kris Kitani. Dyadit: A multi-modal diffusion transformer for socially favorable dyadic gesture generation. InCVPR, 2026

  29. [29]

    Dualtalk: Dual-speaker interaction for 3d talking head conversations

    Ziqiao Peng, Yanbo Fan, Haoyu Wu, Xuan Wang, Hongyan Liu, Jun He, and Zhaoxin Fan. Dualtalk: Dual-speaker interaction for 3d talking head conversations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21055–21064, 2025

  30. [30]

    Ope- nAI announcement, accessed 2026-05-18

    Rajarshi Roy, Jonathan Raiman, Sang gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. Personaplex: V oice and role control for full duplex conversational speech models, 2026. URLhttps://arxiv.org/abs/2602.06053

  31. [31]

    Glu variants improve transformer, 2020

    Noam Shazeer. Glu variants improve transformer, 2020

  32. [32]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 2024

  33. [33]

    Motionclip: Exposing human motion generation to clip space

    Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. InEuropean Conference on Computer Vision, pages 358–374. Springer, 2022

  34. [34]

    Intercontrol: Zero-shot human interaction generation by controlling every joint.Advances in Neural Information Processing Systems, 37:105397– 105424, 2024

    Zhenzhi Wang, Jingbo Wang, Yixuan Li, Dahua Lin, and Bo Dai. Intercontrol: Zero-shot human interaction generation by controlling every joint.Advances in Neural Information Processing Systems, 37:105397– 105424, 2024

  35. [35]

    Regennet: Towards human action-reaction synthesis

    Liang Xu, Yizhou Zhou, Yichao Yan, Xin Jin, Wenhan Zhu, Fengyun Rao, Xiaokang Yang, and Wenjun Zeng. Regennet: Towards human action-reaction synthesis. InCVPR, 2024

  36. [36]

    Generating holistic 3D human motion from speech

    Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, and Michael J Black. Generating holistic 3D human motion from speech. InCVPR, 2023

  37. [37]

    Speech gesture generation from the trimodal context of text, audio, and speaker identity.ACM TOG, page 1–16, 2020

    Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. Speech gesture generation from the trimodal context of text, audio, and speaker identity.ACM TOG, page 1–16, 2020

  38. [38]

    T2m-gpt: Generating human motion from textual descriptions with discrete representations

    Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  39. [39]

    Vibes: A conversational agent with behaviorally-intelligent 3d virtual body

    Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi Kowshika Lakshmikanth, and Ehsan Adeli. Vibes: A conversational agent with behaviorally-intelligent 3d virtual body. InCVPR, 2026

  40. [40]

    On the continuity of rotation representa- tions in neural networks

    Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representa- tions in neural networks. InCVPR, 2019

  41. [41]

    <system> You enjoy having a good conversation. <system>

    Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. Taming diffusion models for audio-driven co-speech gesture generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10544–10553, 2023. 12 Figure 5: Comparisons without (a) and with (b) time-aligned speech-motion RoPE. A Discussion A.1 Limi...