pith. sign in

arxiv: 2606.11167 · v1 · pith:ZVWGBEFFnew · submitted 2026-06-09 · 💻 cs.CL · eess.AS

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

Pith reviewed 2026-06-27 12:51 UTC · model grok-4.3

classification 💻 cs.CL eess.AS
keywords full-duplex speechspoken dialogue modelsreinforcement learning alignmentturn-takingbackchannelingpause handlinguser interruptionpost-training alignment
0
0 comments X

The pith

Post-training RL with axis-specific rewards from human audio segments aligns full-duplex speech models on pause handling, turn-taking, backchanneling, and user interruption.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that supervised training alone leaves full-duplex spoken dialogue models with poor timing behaviors such as excessive silence and mistimed responses. It shows that a follow-up reinforcement learning stage can target four specific interactivity axes by pulling short clips from real human conversations and scoring the model against axis-tailored reward functions. An added LLM-based quality reward keeps the semantic content of responses from degrading during the alignment. The method is demonstrated on two existing open models and measured both on pre-recorded audio and in live multi-turn conversations.

Core claim

We propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation

What carries the argument

Axis-specific RL reward functions derived from short human-conversation audio segments, combined with an auxiliary LLM quality reward, applied as post-training alignment.

If this is right

  • Both Moshi and PersonaPlex exhibit consistent gains across the four axes under offline pre-recorded audio evaluation.
  • The same gains appear in live multi-turn dialogue sessions.
  • The LLM quality reward keeps response semantics from degrading during alignment.
  • The approach works as a post-training step rather than requiring full retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same segment-extraction and reward design could be reused on any future full-duplex model without redesigning the underlying architecture.
  • Extending the four axes to include additional behaviors such as laughter timing or topic-shift handling would require only new segment sets and reward definitions.
  • If the method scales, voice interfaces could move from turn-based to truly overlapping conversation without separate timing modules.

Load-bearing premise

Short audio segments extracted from human corpora together with their axis-specific reward functions will produce measurable interactivity gains in live multi-turn settings without introducing new failure modes that the added LLM quality reward cannot fully mitigate.

What would settle it

A controlled real-time multi-turn dialogue test in which the aligned model shows no statistically significant improvement, or a decline, on at least one of the four interactivity axes relative to the unaligned base model.

Figures

Figures reproduced from arXiv: 2606.11167 by Alexandre D\'efossez, Atsumoto Ohashi, Eugene Kharitonov, Neil Zeghidour.

Figure 1
Figure 1. Figure 1: Overview of the proposed method. We first extract segments related to each interactivity axis [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of a conversation between GPT-Realtime (Examiner) and Moshi + RL in the Daily task of Full [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: System prompt used to compute the LLM Judge reward A Training Details During generation, we use sampling temperatures of 0.7 and 0.8 for text and audio tokens, respec￾tively, with top-k = 250 for audio tokens. At each of the 100 training epochs, we sample 32 segments (i.e., groups), each of which yields G = 16 com￾pletions. Training is distributed across 32 NVIDIA H100 GPUs using Fully Sharded Data Paralle… view at source ↗
Figure 4
Figure 4. Figure 4: Example of a conversation between GPT￾Realtime (Examiner) and Moshi in the Daily task of Full-Duplex-Bench v2 (Turn-taking fluency score: 2.75, Instruction-following score: 2.50). all models using UTMOSv2 (Baba et al., 2024), a neural mean opinion score (MOS) predictor. We computethe scores on the agent’s speech segments extracted from the Turn-Taking and User Interrup￾tion scenarios of Full-Duplex-Bench v… view at source ↗
Figure 5
Figure 5. Figure 5: Dialogue examples from the Safety task of Full-Duplex-Bench v2 between GPT-Realtime and (a) the base [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes a post-training RL alignment procedure for full-duplex spoken dialogue models that targets four interactivity axes (pause handling, turn-taking, backchanneling, user interruption) by extracting short audio segments from human corpora, applying axis-specific reward functions, and adding an LLM-based response-quality reward to avoid semantic degradation. The method is applied to Moshi and PersonaPlex and evaluated on both offline pre-recorded audio and real-time multi-turn dialogue, with claims of consistent interactivity improvements.

Significance. If the empirical gains are shown to be robust and non-circular, the work would provide a concrete, multi-axis RL recipe that moves beyond token-level supervised training for full-duplex models, offering a practical template for improving natural turn-taking behaviors in open-source systems.

major comments (3)
  1. [Abstract and Evaluation] The central empirical claim rests on short-segment axis rewards generalizing to live multi-turn full-duplex dynamics, yet the manuscript supplies no analysis, ablation, or failure-mode study of cross-axis timing interactions (e.g., whether improved backchanneling disrupts turn-taking). This is load-bearing for the generalization asserted in the abstract and evaluation sections.
  2. [Method] No explicit definitions, functional forms, or hyperparameter values for the four axis-specific reward functions or the LLM quality reward appear in the provided abstract or summary; without these, it is impossible to verify that the reported gains are not artifacts of the reward construction itself.
  3. [Evaluation] The offline and real-time evaluations are described only qualitatively in the abstract; the absence of quantitative metrics, baselines, or statistical significance tests undermines the claim of “consistent improvements” on both Moshi and PersonaPlex.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., a delta on an interactivity metric) to allow readers to gauge effect size without reading the full text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where the manuscript's presentation can be strengthened to better support the claims. We address each point below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] The central empirical claim rests on short-segment axis rewards generalizing to live multi-turn full-duplex dynamics, yet the manuscript supplies no analysis, ablation, or failure-mode study of cross-axis timing interactions (e.g., whether improved backchanneling disrupts turn-taking). This is load-bearing for the generalization asserted in the abstract and evaluation sections.

    Authors: We agree that an explicit analysis of cross-axis timing interactions would strengthen the generalization argument. While the real-time multi-turn evaluations require the model to handle all axes simultaneously and show no obvious degradation, we did not include a dedicated ablation or failure-mode study. We will add this analysis in the revised manuscript. revision: yes

  2. Referee: [Method] No explicit definitions, functional forms, or hyperparameter values for the four axis-specific reward functions or the LLM quality reward appear in the provided abstract or summary; without these, it is impossible to verify that the reported gains are not artifacts of the reward construction itself.

    Authors: The full manuscript defines the axis-specific reward functions and the LLM quality reward (including functional forms and hyperparameters) in Section 3 and the appendix. The referee summary references only the abstract, which is necessarily high-level. We will add a brief summary of the reward definitions to the abstract in revision. revision: yes

  3. Referee: [Evaluation] The offline and real-time evaluations are described only qualitatively in the abstract; the absence of quantitative metrics, baselines, or statistical significance tests undermines the claim of “consistent improvements” on both Moshi and PersonaPlex.

    Authors: The abstract is qualitative due to length limits, but the full manuscript reports quantitative metrics, baselines, and significance tests in the Evaluation section for both models. We will incorporate key quantitative highlights and significance statements into the abstract during revision. revision: yes

Circularity Check

0 steps flagged

Empirical RL alignment uses external rewards; no circularity

full rationale

The paper presents an empirical post-training RL procedure that extracts short audio segments from human conversation corpora to define axis-specific rewards (pause handling, turn-taking, backchanneling, user interruption) plus an LLM-based quality reward. These inputs are independent of the trained model, and improvements are measured on separate offline and real-time evaluations. No equations, derivations, or self-citations reduce the claimed gains to the training objective by construction; the central claim remains an externally falsifiable empirical result rather than a self-definitional or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete information on fitted parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5725 in / 1105 out tokens · 23336 ms · 2026-06-27T12:51:03.979438+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 40 canonical work pages · 16 internal anchors

  1. [1]

    Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, Praveen Chowdary, Joe Chuang, Antony D'Avirro, Jon Daly, Ning Dong, Mark Duppenthaler, Cynthia Gao, Jeff Girard, Martin Gleize, and 65 others. 2025. https://doi.org/10.48550/arXiv.2506.22554 Seamless I...

  2. [2]

    Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, and Shinji Watanabe. 2026 a . https://doi.org/10.48550/arXiv.2504.08528 On The Landscape of Spoken Language Models : A Comprehensive Survey . arXiv preprint arXiv:2504.08528

  3. [3]

    Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, and Shinji Watanabe. 2024. https://openreview.net/forum?id=2e4ECh0ikn Talking Turns : Benchmarking Audio Foundation Models on Turn-Taking Dynamics . In Proceedings of the Thirteenth International Conference on Learning Representations

  4. [4]

    Siddhant Arora, Jinchuan Tian, Jiatong Shi, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, and Shinji Watanabe. 2026 b . https://doi.org/10.48550/arXiv.2601.19063 Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback . arXiv preprint arXiv:2601.19063

  5. [5]

    Kaito Baba, Wataru Nakata, Yuki Saito, and Hiroshi Saruwatari. 2024. https://doi.org/10.1109/SLT61566.2024.10832315 The T05 System for the voicemos challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech . In Proceedings of the 2024 IEEE Spoken Language Technology Workshop , pages 818--824

  6. [6]

    Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, and James Glass. 2026. https://ieeexplore.ieee.org/abstract/document/11464000/ Game-time: Evaluating temporal dynamics in spoken language models . In Proceedings of the 2026 IEEE International Conference on Acoustics , Speech and Signal Pro...

  7. [7]

    Chen Chen, Ke Hu, Chao-Han Huck Yang, Ankita Pasad, Edresson Casanova, Weiqing Wang, Szu-Wei Fu, Jason Li, Zhehuai Chen, Jagadeesh Balam, and Boris Ginsburg. 2025 a . https://openreview.net/forum?id=QbLbXz8Idp#discussion Reinforcement Learning Enhanced Full-Duplex Spoken Dialogue Language Models for Conversational Interactions . In Proceedings of the Seco...

  8. [8]

    Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, and 17 others. 2025 b . https://doi.org/10.48550/arXiv.2501.06282 MinMo : A Multimodal Large Language Model for Seamless Voice Intera...

  9. [9]

    Yifu Chen, Shengpeng Ji, Qian Chen, Tianle Liang, Yangzhuo Li, Ziqing Wang, Wen Wang, Jingyu Lu, Haoxiao Wang, Xueyi Pu, Fan Zhuo, and Zhou Zhao. 2026 a . https://doi.org/10.48550/arXiv.2604.14932 WavAlign : Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training . arXiv preprint arXiv:2604.14932

  10. [10]

    Yifu Chen, Shengpeng Ji, Zhengqing Liu, Qian Chen, Wen Wang, Ziqing Wang, Yangzhuo Li, Tianle Liang, and Zhou Zhao. 2026 b . https://doi.org/10.48550/arXiv.2604.14920 Dual- Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models . arXiv preprint arXiv:2604.14920

  11. [11]

    Christopher Cieri, David Miller, and Kevin Walker. 2004. https://aclanthology.org/L04-1500/ The Fisher Corpus : A Resource for the Next Generations of Speech-to-Text . In Proceedings of the Fourth International Conference on Language Resources and Evaluation

  12. [12]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, and 3416 others. 2025. https://doi.org/10.48550/arXiv.2507.06261 Gemini...

  13. [13]

    Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. 2024. https://doi.org/10.48550/arXiv.2410.00037 Moshi: A speech-text foundation model for real-time dialogue . arXiv preprint arXiv:2410.00037

  14. [14]

    Erik Ekstedt and Gabriel Skantze. 2022. https://doi.org/10.21437/Interspeech.2022-10955 Voice Activity Projection : Self-supervised Learning of Turn-taking Events . In Proceeding of Interspeech 2022 , pages 5190--5194

  15. [15]

    Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. 2024. https://openreview.net/forum?id=PYmrUQmMEw LLaMA-Omni : Seamless Speech Interaction with Large Language Models . In Proceedings of the Thirteenth International Conference on Learning Representations

  16. [16]

    Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, Haoyu Cao, Di Yin, Long Ma, Xiawu Zheng, Rongrong Ji, Yunsheng Wu, Ran He, Caifeng Shan, and Xing Sun. 2025. https://doi.org/10.48550/arXiv.2408.05211 VITA : Towards Open-Source Interactive Omni Multimodal LLM . arXiv preprint arXiv:2...

  17. [17]

    Yuan Ge, Saihan Chen, Jingqi Xiao, Xiaoqian Liu, Tong Xiao, Yan Xiang, Zhengtao Yu, and Jingbo Zhu. 2025. https://doi.org/10.48550/arXiv.2509.22243 FLEXI : Benchmarking Full-duplex Human-LLM Speech Interaction . arXiv preprint arXiv:2509.22243

  18. [18]

    Mattias Heldner and Jens Edlund. 2010. https://doi.org/10.1016/j.wocn.2010.08.002 Pauses, gaps and overlaps in conversations . Journal of Phonetics, 38(4):555--568

  19. [19]

    Chi-Yuan Hsiao, Ke-Han Lu, Yu-Kuan Fu, Guan-Ting Lin, Hsiao-Tsung Hung, and Hung-yi Lee. 2026. https://doi.org/10.48550/arXiv.2604.10065 ASPIRin : Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models . arXiv preprint arXiv:2604.10065

  20. [20]

    Ke Hu, Ehsan Hosseini-Asl , Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr \.Z elasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, and Boris Ginsburg. 2025. https://doi.org/10.21437/Interspeech.2025-874 Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model . In Proceeding of Interspeech 2025 , pages 2715--2719

  21. [21]

    Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, and Zhou Zhao. 2024. https://doi.org/10.48550/arXiv.2411.13577 WavChat : A Survey of Spoken Dialogue Models . arXiv preprint arXiv:2411.13577

  22. [22]

    Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, and Zhou Zhao. 2025. https://doi.org/10.48550/arXiv.2505.09558 WavReward : Spoken Dialogue Models With Generalist Reward Evaluators . arXiv preprint arXiv:2505.09558

  23. [23]

    Borui Liao, Yulong Xu, Jiao Ou, Kaiyuan Yang, Weihua Jian, Pengfei Wan, and Di Zhang. 2025. https://doi.org/10.48550/arXiv.2502.13472 FlexDuo : A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems . arXiv preprint arXiv:2502.13472

  24. [24]

    Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung-yi Lee. 2026. https://doi.org/10.48550/arXiv.2510.07838 Full- Duplex-Bench-v2 : A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner . arXiv preprint arXiv:2510.07838

  25. [25]

    Liu, and Hung-yi Lee

    Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung-yi Lee. 2025 a . https://doi.org/10.48550/arXiv.2503.04721 Full- Duplex-Bench : A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities . arXiv preprint arXiv:2503.04721

  26. [26]

    Guan-Ting Lin, Prashanth Gurunath Shivakumar, Aditya Gourav, Yile Gu, Ankur Gandhe, Hung-yi Lee, and Ivan Bulyko. 2025 b . https://doi.org/10.18653/v1/2025.acl-long.997 Align- SLM : Textless Spoken Language Models with Reinforcement Learning from AI Feedback . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pag...

  27. [27]

    Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. 2026. https://doi.org/10.48550/arXiv.2601.05242 GDPO : Group reward- Decoupled Normalization Policy Optimization for Multi-reward RL Optimization . arXiv preprint arXiv:2...

  28. [28]

    Ilya Loshchilov and Frank Hutter. 2018. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled Weight Decay Regularization . In Proceedings of the Sixth International Conference on Learning Representations

  29. [29]

    Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Beno \^i t Sagot, Abdelrahman Mohamed, and Emmanuel Dupoux. 2023. https://doi.org/10.1162/tacl_a_00545 Generative Spoken Dialogue Language Modeling . Transactions of the Association for Computational Linguistics, 11:250--266

  30. [30]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, and 262 others. 2024. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 Te...

  31. [31]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://openreview.net/forum?id=TG8KACxEON Training language ...

  32. [32]

    Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, and Eng Siong Chng. 2025. https://doi.org/10.21437/Interspeech.2025-739 FD-Bench : A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems . In Interspeech 2025, pages 176--180

  33. [33]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. 2023. https://openreview.net/forum?id=HPuSIXJaa9 Direct Preference Optimization : Your Language Model is Secretly a Reward Model . In Proceedings of the Thirty-seventh Conference on Neural Information Processing Systems

  34. [34]

    Rajarshi Roy, Jonathan Raiman, Sang-gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. 2026. https://doi.org/10.48550/arXiv.2602.06053 PersonaPlex : Voice and Role Control for Full Duplex Conversational Speech Models . arXiv preprint arXiv:2602.06053

  35. [35]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://doi.org/10.48550/arXiv.1707.06347 Proximal Policy Optimization Algorithms . arXiv preprint arXiv:1707.06347

  36. [36]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://doi.org/10.48550/arXiv.2402.03300 DeepSeekMath : Pushing the Limits of Mathematical Reasoning in Open Language Models . arXiv preprint arXiv:2402.03300

  37. [37]

    Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, and Zhiting Hu. 2025. https://doi.org/10.48550/arXiv.2505.02707 Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play . arXiv preprint arXiv:2505.02707

  38. [38]

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html Learning to summarize with human feedback . In Proceedings of the Thirty-fourth Advances in Neural Information Processing Sy...

  39. [39]

    Silero Team. 2024. https://github.com/snakers4/silero-vad Silero VAD : Pre-trained enterprise-grade voice activity detector ( VAD ), number detector and language classifier

  40. [40]

    Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, and Shyamnath Gollakota. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1192 Beyond Turn-Based Interfaces : Synchronous LLMs as Full-Duplex Dialogue Agents . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 21390--21402

  41. [41]

    Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Wei Xia, and Yuanjun Xiong. 2024. https://openreview.net/forum?id=YawXY6mWiK&referrer= In Proceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems

  42. [42]

    Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, and Peilin Zhao. 2025 a . https://openreview.net/forum?id=5Gke1dfRVA NTPP : Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction . In Proceedings of the Forty-second International Conference on Machine Learning

  43. [43]

    Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. 2025 b . https://openreview.net/forum?id=s1EImzs5Id Freeze- Omni : A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM . In Proceedings of the Forty-second International Conference on Machine Learning

  44. [44]

    Williams

    Ronald J. Williams. 1992. https://doi.org/10.1007/BF00992696 Simple statistical gradient-following algorithms for connectionist reinforcement learning . Machine Learning, 8(3):229--256

  45. [45]

    Anne Wu, Laurent Mazar \'e , Neil Zeghidour, and Alexandre D \'e fossez. 2025 a . https://openreview.net/forum?id=kxFu9rQ0Mu Aligning Spoken Dialogue Models from User Interactions . In Proceedings of the Forty-second International Conference on Machine Learning

  46. [46]

    Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, and 90 others. 2025 b . https://doi.org/10.48550/arXiv.2507.16632 Step- Audio 2 Technical Report . arXiv preprint arXiv:2507.16632

  47. [47]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025. https://doi.org/10.48550/arXiv.2503.20215 Qwen2.5- Omni Technical Report . arXiv preprint arXiv:2503.20215

  48. [48]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025 a . https://doi.org/10.48550/arXiv.2505.09388 Qwen3 Technical Report . arXiv preprint arXiv:2505.09388

  49. [49]

    Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, and Yonghui Wu

    Shu-wen Yang, Ming Tu, Andy T. Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, and Yonghui Wu. 2025 b . https://openreview.net/forum?id=CcmDDh070o ParaS2S : Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction . In Proceedings of the Fourteenth International Conference on Learning Representations

  50. [50]

    Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Wenjia Ma, Aixin Sun, and Yequan Wang. 2026. https://doi.org/10.48550/arXiv.2509.02521 FLM-Audio : Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training . arXiv preprint arXiv:2509.02521

  51. [51]

    Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, and Chao Zhang. 2025. https://openreview.net/forum?id=AsRB5nmlOD SALMONN-omni : A Standalone Speech LLM without Codec Injection for Full-duplex Conversation . In Proceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems

  52. [52]

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.1055 SpeechGPT : Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities . In Findings of the Association for Computational Linguistics : EMNLP 2023 , pages 15757--15773

  53. [53]

    Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2024 a . https://openreview.net/forum?id=SKCbZR8Pyd SpeechAlign : Aligning Speech Generation to Human Preferences . In Proceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems

  54. [54]

    He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Haoli Bai, Shaohua Ma, and Irwin King. 2026. https://doi.org/10.48550/arXiv.2511.10262 MTR-DuplexBench : Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models . arXiv preprint arXiv:2511.10262

  55. [55]

    Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chao-Hong Tan, Zhihao Du, and ShiLiang Zhang. 2025. https://doi.org/10.18653/v1/2025.acl-long.709 OmniFlatten : An End-to-end GPT Model for Seamless Voice Conversation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , ...

  56. [56]

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Xu Han, Zihang Xu, Yuanwei Xu, Weilin Zhao, Maosong Sun, and Zhiyuan Liu. 2024 b . https://doi.org/10.18653/v1/2024.emnlp-main.644 Beyond the Turn-Based Game : Enabling Real-Time Conversations with Duplex Models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 115...

  57. [57]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. https://doi.org/10.48550/arXiv.2304.11277 PyTorch FSDP : Experiences on Scaling Fully Sharded Data Parall...