Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

Alexandre D\'efossez; Atsumoto Ohashi; Eugene Kharitonov; Neil Zeghidour

arxiv: 2606.11167 · v1 · pith:ZVWGBEFFnew · submitted 2026-06-09 · 💻 cs.CL · eess.AS

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

Atsumoto Ohashi , Neil Zeghidour , Alexandre D\'efossez , Eugene Kharitonov This is my paper

Pith reviewed 2026-06-27 12:51 UTC · model grok-4.3

classification 💻 cs.CL eess.AS

keywords full-duplex speechspoken dialogue modelsreinforcement learning alignmentturn-takingbackchannelingpause handlinguser interruptionpost-training alignment

0 comments

The pith

Post-training RL with axis-specific rewards from human audio segments aligns full-duplex speech models on pause handling, turn-taking, backchanneling, and user interruption.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that supervised training alone leaves full-duplex spoken dialogue models with poor timing behaviors such as excessive silence and mistimed responses. It shows that a follow-up reinforcement learning stage can target four specific interactivity axes by pulling short clips from real human conversations and scoring the model against axis-tailored reward functions. An added LLM-based quality reward keeps the semantic content of responses from degrading during the alignment. The method is demonstrated on two existing open models and measured both on pre-recorded audio and in live multi-turn conversations.

Core claim

We propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation

What carries the argument

Axis-specific RL reward functions derived from short human-conversation audio segments, combined with an auxiliary LLM quality reward, applied as post-training alignment.

If this is right

Both Moshi and PersonaPlex exhibit consistent gains across the four axes under offline pre-recorded audio evaluation.
The same gains appear in live multi-turn dialogue sessions.
The LLM quality reward keeps response semantics from degrading during alignment.
The approach works as a post-training step rather than requiring full retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same segment-extraction and reward design could be reused on any future full-duplex model without redesigning the underlying architecture.
Extending the four axes to include additional behaviors such as laughter timing or topic-shift handling would require only new segment sets and reward definitions.
If the method scales, voice interfaces could move from turn-based to truly overlapping conversation without separate timing modules.

Load-bearing premise

Short audio segments extracted from human corpora together with their axis-specific reward functions will produce measurable interactivity gains in live multi-turn settings without introducing new failure modes that the added LLM quality reward cannot fully mitigate.

What would settle it

A controlled real-time multi-turn dialogue test in which the aligned model shows no statistically significant improvement, or a decline, on at least one of the four interactivity axes relative to the unaligned base model.

Figures

Figures reproduced from arXiv: 2606.11167 by Alexandre D\'efossez, Atsumoto Ohashi, Eugene Kharitonov, Neil Zeghidour.

**Figure 2.** Figure 2: Example of a conversation between GPT-Realtime (Examiner) and Moshi + RL in the Daily task of Full [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: System prompt used to compute the LLM Judge reward A Training Details During generation, we use sampling temperatures of 0.7 and 0.8 for text and audio tokens, respectively, with top-k = 250 for audio tokens. At each of the 100 training epochs, we sample 32 segments (i.e., groups), each of which yields G = 16 completions. Training is distributed across 32 NVIDIA H100 GPUs using Fully Sharded Data Paralle… view at source ↗

**Figure 4.** Figure 4: Example of a conversation between GPTRealtime (Examiner) and Moshi in the Daily task of Full-Duplex-Bench v2 (Turn-taking fluency score: 2.75, Instruction-following score: 2.50). all models using UTMOSv2 (Baba et al., 2024), a neural mean opinion score (MOS) predictor. We computethe scores on the agent’s speech segments extracted from the Turn-Taking and User Interruption scenarios of Full-Duplex-Bench v… view at source ↗

**Figure 5.** Figure 5: Dialogue examples from the Safety task of Full-Duplex-Bench v2 between GPT-Realtime and (a) the base [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable RL post-training recipe for four-axis interactivity in full-duplex models and backs it with real-time tests on two open models.

read the letter

The colleague should know two things: the method extends prior RL interactivity work by hitting pause handling, turn-taking, backchanneling, and interruption handling at once, and the authors report consistent gains on both offline audio and live multi-turn dialogue for Moshi and PersonaPlex.

What is actually new is the combination of axis-specific rewards drawn from short human-corpus segments plus an LLM quality term to keep semantics intact. The paper does well by moving beyond single-axis or offline-only setups and by testing the aligned models in actual real-time conversation. That dual evaluation setup is a clear step up from abstract claims alone.

The soft spot is the leap from short-segment rewards to continuous live dynamics. Segment-level optimization does not directly encode cross-axis timing or longer context, so it is possible to improve per-axis scores while creating new coordination problems that the LLM reward, being text-based, might miss. The abstract states the real-time results held up, which reduces the concern, but the lack of visible ablations or failure-mode analysis in the provided summary leaves the robustness open to question.

This paper is for researchers working on deployed full-duplex voice systems who need concrete post-training steps. Readers who want a practical alignment recipe with results on open models will find it useful. It deserves a serious referee because the empirical application and live evaluation give referees something concrete to check even if the generalization argument needs tightening.

Referee Report

3 major / 1 minor

Summary. The paper proposes a post-training RL alignment procedure for full-duplex spoken dialogue models that targets four interactivity axes (pause handling, turn-taking, backchanneling, user interruption) by extracting short audio segments from human corpora, applying axis-specific reward functions, and adding an LLM-based response-quality reward to avoid semantic degradation. The method is applied to Moshi and PersonaPlex and evaluated on both offline pre-recorded audio and real-time multi-turn dialogue, with claims of consistent interactivity improvements.

Significance. If the empirical gains are shown to be robust and non-circular, the work would provide a concrete, multi-axis RL recipe that moves beyond token-level supervised training for full-duplex models, offering a practical template for improving natural turn-taking behaviors in open-source systems.

major comments (3)

[Abstract and Evaluation] The central empirical claim rests on short-segment axis rewards generalizing to live multi-turn full-duplex dynamics, yet the manuscript supplies no analysis, ablation, or failure-mode study of cross-axis timing interactions (e.g., whether improved backchanneling disrupts turn-taking). This is load-bearing for the generalization asserted in the abstract and evaluation sections.
[Method] No explicit definitions, functional forms, or hyperparameter values for the four axis-specific reward functions or the LLM quality reward appear in the provided abstract or summary; without these, it is impossible to verify that the reported gains are not artifacts of the reward construction itself.
[Evaluation] The offline and real-time evaluations are described only qualitatively in the abstract; the absence of quantitative metrics, baselines, or statistical significance tests undermines the claim of “consistent improvements” on both Moshi and PersonaPlex.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., a delta on an interactivity metric) to allow readers to gauge effect size without reading the full text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where the manuscript's presentation can be strengthened to better support the claims. We address each point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract and Evaluation] The central empirical claim rests on short-segment axis rewards generalizing to live multi-turn full-duplex dynamics, yet the manuscript supplies no analysis, ablation, or failure-mode study of cross-axis timing interactions (e.g., whether improved backchanneling disrupts turn-taking). This is load-bearing for the generalization asserted in the abstract and evaluation sections.

Authors: We agree that an explicit analysis of cross-axis timing interactions would strengthen the generalization argument. While the real-time multi-turn evaluations require the model to handle all axes simultaneously and show no obvious degradation, we did not include a dedicated ablation or failure-mode study. We will add this analysis in the revised manuscript. revision: yes
Referee: [Method] No explicit definitions, functional forms, or hyperparameter values for the four axis-specific reward functions or the LLM quality reward appear in the provided abstract or summary; without these, it is impossible to verify that the reported gains are not artifacts of the reward construction itself.

Authors: The full manuscript defines the axis-specific reward functions and the LLM quality reward (including functional forms and hyperparameters) in Section 3 and the appendix. The referee summary references only the abstract, which is necessarily high-level. We will add a brief summary of the reward definitions to the abstract in revision. revision: yes
Referee: [Evaluation] The offline and real-time evaluations are described only qualitatively in the abstract; the absence of quantitative metrics, baselines, or statistical significance tests undermines the claim of “consistent improvements” on both Moshi and PersonaPlex.

Authors: The abstract is qualitative due to length limits, but the full manuscript reports quantitative metrics, baselines, and significance tests in the Evaluation section for both models. We will incorporate key quantitative highlights and significance statements into the abstract during revision. revision: yes

Circularity Check

0 steps flagged

Empirical RL alignment uses external rewards; no circularity

full rationale

The paper presents an empirical post-training RL procedure that extracts short audio segments from human conversation corpora to define axis-specific rewards (pause handling, turn-taking, backchanneling, user interruption) plus an LLM-based quality reward. These inputs are independent of the trained model, and improvements are measured on separate offline and real-time evaluations. No equations, derivations, or self-citations reduce the claimed gains to the training objective by construction; the central claim remains an externally falsifiable empirical result rather than a self-definitional or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete information on fitted parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5725 in / 1105 out tokens · 23336 ms · 2026-06-27T12:51:03.979438+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 40 canonical work pages · 16 internal anchors

[1]

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, Praveen Chowdary, Joe Chuang, Antony D'Avirro, Jon Daly, Ning Dong, Mark Duppenthaler, Cynthia Gao, Jeff Girard, Martin Gleize, and 65 others. 2025. https://doi.org/10.48550/arXiv.2506.22554 Seamless I...

work page doi:10.48550/arxiv.2506.22554 2025
[2]

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, and Shinji Watanabe. 2026 a . https://doi.org/10.48550/arXiv.2504.08528 On The Landscape of Spoken Language Models : A Comprehensive Survey . arXiv preprint arXiv:2504.08528

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.08528 2026
[3]

Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, and Shinji Watanabe. 2024. https://openreview.net/forum?id=2e4ECh0ikn Talking Turns : Benchmarking Audio Foundation Models on Turn-Taking Dynamics . In Proceedings of the Thirteenth International Conference on Learning Representations

2024
[4]

Siddhant Arora, Jinchuan Tian, Jiatong Shi, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, and Shinji Watanabe. 2026 b . https://doi.org/10.48550/arXiv.2601.19063 Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback . arXiv preprint arXiv:2601.19063

work page doi:10.48550/arxiv.2601.19063 2026
[5]

Kaito Baba, Wataru Nakata, Yuki Saito, and Hiroshi Saruwatari. 2024. https://doi.org/10.1109/SLT61566.2024.10832315 The T05 System for the voicemos challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech . In Proceedings of the 2024 IEEE Spoken Language Technology Workshop , pages 818--824

work page doi:10.1109/slt61566.2024.10832315 2024
[6]

Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, and James Glass. 2026. https://ieeexplore.ieee.org/abstract/document/11464000/ Game-time: Evaluating temporal dynamics in spoken language models . In Proceedings of the 2026 IEEE International Conference on Acoustics , Speech and Signal Pro...

arXiv 2026
[7]

Chen Chen, Ke Hu, Chao-Han Huck Yang, Ankita Pasad, Edresson Casanova, Weiqing Wang, Szu-Wei Fu, Jason Li, Zhehuai Chen, Jagadeesh Balam, and Boris Ginsburg. 2025 a . https://openreview.net/forum?id=QbLbXz8Idp#discussion Reinforcement Learning Enhanced Full-Duplex Spoken Dialogue Language Models for Conversational Interactions . In Proceedings of the Seco...

2025
[8]

Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, and 17 others. 2025 b . https://doi.org/10.48550/arXiv.2501.06282 MinMo : A Multimodal Large Language Model for Seamless Voice Intera...

work page doi:10.48550/arxiv.2501.06282 2025
[9]

Yifu Chen, Shengpeng Ji, Qian Chen, Tianle Liang, Yangzhuo Li, Ziqing Wang, Wen Wang, Jingyu Lu, Haoxiao Wang, Xueyi Pu, Fan Zhuo, and Zhou Zhao. 2026 a . https://doi.org/10.48550/arXiv.2604.14932 WavAlign : Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training . arXiv preprint arXiv:2604.14932

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.14932 2026
[10]

Yifu Chen, Shengpeng Ji, Zhengqing Liu, Qian Chen, Wen Wang, Ziqing Wang, Yangzhuo Li, Tianle Liang, and Zhou Zhao. 2026 b . https://doi.org/10.48550/arXiv.2604.14920 Dual- Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models . arXiv preprint arXiv:2604.14920

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.14920 2026
[11]

Christopher Cieri, David Miller, and Kevin Walker. 2004. https://aclanthology.org/L04-1500/ The Fisher Corpus : A Resource for the Next Generations of Speech-to-Text . In Proceedings of the Fourth International Conference on Language Resources and Evaluation

2004
[12]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, and 3416 others. 2025. https://doi.org/10.48550/arXiv.2507.06261 Gemini...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025
[13]

Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. 2024. https://doi.org/10.48550/arXiv.2410.00037 Moshi: A speech-text foundation model for real-time dialogue . arXiv preprint arXiv:2410.00037

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.00037 2024
[14]

Erik Ekstedt and Gabriel Skantze. 2022. https://doi.org/10.21437/Interspeech.2022-10955 Voice Activity Projection : Self-supervised Learning of Turn-taking Events . In Proceeding of Interspeech 2022 , pages 5190--5194

work page doi:10.21437/interspeech.2022-10955 2022
[15]

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. 2024. https://openreview.net/forum?id=PYmrUQmMEw LLaMA-Omni : Seamless Speech Interaction with Large Language Models . In Proceedings of the Thirteenth International Conference on Learning Representations

2024
[16]

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, Haoyu Cao, Di Yin, Long Ma, Xiawu Zheng, Rongrong Ji, Yunsheng Wu, Ran He, Caifeng Shan, and Xing Sun. 2025. https://doi.org/10.48550/arXiv.2408.05211 VITA : Towards Open-Source Interactive Omni Multimodal LLM . arXiv preprint arXiv:2...

work page doi:10.48550/arxiv.2408.05211 2025
[17]

Yuan Ge, Saihan Chen, Jingqi Xiao, Xiaoqian Liu, Tong Xiao, Yan Xiang, Zhengtao Yu, and Jingbo Zhu. 2025. https://doi.org/10.48550/arXiv.2509.22243 FLEXI : Benchmarking Full-duplex Human-LLM Speech Interaction . arXiv preprint arXiv:2509.22243

work page doi:10.48550/arxiv.2509.22243 2025
[18]

Mattias Heldner and Jens Edlund. 2010. https://doi.org/10.1016/j.wocn.2010.08.002 Pauses, gaps and overlaps in conversations . Journal of Phonetics, 38(4):555--568

work page doi:10.1016/j.wocn.2010.08.002 2010
[19]

Chi-Yuan Hsiao, Ke-Han Lu, Yu-Kuan Fu, Guan-Ting Lin, Hsiao-Tsung Hung, and Hung-yi Lee. 2026. https://doi.org/10.48550/arXiv.2604.10065 ASPIRin : Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models . arXiv preprint arXiv:2604.10065

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10065 2026
[20]

Ke Hu, Ehsan Hosseini-Asl , Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr \.Z elasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, and Boris Ginsburg. 2025. https://doi.org/10.21437/Interspeech.2025-874 Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model . In Proceeding of Interspeech 2025 , pages 2715--2719

work page doi:10.21437/interspeech.2025-874 2025
[21]

Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, and Zhou Zhao. 2024. https://doi.org/10.48550/arXiv.2411.13577 WavChat : A Survey of Spoken Dialogue Models . arXiv preprint arXiv:2411.13577

work page doi:10.48550/arxiv.2411.13577 2024
[22]

Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, and Zhou Zhao. 2025. https://doi.org/10.48550/arXiv.2505.09558 WavReward : Spoken Dialogue Models With Generalist Reward Evaluators . arXiv preprint arXiv:2505.09558

work page doi:10.48550/arxiv.2505.09558 2025
[23]

Borui Liao, Yulong Xu, Jiao Ou, Kaiyuan Yang, Weihua Jian, Pengfei Wan, and Di Zhang. 2025. https://doi.org/10.48550/arXiv.2502.13472 FlexDuo : A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems . arXiv preprint arXiv:2502.13472

work page doi:10.48550/arxiv.2502.13472 2025
[24]

Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung-yi Lee. 2026. https://doi.org/10.48550/arXiv.2510.07838 Full- Duplex-Bench-v2 : A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner . arXiv preprint arXiv:2510.07838

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.07838 2026
[25]

Liu, and Hung-yi Lee

Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung-yi Lee. 2025 a . https://doi.org/10.48550/arXiv.2503.04721 Full- Duplex-Bench : A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities . arXiv preprint arXiv:2503.04721

work page doi:10.48550/arxiv.2503.04721 2025
[26]

Guan-Ting Lin, Prashanth Gurunath Shivakumar, Aditya Gourav, Yile Gu, Ankur Gandhe, Hung-yi Lee, and Ivan Bulyko. 2025 b . https://doi.org/10.18653/v1/2025.acl-long.997 Align- SLM : Textless Spoken Language Models with Reinforcement Learning from AI Feedback . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pag...

work page doi:10.18653/v1/2025.acl-long.997 2025
[27]

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. 2026. https://doi.org/10.48550/arXiv.2601.05242 GDPO : Group reward- Decoupled Normalization Policy Optimization for Multi-reward RL Optimization . arXiv preprint arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.05242 2026
[28]

Ilya Loshchilov and Frank Hutter. 2018. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled Weight Decay Regularization . In Proceedings of the Sixth International Conference on Learning Representations

2018
[29]

Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Beno \^i t Sagot, Abdelrahman Mohamed, and Emmanuel Dupoux. 2023. https://doi.org/10.1162/tacl_a_00545 Generative Spoken Dialogue Language Modeling . Transactions of the Association for Computational Linguistics, 11:250--266

work page doi:10.1162/tacl_a_00545 2023
[30]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, and 262 others. 2024. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 Te...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024
[31]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://openreview.net/forum?id=TG8KACxEON Training language ...

2022
[32]

Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, and Eng Siong Chng. 2025. https://doi.org/10.21437/Interspeech.2025-739 FD-Bench : A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems . In Interspeech 2025, pages 176--180

work page doi:10.21437/interspeech.2025-739 2025
[33]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. 2023. https://openreview.net/forum?id=HPuSIXJaa9 Direct Preference Optimization : Your Language Model is Secretly a Reward Model . In Proceedings of the Thirty-seventh Conference on Neural Information Processing Systems

2023
[34]

Rajarshi Roy, Jonathan Raiman, Sang-gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. 2026. https://doi.org/10.48550/arXiv.2602.06053 PersonaPlex : Voice and Role Control for Full Duplex Conversational Speech Models . arXiv preprint arXiv:2602.06053

work page doi:10.48550/arxiv.2602.06053 2026
[35]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://doi.org/10.48550/arXiv.1707.06347 Proximal Policy Optimization Algorithms . arXiv preprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017
[36]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://doi.org/10.48550/arXiv.2402.03300 DeepSeekMath : Pushing the Limits of Mathematical Reasoning in Open Language Models . arXiv preprint arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[37]

Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, and Zhiting Hu. 2025. https://doi.org/10.48550/arXiv.2505.02707 Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play . arXiv preprint arXiv:2505.02707

work page doi:10.48550/arxiv.2505.02707 2025
[38]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html Learning to summarize with human feedback . In Proceedings of the Thirty-fourth Advances in Neural Information Processing Sy...

2020
[39]

Silero Team. 2024. https://github.com/snakers4/silero-vad Silero VAD : Pre-trained enterprise-grade voice activity detector ( VAD ), number detector and language classifier

2024
[40]

Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, and Shyamnath Gollakota. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1192 Beyond Turn-Based Interfaces : Synchronous LLMs as Full-Duplex Dialogue Agents . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 21390--21402

work page doi:10.18653/v1/2024.emnlp-main.1192 2024
[41]

Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Wei Xia, and Yuanjun Xiong. 2024. https://openreview.net/forum?id=YawXY6mWiK&referrer= In Proceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems

2024
[42]

Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, and Peilin Zhao. 2025 a . https://openreview.net/forum?id=5Gke1dfRVA NTPP : Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction . In Proceedings of the Forty-second International Conference on Machine Learning

2025
[43]

Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. 2025 b . https://openreview.net/forum?id=s1EImzs5Id Freeze- Omni : A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM . In Proceedings of the Forty-second International Conference on Machine Learning

2025
[44]

Williams

Ronald J. Williams. 1992. https://doi.org/10.1007/BF00992696 Simple statistical gradient-following algorithms for connectionist reinforcement learning . Machine Learning, 8(3):229--256

work page doi:10.1007/bf00992696 1992
[45]

Anne Wu, Laurent Mazar \'e , Neil Zeghidour, and Alexandre D \'e fossez. 2025 a . https://openreview.net/forum?id=kxFu9rQ0Mu Aligning Spoken Dialogue Models from User Interactions . In Proceedings of the Forty-second International Conference on Machine Learning

2025
[46]

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, and 90 others. 2025 b . https://doi.org/10.48550/arXiv.2507.16632 Step- Audio 2 Technical Report . arXiv preprint arXiv:2507.16632

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.16632 2025
[47]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025. https://doi.org/10.48550/arXiv.2503.20215 Qwen2.5- Omni Technical Report . arXiv preprint arXiv:2503.20215

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20215 2025
[48]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025 a . https://doi.org/10.48550/arXiv.2505.09388 Qwen3 Technical Report . arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[49]

Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, and Yonghui Wu

Shu-wen Yang, Ming Tu, Andy T. Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, and Yonghui Wu. 2025 b . https://openreview.net/forum?id=CcmDDh070o ParaS2S : Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction . In Proceedings of the Fourteenth International Conference on Learning Representations

2025
[50]

Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Wenjia Ma, Aixin Sun, and Yequan Wang. 2026. https://doi.org/10.48550/arXiv.2509.02521 FLM-Audio : Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training . arXiv preprint arXiv:2509.02521

work page doi:10.48550/arxiv.2509.02521 2026
[51]

Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, and Chao Zhang. 2025. https://openreview.net/forum?id=AsRB5nmlOD SALMONN-omni : A Standalone Speech LLM without Codec Injection for Full-duplex Conversation . In Proceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[52]

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.1055 SpeechGPT : Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities . In Findings of the Association for Computational Linguistics : EMNLP 2023 , pages 15757--15773

work page doi:10.18653/v1/2023.findings-emnlp.1055 2023
[53]

Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2024 a . https://openreview.net/forum?id=SKCbZR8Pyd SpeechAlign : Aligning Speech Generation to Human Preferences . In Proceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems

2024
[54]

He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Haoli Bai, Shaohua Ma, and Irwin King. 2026. https://doi.org/10.48550/arXiv.2511.10262 MTR-DuplexBench : Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models . arXiv preprint arXiv:2511.10262

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.10262 2026
[55]

Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chao-Hong Tan, Zhihao Du, and ShiLiang Zhang. 2025. https://doi.org/10.18653/v1/2025.acl-long.709 OmniFlatten : An End-to-end GPT Model for Seamless Voice Conversation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , ...

work page doi:10.18653/v1/2025.acl-long.709 2025
[56]

Xinrong Zhang, Yingfa Chen, Shengding Hu, Xu Han, Zihang Xu, Yuanwei Xu, Weilin Zhao, Maosong Sun, and Zhiyuan Liu. 2024 b . https://doi.org/10.18653/v1/2024.emnlp-main.644 Beyond the Turn-Based Game : Enabling Real-Time Conversations with Duplex Models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 115...

work page doi:10.18653/v1/2024.emnlp-main.644 2024
[57]

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. https://doi.org/10.48550/arXiv.2304.11277 PyTorch FSDP : Experiences on Scaling Fully Sharded Data Parall...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.11277 2023

[1] [1]

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, Praveen Chowdary, Joe Chuang, Antony D'Avirro, Jon Daly, Ning Dong, Mark Duppenthaler, Cynthia Gao, Jeff Girard, Martin Gleize, and 65 others. 2025. https://doi.org/10.48550/arXiv.2506.22554 Seamless I...

work page doi:10.48550/arxiv.2506.22554 2025

[2] [2]

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, and Shinji Watanabe. 2026 a . https://doi.org/10.48550/arXiv.2504.08528 On The Landscape of Spoken Language Models : A Comprehensive Survey . arXiv preprint arXiv:2504.08528

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.08528 2026

[3] [3]

Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, and Shinji Watanabe. 2024. https://openreview.net/forum?id=2e4ECh0ikn Talking Turns : Benchmarking Audio Foundation Models on Turn-Taking Dynamics . In Proceedings of the Thirteenth International Conference on Learning Representations

2024

[4] [4]

Siddhant Arora, Jinchuan Tian, Jiatong Shi, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, and Shinji Watanabe. 2026 b . https://doi.org/10.48550/arXiv.2601.19063 Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback . arXiv preprint arXiv:2601.19063

work page doi:10.48550/arxiv.2601.19063 2026

[5] [5]

Kaito Baba, Wataru Nakata, Yuki Saito, and Hiroshi Saruwatari. 2024. https://doi.org/10.1109/SLT61566.2024.10832315 The T05 System for the voicemos challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech . In Proceedings of the 2024 IEEE Spoken Language Technology Workshop , pages 818--824

work page doi:10.1109/slt61566.2024.10832315 2024

[6] [6]

Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, and James Glass. 2026. https://ieeexplore.ieee.org/abstract/document/11464000/ Game-time: Evaluating temporal dynamics in spoken language models . In Proceedings of the 2026 IEEE International Conference on Acoustics , Speech and Signal Pro...

arXiv 2026

[7] [7]

Chen Chen, Ke Hu, Chao-Han Huck Yang, Ankita Pasad, Edresson Casanova, Weiqing Wang, Szu-Wei Fu, Jason Li, Zhehuai Chen, Jagadeesh Balam, and Boris Ginsburg. 2025 a . https://openreview.net/forum?id=QbLbXz8Idp#discussion Reinforcement Learning Enhanced Full-Duplex Spoken Dialogue Language Models for Conversational Interactions . In Proceedings of the Seco...

2025

[8] [8]

Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, and 17 others. 2025 b . https://doi.org/10.48550/arXiv.2501.06282 MinMo : A Multimodal Large Language Model for Seamless Voice Intera...

work page doi:10.48550/arxiv.2501.06282 2025

[9] [9]

Yifu Chen, Shengpeng Ji, Qian Chen, Tianle Liang, Yangzhuo Li, Ziqing Wang, Wen Wang, Jingyu Lu, Haoxiao Wang, Xueyi Pu, Fan Zhuo, and Zhou Zhao. 2026 a . https://doi.org/10.48550/arXiv.2604.14932 WavAlign : Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training . arXiv preprint arXiv:2604.14932

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.14932 2026

[10] [10]

Yifu Chen, Shengpeng Ji, Zhengqing Liu, Qian Chen, Wen Wang, Ziqing Wang, Yangzhuo Li, Tianle Liang, and Zhou Zhao. 2026 b . https://doi.org/10.48550/arXiv.2604.14920 Dual- Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models . arXiv preprint arXiv:2604.14920

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.14920 2026

[11] [11]

Christopher Cieri, David Miller, and Kevin Walker. 2004. https://aclanthology.org/L04-1500/ The Fisher Corpus : A Resource for the Next Generations of Speech-to-Text . In Proceedings of the Fourth International Conference on Language Resources and Evaluation

2004

[12] [12]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, and 3416 others. 2025. https://doi.org/10.48550/arXiv.2507.06261 Gemini...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025

[13] [13]

Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. 2024. https://doi.org/10.48550/arXiv.2410.00037 Moshi: A speech-text foundation model for real-time dialogue . arXiv preprint arXiv:2410.00037

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.00037 2024

[14] [14]

Erik Ekstedt and Gabriel Skantze. 2022. https://doi.org/10.21437/Interspeech.2022-10955 Voice Activity Projection : Self-supervised Learning of Turn-taking Events . In Proceeding of Interspeech 2022 , pages 5190--5194

work page doi:10.21437/interspeech.2022-10955 2022

[15] [15]

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. 2024. https://openreview.net/forum?id=PYmrUQmMEw LLaMA-Omni : Seamless Speech Interaction with Large Language Models . In Proceedings of the Thirteenth International Conference on Learning Representations

2024

[16] [16]

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, Haoyu Cao, Di Yin, Long Ma, Xiawu Zheng, Rongrong Ji, Yunsheng Wu, Ran He, Caifeng Shan, and Xing Sun. 2025. https://doi.org/10.48550/arXiv.2408.05211 VITA : Towards Open-Source Interactive Omni Multimodal LLM . arXiv preprint arXiv:2...

work page doi:10.48550/arxiv.2408.05211 2025

[17] [17]

Yuan Ge, Saihan Chen, Jingqi Xiao, Xiaoqian Liu, Tong Xiao, Yan Xiang, Zhengtao Yu, and Jingbo Zhu. 2025. https://doi.org/10.48550/arXiv.2509.22243 FLEXI : Benchmarking Full-duplex Human-LLM Speech Interaction . arXiv preprint arXiv:2509.22243

work page doi:10.48550/arxiv.2509.22243 2025

[18] [18]

Mattias Heldner and Jens Edlund. 2010. https://doi.org/10.1016/j.wocn.2010.08.002 Pauses, gaps and overlaps in conversations . Journal of Phonetics, 38(4):555--568

work page doi:10.1016/j.wocn.2010.08.002 2010

[19] [19]

Chi-Yuan Hsiao, Ke-Han Lu, Yu-Kuan Fu, Guan-Ting Lin, Hsiao-Tsung Hung, and Hung-yi Lee. 2026. https://doi.org/10.48550/arXiv.2604.10065 ASPIRin : Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models . arXiv preprint arXiv:2604.10065

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10065 2026

[20] [20]

Ke Hu, Ehsan Hosseini-Asl , Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr \.Z elasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, and Boris Ginsburg. 2025. https://doi.org/10.21437/Interspeech.2025-874 Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model . In Proceeding of Interspeech 2025 , pages 2715--2719

work page doi:10.21437/interspeech.2025-874 2025

[21] [21]

Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, and Zhou Zhao. 2024. https://doi.org/10.48550/arXiv.2411.13577 WavChat : A Survey of Spoken Dialogue Models . arXiv preprint arXiv:2411.13577

work page doi:10.48550/arxiv.2411.13577 2024

[22] [22]

Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, and Zhou Zhao. 2025. https://doi.org/10.48550/arXiv.2505.09558 WavReward : Spoken Dialogue Models With Generalist Reward Evaluators . arXiv preprint arXiv:2505.09558

work page doi:10.48550/arxiv.2505.09558 2025

[23] [23]

Borui Liao, Yulong Xu, Jiao Ou, Kaiyuan Yang, Weihua Jian, Pengfei Wan, and Di Zhang. 2025. https://doi.org/10.48550/arXiv.2502.13472 FlexDuo : A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems . arXiv preprint arXiv:2502.13472

work page doi:10.48550/arxiv.2502.13472 2025

[24] [24]

Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung-yi Lee. 2026. https://doi.org/10.48550/arXiv.2510.07838 Full- Duplex-Bench-v2 : A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner . arXiv preprint arXiv:2510.07838

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.07838 2026

[25] [25]

Liu, and Hung-yi Lee

Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung-yi Lee. 2025 a . https://doi.org/10.48550/arXiv.2503.04721 Full- Duplex-Bench : A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities . arXiv preprint arXiv:2503.04721

work page doi:10.48550/arxiv.2503.04721 2025

[26] [26]

Guan-Ting Lin, Prashanth Gurunath Shivakumar, Aditya Gourav, Yile Gu, Ankur Gandhe, Hung-yi Lee, and Ivan Bulyko. 2025 b . https://doi.org/10.18653/v1/2025.acl-long.997 Align- SLM : Textless Spoken Language Models with Reinforcement Learning from AI Feedback . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pag...

work page doi:10.18653/v1/2025.acl-long.997 2025

[27] [27]

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. 2026. https://doi.org/10.48550/arXiv.2601.05242 GDPO : Group reward- Decoupled Normalization Policy Optimization for Multi-reward RL Optimization . arXiv preprint arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.05242 2026

[28] [28]

Ilya Loshchilov and Frank Hutter. 2018. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled Weight Decay Regularization . In Proceedings of the Sixth International Conference on Learning Representations

2018

[29] [29]

Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Beno \^i t Sagot, Abdelrahman Mohamed, and Emmanuel Dupoux. 2023. https://doi.org/10.1162/tacl_a_00545 Generative Spoken Dialogue Language Modeling . Transactions of the Association for Computational Linguistics, 11:250--266

work page doi:10.1162/tacl_a_00545 2023

[30] [30]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, and 262 others. 2024. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 Te...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024

[31] [31]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://openreview.net/forum?id=TG8KACxEON Training language ...

2022

[32] [32]

Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, and Eng Siong Chng. 2025. https://doi.org/10.21437/Interspeech.2025-739 FD-Bench : A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems . In Interspeech 2025, pages 176--180

work page doi:10.21437/interspeech.2025-739 2025

[33] [33]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. 2023. https://openreview.net/forum?id=HPuSIXJaa9 Direct Preference Optimization : Your Language Model is Secretly a Reward Model . In Proceedings of the Thirty-seventh Conference on Neural Information Processing Systems

2023

[34] [34]

Rajarshi Roy, Jonathan Raiman, Sang-gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. 2026. https://doi.org/10.48550/arXiv.2602.06053 PersonaPlex : Voice and Role Control for Full Duplex Conversational Speech Models . arXiv preprint arXiv:2602.06053

work page doi:10.48550/arxiv.2602.06053 2026

[35] [35]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://doi.org/10.48550/arXiv.1707.06347 Proximal Policy Optimization Algorithms . arXiv preprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017

[36] [36]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://doi.org/10.48550/arXiv.2402.03300 DeepSeekMath : Pushing the Limits of Mathematical Reasoning in Open Language Models . arXiv preprint arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024

[37] [37]

Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, and Zhiting Hu. 2025. https://doi.org/10.48550/arXiv.2505.02707 Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play . arXiv preprint arXiv:2505.02707

work page doi:10.48550/arxiv.2505.02707 2025

[38] [38]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html Learning to summarize with human feedback . In Proceedings of the Thirty-fourth Advances in Neural Information Processing Sy...

2020

[39] [39]

Silero Team. 2024. https://github.com/snakers4/silero-vad Silero VAD : Pre-trained enterprise-grade voice activity detector ( VAD ), number detector and language classifier

2024

[40] [40]

Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, and Shyamnath Gollakota. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1192 Beyond Turn-Based Interfaces : Synchronous LLMs as Full-Duplex Dialogue Agents . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 21390--21402

work page doi:10.18653/v1/2024.emnlp-main.1192 2024

[41] [41]

Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Wei Xia, and Yuanjun Xiong. 2024. https://openreview.net/forum?id=YawXY6mWiK&referrer= In Proceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems

2024

[42] [42]

Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, and Peilin Zhao. 2025 a . https://openreview.net/forum?id=5Gke1dfRVA NTPP : Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction . In Proceedings of the Forty-second International Conference on Machine Learning

2025

[43] [43]

Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. 2025 b . https://openreview.net/forum?id=s1EImzs5Id Freeze- Omni : A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM . In Proceedings of the Forty-second International Conference on Machine Learning

2025

[44] [44]

Williams

Ronald J. Williams. 1992. https://doi.org/10.1007/BF00992696 Simple statistical gradient-following algorithms for connectionist reinforcement learning . Machine Learning, 8(3):229--256

work page doi:10.1007/bf00992696 1992

[45] [45]

Anne Wu, Laurent Mazar \'e , Neil Zeghidour, and Alexandre D \'e fossez. 2025 a . https://openreview.net/forum?id=kxFu9rQ0Mu Aligning Spoken Dialogue Models from User Interactions . In Proceedings of the Forty-second International Conference on Machine Learning

2025

[46] [46]

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, and 90 others. 2025 b . https://doi.org/10.48550/arXiv.2507.16632 Step- Audio 2 Technical Report . arXiv preprint arXiv:2507.16632

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.16632 2025

[47] [47]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025. https://doi.org/10.48550/arXiv.2503.20215 Qwen2.5- Omni Technical Report . arXiv preprint arXiv:2503.20215

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20215 2025

[48] [48]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025 a . https://doi.org/10.48550/arXiv.2505.09388 Qwen3 Technical Report . arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[49] [49]

Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, and Yonghui Wu

Shu-wen Yang, Ming Tu, Andy T. Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, and Yonghui Wu. 2025 b . https://openreview.net/forum?id=CcmDDh070o ParaS2S : Benchmarking and Aligning Spoken Language Models for Paralinguistic-aware Speech-to-Speech Interaction . In Proceedings of the Fourteenth International Conference on Learning Representations

2025

[50] [50]

Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Wenjia Ma, Aixin Sun, and Yequan Wang. 2026. https://doi.org/10.48550/arXiv.2509.02521 FLM-Audio : Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training . arXiv preprint arXiv:2509.02521

work page doi:10.48550/arxiv.2509.02521 2026

[51] [51]

Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, and Chao Zhang. 2025. https://openreview.net/forum?id=AsRB5nmlOD SALMONN-omni : A Standalone Speech LLM without Codec Injection for Full-duplex Conversation . In Proceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems

2025

[52] [52]

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.1055 SpeechGPT : Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities . In Findings of the Association for Computational Linguistics : EMNLP 2023 , pages 15757--15773

work page doi:10.18653/v1/2023.findings-emnlp.1055 2023

[53] [53]

Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2024 a . https://openreview.net/forum?id=SKCbZR8Pyd SpeechAlign : Aligning Speech Generation to Human Preferences . In Proceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems

2024

[54] [54]

He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Haoli Bai, Shaohua Ma, and Irwin King. 2026. https://doi.org/10.48550/arXiv.2511.10262 MTR-DuplexBench : Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models . arXiv preprint arXiv:2511.10262

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.10262 2026

[55] [55]

Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chao-Hong Tan, Zhihao Du, and ShiLiang Zhang. 2025. https://doi.org/10.18653/v1/2025.acl-long.709 OmniFlatten : An End-to-end GPT Model for Seamless Voice Conversation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , ...

work page doi:10.18653/v1/2025.acl-long.709 2025

[56] [56]

Xinrong Zhang, Yingfa Chen, Shengding Hu, Xu Han, Zihang Xu, Yuanwei Xu, Weilin Zhao, Maosong Sun, and Zhiyuan Liu. 2024 b . https://doi.org/10.18653/v1/2024.emnlp-main.644 Beyond the Turn-Based Game : Enabling Real-Time Conversations with Duplex Models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 115...

work page doi:10.18653/v1/2024.emnlp-main.644 2024

[57] [57]

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. https://doi.org/10.48550/arXiv.2304.11277 PyTorch FSDP : Experiences on Scaling Fully Sharded Data Parall...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.11277 2023