pith. machine review for the scientific record. sign in

arxiv: 2604.10065 · v1 · submitted 2026-04-11 · 💻 cs.CL · cs.AI· cs.SD· eess.AS

Recognition: unknown

ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SDeess.AS
keywords full-duplex speech language modelsreinforcement learningaction space projectionturn-takingsemantic coherencedegenerative repetitionGRPOinteractivity optimization
0
0 comments X

The pith

Decoupling speech timing from content selection prevents repetition while improving turn-taking in full-duplex models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method that projects the full vocabulary into a simple binary choice between speaking and staying silent. This projection lets reinforcement learning focus on timing decisions such as when to interrupt or pause, without mixing those decisions into every word choice. Standard approaches that optimize everything together quickly produce repetitive output and lose coherence. Readers care because the separation keeps responses meaningful while making real-time voice interactions feel more natural and less robotic.

Core claim

ASPIRin maps the text vocabulary into a coarse-grained binary state of active speech versus inactive silence. Applying Group Relative Policy Optimization with rule-based rewards on this reduced space balances user interruption and response latency. Isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50 percent compared to standard GRPO, effectively eliminating degenerative repetition.

What carries the argument

Action Space Projection, which collapses the text vocabulary to a binary active-speech versus inactive-silence state so that timing can be optimized separately from content.

If this is right

  • Optimizes turn-taking, backchanneling, and pause handling in full-duplex settings.
  • Preserves semantic coherence while training for interactivity.
  • Reduces duplicate n-grams by more than 50 percent relative to standard GRPO.
  • Balances user interruption against response latency with rule-based rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same binary projection idea could be tested on other sequential tasks where timing and content decisions compete during training.
  • Smaller action spaces from this projection may allow faster policy updates or lower memory use in reinforcement learning for dialogue systems.
  • Extending the binary state to a few more coarse levels such as low-volume backchannels might add finer interactivity control without reintroducing repetition.

Load-bearing premise

Mapping the text vocabulary into a coarse-grained binary state of active speech versus inactive silence is sufficient to optimize timing without losing information critical to semantic quality.

What would settle it

A side-by-side run of ASPIRin against standard GRPO on a full-duplex conversation benchmark that measures both duplicate n-gram rate and semantic coherence scores; if the duplicate reduction drops below 50 percent or coherence falls, the central claim fails.

Figures

Figures reproduced from arXiv: 2604.10065 by Chi-Yuan Hsiao, Guan-Ting Lin, Hsiao-Tsung Hung, Hung-yi Lee, Ke-Han Lu, Yu-Kuan Fu.

Figure 1
Figure 1. Figure 1: Overview of the ASPIRin framework. (a) Action Space Projection & State Policy Optimization: The fine-grained text vocabulary is decoupled into a coarse-grained binary state (Active Speech vs. Inactive Silence) by grouping and summing non-padding and padding logits. This projected state policy is then explicitly optimized. (b) Rule-Based Rewards: The state policy is guided by continuous temporal constraints… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of training reward dynamics between standard GRPO and ASPIRin. 4.2. Analysis of Reward Dynamics Standard GRPO and ASPIRin both display an upward trend in total reward throughout training, yet their Interruption Score dynamics differ dramatically. As shown in Figures 2a and 2c, standard GRPO exhibits severe instability, featuring rapid oscil￾lations and a consistent downward trend that signals cl… view at source ↗
read the original abstract

End-to-end full-duplex Speech Language Models (SLMs) require precise turn-taking for natural interaction. However, optimizing temporal dynamics via standard raw-token reinforcement learning (RL) degrades semantic quality, causing severe generative collapse and repetition. We propose ASPIRin, an interactivity-optimized RL framework that explicitly decouples when to speak from what to say. Using Action Space Projection, ASPIRin maps the text vocabulary into a coarse-grained binary state (active speech vs. inactive silence). By applying Group Relative Policy Optimization (GRPO) with rule-based rewards, it balances user interruption and response latency. Empirical evaluations show ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling. Crucially, isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ASPIRin, a reinforcement learning framework for full-duplex Speech Language Models that decouples timing (when to speak) from content (what to say) via Action Space Projection, which maps the text vocabulary to a coarse binary state of active speech versus inactive silence. GRPO is then applied with rule-based rewards to optimize interactivity metrics such as turn-taking, backchanneling, and pause handling. The central empirical claim is that this isolation preserves semantic coherence while reducing the portion of duplicate n-grams by over 50% relative to standard GRPO, thereby mitigating degenerative repetition.

Significance. If the decoupling via binary projection can be shown to retain critical semantic and timing cues without collapse, the approach would address a key failure mode in RL for conversational SLMs and enable more natural full-duplex interaction. The explicit separation of action spaces is a targeted intervention that could generalize to other latency-sensitive generation tasks, but its value hinges on rigorous validation of the preservation claim.

major comments (3)
  1. Section 3.2: The Action Space Projection is defined as a simple binary mapping of the full vocabulary to {speech, silence} states prior to GRPO. No ablation on projection granularity (e.g., finer token-level or lexical-category states) is reported, leaving open whether the coarse mapping erases distinctions needed for backchannels, pauses, or semantic coherence as hypothesized in the skeptic analysis.
  2. Results section (referenced in abstract): The claim of >50% reduction in duplicate n-grams and elimination of degenerative repetition is presented without experimental setup details, baseline comparisons beyond standard GRPO, statistical significance tests, dataset descriptions, or ablation studies on the projection step, rendering the quantitative result unverifiable and the central claim unsupported.
  3. Reward formulation (Section 3): The rule-based rewards used to balance user interruption and response latency are not explicitly defined or derived; without their precise functional form or sensitivity analysis, it is unclear whether the reported interactivity gains are robust or merely artifacts of the reduced action space.
minor comments (2)
  1. The abstract and introduction would benefit from a brief statement of the underlying SLM architecture and training corpus to contextualize the GRPO application.
  2. Notation for the projected states (active/inactive) should be formalized with an equation or pseudocode for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the manuscript lacks sufficient clarity or supporting analyses, we will revise accordingly to strengthen the presentation and verifiability of our claims.

read point-by-point responses
  1. Referee: Section 3.2: The Action Space Projection is defined as a simple binary mapping of the full vocabulary to {speech, silence} states prior to GRPO. No ablation on projection granularity (e.g., finer token-level or lexical-category states) is reported, leaving open whether the coarse mapping erases distinctions needed for backchannels, pauses, or semantic coherence as hypothesized in the skeptic analysis.

    Authors: The binary projection is a core design decision to enforce explicit decoupling of timing from content, directly targeting the entanglement that produces repetition under standard GRPO. Our results show that semantic coherence and backchanneling/pause handling are preserved, consistent with the hypothesized benefits. We agree that granularity ablations would provide additional support. In the revision we will expand Section 3.2 with theoretical justification for the binary choice and include a new ablation comparing binary versus lexical-category projections. revision: partial

  2. Referee: Results section (referenced in abstract): The claim of >50% reduction in duplicate n-grams and elimination of degenerative repetition is presented without experimental setup details, baseline comparisons beyond standard GRPO, statistical significance tests, dataset descriptions, or ablation studies on the projection step, rendering the quantitative result unverifiable and the central claim unsupported.

    Authors: We regret that the experimental details were not presented with sufficient prominence. Section 4 of the manuscript describes the datasets, the duplicate n-gram metric computation, and comparisons against standard GRPO; statistical significance was assessed via paired tests. We will revise the Results section to explicitly restate all setup elements, expand baseline comparisons, report exact p-values, and add a dedicated ablation isolating the projection step so that the >50% reduction claim is fully verifiable. revision: yes

  3. Referee: Reward formulation (Section 3): The rule-based rewards used to balance user interruption and response latency are not explicitly defined or derived; without their precise functional form or sensitivity analysis, it is unclear whether the reported interactivity gains are robust or merely artifacts of the reduced action space.

    Authors: The rewards are defined in Section 3 as a linear combination of an interruption penalty term and a latency reward term. We will make the exact functional forms and weighting coefficients explicit, include their derivation from the interactivity objectives, and add a sensitivity analysis over the weighting hyperparameters in the revised manuscript to demonstrate that the gains are robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ASPIRin derivation chain

full rationale

The paper introduces Action Space Projection as an explicit, rule-based design choice that maps the full text vocabulary to a coarse binary state space ({active speech, inactive silence}) before applying GRPO with separate rule-based rewards. This mapping is defined independently of the parameters being optimized and does not reduce any claimed prediction or result to its own inputs by construction. No self-citations, uniqueness theorems, or fitted inputs are invoked as load-bearing justifications for the central decoupling or the reported >50% reduction in duplicate n-grams; those outcomes are presented as empirical consequences of the method rather than tautological consequences of the projection definition. The derivation remains self-contained with externally verifiable components (standard GRPO plus hand-specified rewards).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that a binary projection of the vocabulary suffices for timing control and on standard RL assumptions that are not re-derived here. No free parameters or additional invented entities beyond the projection itself are quantified in the abstract.

axioms (1)
  • domain assumption Reinforcement learning can be applied to a projected binary action space while the original token-level policy remains intact.
    Implicit in the decoupling claim; the abstract does not derive or justify this separation.
invented entities (1)
  • Action Space Projection no independent evidence
    purpose: Maps full text vocabulary to binary active-speech vs. inactive-silence states to isolate timing decisions.
    Core technical contribution introduced to enable the GRPO training without semantic degradation.

pith-pipeline@v0.9.0 · 5486 in / 1271 out tokens · 66133 ms · 2026-05-10T16:59:31.921699+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 40 canonical work pages · 19 internal anchors

  1. [1]

    ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

    Introduction Traditional spoken dialogue systems have long relied on a cas- caded architecture, pipelining audio through independent Auto- matic Speech Recognition (ASR) [1–9], Large Language Mod- els (LLMs) [10–16], and Text-to-Speech (TTS) [17–25] mod- ules. While effective for basic information retrieval, this dis- jointed pipeline introduces compoundi...

  2. [2]

    Methodology As illustrated in Figure 1, we propose ASPIRin, an alignment framework designed to optimize the temporal dynamics of full- duplex speech models parameterized byθ. Unlike standard ap- proaches that treat audio generation as a unified sequence task, ASPIRin decoupleswhento speak fromwhatto say by replac- ing fine-grained token optimization with ...

  3. [3]

    Experimental Setup Training Data.We utilize a 43-hour in-house dataset of natural conversational speech (approx

    Experiments 3.1. Experimental Setup Training Data.We utilize a 43-hour in-house dataset of natural conversational speech (approx. 1,300 two-minute, dual-channel clips). This dataset was collected with ex- plicit speaker consent and rigorously anonymized to en- sure privacy compliance. We process the audio using the nvidia/parakeet-tdt-0.6b-v3ASR model [9]...

  4. [4]

    speak or not

    Results and Analysis 4.1. Main Results Establishing a Strong Baseline.We establish a strong heuris- tic baseline by introducing a 3-second prompt delay to the base Moshi model in Table 1. This simple modification yields sub- stantial improvements: Takeover Rate (TOR) drops by 49% – 57% in pause handling and backchanneling scenarios, while the GPT-4o seman...

  5. [5]

    speak or not

    Conclusion We introduced ASPIRin, an interactivity-optimized reinforce- ment learning framework resolving the tension between tem- poral dynamics and semantic coherence in full-duplex SLMs. While standard GRPO burdens fine-grained token policies and suffers from aggressive, repetitive generation, ASPIRin utilizes Action Space Projection to map vocabulary ...

  6. [6]

    Generative AI was not used to pro- duce any significant portion of the manuscript’s original con- tent, ideas, or research findings

    Generative AI Use Disclosure During the preparation of this work, the authors used Generative AI tools exclusively for editing and polishing the manuscript to improve overall readability. Generative AI was not used to pro- duce any significant portion of the manuscript’s original con- tent, ideas, or research findings. All co-authors consent to this submi...

  7. [7]

    We are also grateful to Steve Chung-Cheng Chen, Tsung-Ying Yang, Jen-Hao Cheng, and Dau-Cheng Lyu for their insight- ful discussions and feedback

    Acknowledgements We thank the ASUS Open Cloud Infrastructure Software Center for providing the essential resources that supported this work. We are also grateful to Steve Chung-Cheng Chen, Tsung-Ying Yang, Jen-Hao Cheng, and Dau-Cheng Lyu for their insight- ful discussions and feedback. Additionally, this research was supported by the National Center for ...

  8. [8]

    Robust Speech Recognition via Large-Scale Weak Supervision

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large- scale weak supervision,” 2022. [Online]. Available: https: //arxiv.org/abs/2212.04356

  9. [9]

    Qwen3-ASR Technical Report

    X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang, Z. Guo, H. Hao, Y . Xi, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-asr technical report,”arXiv preprint arXiv:2601.21337, 2026

  10. [10]

    Mandarin- english code-switching speech recognition with self-supervised speech representation models,

    L.-H. Tseng, Y .-K. Fu, H.-J. Chang, and H.-y. Lee, “Mandarin- english code-switching speech recognition with self-supervised speech representation models,”arXiv preprint arXiv:2110.03504, 2021

  11. [11]

    Reborn: Reinforcement-learned boundary segmentation with iterative training for unsupervised asr,

    L.-H. Tseng, E.-P. Hu, C.-H. Chiang, Y . Tseng, H.-y. Lee, L.-s. Lee, and S.-H. Sun, “Reborn: Reinforcement-learned boundary segmentation with iterative training for unsupervised asr,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, ...

  12. [12]

    Yang, K.-P

    C.-K. Yang, K.-P. Huang, K.-H. Lu, C.-Y . Kuan, C.-Y . Hsiao, and H.-Y . Lee, “Investigating zero-shot generalizability on mandarin- english code-switched asr and speech-to-text translation of recent foundation models with self-supervision and weak supervision,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (IC...

  13. [13]

    Do prompts really prompt? exploring the prompt understanding capability of whis- per,

    C.-K. Yang, K.-P. Huang, and H.-Y . Lee, “Do prompts really prompt? exploring the prompt understanding capability of whis- per,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1–8

  14. [14]

    Enhanc- ing multilingual asr for unseen languages via language embedding modeling,

    S.-S. Huang, K.-P. Huang, A. T. Liu, and H.-Y . Lee, “Enhanc- ing multilingual asr for unseen languages via language embedding modeling,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  15. [15]

    A self-refining frame- work for enhancing asr using tts-synthesized data,

    C.-K. Chou, C.-J. Hsu, H.-L. Chung, L.-H. Tseng, H.-C. Cheng, Y .-K. Fu, K. P. Huang, and H.-Y . Lee, “A self-refining frame- work for enhancing asr using tts-synthesized data,”arXiv preprint arXiv:2506.11130, 2025

  16. [16]

    Available: https://arxiv.org/abs/2509.14128

    M. Sekoyanet al., “Canary-1b-v2 & parakeet-tdt-0.6b-v3: Efficient and high-performance models for multilingual asr and ast,” 2025. [Online]. Available: https://arxiv.org/abs/2509.14128

  17. [17]

    GPT-4o System Card

    OpenAIet al., “Gpt-4o system card,” 2024. [Online]. Available: https://arxiv.org/abs/2410.21276

  18. [18]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  19. [19]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025

  20. [20]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  21. [21]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  22. [22]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical re- port,”arXiv preprint arXiv:2412.19437, 2024

  23. [23]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2501.12948

  24. [24]

    Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens

    Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”arXiv preprint arXiv:2407.05407, 2024

  25. [25]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wanget al., “Cosyvoice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

  26. [26]

    Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025

    Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025

  27. [27]

    Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

    H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guo, X. Zhang, P. Zhang, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-tts technical report,”arXiv preprint arXiv:2601.15621, 2026

  28. [28]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language mod- els are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

  29. [29]

    V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers

    S. Chen, S. Liu, L. Zhou, Y . Liu, X. Tan, J. Li, S. Zhao, Y . Qian, and F. Wei, “Vall-e 2: Neural codec language models are hu- man parity zero-shot text to speech synthesizers,”arXiv preprint arXiv:2406.05370, 2024

  30. [30]

    XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,

    E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. We- ber, “XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,” inInterspeech 2024, 2024, pp. 4978–4982

  31. [31]

    Breezyvoice: Adapting tts for taiwanese mandarin with enhanced polyphone disambiguation – challenges and insights,

    C.-J. Hsuet al., “Breezyvoice: Adapting tts for taiwanese man- darin with enhanced polyphone disambiguation–challenges and insights,”arXiv preprint arXiv:2501.17790, 2025

  32. [32]

    The breeze 2 herd of models: Traditional chinese llms based on llama with vision-aware and function-calling capa- bilities,

    C.-J. Hsu, C.-S. Liu, M.-H. Chen, M. Chen, P.-C. Hsu, Y .-C. Chen, and D.-S. Shiu, “The breeze 2 herd of models: Traditional chinese llms based on llama with vision-aware and function-calling capa- bilities,”arXiv preprint arXiv:2501.13921, 2025

  33. [33]

    Building a taiwanese mandarin spoken language model: A first attempt,

    C.-K. Yanget al., “Building a taiwanese mandarin spoken lan- guage model: A first attempt,”arXiv preprint arXiv:2411.07111, 2024

  34. [34]

    Analyzing Mitigation Strategies for Catas- trophic Forgetting in End-to-End Training of Spoken Language Models,

    C.-Y . Hsiaoet al., “Analyzing Mitigation Strategies for Catas- trophic Forgetting in End-to-End Training of Spoken Language Models,” inInterspeech 2025, 2025, pp. 3234–3238

  35. [35]

    Desta: Enhancing speech language models through descriptive speech-text alignment,

    K.-H. Lu, Z. Chen, S.-W. Fu, H. Huang, B. Ginsburg, Y .-C. F. Wang, and H.-y. Lee, “Desta: Enhancing speech language models through descriptive speech-text alignment,” inInterspeech 2024, 2024, pp. 4159–4163

  36. [36]

    Developing instruction-following speech lan- guage model without speech instruction-tuning data,

    K.-H. Luet al., “Developing instruction-following speech lan- guage model without speech instruction-tuning data,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  37. [37]

    DeSTA2.5-Audio: Toward general-purpose large audio language model with self-generated cross-modal alignment,

    K.-H. Lu, Z. Chen, S.-W. Fu, C.-H. H. Yang, J. Balam, B. Gins- burg, Y .-C. F. Wang, and H.-Y . Lee, “Desta2.5-audio: Toward general-purpose large audio language model with self-generated cross-modal alignment,”arXiv preprint arXiv:2507.02768, 2025

  38. [38]

    A preliminary exploration with gpt-4o voice mode,

    Y .-X. Linet al., “A preliminary exploration with gpt-4o voice mode,”arXiv preprint arXiv:2502.09940, 2025

  39. [39]

    Reducing object hallucination in large audio-language models via audio-aware decoding.arXiv preprint arXiv:2506.07233, 2025

    T.-w. Hsuet al., “Reducing object hallucination in large audio- language models via audio-aware decoding,”arXiv preprint arXiv:2506.07233, 2025

  40. [40]

    Speech-copilot: Leveraging large language models for speech processing via task decomposition, modularization, and program generation,

    C.-Y . Kuan, C.-K. Yang, W.-P. Huang, K.-H. Lu, and H.-y. Lee, “Speech-copilot: Leveraging large language models for speech processing via task decomposition, modularization, and program generation,” in2024 IEEE Spoken Language Technology Work- shop (SLT). IEEE, 2024, pp. 1060–1067

  41. [41]

    Stitch: Simultaneous thinking and talking with chunked reasoning for spoken language models,

    C.-H. Chiang, X. Wang, L. Li, C.-C. Lin, K. Lin, S. Liu, Z. Wang, Z. Yang, H.-y. Lee, and L. Wang, “Stitch: Simultaneous thinking and talking with chunked reasoning for spoken language models,” arXiv preprint arXiv:2507.15375, 2025

  42. [42]

    On The Landscape of Spoken Language Models: A Comprehensive Survey

    S. Aroraet al., “On the landscape of spoken language models: A comprehensive survey,”arXiv preprint arXiv:2504.08528, 2025

  43. [43]

    Speechprompt: Prompting speech language models for speech processing tasks,

    K.-W. Chang, H. Wu, Y .-K. Wang, Y .-K. Wu, H. Shen, W.- C. Tseng, I.-t. Kang, S.-W. Li, and H.-y. Lee, “Speechprompt: Prompting speech language models for speech processing tasks,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, 2024

  44. [44]

    Speechprompt: An exploration of prompt tuning on generative spoken language model for speech processing tasks

    K.-W. Chang, W.-C. Tseng, S.-W. Li, and H.-y. Lee, “Speech- prompt: An exploration of prompt tuning on generative spo- ken language model for speech processing tasks,”arXiv preprint arXiv:2203.16773, 2022

  45. [45]

    Speechprompt v2: Prompt tuning for speech classification tasks,

    K.-W. Chang, Y .-K. Wang, H. Shen, I.-t. Kang, W.-C. Tseng, S.-W. Li, and H.-y. Lee, “Speechprompt v2: Prompt tuning for speech classification tasks,”arXiv preprint arXiv:2303.00733, 2023

  46. [46]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

  47. [47]

    Taste: Text-aligned speech tokenization and embedding for spoken language modeling,

    L.-H. Tseng, Y .-C. Chen, K.-Y . Lee, D.-S. Shiu, and H. yi Lee, “Taste: Text-aligned speech tokenization and embedding for spoken language modeling,” 2026. [Online]. Available: https://arxiv.org/abs/2504.07053

  48. [48]

    Dynamic-SUPERB phase-2: A collabo- ratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,

    C. yu Huanget al., “Dynamic-SUPERB phase-2: A collabo- ratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,” inThe Thirteenth In- ternational Conference on Learning Representations, 2025. [On- line]. Available: https://openreview.net/forum?id=s7lzZpAW7T

  49. [49]

    Dynamic-superb: Towards a dynamic, col- laborative, and comprehensive instruction-tuning benchmark for speech,

    C.-y. Huanget al., “Dynamic-superb: Towards a dynamic, col- laborative, and comprehensive instruction-tuning benchmark for speech,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 136–12 140

  50. [50]

    Language model can listen while speaking,

    Z. Ma, Y . Song, C. Du, J. Cong, Z. Chen, Y . Wang, Y . Wang, and X. Chen, “Language model can listen while speaking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 23, 2025, pp. 24 831–24 839

  51. [51]

    Effi- cient and Direct Duplex Modeling for Speech-to-Speech Lan- guage Model,

    K. Hu, E. Hosseini-Asl, C. Chen, E. Casanova, S. Ghosh, P. ˙Zelasko, Z. Chen, J. Li, J. Balam, and B. Ginsburg, “Effi- cient and Direct Duplex Modeling for Speech-to-Speech Lan- guage Model,” inInterspeech 2025, 2025, pp. 2715–2719

  52. [52]

    Personaplex: V oice and role control for full duplex conversational speech models.arXiv preprint arXiv:2602.06053, 2026

    R. Roy, J. Raiman, S. gil Lee, T.-D. Ene, R. Kirby, S. Kim, J. Kim, and B. Catanzaro, “Personaplex: V oice and role control for full duplex conversational speech models,” 2026. [Online]. Available: https://arxiv.org/abs/2602.06053

  53. [53]

    Moshi: a speech-text foundation model for real-time dialogue

    A. D ´efossez, L. Mazar´e, M. Orsini, A. Royer, P. P´erez, H. J´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

  54. [54]

    Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner

    G.-T. Lin, S.-Y . S. Kuan, J. Shi, K.-W. Chang, S. Arora, S. Watanabe, and H. yi Lee, “Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an automated examiner,” 2025. [Online]. Available: https://arxiv.org/abs/2510.07838

  55. [55]

    Towards holistic evaluation of large audio- language models: A comprehensive survey,

    C.-K. Yanget al., “Towards holistic evaluation of large audio- language models: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 10 144–10 170. [Online...

  56. [56]

    Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

    K.-W. Changet al., “Game-time: Evaluating temporal dynamics in spoken language models,” 2025. [Online]. Available: https: //arxiv.org/abs/2509.26388

  57. [57]

    Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

    G.-T. Lin, S.-Y . S. Kuan, Q. Wang, J. Lian, T. Li, S. Watanabe, and H. yi Lee, “Full-duplex-bench v1.5: Evaluating overlap handling for full-duplex speech models,” 2026. [Online]. Available: https://arxiv.org/abs/2507.23159

  58. [58]

    Aligning spoken dialogue models from user in- teractions,

    A. Wuet al., “Aligning spoken dialogue models from user in- teractions,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 67 476–67 498

  59. [59]

    Align-SLM: Textless spoken language models with reinforcement learning from AI feedback,

    G.-T. Lin, P. G. Shivakumar, A. Gourav, Y . Gu, A. Gandhe, H.-y. Lee, and I. Bulyko, “Align-SLM: Textless spoken language models with reinforcement learning from AI feedback,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, ...

  60. [60]

    Reinforcement learn- ing enhanced full-duplex spoken dialogue language models for conversational interactions,

    C. Chen, K. Hu, C.-H. H. Yang, A. Pasad, E. Casanova, W. Wang, S.-W. Fu, J. Li, Z. Chen, J. Balamet al., “Reinforcement learn- ing enhanced full-duplex spoken dialogue language models for conversational interactions,” inSecond Conference on Language Modeling, 2025

  61. [61]

    Optimizing conversational quality in spoken dialogue systems with reinforcement learning from ai feedback,

    S. Arora, J. Tian, J. Shi, H. Futami, Y . Kashiwagi, E. Tsunoo, and S. Watanabe, “Optimizing conversational quality in spoken dialogue systems with reinforcement learning from ai feedback,” arXiv preprint arXiv:2601.19063, 2026

  62. [62]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shaoet al., “Deepseekmath: Pushing the limits of math- ematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

  63. [63]

    Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities,

    G.-T. Linet al., “Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities,”

  64. [64]

    H., and Lee, H.-y

    [Online]. Available: https://arxiv.org/abs/2503.04721

  65. [65]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=nZeVKeeFYf9

  66. [66]

    Neural text generation with unlikelihood training,

    S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston, “Neural text generation with unlikelihood training,” inInternational Conference on Learning Representations,

  67. [67]

    Available: https://openreview.net/forum?id= SJeYe0NtvH

    [Online]. Available: https://openreview.net/forum?id= SJeYe0NtvH

  68. [68]

    Texygen: A benchmarking platform for text generation models,

    Y . Zhu, S. Lu, L. Zheng, J. Guo, W. Zhang, J. Wang, and Y . Yu, “Texygen: A benchmarking platform for text generation models,” SIGIR, 2018