pith. sign in

arxiv: 2603.17837 · v4 · pith:R4Y3VJSPnew · submitted 2026-03-18 · 📡 eess.AS · cs.CL

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Pith reviewed 2026-05-21 10:39 UTC · model grok-4.3

classification 📡 eess.AS cs.CL
keywords full-duplex spoken dialoguelatent reasoninginternal cognitionspoken dialogue modelsELBO objectiveteacher forcingspeech perceptioncausal reasoning
0
0 comments X

The pith

FLAIR enables spoken dialogue models to conduct latent reasoning while perceiving speech by recursively reusing embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FLAIR to model internal cognition in full-duplex spoken dialogue systems. FLAIR performs latent thinking simultaneously with speech perception by recursively feeding the previous latent embedding output into the next processing step. This maintains strict causality and adds no latency. Training relies on an Evidence Lower Bound objective with teacher forcing to learn without explicit reasoning annotations. Experiments report competitive results on speech benchmarks and full-duplex interaction metrics.

Core claim

By recursively feeding the latent embedding output from the previous step into the next step during the user's speaking phase, FLAIR conducts continuous internal reasoning simultaneously with speech perception. This adheres strictly to causality without introducing additional latency. An Evidence Lower Bound-based objective supports efficient supervised finetuning via teacher forcing and circumvents the need for explicit reasoning annotations.

What carries the argument

Recursive reuse of latent embeddings fed back as input to the next step while processing incoming speech, optimized via an ELBO objective under teacher forcing.

If this is right

  • Continuous reasoning occurs during the user's speaking phase without adding latency or breaking causality.
  • The model handles conversational dynamics more robustly in full-duplex settings.
  • Competitive performance is reached on a range of speech benchmarks.
  • Full-duplex interaction metrics remain competitive.
  • Reasoning proceeds without requiring explicit annotations or separate post-hoc generation steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same recursive latent-state mechanism could extend to other streaming inputs such as video or sensor data for concurrent internal processing.
  • Over longer dialogues the maintained latent state might implicitly track topic coherence or user intent shifts.
  • Combining this latent mode with occasional explicit reasoning triggers could address cases where deeper deliberation is required.

Load-bearing premise

That recursive reuse of latent embeddings combined with an ELBO objective under teacher forcing produces meaningful internal cognitive representations that improve dialogue quality without explicit reasoning annotations or post-hoc generation.

What would settle it

An ablation experiment that disables the recursive latent embedding feedback and measures whether full-duplex interaction metrics or dialogue quality drop compared to the full model would settle the claim.

Figures

Figures reproduced from arXiv: 2603.17837 by Chen Chen, Donghang Wu, Eng Siong Chng, Hexin Liu, Tianyu Zhang, Yoshua Bengio, Yuxin Li.

Figure 1
Figure 1. Figure 1: (a) Turn-based interaction. The agent remains idle and starts to respond after the end of the user turn, but cannot be interrupted by user, as shown in the red box. (b) Full-duplex SDLM continuously listens to streaming speech input, supports user barge-in, and automatically switches between thinking and speaking like a human speaker. tral goal in Human-Computer Interaction. In recent years, driven by larg… view at source ↗
Figure 2
Figure 2. Figure 2: The overview of proposed FLAIR. During the user’s speech phase, the LLM performs latent reasoning, using the LLM’s output latent embeddings as the input for the next step. Once the user finishes speaking, the assistant autonomously decides when to respond; the LLM then executes an explicit forward pass, using text tokens as the input for the next step. When the user barges in, the LLM autonomously decides … view at source ↗
Figure 3
Figure 3. Figure 3: The distribution of the input audio, target text, and latent reasoning embeddings. Specifically, the latent reasoning embeddings act as a bridge that connects the input audio with the target text. 2025). Given that the CANDOR dataset used for turn-taking evaluation is a real-world noisy dataset, we construct an ad￾ditional training subset by linearly mixing background noise ranging from 0 dB to 60 dB. Furt… view at source ↗
read the original abstract

During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FLAIR, a Full-duplex LAtent and Internal Reasoning method for spoken dialogue models. It models concurrent internal cognition during speech perception by recursively reusing latent embeddings from prior steps as input to subsequent steps, preserving strict causality and zero added latency. An ELBO objective enables supervised fine-tuning under teacher forcing without requiring explicit reasoning annotations. The approach is evaluated on speech benchmarks and full-duplex interaction metrics, where it reports competitive performance.

Significance. If the recursive latent mechanism produces representations that meaningfully capture internal cognition and improve response quality, the work could advance full-duplex conversational systems by aligning model behavior more closely with human-like think-while-listening processes. The parameter-free recursive construction and avoidance of post-hoc generation are notable strengths, as is the use of an ELBO to train latent states without additional annotations. However, the absence of detailed quantitative results, error bars, baselines, or ablations in the provided description limits assessment of whether these gains are robust or merely incremental.

major comments (2)
  1. The abstract and introduction claim competitive results on speech benchmarks and full-duplex metrics, yet supply no numerical values, baselines, error bars, or ablation studies. This absence is load-bearing for the central effectiveness claim and prevents verification that the latent-reasoning component drives the reported gains rather than other architectural choices.
  2. The training objective is defined directly on the same latent embeddings that are asserted to represent internal reasoning, and performance is measured on the identical dialogue data used for optimization. While internally consistent by construction, this setup creates a moderate dependence that requires explicit controls (e.g., held-out reasoning probes or downstream task ablations) to substantiate that the latents encode cognition beyond what is needed for next-token prediction.
minor comments (2)
  1. Notation for the recursive latent update and the precise form of the ELBO should be stated explicitly with equation numbers in the methods section to allow reproduction.
  2. Clarify whether the teacher-forcing schedule during fine-tuning matches the inference-time recursive usage; any mismatch could affect causality claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below in detail and outline the revisions we will make to improve clarity and substantiation of our claims.

read point-by-point responses
  1. Referee: The abstract and introduction claim competitive results on speech benchmarks and full-duplex metrics, yet supply no numerical values, baselines, error bars, or ablation studies. This absence is load-bearing for the central effectiveness claim and prevents verification that the latent-reasoning component drives the reported gains rather than other architectural choices.

    Authors: We acknowledge that the abstract and introduction, as currently written, summarize the outcomes at a high level without embedding specific numerical results. The full manuscript contains detailed experimental results in the Experiments section, including comparisons against baselines on speech benchmarks and full-duplex metrics, along with ablation studies. To directly address the concern and make the effectiveness claims more immediately verifiable, we will revise the abstract and introduction to incorporate key quantitative highlights (e.g., specific performance deltas and references to the relevant tables) while preserving the overall length constraints. revision: yes

  2. Referee: The training objective is defined directly on the same latent embeddings that are asserted to represent internal reasoning, and performance is measured on the identical dialogue data used for optimization. While internally consistent by construction, this setup creates a moderate dependence that requires explicit controls (e.g., held-out reasoning probes or downstream task ablations) to substantiate that the latents encode cognition beyond what is needed for next-token prediction.

    Authors: We agree that the shared use of dialogue data for both the ELBO objective and evaluation introduces a potential dependence that merits explicit controls. The recursive latent mechanism and ELBO are specifically constructed to optimize for internal state evolution during perception, which is distinct from standard autoregressive prediction; the full-duplex metrics further test real-time conversational behaviors not directly optimized during training. To strengthen this distinction, we will add targeted ablations in the revised manuscript that disable the recursive latent reuse while keeping other components fixed, thereby isolating the contribution of the internal reasoning latents. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines FLAIR via a recursive latent-embedding reuse mechanism during speech perception and an ELBO objective under teacher forcing to train without explicit reasoning annotations. Causality is preserved by the recursive structure by design, the ELBO is a standard variational lower bound on the latent states, and competitive results on speech benchmarks plus full-duplex metrics serve as an independent empirical check. No equation or claim reduces the asserted internal cognitive representations to a tautological fit or self-citation chain; the modeling choice and downstream evaluation remain distinct from the training inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard latent-variable modeling assumptions and the premise that hidden states can stand in for unannotated internal cognition; no new entities or free parameters are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Latent embeddings can serve as proxies for subconscious internal cognitive processing during speech perception.
    Invoked to justify recursive feeding of embeddings as continuous reasoning.
  • domain assumption An ELBO objective with teacher forcing enables effective supervised finetuning of the latent reasoning process without explicit annotations.
    Central to the training procedure described.

pith-pipeline@v0.9.0 · 5736 in / 1246 out tokens · 40974 ms · 2026-05-21T10:39:40.064827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

    cs.SD 2026-05 unverdicted novelty 5.0

    A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

  2. A Survey of Audio Reasoning in Multimodal Foundation Models

    eess.AS 2026-05 unverdicted novelty 2.0

    A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 14 internal anchors

  1. [1]

    Chain-of-thought reason- ing in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066,

    Arora, S., Tian, J., Futami, H., Shi, J., Kashiwagi, Y ., Tsunoo, E., and Watanabe, S. Chain-of-thought reason- ing in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066,

  2. [2]

    Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

    Bakhturina, E., Lavrukhin, V ., Ginsburg, B., and Zhang, Y . Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

  3. [3]

    Semantic parsing on freebase from question-answer pairs

    Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on freebase from question-answer pairs. InPro- ceedings of the 2013 conference on empirical methods in natural language processing, pp. 1533–1544,

  4. [4]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

  5. [5]

    VoiceBench: Benchmarking LLM-Based Voice Assistants

    Chen, C., Hu, K., Yang, C.-H. H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al. Reinforcement learning enhanced full-duplex spoken di- alogue language models for conversational interactions. InSecond Conference on Language Modeling. Chen, Y ., Yue, X., Zhang, C., Gao, X., Tan, R. T., and Li, H. V oicebench: Benchmarking...

  6. [6]

    Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a

    Chiang, C.-H., Wang, X., Li, L., Lin, C.-C., Lin, K., Liu, S., Wang, Z., Yang, Z., Lee, H.-y., and Wang, L. Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a. Chiang, C.-H., Wang, X., Li, L., Lin, C.-C., Lin, K., Liu, S., Wang, Z., Yang, Z., Lee, H.-y., and Wang, L. Stitch: Simultaneous thinking an...

  7. [7]

    Qwen2-Audio Technical Report

    Chu, Y ., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y ., Lv, Y ., He, J., Lin, J., et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

  8. [8]

    Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,

    Cui, W., Yu, D., Jiao, X., Meng, Z., Zhang, G., Wang, Q., Guo, Y ., and King, I. Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,

  9. [9]

    Moshi: a speech-text foundation model for real-time dialogue

    D´efossez, A., Mazar ´e, L., Orsini, M., Royer, A., P ´erez, P., J ´egou, H., Grave, E., and Zeghidour, N. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037,

  10. [10]

    Kimi-Audio Technical Report

    Ding, D., Ju, Z., Leng, Y ., Liu, S., Liu, T., Shang, Z., Shen, K., Song, W., Tan, X., Tang, H., et al. Kimi-audio techni- cal report.arXiv preprint arXiv:2504.18425,

  11. [11]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Du, Z., Wang, Y ., Chen, Q., Shi, X., Lv, X., Zhao, T., Gao, Z., Yang, Y ., Gao, C., Wang, H., et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117,

  12. [12]

    Faisal, F., Keshava, S., Alam, M. M. I., and Anastasopoulos, A. Sd-qa: Spoken dialectal question answering for the real world. InFindings of the Association for Computa- tional Linguistics: EMNLP 2021,

  13. [13]

    Training Large Language Models to Reason in a Continuous Latent Space

    Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to rea- son in a continuous latent space.arXiv preprint arXiv: 2412.06769,

  14. [14]

    N., Prabhavalkar, R., McGraw, I., Al- varez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y ., Pang, R., et al

    He, Y ., Sainath, T. N., Prabhavalkar, R., McGraw, I., Al- varez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y ., Pang, R., et al. Streaming end-to-end speech recognition for mobile devices. InICASSP 2019-2019 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE,

  15. [15]

    Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670,

    Hu, K., Hosseini-Asl, E., Chen, C., Casanova, E., Ghosh, S., ˙Zelasko, P., Chen, Z., Li, J., Balam, J., and Ginsburg, B. Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670,

  16. [16]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

  17. [17]

    Kingma, D. P. and Welling, M. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

  18. [18]

    R., Sekoyan, M., Zelenfroynd, G., Meister, S., Ding, S., Kostandian, S., Huang, H., Karpov, N., Balam, J., Lavrukhin, V ., et al

    Koluguri, N. R., Sekoyan, M., Zelenfroynd, G., Meister, S., Ding, S., Kostandian, S., Huang, H., Karpov, N., Balam, J., Lavrukhin, V ., et al. Granary: Speech recognition and translation dataset in 25 european languages.arXiv preprint arXiv:2505.13404,

  19. [19]

    arXiv preprint arXiv:1909.09577 , year=

    Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, B., Kriman, S., Beliaev, S., Lavrukhin, V ., Cook, J., et al. Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577,

  20. [20]

    Baichuan-audio: A unified framework for end-to-end speech interaction

    Li, T., Liu, J., Zhang, T., Fang, Y ., Pan, D., Wang, M., Liang, Z., Li, Z., Lin, M., Dong, G., et al. Baichuan-audio: A unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239,

  21. [21]

    In2024 IEEE Spo- ken Language Technology Workshop (SLT), pages 1115–1122

    Lin, G.-T., Lian, J., Li, T., Wang, Q., Anumanchipalli, G., Liu, A. H., and Lee, H.-y. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721,

  22. [22]

    Streaming automatic speech recognition with the transformer model

    Moritz, N., Hori, T., and Le, J. Streaming automatic speech recognition with the transformer model. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6074–6078. IEEE,

  23. [23]

    Nachmani, E., Levkovitch, A., Hirsch, R., Salazar, J., Asawaroengchai, C., Mariooryad, S., Rivlin, E., Skerry- Ryan, R., and Ramanovich, M. T. Spoken question answering and speech continuation using spectrogram- powered llm.arXiv preprint arXiv:2305.15255,

  24. [24]

    H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y

    Ni, J., Abrego, G. H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y . Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. InFindings of the association for computational linguistics: ACL 2022, pp. 1864–1874,

  25. [25]

    Instruction data generation and unsu- pervised adaptation for speech language models.arXiv preprint arXiv:2406.12946, 2024a

    Noroozi, V ., Chen, Z., Majumdar, S., Huang, S., Balam, J., and Ginsburg, B. Instruction data generation and unsu- pervised adaptation for speech language models.arXiv preprint arXiv:2406.12946, 2024a. Noroozi, V ., Majumdar, S., Kumar, A., Balam, J., and Ginsburg, B. Stateful conformer with cache-based in- ference for streaming automatic speech recogniti...

  26. [26]

    Robust Speech Recognition via Large-Scale Weak Supervision

    URL https://arxiv. org/abs/2212.04356. Reece, A., Cooney, G., Bull, P., Chung, C., Dawson, B., Fitzpatrick, C., Glazer, T., Knox, D., Liebscher, A., and Marin, S. The candor corpus: Insights from a large multimodal dataset of naturalistic conversation.Science Advances, 9(13):eadf3197,

  27. [27]

    Utmos: Utokyo-sarulab system for voicemos challenge 2022

    10 The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi, S., and Saruwatari, H. Utmos: Utokyo-sarulab sys- tem for voicemos challenge 2022.arXiv preprint arXiv:2204.02152,

  28. [28]

    Saunshi, N., Dikkala, N., Li, Z., Kumar, S., and Reddi, S. J. Reasoning with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

  29. [29]

    Distributional reasoning in llms: Parallel reasoning processes in multi- hop reasoning.arXiv preprint arXiv:2406.13858,

    Shalev, Y ., Feder, A., and Goldstein, A. Distributional reasoning in llms: Parallel reasoning processes in multi- hop reasoning.arXiv preprint arXiv:2406.13858,

  30. [30]

    CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

    Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y ., and He, Y . Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv: 2502.21074,

  31. [31]

    Can speech llms think while listening?arXiv preprint arXiv:2510.07497,

    Shih, Y .-J., Raj, D., Wu, C., Zhou, W., Bong, S., Gaur, Y ., Mahadeokar, J., Kalinli, O., and Seltzer, M. Can speech llms think while listening?arXiv preprint arXiv:2510.07497,

  32. [32]

    MUSAN: A Music, Speech, and Noise Corpus

    Snyder, D., Chen, G., and Povey, D. Musan: A music, speech, and noise corpus.arXiv preprint arXiv:1510.08484,

  33. [33]

    N., Yu, B., Gong, H., and Gol- lakota, S

    Veluri, B., Peloquin, B. N., Yu, B., Gong, H., and Gol- lakota, S. Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents.arXiv preprint arXiv:2409.15594,

  34. [34]

    MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

    Wang, B., Min, S., Deng, X., Shen, J., Wu, Y ., Zettlemoyer, L., and Sun, H. Towards understanding chain-of-thought prompting: An empirical study of what matters. InPro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 2717–2739, 2023a. Wang, D., Wu, J., Li, J., Yang, D., Chen, X., Zhang, T....

  35. [35]

    Freeze- omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024

    Wang, P., Lu, S., Tang, Y ., Yan, S., Xia, W., and Xiong, Y . A full-duplex speech dialogue scheme based on large language model.Advances in Neural Information Pro- cessing Systems, 37:13372–13403, 2024a. Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self- consistency improves chain of thought reasoning...

  36. [36]

    Chronological thinking in full-duplex spoken dialogue language models

    Wu, D., Zhang, H., Chen, C., Zhang, T., Tian, F., Yang, X., Yu, G., Liu, H., Hou, N., Hu, Y ., et al. Chronological thinking in full-duplex spoken dialogue language models. arXiv preprint arXiv:2510.05150,

  37. [37]

    SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs

    Xu, Y ., Guo, X., Zeng, Z., and Miao, C. Softcot: Soft chain-of-thought for efficient reasoning with llms.Annual Meeting of the Association for Computational Linguistics, 2025a. doi: 10.48550/arXiv.2502.12134. 11 The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Xu, Y ., Guo, X., Zeng, Z., and Miao,...

  38. [38]

    Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,

    Yu, W., Wang, S., Yang, X., Chen, X., Tian, X., Zhang, J., Sun, G., Lu, L., Wang, Y ., and Zhang, C. Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,

  39. [39]

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    Zen, H., Dang, V ., Clark, R., Zhang, Y ., Weiss, R. J., Jia, Y ., Chen, Z., and Wu, Y . Libritts: A corpus de- rived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882,

  40. [40]

    GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

    Zeng, A., Du, Z., Liu, M., Wang, K., Jiang, S., Zhao, L., Dong, Y ., and Tang, J. Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

  41. [41]

    Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,

    Zeng, B., Song, S., Huang, S., Wang, Y ., Li, H., He, Z., Wang, X., Li, Z., and Lin, Z. Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,

  42. [42]

    Ad- vances in variational inference.IEEE transactions on pattern analysis and machine intelligence, 41(8):2008– 2026,

    Zhang, C., B¨utepage, J., Kjellstr¨om, H., and Mandt, S. Ad- vances in variational inference.IEEE transactions on pattern analysis and machine intelligence, 41(8):2008– 2026,

  43. [43]

    Global-aware Expert

    Zhang, H., Li, W., Chen, R., Kothapally, V ., Yu, M., and Yu, D. Llm-enhanced dialogue management for full-duplex spoken dialogue systems.arXiv preprint arXiv:2502.14145, 2025a. Zhang, Z., He, X., Yan, W., Shen, A., Zhao, C., Wang, S., Shen, Y ., and Wang, X. E. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space. arXiv pr...

  44. [44]

    For voice cloning, we construct a massive prompt bank by aggregating all speech segments with a duration of 5–10 seconds from the LibriTTS (Zen et al., 2019), YODAS (Li et al., 2023), and Hifi-TTS (Bakhturina et al.,

  45. [45]

    In total, over 100k distinct speech segments spanning more than 20k unique speakers are utilized to ensure high acoustic variance in the generated training data

    corpora. In total, over 100k distinct speech segments spanning more than 20k unique speakers are utilized to ensure high acoustic variance in the generated training data. Speech continuation data(530K hours). Leveraging massive text pre-training corpora (Su et al., 2025), we construct synthetic pseudo-dialogues by alternately assigning sentences from cont...

  46. [46]

    During the training phase, these noise signals are dynamically superimposed onto the user speech stream with an injection probability of 50%

    and MUSAN (Snyder et al., 2015). During the training phase, these noise signals are dynamically superimposed onto the user speech stream with an injection probability of 50%. The mixing intensity is varied by sampling a Signal-to-Noise Ratio (SNR) uniformly from the range of 0 dB to 60 dB. The pipeline for dataset construction is released in (Artificial A...

  47. [47]

    The LLM backbone is initialized from the Qwen2.5-7B-Instruct (Team, 2024)

    and all the models are trained on 64 A800 (80G) GPUs. The LLM backbone is initialized from the Qwen2.5-7B-Instruct (Team, 2024). A Parakeet-based encoder (600M) 8 is employed for speech encoder, which features a causal convolutional context to support streaming input and a Transformer-based modality adapter with 1024 hidden units to align audio features w...

  48. [48]

    Response by FLAIR Sure! A simple way to clean a shower head is to remove it and soak it in white vinegar for about an hour

    Table 5.Case study for PT model, FLAIR, and LLM (Qwen2.5-7B-Ins) User query Can you tell me a very easy way to clean a shower head? Response by PT model A easy and effective method you can try at home is using common household items. Response by FLAIR Sure! A simple way to clean a shower head is to remove it and soak it in white vinegar for about an hour....