arxiv: 2603.17837 · v3 · submitted 2026-03-18 · 📡 eess.AS · cs.CL

Recognition: no theorem link

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Donghang Wu , Tianyu Zhang , Yuxin Li , Hexin Liu , Chen Chen , Eng Siong Chng , Yoshua Bengio

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:32 UTC · model grok-4.3

classification 📡 eess.AS cs.CL

keywords full-duplex dialoguelatent reasoningspoken dialogue modelsinternal cognitionFLAIRevidence lower boundteacher forcingcausal reasoning

0 comments

The pith

Spoken dialogue models can perform continuous internal reasoning while listening by recursively updating latent embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FLAIR, a method that lets full-duplex spoken dialogue systems conduct latent thinking at the same time as they perceive incoming speech. During the user's speaking phase the model feeds its previous latent embedding output directly into the next processing step, creating ongoing internal cognition that stays strictly causal and adds no extra delay. Training relies on an Evidence Lower Bound objective that supports supervised fine-tuning through teacher forcing, so no explicit reasoning annotations are required. Experiments on speech benchmarks and full-duplex interaction metrics show competitive results while handling real conversational dynamics. The approach therefore removes the need for post-hoc thinking steps that normally break real-time flow.

Core claim

The central claim is that internal cognition in spoken dialogue can be modeled as recursive updates to latent embeddings: each step takes the embedding produced in the prior step as input, allowing the model to keep reasoning while speech continues to arrive, all while preserving exact causality and incurring zero added latency, and all trainable via an ELBO-based objective that needs no separate reasoning labels.

What carries the argument

FLAIR (Full-duplex LAtent and Internal Reasoning), which recursively feeds the latent embedding from the previous time step back into the model to sustain ongoing internal processing during speech perception.

Load-bearing premise

Recursive updates to latent embeddings can meaningfully represent and advance internal cognitive processing without explicit reasoning supervision or causing drift in the dialogue state.

What would settle it

If extended listening segments produce visibly less coherent or less relevant responses than a non-latent baseline, or if embedding drift visibly harms state consistency over long turns, the latent reasoning mechanism would be shown ineffective.

Figures

Figures reproduced from arXiv: 2603.17837 by Chen Chen, Donghang Wu, Eng Siong Chng, Hexin Liu, Tianyu Zhang, Yoshua Bengio, Yuxin Li.

**Figure 1.** Figure 1: (a) Turn-based interaction. The agent remains idle and starts to respond after the end of the user turn, but cannot be interrupted by user, as shown in the red box. (b) Full-duplex SDLM continuously listens to streaming speech input, supports user barge-in, and automatically switches between thinking and speaking like a human speaker. tral goal in Human-Computer Interaction. In recent years, driven by larg… view at source ↗

**Figure 2.** Figure 2: The overview of proposed FLAIR. During the user’s speech phase, the LLM performs latent reasoning, using the LLM’s output latent embeddings as the input for the next step. Once the user finishes speaking, the assistant autonomously decides when to respond; the LLM then executes an explicit forward pass, using text tokens as the input for the next step. When the user barges in, the LLM autonomously decides … view at source ↗

**Figure 3.** Figure 3: The distribution of the input audio, target text, and latent reasoning embeddings. Specifically, the latent reasoning embeddings act as a bridge that connects the input audio with the target text. 2025). Given that the CANDOR dataset used for turn-taking evaluation is a real-world noisy dataset, we construct an additional training subset by linearly mixing background noise ranging from 0 dB to 60 dB. Furt… view at source ↗

read the original abstract

During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The recursive latent update during listening is a clean architectural idea for full-duplex dialogue, but the paper gives almost no numbers or controls to show it actually models advancing cognition rather than generic state change.

read the letter

The paper's main move is to run a recursive latent embedding update while the user is still speaking, feeding the prior step's output straight into the next one. This keeps strict causality and adds no latency, which is a practical fit for spoken systems that need to think while listening. They train it with a standard ELBO under teacher forcing so no explicit reasoning labels are required. That combination looks distinct from the usual post-hoc chain-of-thought tricks in text models, and the motivation from human concurrent cognition is straightforward and reasonable on its face. The design itself is simple enough that it could be tried on top of existing speech encoders without much extra machinery. The abstract claims competitive numbers on speech benchmarks and full-duplex metrics, but supplies none of the actual scores, baselines, or ablations, so it is impossible to tell whether the recursion is doing real work or just riding along. The ELBO objective optimizes the joint over speech and final response; nothing in the loss explicitly rewards the latent trajectory for progressing in a cognitively useful direction. Over multi-second utterances the hidden state could drift into representations that correlate with acoustics yet carry little signal for the eventual reply. That is the main soft spot, and it is not minor if the central claim is that these updates constitute internal thought. The paper is aimed at groups already working on real-time conversational agents who want a lightweight way to add think-while-listen behavior. A reader in that niche would find the mechanism worth discussing even if the results are still thin. I would bring it to a reading group to walk through the exact recursion and loss, but I would not cite it until the experiments are shown in detail. It deserves peer review so referees can see the full tables and check whether the drift concern is addressed in the implementation.

Referee Report

3 major / 2 minor

Summary. The paper proposes FLAIR, a full-duplex latent and internal reasoning method for spoken dialogue models. During the user's speaking phase, it recursively injects the prior-step latent embedding to enable continuous, causal latent reasoning without added latency. Training uses a standard ELBO objective with teacher forcing to avoid needing explicit reasoning annotations. Experiments are claimed to show competitive results on speech benchmarks and full-duplex interaction metrics.

Significance. If the recursive latent updates can be shown to produce advancing internal cognition rather than generic state evolution, the approach would offer a latency-free way to incorporate think-while-listening behavior into full-duplex systems, potentially improving response coherence in conversational speech models.

major comments (3)

[§3] §3 (Method), ELBO formulation: the objective is the standard variational lower bound on the joint speech-response distribution under teacher forcing; no derivation or auxiliary loss is shown that would force the recursive latent trajectory to encode progressive cognitive steps rather than merely correlating with acoustics.
[Experiments] Experiments section: competitive results are asserted on speech benchmarks and full-duplex metrics, yet no numerical values, non-recursive baselines, or ablations isolating the recursive latent injection appear; without these it is impossible to verify that the claimed continuous reasoning improves downstream response quality.
[§4.1] §4.1 (full-duplex evaluation): the central claim that recursive embedding injection produces genuine internal cognition without drift is load-bearing, but no analysis (e.g., latent trajectory visualization or response-quality correlation over utterance length) addresses the risk that multi-second recursion simply accumulates uninformative state.

minor comments (2)

[Abstract] Abstract: replace the phrase 'competitive results' with at least one concrete metric and baseline comparison.
[§3] Notation: define the exact recursive update equation for the latent embedding (including any gating or normalization) to permit reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, clarifying the design choices in FLAIR and committing to revisions where additional evidence or exposition is warranted.

read point-by-point responses

Referee: [§3] §3 (Method), ELBO formulation: the objective is the standard variational lower bound on the joint speech-response distribution under teacher forcing; no derivation or auxiliary loss is shown that would force the recursive latent trajectory to encode progressive cognitive steps rather than merely correlating with acoustics.

Authors: We agree that the training objective is the standard ELBO under teacher forcing. The progressive nature of the internal cognition arises from the architectural choice of recursively injecting the prior-step latent embedding into the next step during the listening phase; this creates a causal chain in which each latent must support both immediate acoustic modeling and the evolving state needed for eventual response generation. Because the ELBO is optimized end-to-end for accurate response prediction, the latents are incentivized to carry forward task-relevant information rather than generic acoustics. We will revise §3 to include an expanded discussion of this implicit pressure toward progressive state evolution and will add a short derivation sketch showing how the recursive conditioning affects the variational posterior. revision: partial
Referee: [Experiments] Experiments section: competitive results are asserted on speech benchmarks and full-duplex metrics, yet no numerical values, non-recursive baselines, or ablations isolating the recursive latent injection appear; without these it is impossible to verify that the claimed continuous reasoning improves downstream response quality.

Authors: We acknowledge that the current manuscript presents only qualitative claims of competitiveness. In the revised version we will insert quantitative tables reporting exact metrics on the speech benchmarks, direct comparisons against non-recursive (i.e., single-pass latent) baselines, and ablation studies that isolate the effect of recursive latent injection on response quality and full-duplex metrics. revision: yes
Referee: [§4.1] §4.1 (full-duplex evaluation): the central claim that recursive embedding injection produces genuine internal cognition without drift is load-bearing, but no analysis (e.g., latent trajectory visualization or response-quality correlation over utterance length) addresses the risk that multi-second recursion simply accumulates uninformative state.

Authors: We recognize that demonstrating the absence of drift is essential. The revised manuscript will include (i) t-SNE or PCA visualizations of latent trajectories over multi-second utterances and (ii) plots correlating response quality with utterance length for both FLAIR and non-recursive controls. These additions will directly test whether the recursive state remains informative rather than collapsing. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the FLAIR derivation chain

full rationale

The paper defines FLAIR via recursive injection of prior-step latent embeddings during user speech, optimized under a standard ELBO objective with teacher forcing. This construction does not equate the claimed continuous internal reasoning to its own inputs by definition, nor does it rename a fitted parameter as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to force the result; empirical validation on speech benchmarks supplies independent content. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that latent embeddings can carry forward useful internal reasoning states; no explicit free parameters or invented physical entities are described.

axioms (1)

domain assumption Latent embeddings from a speech model can represent and propagate internal cognitive states across time steps
Invoked when the method recursively feeds previous latent outputs into subsequent steps to enable continuous reasoning.

invented entities (1)

FLAIR latent reasoning process no independent evidence
purpose: To perform internal cognition concurrently with speech perception in full-duplex dialogue
Newly proposed mechanism that is not derived from prior equations but introduced to solve the think-while-listening problem.

pith-pipeline@v0.9.0 · 5505 in / 1180 out tokens · 32356 ms · 2026-05-15T08:32:22.876995+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 10 internal anchors

[1]

Chain-of-thought reason- ing in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066,

Arora, S., Tian, J., Futami, H., Shi, J., Kashiwagi, Y ., Tsunoo, E., and Watanabe, S. Chain-of-thought reason- ing in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066,

work page arXiv
[2]

Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

Bakhturina, E., Lavrukhin, V ., Ginsburg, B., and Zhang, Y . Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

work page arXiv
[3]

Semantic parsing on freebase from question-answer pairs

Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on freebase from question-answer pairs. InPro- ceedings of the 2013 conference on empirical methods in natural language processing, pp. 1533–1544,

work page 2013
[4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[5]

H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al

Chen, C., Hu, K., Yang, C.-H. H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al. Reinforcement learning enhanced full-duplex spoken di- alogue language models for conversational interactions. InSecond Conference on Language Modeling. Chen, Y ., Yue, X., Zhang, C., Gao, X., Tan, R. T., and Li, H. V oicebench: Benchmarking...

work page arXiv
[6]

Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a

Chiang, C.-H., Wang, X., Li, L., Lin, C.-C., Lin, K., Liu, S., Wang, Z., Yang, Z., Lee, H.-y., and Wang, L. Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a. Chiang, C.-H., Wang, X., Li, L., Lin, C.-C., Lin, K., Liu, S., Wang, Z., Yang, Z., Lee, H.-y., and Wang, L. Stitch: Simultaneous thinking an...

work page arXiv
[7]

Qwen2-Audio Technical Report

Chu, Y ., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y ., Lv, Y ., He, J., Lin, J., et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,

Cui, W., Yu, D., Jiao, X., Meng, Z., Zhang, G., Wang, Q., Guo, Y ., and King, I. Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,

work page arXiv
[9]

Moshi: a speech-text foundation model for real-time dialogue

D´efossez, A., Mazar ´e, L., Orsini, M., Royer, A., P ´erez, P., J ´egou, H., Grave, E., and Zeghidour, N. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Kimi-Audio Technical Report

Ding, D., Ju, Z., Leng, Y ., Liu, S., Liu, T., Shang, Z., Shen, K., Song, W., Tan, X., Tang, H., et al. Kimi-audio techni- cal report.arXiv preprint arXiv:2504.18425,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Du, Z., Wang, Y ., Chen, Q., Shi, X., Lv, X., Zhao, T., Gao, Z., Yang, Y ., Gao, C., Wang, H., et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Faisal, F., Keshava, S., Alam, M. M. I., and Anastasopoulos, A. Sd-qa: Spoken dialectal question answering for the real world. InFindings of the Association for Computa- tional Linguistics: EMNLP 2021,

work page 2021
[13]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to rea- son in a continuous latent space.arXiv preprint arXiv: 2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

N., Prabhavalkar, R., McGraw, I., Al- varez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y ., Pang, R., et al

He, Y ., Sainath, T. N., Prabhavalkar, R., McGraw, I., Al- varez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y ., Pang, R., et al. Streaming end-to-end speech recognition for mobile devices. InICASSP 2019-2019 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE,

work page 2019
[15]

Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670,

Hu, K., Hosseini-Asl, E., Chen, C., Casanova, E., Ghosh, S., ˙Zelasko, P., Chen, Z., Li, J., Balam, J., and Ginsburg, B. Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670,

work page arXiv
[16]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Kingma, D. P. and Welling, M. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

R., Sekoyan, M., Zelenfroynd, G., Meister, S., Ding, S., Kostandian, S., Huang, H., Karpov, N., Balam, J., Lavrukhin, V ., et al

Koluguri, N. R., Sekoyan, M., Zelenfroynd, G., Meister, S., Ding, S., Kostandian, S., Huang, H., Karpov, N., Balam, J., Lavrukhin, V ., et al. Granary: Speech recognition and translation dataset in 25 european languages.arXiv preprint arXiv:2505.13404,

work page arXiv
[19]

Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577,

Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, B., Kriman, S., Beliaev, S., Lavrukhin, V ., Cook, J., et al. Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577,

work page arXiv 1909
[20]

Baichuan-audio: A unified framework for end-to-end speech interaction

Li, T., Liu, J., Zhang, T., Fang, Y ., Pan, D., Wang, M., Liang, Z., Li, Z., Lin, M., Dong, G., et al. Baichuan-audio: A unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239,

work page arXiv
[21]

H., and Lee, H.-y

Lin, G.-T., Lian, J., Li, T., Wang, Q., Anumanchipalli, G., Liu, A. H., and Lee, H.-y. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721,

work page arXiv
[22]

Streaming automatic speech recognition with the transformer model

Moritz, N., Hori, T., and Le, J. Streaming automatic speech recognition with the transformer model. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6074–6078. IEEE,

work page 2020
[23]

Nachmani, E., Levkovitch, A., Hirsch, R., Salazar, J., Asawaroengchai, C., Mariooryad, S., Rivlin, E., Skerry- Ryan, R., and Ramanovich, M. T. Spoken question answering and speech continuation using spectrogram- powered llm.arXiv preprint arXiv:2305.15255,

work page arXiv
[24]

H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y

Ni, J., Abrego, G. H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y . Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. InFindings of the association for computational linguistics: ACL 2022, pp. 1864–1874,

work page 2022
[25]

Instruction data generation and unsu- pervised adaptation for speech language models.arXiv preprint arXiv:2406.12946, 2024a

Noroozi, V ., Chen, Z., Majumdar, S., Huang, S., Balam, J., and Ginsburg, B. Instruction data generation and unsu- pervised adaptation for speech language models.arXiv preprint arXiv:2406.12946, 2024a. Noroozi, V ., Majumdar, S., Kumar, A., Balam, J., and Ginsburg, B. Stateful conformer with cache-based in- ference for streaming automatic speech recogniti...

work page arXiv 2024
[26]

Robust Speech Recognition via Large-Scale Weak Supervision

URL https://arxiv. org/abs/2212.04356. Reece, A., Cooney, G., Bull, P., Chung, C., Dawson, B., Fitzpatrick, C., Glazer, T., Knox, D., Liebscher, A., and Marin, S. The candor corpus: Insights from a large multimodal dataset of naturalistic conversation.Science Advances, 9(13):eadf3197,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Utmos: Utokyo-sarulab sys- tem for voicemos challenge 2022.arXiv preprint arXiv:2204.02152,

10 The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi, S., and Saruwatari, H. Utmos: Utokyo-sarulab sys- tem for voicemos challenge 2022.arXiv preprint arXiv:2204.02152,

work page arXiv 2022
[28]

Saunshi, N., Dikkala, N., Li, Z., Kumar, S., and Reddi, S. J. Reasoning with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

work page 2025
[29]

Distributional reasoning in llms: Parallel reasoning processes in multi- hop reasoning.arXiv preprint arXiv:2406.13858,

Shalev, Y ., Feder, A., and Goldstein, A. Distributional reasoning in llms: Parallel reasoning processes in multi- hop reasoning.arXiv preprint arXiv:2406.13858,

work page arXiv
[30]

Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv: 2502.21074,

Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y ., and He, Y . Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv: 2502.21074,

work page arXiv
[31]

Can speech llms think while listening?arXiv preprint arXiv:2510.07497,

Shih, Y .-J., Raj, D., Wu, C., Zhou, W., Bong, S., Gaur, Y ., Mahadeokar, J., Kalinli, O., and Seltzer, M. Can speech llms think while listening?arXiv preprint arXiv:2510.07497,

work page arXiv
[32]

MUSAN: A Music, Speech, and Noise Corpus

Snyder, D., Chen, G., and Povey, D. Musan: A music, speech, and noise corpus.arXiv preprint arXiv:1510.08484,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

N., Yu, B., Gong, H., and Gol- lakota, S

Veluri, B., Peloquin, B. N., Yu, B., Gong, H., and Gol- lakota, S. Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents.arXiv preprint arXiv:2409.15594,

work page arXiv
[34]

Towards understanding chain-of-thought prompting: An empirical study of what matters

Wang, B., Min, S., Deng, X., Shen, J., Wu, Y ., Zettlemoyer, L., and Sun, H. Towards understanding chain-of-thought prompting: An empirical study of what matters. InPro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 2717–2739, 2023a. Wang, D., Wu, J., Li, J., Yang, D., Chen, X., Zhang, T....

work page arXiv
[35]

A full-duplex speech dialogue scheme based on large language model.Advances in Neural Information Pro- cessing Systems, 37:13372–13403, 2024a

Wang, P., Lu, S., Tang, Y ., Yan, S., Xia, W., and Xiong, Y . A full-duplex speech dialogue scheme based on large language model.Advances in Neural Information Pro- cessing Systems, 37:13372–13403, 2024a. Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self- consistency improves chain of thought reasoning...

work page arXiv
[36]

Chronological thinking in full-duplex spoken dialogue language models

Wu, D., Zhang, H., Chen, C., Zhang, T., Tian, F., Yang, X., Yu, G., Liu, H., Hou, N., Hu, Y ., et al. Chronological thinking in full-duplex spoken dialogue language models. arXiv preprint arXiv:2510.05150,

work page arXiv
[37]

Softcot: Soft chain-of-thought for efficient reasoning with llms.Annual Meeting of the Association for Computational Linguistics, 2025a

Xu, Y ., Guo, X., Zeng, Z., and Miao, C. Softcot: Soft chain-of-thought for efficient reasoning with llms.Annual Meeting of the Association for Computational Linguistics, 2025a. doi: 10.48550/arXiv.2502.12134. 11 The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Xu, Y ., Guo, X., Zeng, Z., and Miao,...

work page doi:10.48550/arxiv.2502.12134
[38]

Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,

Yu, W., Wang, S., Yang, X., Chen, X., Tian, X., Zhang, J., Sun, G., Lu, L., Wang, Y ., and Zhang, C. Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,

work page arXiv
[39]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Zen, H., Dang, V ., Clark, R., Zhang, Y ., Weiss, R. J., Jia, Y ., Chen, Z., and Wu, Y . Libritts: A corpus de- rived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[40]

Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

Zeng, A., Du, Z., Liu, M., Wang, K., Jiang, S., Zhao, L., Dong, Y ., and Tang, J. Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

work page arXiv
[41]

Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,

Zeng, B., Song, S., Huang, S., Wang, Y ., Li, H., He, Z., Wang, X., Li, Z., and Lin, Z. Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,

work page arXiv
[42]

Ad- vances in variational inference.IEEE transactions on pattern analysis and machine intelligence, 41(8):2008– 2026,

Zhang, C., B¨utepage, J., Kjellstr¨om, H., and Mandt, S. Ad- vances in variational inference.IEEE transactions on pattern analysis and machine intelligence, 41(8):2008– 2026,

work page 2008
[43]

Global-aware Expert

Zhang, H., Li, W., Chen, R., Kothapally, V ., Yu, M., and Yu, D. Llm-enhanced dialogue management for full-duplex spoken dialogue systems.arXiv preprint arXiv:2502.14145, 2025a. Zhang, Z., He, X., Yan, W., Shen, A., Zhao, C., Wang, S., Shen, Y ., and Wang, X. E. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space. arXiv pr...

work page arXiv 1998
[44]

For voice cloning, we construct a massive prompt bank by aggregating all speech segments with a duration of 5–10 seconds from the LibriTTS (Zen et al., 2019), YODAS (Li et al., 2023), and Hifi-TTS (Bakhturina et al.,

work page 2019
[45]

In total, over 100k distinct speech segments spanning more than 20k unique speakers are utilized to ensure high acoustic variance in the generated training data

corpora. In total, over 100k distinct speech segments spanning more than 20k unique speakers are utilized to ensure high acoustic variance in the generated training data. Speech continuation data(530K hours). Leveraging massive text pre-training corpora (Su et al., 2025), we construct synthetic pseudo-dialogues by alternately assigning sentences from cont...

work page 2025
[46]

During the training phase, these noise signals are dynamically superimposed onto the user speech stream with an injection probability of 50%

and MUSAN (Snyder et al., 2015). During the training phase, these noise signals are dynamically superimposed onto the user speech stream with an injection probability of 50%. The mixing intensity is varied by sampling a Signal-to-Noise Ratio (SNR) uniformly from the range of 0 dB to 60 dB. The pipeline for dataset construction is released in (Artificial A...

work page 2015
[47]

The LLM backbone is initialized from the Qwen2.5-7B-Instruct (Team, 2024)

and all the models are trained on 64 A800 (80G) GPUs. The LLM backbone is initialized from the Qwen2.5-7B-Instruct (Team, 2024). A Parakeet-based encoder (600M) 8 is employed for speech encoder, which features a causal convolutional context to support streaming input and a Transformer-based modality adapter with 1024 hidden units to align audio features w...

work page 2024
[48]

Response by FLAIR Sure! A simple way to clean a shower head is to remove it and soak it in white vinegar for about an hour

Table 5.Case study for PT model, FLAIR, and LLM (Qwen2.5-7B-Ins) User query Can you tell me a very easy way to clean a shower head? Response by PT model A easy and effective method you can try at home is using common household items. Response by FLAIR Sure! A simple way to clean a shower head is to remove it and soak it in white vinegar for about an hour....

work page 2022