Recognition: no theorem link
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
Pith reviewed 2026-05-15 08:32 UTC · model grok-4.3
The pith
Spoken dialogue models can perform continuous internal reasoning while listening by recursively updating latent embeddings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that internal cognition in spoken dialogue can be modeled as recursive updates to latent embeddings: each step takes the embedding produced in the prior step as input, allowing the model to keep reasoning while speech continues to arrive, all while preserving exact causality and incurring zero added latency, and all trainable via an ELBO-based objective that needs no separate reasoning labels.
What carries the argument
FLAIR (Full-duplex LAtent and Internal Reasoning), which recursively feeds the latent embedding from the previous time step back into the model to sustain ongoing internal processing during speech perception.
Load-bearing premise
Recursive updates to latent embeddings can meaningfully represent and advance internal cognitive processing without explicit reasoning supervision or causing drift in the dialogue state.
What would settle it
If extended listening segments produce visibly less coherent or less relevant responses than a non-latent baseline, or if embedding drift visibly harms state consistency over long turns, the latent reasoning mechanism would be shown ineffective.
Figures
read the original abstract
During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes FLAIR, a full-duplex latent and internal reasoning method for spoken dialogue models. During the user's speaking phase, it recursively injects the prior-step latent embedding to enable continuous, causal latent reasoning without added latency. Training uses a standard ELBO objective with teacher forcing to avoid needing explicit reasoning annotations. Experiments are claimed to show competitive results on speech benchmarks and full-duplex interaction metrics.
Significance. If the recursive latent updates can be shown to produce advancing internal cognition rather than generic state evolution, the approach would offer a latency-free way to incorporate think-while-listening behavior into full-duplex systems, potentially improving response coherence in conversational speech models.
major comments (3)
- [§3] §3 (Method), ELBO formulation: the objective is the standard variational lower bound on the joint speech-response distribution under teacher forcing; no derivation or auxiliary loss is shown that would force the recursive latent trajectory to encode progressive cognitive steps rather than merely correlating with acoustics.
- [Experiments] Experiments section: competitive results are asserted on speech benchmarks and full-duplex metrics, yet no numerical values, non-recursive baselines, or ablations isolating the recursive latent injection appear; without these it is impossible to verify that the claimed continuous reasoning improves downstream response quality.
- [§4.1] §4.1 (full-duplex evaluation): the central claim that recursive embedding injection produces genuine internal cognition without drift is load-bearing, but no analysis (e.g., latent trajectory visualization or response-quality correlation over utterance length) addresses the risk that multi-second recursion simply accumulates uninformative state.
minor comments (2)
- [Abstract] Abstract: replace the phrase 'competitive results' with at least one concrete metric and baseline comparison.
- [§3] Notation: define the exact recursive update equation for the latent embedding (including any gating or normalization) to permit reproduction.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below, clarifying the design choices in FLAIR and committing to revisions where additional evidence or exposition is warranted.
read point-by-point responses
-
Referee: [§3] §3 (Method), ELBO formulation: the objective is the standard variational lower bound on the joint speech-response distribution under teacher forcing; no derivation or auxiliary loss is shown that would force the recursive latent trajectory to encode progressive cognitive steps rather than merely correlating with acoustics.
Authors: We agree that the training objective is the standard ELBO under teacher forcing. The progressive nature of the internal cognition arises from the architectural choice of recursively injecting the prior-step latent embedding into the next step during the listening phase; this creates a causal chain in which each latent must support both immediate acoustic modeling and the evolving state needed for eventual response generation. Because the ELBO is optimized end-to-end for accurate response prediction, the latents are incentivized to carry forward task-relevant information rather than generic acoustics. We will revise §3 to include an expanded discussion of this implicit pressure toward progressive state evolution and will add a short derivation sketch showing how the recursive conditioning affects the variational posterior. revision: partial
-
Referee: [Experiments] Experiments section: competitive results are asserted on speech benchmarks and full-duplex metrics, yet no numerical values, non-recursive baselines, or ablations isolating the recursive latent injection appear; without these it is impossible to verify that the claimed continuous reasoning improves downstream response quality.
Authors: We acknowledge that the current manuscript presents only qualitative claims of competitiveness. In the revised version we will insert quantitative tables reporting exact metrics on the speech benchmarks, direct comparisons against non-recursive (i.e., single-pass latent) baselines, and ablation studies that isolate the effect of recursive latent injection on response quality and full-duplex metrics. revision: yes
-
Referee: [§4.1] §4.1 (full-duplex evaluation): the central claim that recursive embedding injection produces genuine internal cognition without drift is load-bearing, but no analysis (e.g., latent trajectory visualization or response-quality correlation over utterance length) addresses the risk that multi-second recursion simply accumulates uninformative state.
Authors: We recognize that demonstrating the absence of drift is essential. The revised manuscript will include (i) t-SNE or PCA visualizations of latent trajectories over multi-second utterances and (ii) plots correlating response quality with utterance length for both FLAIR and non-recursive controls. These additions will directly test whether the recursive state remains informative rather than collapsing. revision: yes
Circularity Check
No significant circularity in the FLAIR derivation chain
full rationale
The paper defines FLAIR via recursive injection of prior-step latent embeddings during user speech, optimized under a standard ELBO objective with teacher forcing. This construction does not equate the claimed continuous internal reasoning to its own inputs by definition, nor does it rename a fitted parameter as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to force the result; empirical validation on speech benchmarks supplies independent content. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Latent embeddings from a speech model can represent and propagate internal cognitive states across time steps
invented entities (1)
-
FLAIR latent reasoning process
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Arora, S., Tian, J., Futami, H., Shi, J., Kashiwagi, Y ., Tsunoo, E., and Watanabe, S. Chain-of-thought reason- ing in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066,
-
[2]
Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,
Bakhturina, E., Lavrukhin, V ., Ginsburg, B., and Zhang, Y . Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,
-
[3]
Semantic parsing on freebase from question-answer pairs
Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on freebase from question-answer pairs. InPro- ceedings of the 2013 conference on empirical methods in natural language processing, pp. 1533–1544,
work page 2013
-
[4]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,
work page 1901
-
[5]
H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al
Chen, C., Hu, K., Yang, C.-H. H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al. Reinforcement learning enhanced full-duplex spoken di- alogue language models for conversational interactions. InSecond Conference on Language Modeling. Chen, Y ., Yue, X., Zhang, C., Gao, X., Tan, R. T., and Li, H. V oicebench: Benchmarking...
-
[6]
Chiang, C.-H., Wang, X., Li, L., Lin, C.-C., Lin, K., Liu, S., Wang, Z., Yang, Z., Lee, H.-y., and Wang, L. Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a. Chiang, C.-H., Wang, X., Li, L., Lin, C.-C., Lin, K., Liu, S., Wang, Z., Yang, Z., Lee, H.-y., and Wang, L. Stitch: Simultaneous thinking an...
-
[7]
Chu, Y ., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y ., Lv, Y ., He, J., Lin, J., et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,
Cui, W., Yu, D., Jiao, X., Meng, Z., Zhang, G., Wang, Q., Guo, Y ., and King, I. Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,
-
[9]
Moshi: a speech-text foundation model for real-time dialogue
D´efossez, A., Mazar ´e, L., Orsini, M., Royer, A., P ´erez, P., J ´egou, H., Grave, E., and Zeghidour, N. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Ding, D., Ju, Z., Leng, Y ., Liu, S., Liu, T., Shang, Z., Shen, K., Song, W., Tan, X., Tang, H., et al. Kimi-audio techni- cal report.arXiv preprint arXiv:2504.18425,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Du, Z., Wang, Y ., Chen, Q., Shi, X., Lv, X., Zhao, T., Gao, Z., Yang, Y ., Gao, C., Wang, H., et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Faisal, F., Keshava, S., Alam, M. M. I., and Anastasopoulos, A. Sd-qa: Spoken dialectal question answering for the real world. InFindings of the Association for Computa- tional Linguistics: EMNLP 2021,
work page 2021
-
[13]
Training Large Language Models to Reason in a Continuous Latent Space
Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to rea- son in a continuous latent space.arXiv preprint arXiv: 2412.06769,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
He, Y ., Sainath, T. N., Prabhavalkar, R., McGraw, I., Al- varez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y ., Pang, R., et al. Streaming end-to-end speech recognition for mobile devices. InICASSP 2019-2019 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE,
work page 2019
-
[15]
Hu, K., Hosseini-Asl, E., Chen, C., Casanova, E., Ghosh, S., ˙Zelasko, P., Chen, Z., Li, J., Balam, J., and Ginsburg, B. Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670,
-
[16]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Kingma, D. P. and Welling, M. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Koluguri, N. R., Sekoyan, M., Zelenfroynd, G., Meister, S., Ding, S., Kostandian, S., Huang, H., Karpov, N., Balam, J., Lavrukhin, V ., et al. Granary: Speech recognition and translation dataset in 25 european languages.arXiv preprint arXiv:2505.13404,
-
[19]
Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577,
Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, B., Kriman, S., Beliaev, S., Lavrukhin, V ., Cook, J., et al. Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577,
-
[20]
Baichuan-audio: A unified framework for end-to-end speech interaction
Li, T., Liu, J., Zhang, T., Fang, Y ., Pan, D., Wang, M., Liang, Z., Li, Z., Lin, M., Dong, G., et al. Baichuan-audio: A unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239,
-
[21]
Lin, G.-T., Lian, J., Li, T., Wang, Q., Anumanchipalli, G., Liu, A. H., and Lee, H.-y. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721,
-
[22]
Streaming automatic speech recognition with the transformer model
Moritz, N., Hori, T., and Le, J. Streaming automatic speech recognition with the transformer model. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6074–6078. IEEE,
work page 2020
- [23]
-
[24]
H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y
Ni, J., Abrego, G. H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y . Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. InFindings of the association for computational linguistics: ACL 2022, pp. 1864–1874,
work page 2022
-
[25]
Noroozi, V ., Chen, Z., Majumdar, S., Huang, S., Balam, J., and Ginsburg, B. Instruction data generation and unsu- pervised adaptation for speech language models.arXiv preprint arXiv:2406.12946, 2024a. Noroozi, V ., Majumdar, S., Kumar, A., Balam, J., and Ginsburg, B. Stateful conformer with cache-based in- ference for streaming automatic speech recogniti...
-
[26]
Robust Speech Recognition via Large-Scale Weak Supervision
URL https://arxiv. org/abs/2212.04356. Reece, A., Cooney, G., Bull, P., Chung, C., Dawson, B., Fitzpatrick, C., Glazer, T., Knox, D., Liebscher, A., and Marin, S. The candor corpus: Insights from a large multimodal dataset of naturalistic conversation.Science Advances, 9(13):eadf3197,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Utmos: Utokyo-sarulab sys- tem for voicemos challenge 2022.arXiv preprint arXiv:2204.02152,
10 The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi, S., and Saruwatari, H. Utmos: Utokyo-sarulab sys- tem for voicemos challenge 2022.arXiv preprint arXiv:2204.02152,
-
[28]
Saunshi, N., Dikkala, N., Li, Z., Kumar, S., and Reddi, S. J. Reasoning with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,
work page 2025
-
[29]
Shalev, Y ., Feder, A., and Goldstein, A. Distributional reasoning in llms: Parallel reasoning processes in multi- hop reasoning.arXiv preprint arXiv:2406.13858,
-
[30]
Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y ., and He, Y . Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv: 2502.21074,
-
[31]
Can speech llms think while listening?arXiv preprint arXiv:2510.07497,
Shih, Y .-J., Raj, D., Wu, C., Zhou, W., Bong, S., Gaur, Y ., Mahadeokar, J., Kalinli, O., and Seltzer, M. Can speech llms think while listening?arXiv preprint arXiv:2510.07497,
-
[32]
MUSAN: A Music, Speech, and Noise Corpus
Snyder, D., Chen, G., and Povey, D. Musan: A music, speech, and noise corpus.arXiv preprint arXiv:1510.08484,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
N., Yu, B., Gong, H., and Gol- lakota, S
Veluri, B., Peloquin, B. N., Yu, B., Gong, H., and Gol- lakota, S. Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents.arXiv preprint arXiv:2409.15594,
-
[34]
Towards understanding chain-of-thought prompting: An empirical study of what matters
Wang, B., Min, S., Deng, X., Shen, J., Wu, Y ., Zettlemoyer, L., and Sun, H. Towards understanding chain-of-thought prompting: An empirical study of what matters. InPro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 2717–2739, 2023a. Wang, D., Wu, J., Li, J., Yang, D., Chen, X., Zhang, T....
-
[35]
Wang, P., Lu, S., Tang, Y ., Yan, S., Xia, W., and Xiong, Y . A full-duplex speech dialogue scheme based on large language model.Advances in Neural Information Pro- cessing Systems, 37:13372–13403, 2024a. Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self- consistency improves chain of thought reasoning...
-
[36]
Chronological thinking in full-duplex spoken dialogue language models
Wu, D., Zhang, H., Chen, C., Zhang, T., Tian, F., Yang, X., Yu, G., Liu, H., Hou, N., Hu, Y ., et al. Chronological thinking in full-duplex spoken dialogue language models. arXiv preprint arXiv:2510.05150,
-
[37]
Xu, Y ., Guo, X., Zeng, Z., and Miao, C. Softcot: Soft chain-of-thought for efficient reasoning with llms.Annual Meeting of the Association for Computational Linguistics, 2025a. doi: 10.48550/arXiv.2502.12134. 11 The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Xu, Y ., Guo, X., Zeng, Z., and Miao,...
-
[38]
Yu, W., Wang, S., Yang, X., Chen, X., Tian, X., Zhang, J., Sun, G., Lu, L., Wang, Y ., and Zhang, C. Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,
-
[39]
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
Zen, H., Dang, V ., Clark, R., Zhang, Y ., Weiss, R. J., Jia, Y ., Chen, Z., and Wu, Y . Libritts: A corpus de- rived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[40]
Zeng, A., Du, Z., Liu, M., Wang, K., Jiang, S., Zhao, L., Dong, Y ., and Tang, J. Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,
-
[41]
Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,
Zeng, B., Song, S., Huang, S., Wang, Y ., Li, H., He, Z., Wang, X., Li, Z., and Lin, Z. Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,
-
[42]
Zhang, C., B¨utepage, J., Kjellstr¨om, H., and Mandt, S. Ad- vances in variational inference.IEEE transactions on pattern analysis and machine intelligence, 41(8):2008– 2026,
work page 2008
-
[43]
Zhang, H., Li, W., Chen, R., Kothapally, V ., Yu, M., and Yu, D. Llm-enhanced dialogue management for full-duplex spoken dialogue systems.arXiv preprint arXiv:2502.14145, 2025a. Zhang, Z., He, X., Yan, W., Shen, A., Zhao, C., Wang, S., Shen, Y ., and Wang, X. E. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space. arXiv pr...
-
[44]
For voice cloning, we construct a massive prompt bank by aggregating all speech segments with a duration of 5–10 seconds from the LibriTTS (Zen et al., 2019), YODAS (Li et al., 2023), and Hifi-TTS (Bakhturina et al.,
work page 2019
-
[45]
corpora. In total, over 100k distinct speech segments spanning more than 20k unique speakers are utilized to ensure high acoustic variance in the generated training data. Speech continuation data(530K hours). Leveraging massive text pre-training corpora (Su et al., 2025), we construct synthetic pseudo-dialogues by alternately assigning sentences from cont...
work page 2025
-
[46]
and MUSAN (Snyder et al., 2015). During the training phase, these noise signals are dynamically superimposed onto the user speech stream with an injection probability of 50%. The mixing intensity is varied by sampling a Signal-to-Noise Ratio (SNR) uniformly from the range of 0 dB to 60 dB. The pipeline for dataset construction is released in (Artificial A...
work page 2015
-
[47]
The LLM backbone is initialized from the Qwen2.5-7B-Instruct (Team, 2024)
and all the models are trained on 64 A800 (80G) GPUs. The LLM backbone is initialized from the Qwen2.5-7B-Instruct (Team, 2024). A Parakeet-based encoder (600M) 8 is employed for speech encoder, which features a causal convolutional context to support streaming input and a Transformer-based modality adapter with 1024 hidden units to align audio features w...
work page 2024
-
[48]
Table 5.Case study for PT model, FLAIR, and LLM (Qwen2.5-7B-Ins) User query Can you tell me a very easy way to clean a shower head? Response by PT model A easy and effective method you can try at home is using common household items. Response by FLAIR Sure! A simple way to clean a shower head is to remove it and soak it in white vinegar for about an hour....
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.