The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Chen Chen; Donghang Wu; Eng Siong Chng; Hexin Liu; Tianyu Zhang; Yoshua Bengio; Yuxin Li

arxiv: 2603.17837 · v4 · pith:R4Y3VJSPnew · submitted 2026-03-18 · 📡 eess.AS · cs.CL

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Donghang Wu , Tianyu Zhang , Yuxin Li , Hexin Liu , Chen Chen , Eng Siong Chng , Yoshua Bengio This is my paper

Pith reviewed 2026-05-21 10:39 UTC · model grok-4.3

classification 📡 eess.AS cs.CL

keywords full-duplex spoken dialoguelatent reasoninginternal cognitionspoken dialogue modelsELBO objectiveteacher forcingspeech perceptioncausal reasoning

0 comments

The pith

FLAIR enables spoken dialogue models to conduct latent reasoning while perceiving speech by recursively reusing embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FLAIR to model internal cognition in full-duplex spoken dialogue systems. FLAIR performs latent thinking simultaneously with speech perception by recursively feeding the previous latent embedding output into the next processing step. This maintains strict causality and adds no latency. Training relies on an Evidence Lower Bound objective with teacher forcing to learn without explicit reasoning annotations. Experiments report competitive results on speech benchmarks and full-duplex interaction metrics.

Core claim

By recursively feeding the latent embedding output from the previous step into the next step during the user's speaking phase, FLAIR conducts continuous internal reasoning simultaneously with speech perception. This adheres strictly to causality without introducing additional latency. An Evidence Lower Bound-based objective supports efficient supervised finetuning via teacher forcing and circumvents the need for explicit reasoning annotations.

What carries the argument

Recursive reuse of latent embeddings fed back as input to the next step while processing incoming speech, optimized via an ELBO objective under teacher forcing.

If this is right

Continuous reasoning occurs during the user's speaking phase without adding latency or breaking causality.
The model handles conversational dynamics more robustly in full-duplex settings.
Competitive performance is reached on a range of speech benchmarks.
Full-duplex interaction metrics remain competitive.
Reasoning proceeds without requiring explicit annotations or separate post-hoc generation steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recursive latent-state mechanism could extend to other streaming inputs such as video or sensor data for concurrent internal processing.
Over longer dialogues the maintained latent state might implicitly track topic coherence or user intent shifts.
Combining this latent mode with occasional explicit reasoning triggers could address cases where deeper deliberation is required.

Load-bearing premise

That recursive reuse of latent embeddings combined with an ELBO objective under teacher forcing produces meaningful internal cognitive representations that improve dialogue quality without explicit reasoning annotations or post-hoc generation.

What would settle it

An ablation experiment that disables the recursive latent embedding feedback and measures whether full-duplex interaction metrics or dialogue quality drop compared to the full model would settle the claim.

Figures

Figures reproduced from arXiv: 2603.17837 by Chen Chen, Donghang Wu, Eng Siong Chng, Hexin Liu, Tianyu Zhang, Yoshua Bengio, Yuxin Li.

**Figure 1.** Figure 1: (a) Turn-based interaction. The agent remains idle and starts to respond after the end of the user turn, but cannot be interrupted by user, as shown in the red box. (b) Full-duplex SDLM continuously listens to streaming speech input, supports user barge-in, and automatically switches between thinking and speaking like a human speaker. tral goal in Human-Computer Interaction. In recent years, driven by larg… view at source ↗

**Figure 2.** Figure 2: The overview of proposed FLAIR. During the user’s speech phase, the LLM performs latent reasoning, using the LLM’s output latent embeddings as the input for the next step. Once the user finishes speaking, the assistant autonomously decides when to respond; the LLM then executes an explicit forward pass, using text tokens as the input for the next step. When the user barges in, the LLM autonomously decides … view at source ↗

**Figure 3.** Figure 3: The distribution of the input audio, target text, and latent reasoning embeddings. Specifically, the latent reasoning embeddings act as a bridge that connects the input audio with the target text. 2025). Given that the CANDOR dataset used for turn-taking evaluation is a real-world noisy dataset, we construct an additional training subset by linearly mixing background noise ranging from 0 dB to 60 dB. Furt… view at source ↗

read the original abstract

During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLAIR adds recursive latent embeddings for think-while-listening in full-duplex speech models, but the evidence that this captures meaningful internal cognition is still thin.

read the letter

The main point is that this paper gives a concrete way to run latent reasoning in parallel with incoming speech in full-duplex dialogue. They recursively pass the previous latent embedding forward while the user is still talking, train the whole thing with an ELBO under teacher forcing, and avoid any extra latency or need for reasoning labels. That setup is the actual new piece: it takes standard latent-variable tricks and wires them directly into the listening phase of a spoken system so the model can keep thinking without stopping to generate text first. The recursion keeps causality by construction, which fits the real-time constraint nicely. If the downstream numbers hold, it could be useful for making voice interfaces feel less mechanical during overlaps and pauses. The paper reports competitive results on speech benchmarks and full-duplex metrics, which at least shows the mechanism does not hurt performance. The framing around human-like concurrent cognition is reasonable and the training objective is practical for existing dialogue data. The soft spots are mostly about missing detail. The abstract gives no numbers, baselines, or ablations, so it is difficult to tell how much the recursive latent path actually contributes versus ordinary improvements in the encoder or decoder. The claim that the embeddings represent internal reasoning rests on the training setup rather than any direct test or qualitative check, and performance is measured on the same kind of data used for fitting. That creates the usual circularity risk. Readers working on spoken dialogue systems or real-time voice agents would get the most out of it; people looking for broader advances in multimodal reasoning or cognitive modeling will find the scope narrow. I would send it to peer review so the experiments and implementation can be examined properly.

Referee Report

2 major / 2 minor

Summary. The paper introduces FLAIR, a Full-duplex LAtent and Internal Reasoning method for spoken dialogue models. It models concurrent internal cognition during speech perception by recursively reusing latent embeddings from prior steps as input to subsequent steps, preserving strict causality and zero added latency. An ELBO objective enables supervised fine-tuning under teacher forcing without requiring explicit reasoning annotations. The approach is evaluated on speech benchmarks and full-duplex interaction metrics, where it reports competitive performance.

Significance. If the recursive latent mechanism produces representations that meaningfully capture internal cognition and improve response quality, the work could advance full-duplex conversational systems by aligning model behavior more closely with human-like think-while-listening processes. The parameter-free recursive construction and avoidance of post-hoc generation are notable strengths, as is the use of an ELBO to train latent states without additional annotations. However, the absence of detailed quantitative results, error bars, baselines, or ablations in the provided description limits assessment of whether these gains are robust or merely incremental.

major comments (2)

The abstract and introduction claim competitive results on speech benchmarks and full-duplex metrics, yet supply no numerical values, baselines, error bars, or ablation studies. This absence is load-bearing for the central effectiveness claim and prevents verification that the latent-reasoning component drives the reported gains rather than other architectural choices.
The training objective is defined directly on the same latent embeddings that are asserted to represent internal reasoning, and performance is measured on the identical dialogue data used for optimization. While internally consistent by construction, this setup creates a moderate dependence that requires explicit controls (e.g., held-out reasoning probes or downstream task ablations) to substantiate that the latents encode cognition beyond what is needed for next-token prediction.

minor comments (2)

Notation for the recursive latent update and the precise form of the ELBO should be stated explicitly with equation numbers in the methods section to allow reproduction.
Clarify whether the teacher-forcing schedule during fine-tuning matches the inference-time recursive usage; any mismatch could affect causality claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below in detail and outline the revisions we will make to improve clarity and substantiation of our claims.

read point-by-point responses

Referee: The abstract and introduction claim competitive results on speech benchmarks and full-duplex metrics, yet supply no numerical values, baselines, error bars, or ablation studies. This absence is load-bearing for the central effectiveness claim and prevents verification that the latent-reasoning component drives the reported gains rather than other architectural choices.

Authors: We acknowledge that the abstract and introduction, as currently written, summarize the outcomes at a high level without embedding specific numerical results. The full manuscript contains detailed experimental results in the Experiments section, including comparisons against baselines on speech benchmarks and full-duplex metrics, along with ablation studies. To directly address the concern and make the effectiveness claims more immediately verifiable, we will revise the abstract and introduction to incorporate key quantitative highlights (e.g., specific performance deltas and references to the relevant tables) while preserving the overall length constraints. revision: yes
Referee: The training objective is defined directly on the same latent embeddings that are asserted to represent internal reasoning, and performance is measured on the identical dialogue data used for optimization. While internally consistent by construction, this setup creates a moderate dependence that requires explicit controls (e.g., held-out reasoning probes or downstream task ablations) to substantiate that the latents encode cognition beyond what is needed for next-token prediction.

Authors: We agree that the shared use of dialogue data for both the ELBO objective and evaluation introduces a potential dependence that merits explicit controls. The recursive latent mechanism and ELBO are specifically constructed to optimize for internal state evolution during perception, which is distinct from standard autoregressive prediction; the full-duplex metrics further test real-time conversational behaviors not directly optimized during training. To strengthen this distinction, we will add targeted ablations in the revised manuscript that disable the recursive latent reuse while keeping other components fixed, thereby isolating the contribution of the internal reasoning latents. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines FLAIR via a recursive latent-embedding reuse mechanism during speech perception and an ELBO objective under teacher forcing to train without explicit reasoning annotations. Causality is preserved by the recursive structure by design, the ELBO is a standard variational lower bound on the latent states, and competitive results on speech benchmarks plus full-duplex metrics serve as an independent empirical check. No equation or claim reduces the asserted internal cognitive representations to a tautological fit or self-citation chain; the modeling choice and downstream evaluation remain distinct from the training inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard latent-variable modeling assumptions and the premise that hidden states can stand in for unannotated internal cognition; no new entities or free parameters are explicitly introduced in the abstract.

axioms (2)

domain assumption Latent embeddings can serve as proxies for subconscious internal cognitive processing during speech perception.
Invoked to justify recursive feeding of embeddings as continuous reasoning.
domain assumption An ELBO objective with teacher forcing enables effective supervised finetuning of the latent reasoning process without explicit annotations.
Central to the training procedure described.

pith-pipeline@v0.9.0 · 5736 in / 1246 out tokens · 40974 ms · 2026-05-21T10:39:40.064827+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
cs.SD 2026-05 unverdicted novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
A Survey of Audio Reasoning in Multimodal Foundation Models
eess.AS 2026-05 unverdicted novelty 2.0

A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 14 internal anchors

[1]

Chain-of-thought reason- ing in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066,

Arora, S., Tian, J., Futami, H., Shi, J., Kashiwagi, Y ., Tsunoo, E., and Watanabe, S. Chain-of-thought reason- ing in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066,

work page arXiv
[2]

Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

Bakhturina, E., Lavrukhin, V ., Ginsburg, B., and Zhang, Y . Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

work page arXiv
[3]

Semantic parsing on freebase from question-answer pairs

Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on freebase from question-answer pairs. InPro- ceedings of the 2013 conference on empirical methods in natural language processing, pp. 1533–1544,

work page 2013
[4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[5]

VoiceBench: Benchmarking LLM-Based Voice Assistants

Chen, C., Hu, K., Yang, C.-H. H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al. Reinforcement learning enhanced full-duplex spoken di- alogue language models for conversational interactions. InSecond Conference on Language Modeling. Chen, Y ., Yue, X., Zhang, C., Gao, X., Tan, R. T., and Li, H. V oicebench: Benchmarking...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a

Chiang, C.-H., Wang, X., Li, L., Lin, C.-C., Lin, K., Liu, S., Wang, Z., Yang, Z., Lee, H.-y., and Wang, L. Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a. Chiang, C.-H., Wang, X., Li, L., Lin, C.-C., Lin, K., Liu, S., Wang, Z., Yang, Z., Lee, H.-y., and Wang, L. Stitch: Simultaneous thinking an...

work page arXiv
[7]

Qwen2-Audio Technical Report

Chu, Y ., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y ., Lv, Y ., He, J., Lin, J., et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,

Cui, W., Yu, D., Jiao, X., Meng, Z., Zhang, G., Wang, Q., Guo, Y ., and King, I. Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,

work page arXiv
[9]

Moshi: a speech-text foundation model for real-time dialogue

D´efossez, A., Mazar ´e, L., Orsini, M., Royer, A., P ´erez, P., J ´egou, H., Grave, E., and Zeghidour, N. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Kimi-Audio Technical Report

Ding, D., Ju, Z., Leng, Y ., Liu, S., Liu, T., Shang, Z., Shen, K., Song, W., Tan, X., Tang, H., et al. Kimi-audio techni- cal report.arXiv preprint arXiv:2504.18425,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Du, Z., Wang, Y ., Chen, Q., Shi, X., Lv, X., Zhao, T., Gao, Z., Yang, Y ., Gao, C., Wang, H., et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Faisal, F., Keshava, S., Alam, M. M. I., and Anastasopoulos, A. Sd-qa: Spoken dialectal question answering for the real world. InFindings of the Association for Computa- tional Linguistics: EMNLP 2021,

work page 2021
[13]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to rea- son in a continuous latent space.arXiv preprint arXiv: 2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

N., Prabhavalkar, R., McGraw, I., Al- varez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y ., Pang, R., et al

He, Y ., Sainath, T. N., Prabhavalkar, R., McGraw, I., Al- varez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y ., Pang, R., et al. Streaming end-to-end speech recognition for mobile devices. InICASSP 2019-2019 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE,

work page 2019
[15]

Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670,

Hu, K., Hosseini-Asl, E., Chen, C., Casanova, E., Ghosh, S., ˙Zelasko, P., Chen, Z., Li, J., Balam, J., and Ginsburg, B. Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670,

work page arXiv
[16]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Kingma, D. P. and Welling, M. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

R., Sekoyan, M., Zelenfroynd, G., Meister, S., Ding, S., Kostandian, S., Huang, H., Karpov, N., Balam, J., Lavrukhin, V ., et al

Koluguri, N. R., Sekoyan, M., Zelenfroynd, G., Meister, S., Ding, S., Kostandian, S., Huang, H., Karpov, N., Balam, J., Lavrukhin, V ., et al. Granary: Speech recognition and translation dataset in 25 european languages.arXiv preprint arXiv:2505.13404,

work page arXiv
[19]

arXiv preprint arXiv:1909.09577 , year=

Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, B., Kriman, S., Beliaev, S., Lavrukhin, V ., Cook, J., et al. Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577,

work page arXiv 1909
[20]

Baichuan-audio: A unified framework for end-to-end speech interaction

Li, T., Liu, J., Zhang, T., Fang, Y ., Pan, D., Wang, M., Liang, Z., Li, Z., Lin, M., Dong, G., et al. Baichuan-audio: A unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239,

work page arXiv
[21]

In2024 IEEE Spo- ken Language Technology Workshop (SLT), pages 1115–1122

Lin, G.-T., Lian, J., Li, T., Wang, Q., Anumanchipalli, G., Liu, A. H., and Lee, H.-y. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721,

work page arXiv
[22]

Streaming automatic speech recognition with the transformer model

Moritz, N., Hori, T., and Le, J. Streaming automatic speech recognition with the transformer model. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6074–6078. IEEE,

work page 2020
[23]

Nachmani, E., Levkovitch, A., Hirsch, R., Salazar, J., Asawaroengchai, C., Mariooryad, S., Rivlin, E., Skerry- Ryan, R., and Ramanovich, M. T. Spoken question answering and speech continuation using spectrogram- powered llm.arXiv preprint arXiv:2305.15255,

work page arXiv
[24]

H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y

Ni, J., Abrego, G. H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y . Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. InFindings of the association for computational linguistics: ACL 2022, pp. 1864–1874,

work page 2022
[25]

Instruction data generation and unsu- pervised adaptation for speech language models.arXiv preprint arXiv:2406.12946, 2024a

Noroozi, V ., Chen, Z., Majumdar, S., Huang, S., Balam, J., and Ginsburg, B. Instruction data generation and unsu- pervised adaptation for speech language models.arXiv preprint arXiv:2406.12946, 2024a. Noroozi, V ., Majumdar, S., Kumar, A., Balam, J., and Ginsburg, B. Stateful conformer with cache-based in- ference for streaming automatic speech recogniti...

work page arXiv 2024
[26]

Robust Speech Recognition via Large-Scale Weak Supervision

URL https://arxiv. org/abs/2212.04356. Reece, A., Cooney, G., Bull, P., Chung, C., Dawson, B., Fitzpatrick, C., Glazer, T., Knox, D., Liebscher, A., and Marin, S. The candor corpus: Insights from a large multimodal dataset of naturalistic conversation.Science Advances, 9(13):eadf3197,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Utmos: Utokyo-sarulab system for voicemos challenge 2022

10 The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi, S., and Saruwatari, H. Utmos: Utokyo-sarulab sys- tem for voicemos challenge 2022.arXiv preprint arXiv:2204.02152,

work page arXiv 2022
[28]

Saunshi, N., Dikkala, N., Li, Z., Kumar, S., and Reddi, S. J. Reasoning with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

work page 2025
[29]

Distributional reasoning in llms: Parallel reasoning processes in multi- hop reasoning.arXiv preprint arXiv:2406.13858,

Shalev, Y ., Feder, A., and Goldstein, A. Distributional reasoning in llms: Parallel reasoning processes in multi- hop reasoning.arXiv preprint arXiv:2406.13858,

work page arXiv
[30]

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y ., and He, Y . Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv: 2502.21074,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Can speech llms think while listening?arXiv preprint arXiv:2510.07497,

Shih, Y .-J., Raj, D., Wu, C., Zhou, W., Bong, S., Gaur, Y ., Mahadeokar, J., Kalinli, O., and Seltzer, M. Can speech llms think while listening?arXiv preprint arXiv:2510.07497,

work page arXiv
[32]

MUSAN: A Music, Speech, and Noise Corpus

Snyder, D., Chen, G., and Povey, D. Musan: A music, speech, and noise corpus.arXiv preprint arXiv:1510.08484,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

N., Yu, B., Gong, H., and Gol- lakota, S

Veluri, B., Peloquin, B. N., Yu, B., Gong, H., and Gol- lakota, S. Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents.arXiv preprint arXiv:2409.15594,

work page arXiv
[34]

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Wang, B., Min, S., Deng, X., Shen, J., Wu, Y ., Zettlemoyer, L., and Sun, H. Towards understanding chain-of-thought prompting: An empirical study of what matters. InPro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 2717–2739, 2023a. Wang, D., Wu, J., Li, J., Yang, D., Chen, X., Zhang, T....

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Freeze- omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024

Wang, P., Lu, S., Tang, Y ., Yan, S., Xia, W., and Xiong, Y . A full-duplex speech dialogue scheme based on large language model.Advances in Neural Information Pro- cessing Systems, 37:13372–13403, 2024a. Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self- consistency improves chain of thought reasoning...

work page arXiv
[36]

Chronological thinking in full-duplex spoken dialogue language models

Wu, D., Zhang, H., Chen, C., Zhang, T., Tian, F., Yang, X., Yu, G., Liu, H., Hou, N., Hu, Y ., et al. Chronological thinking in full-duplex spoken dialogue language models. arXiv preprint arXiv:2510.05150,

work page arXiv
[37]

SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs

Xu, Y ., Guo, X., Zeng, Z., and Miao, C. Softcot: Soft chain-of-thought for efficient reasoning with llms.Annual Meeting of the Association for Computational Linguistics, 2025a. doi: 10.48550/arXiv.2502.12134. 11 The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Xu, Y ., Guo, X., Zeng, Z., and Miao,...

work page doi:10.48550/arxiv.2502.12134
[38]

Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,

Yu, W., Wang, S., Yang, X., Chen, X., Tian, X., Zhang, J., Sun, G., Lu, L., Wang, Y ., and Zhang, C. Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,

work page arXiv
[39]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Zen, H., Dang, V ., Clark, R., Zhang, Y ., Weiss, R. J., Jia, Y ., Chen, Z., and Wu, Y . Libritts: A corpus de- rived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[40]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Zeng, A., Du, Z., Liu, M., Wang, K., Jiang, S., Zhao, L., Dong, Y ., and Tang, J. Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,

Zeng, B., Song, S., Huang, S., Wang, Y ., Li, H., He, Z., Wang, X., Li, Z., and Lin, Z. Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,

work page arXiv
[42]

Ad- vances in variational inference.IEEE transactions on pattern analysis and machine intelligence, 41(8):2008– 2026,

Zhang, C., B¨utepage, J., Kjellstr¨om, H., and Mandt, S. Ad- vances in variational inference.IEEE transactions on pattern analysis and machine intelligence, 41(8):2008– 2026,

work page 2008
[43]

Global-aware Expert

Zhang, H., Li, W., Chen, R., Kothapally, V ., Yu, M., and Yu, D. Llm-enhanced dialogue management for full-duplex spoken dialogue systems.arXiv preprint arXiv:2502.14145, 2025a. Zhang, Z., He, X., Yan, W., Shen, A., Zhao, C., Wang, S., Shen, Y ., and Wang, X. E. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space. arXiv pr...

work page arXiv 1998
[44]

For voice cloning, we construct a massive prompt bank by aggregating all speech segments with a duration of 5–10 seconds from the LibriTTS (Zen et al., 2019), YODAS (Li et al., 2023), and Hifi-TTS (Bakhturina et al.,

work page 2019
[45]

In total, over 100k distinct speech segments spanning more than 20k unique speakers are utilized to ensure high acoustic variance in the generated training data

corpora. In total, over 100k distinct speech segments spanning more than 20k unique speakers are utilized to ensure high acoustic variance in the generated training data. Speech continuation data(530K hours). Leveraging massive text pre-training corpora (Su et al., 2025), we construct synthetic pseudo-dialogues by alternately assigning sentences from cont...

work page 2025
[46]

During the training phase, these noise signals are dynamically superimposed onto the user speech stream with an injection probability of 50%

and MUSAN (Snyder et al., 2015). During the training phase, these noise signals are dynamically superimposed onto the user speech stream with an injection probability of 50%. The mixing intensity is varied by sampling a Signal-to-Noise Ratio (SNR) uniformly from the range of 0 dB to 60 dB. The pipeline for dataset construction is released in (Artificial A...

work page 2015
[47]

The LLM backbone is initialized from the Qwen2.5-7B-Instruct (Team, 2024)

and all the models are trained on 64 A800 (80G) GPUs. The LLM backbone is initialized from the Qwen2.5-7B-Instruct (Team, 2024). A Parakeet-based encoder (600M) 8 is employed for speech encoder, which features a causal convolutional context to support streaming input and a Transformer-based modality adapter with 1024 hidden units to align audio features w...

work page 2024
[48]

Response by FLAIR Sure! A simple way to clean a shower head is to remove it and soak it in white vinegar for about an hour

Table 5.Case study for PT model, FLAIR, and LLM (Qwen2.5-7B-Ins) User query Can you tell me a very easy way to clean a shower head? Response by PT model A easy and effective method you can try at home is using common household items. Response by FLAIR Sure! A simple way to clean a shower head is to remove it and soak it in white vinegar for about an hour....

work page 2022

[1] [1]

Chain-of-thought reason- ing in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066,

Arora, S., Tian, J., Futami, H., Shi, J., Kashiwagi, Y ., Tsunoo, E., and Watanabe, S. Chain-of-thought reason- ing in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066,

work page arXiv

[2] [2]

Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

Bakhturina, E., Lavrukhin, V ., Ginsburg, B., and Zhang, Y . Hi-fi multi-speaker english tts dataset.arXiv preprint arXiv:2104.01497,

work page arXiv

[3] [3]

Semantic parsing on freebase from question-answer pairs

Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on freebase from question-answer pairs. InPro- ceedings of the 2013 conference on empirical methods in natural language processing, pp. 1533–1544,

work page 2013

[4] [4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901

[5] [5]

VoiceBench: Benchmarking LLM-Based Voice Assistants

Chen, C., Hu, K., Yang, C.-H. H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al. Reinforcement learning enhanced full-duplex spoken di- alogue language models for conversational interactions. InSecond Conference on Language Modeling. Chen, Y ., Yue, X., Zhang, C., Gao, X., Tan, R. T., and Li, H. V oicebench: Benchmarking...

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a

Chiang, C.-H., Wang, X., Li, L., Lin, C.-C., Lin, K., Liu, S., Wang, Z., Yang, Z., Lee, H.-y., and Wang, L. Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a. Chiang, C.-H., Wang, X., Li, L., Lin, C.-C., Lin, K., Liu, S., Wang, Z., Yang, Z., Lee, H.-y., and Wang, L. Stitch: Simultaneous thinking an...

work page arXiv

[7] [7]

Qwen2-Audio Technical Report

Chu, Y ., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y ., Lv, Y ., He, J., Lin, J., et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,

Cui, W., Yu, D., Jiao, X., Meng, Z., Zhang, G., Wang, Q., Guo, Y ., and King, I. Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,

work page arXiv

[9] [9]

Moshi: a speech-text foundation model for real-time dialogue

D´efossez, A., Mazar ´e, L., Orsini, M., Royer, A., P ´erez, P., J ´egou, H., Grave, E., and Zeghidour, N. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Kimi-Audio Technical Report

Ding, D., Ju, Z., Leng, Y ., Liu, S., Liu, T., Shang, Z., Shen, K., Song, W., Tan, X., Tang, H., et al. Kimi-audio techni- cal report.arXiv preprint arXiv:2504.18425,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Du, Z., Wang, Y ., Chen, Q., Shi, X., Lv, X., Zhao, T., Gao, Z., Yang, Y ., Gao, C., Wang, H., et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Faisal, F., Keshava, S., Alam, M. M. I., and Anastasopoulos, A. Sd-qa: Spoken dialectal question answering for the real world. InFindings of the Association for Computa- tional Linguistics: EMNLP 2021,

work page 2021

[13] [13]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to rea- son in a continuous latent space.arXiv preprint arXiv: 2412.06769,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

N., Prabhavalkar, R., McGraw, I., Al- varez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y ., Pang, R., et al

He, Y ., Sainath, T. N., Prabhavalkar, R., McGraw, I., Al- varez, R., Zhao, D., Rybach, D., Kannan, A., Wu, Y ., Pang, R., et al. Streaming end-to-end speech recognition for mobile devices. InICASSP 2019-2019 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. IEEE,

work page 2019

[15] [15]

Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670,

Hu, K., Hosseini-Asl, E., Chen, C., Casanova, E., Ghosh, S., ˙Zelasko, P., Chen, Z., Li, J., Balam, J., and Ginsburg, B. Efficient and direct duplex modeling for speech-to-speech language model.arXiv preprint arXiv:2505.15670,

work page arXiv

[16] [16]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Kingma, D. P. and Welling, M. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

R., Sekoyan, M., Zelenfroynd, G., Meister, S., Ding, S., Kostandian, S., Huang, H., Karpov, N., Balam, J., Lavrukhin, V ., et al

Koluguri, N. R., Sekoyan, M., Zelenfroynd, G., Meister, S., Ding, S., Kostandian, S., Huang, H., Karpov, N., Balam, J., Lavrukhin, V ., et al. Granary: Speech recognition and translation dataset in 25 european languages.arXiv preprint arXiv:2505.13404,

work page arXiv

[19] [19]

arXiv preprint arXiv:1909.09577 , year=

Kuchaiev, O., Li, J., Nguyen, H., Hrinchuk, O., Leary, R., Ginsburg, B., Kriman, S., Beliaev, S., Lavrukhin, V ., Cook, J., et al. Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577,

work page arXiv 1909

[20] [20]

Baichuan-audio: A unified framework for end-to-end speech interaction

Li, T., Liu, J., Zhang, T., Fang, Y ., Pan, D., Wang, M., Liang, Z., Li, Z., Lin, M., Dong, G., et al. Baichuan-audio: A unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239,

work page arXiv

[21] [21]

In2024 IEEE Spo- ken Language Technology Workshop (SLT), pages 1115–1122

Lin, G.-T., Lian, J., Li, T., Wang, Q., Anumanchipalli, G., Liu, A. H., and Lee, H.-y. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721,

work page arXiv

[22] [22]

Streaming automatic speech recognition with the transformer model

Moritz, N., Hori, T., and Le, J. Streaming automatic speech recognition with the transformer model. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6074–6078. IEEE,

work page 2020

[23] [23]

Nachmani, E., Levkovitch, A., Hirsch, R., Salazar, J., Asawaroengchai, C., Mariooryad, S., Rivlin, E., Skerry- Ryan, R., and Ramanovich, M. T. Spoken question answering and speech continuation using spectrogram- powered llm.arXiv preprint arXiv:2305.15255,

work page arXiv

[24] [24]

H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y

Ni, J., Abrego, G. H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y . Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. InFindings of the association for computational linguistics: ACL 2022, pp. 1864–1874,

work page 2022

[25] [25]

Instruction data generation and unsu- pervised adaptation for speech language models.arXiv preprint arXiv:2406.12946, 2024a

Noroozi, V ., Chen, Z., Majumdar, S., Huang, S., Balam, J., and Ginsburg, B. Instruction data generation and unsu- pervised adaptation for speech language models.arXiv preprint arXiv:2406.12946, 2024a. Noroozi, V ., Majumdar, S., Kumar, A., Balam, J., and Ginsburg, B. Stateful conformer with cache-based in- ference for streaming automatic speech recogniti...

work page arXiv 2024

[26] [26]

Robust Speech Recognition via Large-Scale Weak Supervision

URL https://arxiv. org/abs/2212.04356. Reece, A., Cooney, G., Bull, P., Chung, C., Dawson, B., Fitzpatrick, C., Glazer, T., Knox, D., Liebscher, A., and Marin, S. The candor corpus: Insights from a large multimodal dataset of naturalistic conversation.Science Advances, 9(13):eadf3197,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Utmos: Utokyo-sarulab system for voicemos challenge 2022

10 The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi, S., and Saruwatari, H. Utmos: Utokyo-sarulab sys- tem for voicemos challenge 2022.arXiv preprint arXiv:2204.02152,

work page arXiv 2022

[28] [28]

Saunshi, N., Dikkala, N., Li, Z., Kumar, S., and Reddi, S. J. Reasoning with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

work page 2025

[29] [29]

Distributional reasoning in llms: Parallel reasoning processes in multi- hop reasoning.arXiv preprint arXiv:2406.13858,

Shalev, Y ., Feder, A., and Goldstein, A. Distributional reasoning in llms: Parallel reasoning processes in multi- hop reasoning.arXiv preprint arXiv:2406.13858,

work page arXiv

[30] [30]

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y ., and He, Y . Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv: 2502.21074,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Can speech llms think while listening?arXiv preprint arXiv:2510.07497,

Shih, Y .-J., Raj, D., Wu, C., Zhou, W., Bong, S., Gaur, Y ., Mahadeokar, J., Kalinli, O., and Seltzer, M. Can speech llms think while listening?arXiv preprint arXiv:2510.07497,

work page arXiv

[32] [32]

MUSAN: A Music, Speech, and Noise Corpus

Snyder, D., Chen, G., and Povey, D. Musan: A music, speech, and noise corpus.arXiv preprint arXiv:1510.08484,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

N., Yu, B., Gong, H., and Gol- lakota, S

Veluri, B., Peloquin, B. N., Yu, B., Gong, H., and Gol- lakota, S. Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents.arXiv preprint arXiv:2409.15594,

work page arXiv

[34] [34]

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Wang, B., Min, S., Deng, X., Shen, J., Wu, Y ., Zettlemoyer, L., and Sun, H. Towards understanding chain-of-thought prompting: An empirical study of what matters. InPro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pp. 2717–2739, 2023a. Wang, D., Wu, J., Li, J., Yang, D., Chen, X., Zhang, T....

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Freeze- omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024

Wang, P., Lu, S., Tang, Y ., Yan, S., Xia, W., and Xiong, Y . A full-duplex speech dialogue scheme based on large language model.Advances in Neural Information Pro- cessing Systems, 37:13372–13403, 2024a. Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self- consistency improves chain of thought reasoning...

work page arXiv

[36] [36]

Chronological thinking in full-duplex spoken dialogue language models

Wu, D., Zhang, H., Chen, C., Zhang, T., Tian, F., Yang, X., Yu, G., Liu, H., Hou, N., Hu, Y ., et al. Chronological thinking in full-duplex spoken dialogue language models. arXiv preprint arXiv:2510.05150,

work page arXiv

[37] [37]

SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs

Xu, Y ., Guo, X., Zeng, Z., and Miao, C. Softcot: Soft chain-of-thought for efficient reasoning with llms.Annual Meeting of the Association for Computational Linguistics, 2025a. doi: 10.48550/arXiv.2502.12134. 11 The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Xu, Y ., Guo, X., Zeng, Z., and Miao,...

work page doi:10.48550/arxiv.2502.12134

[38] [38]

Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,

Yu, W., Wang, S., Yang, X., Chen, X., Tian, X., Zhang, J., Sun, G., Lu, L., Wang, Y ., and Zhang, C. Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,

work page arXiv

[39] [39]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Zen, H., Dang, V ., Clark, R., Zhang, Y ., Weiss, R. J., Jia, Y ., Chen, Z., and Wu, Y . Libritts: A corpus de- rived from librispeech for text-to-speech.arXiv preprint arXiv:1904.02882,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[40] [40]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Zeng, A., Du, Z., Liu, M., Wang, K., Jiang, S., Zhao, L., Dong, Y ., and Tang, J. Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,

Zeng, B., Song, S., Huang, S., Wang, Y ., Li, H., He, Z., Wang, X., Li, Z., and Lin, Z. Pretraining language models to ponder in continuous space.arXiv preprint arXiv: 2505.20674,

work page arXiv

[42] [42]

Ad- vances in variational inference.IEEE transactions on pattern analysis and machine intelligence, 41(8):2008– 2026,

Zhang, C., B¨utepage, J., Kjellstr¨om, H., and Mandt, S. Ad- vances in variational inference.IEEE transactions on pattern analysis and machine intelligence, 41(8):2008– 2026,

work page 2008

[43] [43]

Global-aware Expert

Zhang, H., Li, W., Chen, R., Kothapally, V ., Yu, M., and Yu, D. Llm-enhanced dialogue management for full-duplex spoken dialogue systems.arXiv preprint arXiv:2502.14145, 2025a. Zhang, Z., He, X., Yan, W., Shen, A., Zhao, C., Wang, S., Shen, Y ., and Wang, X. E. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space. arXiv pr...

work page arXiv 1998

[44] [44]

For voice cloning, we construct a massive prompt bank by aggregating all speech segments with a duration of 5–10 seconds from the LibriTTS (Zen et al., 2019), YODAS (Li et al., 2023), and Hifi-TTS (Bakhturina et al.,

work page 2019

[45] [45]

In total, over 100k distinct speech segments spanning more than 20k unique speakers are utilized to ensure high acoustic variance in the generated training data

corpora. In total, over 100k distinct speech segments spanning more than 20k unique speakers are utilized to ensure high acoustic variance in the generated training data. Speech continuation data(530K hours). Leveraging massive text pre-training corpora (Su et al., 2025), we construct synthetic pseudo-dialogues by alternately assigning sentences from cont...

work page 2025

[46] [46]

During the training phase, these noise signals are dynamically superimposed onto the user speech stream with an injection probability of 50%

and MUSAN (Snyder et al., 2015). During the training phase, these noise signals are dynamically superimposed onto the user speech stream with an injection probability of 50%. The mixing intensity is varied by sampling a Signal-to-Noise Ratio (SNR) uniformly from the range of 0 dB to 60 dB. The pipeline for dataset construction is released in (Artificial A...

work page 2015

[47] [47]

The LLM backbone is initialized from the Qwen2.5-7B-Instruct (Team, 2024)

and all the models are trained on 64 A800 (80G) GPUs. The LLM backbone is initialized from the Qwen2.5-7B-Instruct (Team, 2024). A Parakeet-based encoder (600M) 8 is employed for speech encoder, which features a causal convolutional context to support streaming input and a Transformer-based modality adapter with 1024 hidden units to align audio features w...

work page 2024

[48] [48]

Response by FLAIR Sure! A simple way to clean a shower head is to remove it and soak it in white vinegar for about an hour

Table 5.Case study for PT model, FLAIR, and LLM (Qwen2.5-7B-Ins) User query Can you tell me a very easy way to clean a shower head? Response by PT model A easy and effective method you can try at home is using common household items. Response by FLAIR Sure! A simple way to clean a shower head is to remove it and soak it in white vinegar for about an hour....

work page 2022