Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models
Pith reviewed 2026-05-18 07:40 UTC · model grok-4.3
The pith
Spoken language models achieve full chain-of-thought reasoning in real time by splitting high-level formulation from speech articulation across two coordinated brains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By employing a Formulation Brain to generate high-level reasoning that paces and guides a separate Articulation Brain responsible for fluent speech production, spoken language models can perform real-time reasoning that matches the quality of models which pre-compute the full chain of thought while eliminating the latency of sequential thought generation and the quality loss from mode switching.
What carries the argument
The dual-brain architecture in which a Formulation Brain performs high-level reasoning to pace and direct an Articulation Brain that produces the actual speech output.
If this is right
- The method delivers reasoning accuracy comparable to full pre-computed chain-of-thought while cutting response latency to near zero.
- Existing think-while-speaking techniques are outperformed on both mathematical reasoning and open conversation benchmarks.
- Real-time spoken interaction becomes possible without forcing a tradeoff between depth of thought and speed of delivery.
- The same division of labor can be applied to other spoken tasks that currently require either long pauses or shallow responses.
Where Pith is reading between the lines
- Live spoken assistants could handle multi-step problems such as calculations or planning while the user is still speaking.
- The same separation principle might apply to other output modalities where planning must continue after generation begins.
- Training procedures could be adjusted to reinforce the pacing signal from the formulation component to the articulation component.
Load-bearing premise
The split between a Formulation Brain and an Articulation Brain can be maintained without coordination mistakes that would degrade either the reasoning steps or the spoken output quality.
What would settle it
A direct comparison in which the dual-brain model produces more reasoning errors or lower task accuracy than a single model that generates the complete chain of thought before speaking, under matched conditions and identical base models.
Figures
read the original abstract
Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a "Formulation Brain" for high-level reasoning to pace and guide a separate "Articulation Brain" for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. MPS is the methodology underlying our released Step-Audio R1.1 system, effectively bridging the gap between high-quality reasoning and real-time interaction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Mind-Paced Speaking (MPS), a brain-inspired dual-brain framework for real-time reasoning in Spoken Language Models. A Formulation Brain performs high-level reasoning to pace and guide a separate Articulation Brain responsible for fluent speech generation. This division of labor is claimed to eliminate mode-switching while preserving reasoning integrity, enabling performance comparable to full pre-computed Chain-of-Thought models but with drastically reduced latency. Experiments report that MPS outperforms existing think-while-speaking baselines and, in a zero-latency configuration, reaches 92.8% accuracy on the Spoken-MQA mathematical reasoning task and 82.5 on the URO-Bench speech conversation task. The approach underlies the released Step-Audio R1.1 system.
Significance. If the reported performance parity with full CoT holds without hidden coordination penalties, the work would meaningfully advance real-time SLMs by closing the gap between high-quality reasoning and low-latency interaction. The explicit release of the underlying Step-Audio R1.1 system provides a concrete artifact for reproducibility and extension, which strengthens the practical contribution.
major comments (2)
- [Abstract and dual-brain architecture description] The zero-latency claim and the 92.8% Spoken-MQA accuracy rest on the assertion that high-level reasoning from the Formulation Brain can be streamed to the Articulation Brain without introducing buffering, truncation, or synchronization costs. The manuscript provides no concrete specification of this interface (e.g., token-level handoff protocol, context window sharing, or pacing mechanism), leaving open the possibility that the reported parity with pre-computed CoT is achieved only under unstated assumptions about perfect coordination.
- [Experiments] The experimental section reports strong benchmark results and comparisons, yet the abstract and available description omit full implementation details, exact baselines, error analysis, and ablation studies on the coordination interface. Without these, it is not possible to verify that the performance gains are attributable to the proposed division of labor rather than post-hoc tuning or dataset-specific factors.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments have prompted us to clarify key aspects of the dual-brain architecture and strengthen the experimental reporting. We provide point-by-point responses below, indicating revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract and dual-brain architecture description] The zero-latency claim and the 92.8% Spoken-MQA accuracy rest on the assertion that high-level reasoning from the Formulation Brain can be streamed to the Articulation Brain without introducing buffering, truncation, or synchronization costs. The manuscript provides no concrete specification of this interface (e.g., token-level handoff protocol, context window sharing, or pacing mechanism), leaving open the possibility that the reported parity with pre-computed CoT is achieved only under unstated assumptions about perfect coordination.
Authors: We appreciate the referee highlighting the importance of explicitly detailing the interface. The original manuscript describes the streaming process at a high level in Section 3, but we agree that additional concreteness is warranted. In the revised version we have expanded Section 3.2 with a precise token-level handoff protocol: the Formulation Brain emits reasoning tokens incrementally into a shared KV cache; the Articulation Brain consumes them on-the-fly using a sliding attention window that prevents truncation or buffering. The pacing mechanism relies on asynchronous token arrival with no explicit synchronization barrier, which our latency measurements show adds negligible overhead. We have added a new figure and pseudocode in the appendix to illustrate this flow. These clarifications demonstrate that the reported performance parity does not depend on hidden assumptions. revision: yes
-
Referee: [Experiments] The experimental section reports strong benchmark results and comparisons, yet the abstract and available description omit full implementation details, exact baselines, error analysis, and ablation studies on the coordination interface. Without these, it is not possible to verify that the performance gains are attributable to the proposed division of labor rather than post-hoc tuning or dataset-specific factors.
Authors: We agree that fuller experimental documentation is needed for verifiability. The revised manuscript now includes: complete model architectures and training hyperparameters for both brains; exact baseline implementations with citations and hyperparameter settings; a dedicated error analysis section breaking down failure modes on Spoken-MQA and URO-Bench; and new ablation studies that isolate the coordination interface (varying streaming granularity and context-sharing strategies). These additions show that gains arise from the division of labor rather than tuning artifacts. We have also updated the abstract to reference the expanded experimental details and made the Step-Audio R1.1 code publicly available. revision: yes
Circularity Check
No significant circularity in architectural proposal or empirical claims
full rationale
The paper presents Mind-Paced Speaking as a dual-brain framework with Formulation Brain for high-level reasoning pacing an Articulation Brain for speech generation. This division is claimed to eliminate mode-switching while preserving reasoning integrity, with performance validated empirically on Spoken-MQA (92.8% accuracy under zero-latency) and URO-Bench (82.5 score), comparable to full CoT pre-computation but with lower latency. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described claims. The central results rest on experimental outcomes rather than reducing by construction to inputs or prior self-referential definitions, making the proposal self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Separation into Formulation Brain and Articulation Brain eliminates mode-switching while preserving reasoning integrity.
invented entities (2)
-
Formulation Brain
no independent evidence
-
Articulation Brain
no independent evidence
Forward citations
Cited by 4 Pith papers
-
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
-
StepAudio 2.5 Technical Report
StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
-
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.
-
A Survey of Audio Reasoning in Multimodal Foundation Models
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
Reference graph
Works this paper leans on
-
[1]
Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,
Wenqian Cui et al. “Recent advances in speech language models: A survey”. In:arXiv preprint arXiv:2410.03751 (2024)
-
[2]
Boyong Wu et al. “Step-audio 2 technical report”. In:arXiv preprint arXiv:2507.16632(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Ke Hu et al. “Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model”. In:arXiv preprint arXiv:2505.15670(2025)
-
[4]
Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,
Wenyi Yu et al. “Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation”. In: arXiv preprint arXiv:2411.18138(2024)
-
[5]
Moshi: a speech-text foundation model for real-time dialogue
Alexandre Défossez et al. “Moshi: a speech-text foundation model for real-time dialogue”. In:arXiv preprint arXiv:2410.00037(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei et al. “Chain-of-thought prompting elicits reasoning in large language models”. In:Advances in neural information processing systems35 (2022), pp. 24824–24837. 10 Preprint. Work in progress
work page 2022
-
[7]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models”. In:The Eleventh International Conference on Learning Representations. 2023.URL: https://openreview.net/ forum?id=1PL1NIMMrw
work page 2023
-
[8]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao et al. “Tree of thoughts: Deliberate problem solving with large language models”. In:Advances in neural information processing systems36 (2023), pp. 11809–11822
work page 2023
-
[9]
Pal: Program-aided language models
Luyu Gao et al. “Pal: Program-aided language models”. In:International Conference on Machine Learning. PMLR. 2023, pp. 10764–10799
work page 2023
-
[10]
Jingran Xie et al. “Leveraging chain of thought towards empathetic spoken dialogue without corresponding question-answering data”. In:ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2025, pp. 1–5
work page 2025
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo et al. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning”. In:arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models
Cheng-Han Chiang et al. “STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models”. In:arXiv preprint arXiv:2507.15375(2025)
-
[13]
Mini-omni-reasoner: Token-level thinking-in-speaking in large speech models,
Zhifei Xie et al. “Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models”. In:arXiv preprint arXiv:2508.15827(2025)
- [14]
-
[15]
The cortical organization of speech processing
Gregory Hickok and David Poeppel. “The cortical organization of speech processing”. In:Nature reviews neuroscience8.5 (2007), pp. 393–402
work page 2007
-
[16]
The spatial and temporal signatures of word production components: a critical update
Peter Indefrey. “The spatial and temporal signatures of word production components: a critical update”. In: Frontiers in psychology2 (2011), p. 255
work page 2011
-
[17]
Jin Xu et al. “Qwen2. 5-omni technical report”. In:arXiv preprint arXiv:2503.20215(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Llama-omni2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis
Qingkai Fang et al. “Llama-omni2: Llm-based real-time spoken chatbot with autoregressive streaming speech synthesis”. In:arXiv preprint arXiv:2505.02625(2025)
-
[19]
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
Aohan Zeng et al. “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot”. In:arXiv preprint arXiv:2412.02612(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Audio-reasoner: Improving reasoning capability in large audio language models,
Zhifei Xie et al. “Audio-reasoner: Improving reasoning capability in large audio language models”. In:arXiv preprint arXiv:2503.02318(2025)
-
[21]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models”. In: arXiv preprint arXiv:2402.03300(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Cheng Wen et al. “Sari: Structured audio reasoning via curriculum-guided reinforcement learning”. In:arXiv preprint arXiv:2504.15900(2025)
-
[23]
Gang Li et al. “Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering”. In:arXiv preprint arXiv:2503.11197(2025)
-
[24]
Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,
Sachin Goyal et al. “Think before you speak: Training language models with pause tokens”. In:arXiv preprint arXiv:2310.02226(2023)
-
[25]
Jacob Pfau, William Merrill, and Samuel R Bowman. “Let’s think dot by dot: Hidden computation in transformer language models”. In:arXiv preprint arXiv:2404.15758(2024)
-
[26]
Language models are few-shot learners
Tom Brown et al. “Language models are few-shot learners”. In:Advances in neural information processing systems33 (2020), pp. 1877–1901
work page 2020
-
[27]
Chengwei Wei et al. “Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems”. In:arXiv preprint arXiv:2505.15000(2025)
-
[28]
Ruiqi Yan et al. “Uro-bench: A comprehensive benchmark for end-to-end spoken dialogue models”. In:arXiv preprint arXiv:2502.17810(2025). 11 Preprint. Work in progress
-
[29]
OpenAI.GPT-4o System Card. 2024. arXiv: 2410.21276 [cs.CL].URL: https://arxiv.org/abs/2410. 21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
An Yang et al.Qwen2 Technical Report. 2024. arXiv: 2407.10671 [cs.CL] .URL: https://arxiv.org/ abs/2407.10671. 12 Preprint. Work in progress. A Appendix A.1 Example of MPS-spkfirst Figure 3 shows an example of MPS-spkfirst on Spoken-MQA. After receiving the user input, the Articulation Brain LLM first generates a response segment Sres 1 . Simultaneously, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.