VoxMind: An End-to-End Agentic Spoken Dialogue System
Pith reviewed 2026-05-10 08:26 UTC · model grok-4.3
The pith
VoxMind equips spoken dialogue models with agentic tool-use via a curated dataset, Think-before-Speak reasoning, and multi-agent tool management, raising task completion from 34.88% to 74.57% while preserving conversational quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experimental results confirm that VoxMind achieves significant improvements in agent performance: compared with strong baselines, the task completion rate increases from 34.88% to 74.57%, outperforming Gemini-2.5-Pro on spoken agent tasks while preserving general conversational quality.
Load-bearing premise
The assumption that the reported performance gains are primarily caused by the Think-before-Speak mechanism and Multi-Agent Dynamic Tool Management rather than the specific dataset curation, unstated implementation choices, or evaluation protocol details.
Figures
read the original abstract
Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope. Incorporating agentic capabilities is therefore essential: by enabling tool use, these models can extend their knowledge boundaries and better solve real-world tasks. Yet, existing research has largely concentrated on core perception and generation, with comparatively limited exploration of such tool-augmented extensions. To bridge this gap, we present VoxMind, an integrated framework designed to equip end-to-end spoken dialogue models with comprehensive agentic abilities. Leveraging our curated 470-hour AgentChat dataset, we incorporate a "Think-before-Speak" mechanism, enabling the model to internalize structured reasoning as a critical prerequisite for planning and response generation. Furthermore, to mitigate latency bottlenecks caused by large-scale tool integration, we propose a Multi-Agent Dynamic Tool Management architecture. By asynchronously delegating retrieval tasks to an auxiliary agent aligned with the main model's reasoning trajectory, this system effectively decouples inference latency from toolset size. Experimental results confirm that VoxMind achieves significant improvements in agent performance: compared with strong baselines, the task completion rate increases from 34.88% to 74.57%, outperforming Gemini-2.5-Pro on spoken agent tasks while preserving general conversational quality. The source code and associated data are publicly available at https://github.com/MM-Speech/VoxMind.
Editorial analysis
A structured set of objections, weighed in public.
Circularity Check
No circularity: empirical claims rest on reported experiments, not self-referential definitions or fits
full rationale
The paper introduces a new dataset and two architectural mechanisms, then reports task-completion improvements from end-to-end experiments. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would make any result equivalent to its inputs by construction. The performance numbers (34.88 % to 74.57 %) are presented as measured outcomes against baselines, which remain externally falsifiable and do not reduce to tautology.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
A Survey of Audio Reasoning in Multimodal Foundation Models
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
Reference graph
Works this paper leans on
-
[1]
Stream rag: Instant and accurate spoken di- alogue systems with streaming tool usage.ArXiv, abs/2510.02044. Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xi- quan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu Li, Yan Lu, Shujie Liu, and Xie Chen. 2024a. Slam- omni: Timbre-controllable voice interaction system...
-
[2]
No correct- ness judgment is performed at this stage
Tool Extraction: Extract all tool calls from both the target and model outputs (including only tool names and parameter name-value pairs), ignoring textual content, formatting, spaces, quotes, and line breaks. No correct- ness judgment is performed at this stage
-
[3]
Tool Selection Evaluation: Compare ex- tracted tool names (case-sensitive, ignoring order and leading/trailing spaces). Tool occur- rence counts must match exactly; otherwise, evaluation stops immediately and both tool selection and parameter filling are marked in- correct
-
[4]
Parameter names ignore case and spaces, while parame- ter values must match exactly
Parameter Filling Evaluation: Performed only if tool selection is correct. Parameter names ignore case and spaces, while parame- ter values must match exactly. Numeric equiv- alence and quoting differences are allowed, and argument order does not affect the evalua- tion
-
[5]
Output Format: Evaluation results are strictly returned in JSON format, containing only two boolean fields: "func-select-correct": true|false, "param-fill-correct": true|false. This procedure ensures rigorous, reproducible evaluation and avoids direct string comparison, pro- viding a precise measure of a speech agent’s tool- call capabilities. G Training ...
-
[6]
Start from the User Query; do not back-solve from the answer
-
[7]
Explain why the selected tool is needed and appropriate
-
[8]
For every parameter in the tool call, explain its source and any transformation
-
[9]
Do not introduce assumptions not stated in the query
-
[10]
Produce 5–12 steps in natural language; keep within≤THINK_MAX_WORDS words
-
[11]
Do NOT output the tool call; output only the reasoning
-
[12]
searchTools() indicates that the current tool list cannot meet the task and that additional tools are needed. Output format: Output strictly one-line JSON: {"think": "..."}. No extra text. User Query: {(user_q or ”).strip()} Gold Tool Call: {(assistant_a or ”).strip()} Figure 7: Prompt for constructing chain-of-thought reasoning data for tool interaction....
-
[13]
Start reasoning strictly from the User Query; do not back-solve from the final answer
-
[14]
Ensure that each step follows causally from the previous one
-
[15]
Do not introduce assumptions or external knowledge not implied by the query
-
[16]
Focus on semantic reasoning rather than stylistic or rhetorical choices
-
[17]
Produce 5–12 reasoning steps in natural language; keep within≤THINK_MAX_WORDS words
-
[18]
The Chain-of-Thought is used for training purposes only and will not be shown to end users
-
[19]
Do NOT output the final answer; output only the reasoning process. Output format: Output strictly one-line JSON: {"think": "..."}. No extra text. User Query: {(user_q or ”).strip()} Gold Response: {(assistant_a or ”).strip()} Figure 8: Prompt for constructing chain-of-thought reasoning data for general dialogue interaction. Chain-of-thought quality evalua...
-
[20]
Candidate Chain-of-Thought Scoring criteria (very strict):
-
[21]
Logical soundness (0–3): Is the reasoning stepwise, coherent, and causally connected? Are any key reasoning steps missing?
-
[22]
Consistency with the tool call (0–3): Does the Chain-of-Thought explain why the tool is selected and the source of each tool parameter? Is it fully consistent with the Gold Tool Call?
-
[23]
No hallucination (0–2): Does the reasoning avoid inventing facts, assumptions, or parameters not present in the User Query or Gold Tool Call?
-
[24]
score": <0–10 integer or float>,
Clarity (0–2): Is the reasoning clear, well-structured, and easy to follow as a step-by-step explanation? Final score: The final score is the sum of all criteria, ranging from 0 to 10. Output format requirements: Output JSON only; no explanations. Fields: {"score": <0–10 integer or float>, "reason": "<brief 1–2 sentence justification>"}. Inputs: User Quer...
-
[25]
Generated Chain-of-Thought Scoring criteria:
-
[26]
Correctness (0–4): Does the Chain-of-Thought logically derive the final answer? Is each reasoning step factually correct? Is the reasoning free from hallucinations or fabricated assumptions? Scoring guidelines: 4 = fully correct 3 = mostly correct with minor issues 2 = contains some errors but the final answer is still reachable 1 = reasoning is incorrect...
-
[27]
Relevance (0–2): Does the Chain-of-Thought remain tightly focused on the user question? Does it avoid unrelated tangents? Scoring guidelines: 2 = fully relevant 1 = partially relevant 0 = mostly irrelevant
-
[28]
Step quality and clarity (0–2): Are the reasoning steps clear, structured, and easy to follow? Are there any unjustified jumps in logic? Scoring guidelines: 2 = very clear 1 = acceptable clarity 0 = unclear or disorganized
-
[29]
Completeness (0–1): Does the Chain-of-Thought cover all necessary steps to justify the final answer? Scoring guidelines: 1 = complete 0 = missing key steps
-
[30]
Brevity and conciseness (0–1): Is the Chain-of-Thought concise and free of unnecessary verbosity? Scoring guidelines: 1 = concise 0 = overly long or verbose Final score: The total score is the sum of all criteria, ranging from 0 to 10. Output format requirements (strict): Output JSON only; no additional text. Required fields: { "correctness": X, "relevanc...
-
[31]
Real-time information (e.g., weather, stock prices, news, events, schedules)
-
[32]
External knowledge not contained in general training data (e.g., private databases, personal files, proprietary datasets)
-
[33]
Precise numerical computation beyond mental math (e.g., long arithmetic, complex mathematical evaluation)
-
[34]
Retrieval of specific, unmemorized facts (e.g., obscure identifiers, URLs, tables, large codebases)
-
[35]
Interaction with an external environment (e.g., search engines, APIs, calculators, file operations) A tool isnot necessarywhen: - The question asks for explanations, definitions, or conceptual reasoning - The answer can be inferred using general world knowledge - The question is creative in nature (e.g., writing, storytelling, opinions, reasoning, code ex...
-
[36]
Preserve the logical flow of the original reasoning while condensing it into at most {num} English words
-
[37]
Start from the user’s intent as reflected in the original reasoning
-
[38]
Explicitly mention the selected tool and explain why it is appropriate
-
[39]
For each parameter in the tool call, briefly indicate its source from the user request or the tool-call text
-
[40]
Do not introduce assumptions not supported by the original reasoning
-
[41]
Output only the compressed reasoning
-
[42]
Output strictly one-line JSON: {"think": "..."} with no extra text. Inputs: Original Chain-of-Thought: {orig_think} Gold Tool Call (raw, including tags): {raw_tool_call} Figure 12: Prompt specification for compressing original Chain-of-Thought annotations in tool-based data cleaning. Gemini-2.5-flash evaluates core capabilities of end-to-end speech agents...
-
[43]
- Tool name = full string before ‘(’
Tool Selection (ONLY based on extracted tool names) - Compare tool names AFTER extraction, not raw text. - Tool name = full string before ‘(’. - Tool names are case-sensitive; ignore leading/trailing spaces. - Tool occurrence counts must match exactly (order does NOT matter). - If ANY mismatch exists: * func_select_correct = false * param_fill_correct = f...
-
[44]
Parameter Filling (ONLY if Tool Selection is correct) - Compare parameters ONLY within matched tools. - Parameter names ignore case and spaces. - Parameter values must match exactly (case-sensitive). - Ignore ALL quoting differences: q=’Taylor Swift’≡q="Taylor Swift"≡q=Taylor Swift - Numeric equivalence: 42≡42.0 - Argument order does NOT matter. STRICT OU...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.