arxiv: 2604.04847 · v1 · submitted 2026-04-06 · 📡 eess.AS · cs.CL

Recognition: no theorem link

Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Guan-Ting Lin , Chen Chen , Zhehuai Chen , Hung-yi Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:12 UTC · model grok-4.3

classification 📡 eess.AS cs.CL

keywords full-duplex voice agentsspoken language modelsdisfluencytool usebenchmarklatencyturn-takingreal human audio

0 comments

The pith

Full-Duplex-Bench-v3 uses real disfluent human audio to test voice agents on chained tool-use tasks across four domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Full-Duplex-Bench-v3 as a benchmark built entirely from real human audio recordings annotated for five disfluency types and paired with scenarios that demand multiple sequential tool calls. It evaluates six voice agent configurations on task accuracy, response latency, and turn-taking behavior under these conditions. Results show distinct strengths and weaknesses: GPT-Realtime achieves the highest Pass@1 score and best interruption avoidance, Gemini Live 3.1 delivers the lowest latency, and a cascaded pipeline exhibits perfect turn-taking but the longest delays. Self-correction handling and reasoning in harder multi-step cases emerge as the most frequent failure points across all systems.

Core claim

We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations across accuracy, latency, and turn-taking dimensions, finding that GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5%); Gemini Live 3.1 achieves the fastest latency (4.25 s) but the lowest turn-take rate (78.0%); and the Cascaded baseline incurs the highest latency (10.12 s). Across all systems,self

What carries the argument

Full-Duplex-Bench-v3, a dataset of annotated real human audio paired with multi-step tool-use scenarios that measures accuracy, latency, and turn-taking in full-duplex voice agents.

Load-bearing premise

The five disfluency annotations on real human audio and the selected tool-use scenarios accurately reflect the distribution of challenges that arise in actual user interactions with voice agents.

What would settle it

If a model that scores highest on Full-Duplex-Bench-v3 performs worse than lower-scoring models when tested live with new users producing similar disfluencies and tool requests, or if human raters consistently disagree with the benchmark's automated accuracy and turn-taking scores, the evaluation would lose validity.

Figures

Figures reproduced from arXiv: 2604.04847 by Chen Chen, Guan-Ting Lin, Hung-yi Lee, Zhehuai Chen.

read the original abstract

We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations -- GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper$\rightarrow$GPT-4o$\rightarrow$TTS) -- across accuracy, latency, and turn-taking dimensions. GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5\%); Gemini Live 3.1 achieves the fastest latency (4.25~s) but the lowest turn-take rate (78.0\%); and the Cascaded baseline, despite a perfect turn-take rate, incurs the highest latency (10.12~s). Across all systems, self-correction handling and multi-step reasoning under hard scenarios remain the most consistent failure modes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FDB-v3 adds real human audio with disfluency labels and chained tool tasks to the benchmark line, but the reported model comparisons rest on unvalidated annotations and scoring with no dataset scale or agreement metrics.

read the letter

The main thing to know is that this paper extends their earlier benchmark by using actual recorded human speech annotated for five disfluency categories and pairing it with multi-step tool-use scenarios across four domains. They run six systems through it and report specific numbers on Pass@1 accuracy, latency, turn-taking, and interruption avoidance, with GPT-Realtime ahead on a couple of those and the cascaded baseline slowest but perfect on turns. All models still fail most on self-corrections and harder chained tasks.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Full-Duplex-Bench-v3 (FDB-v3), a benchmark for full-duplex spoken language models consisting of real human audio annotated for five disfluency categories and paired with multi-step chained-API scenarios across four task domains. It evaluates six model configurations (GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a Whisper+GPT-4o+TTS cascaded baseline) on Pass@1 accuracy, latency, interruption avoidance, and turn-taking, reporting GPT-Realtime leads with Pass@1 of 0.600 and 13.5% interruption avoidance, Gemini Live 3.1 has the lowest latency at 4.25 s but 78.0% turn-take rate, and the cascaded baseline has perfect turn-taking but highest latency of 10.12 s. Self-correction handling and multi-step reasoning are identified as primary failure modes.

Significance. If the dataset annotations and scoring protocols prove reliable and reproducible, the benchmark would meaningfully advance evaluation of voice agents by moving beyond clean-speech assumptions to naturalistic disfluencies and tool-use chaining, providing concrete trade-off data across end-to-end and cascaded systems. The reported numbers highlight actionable differences in current models.

major comments (2)

[Abstract] Abstract: the central claims rest on a dataset of real human audio annotated for five disfluency categories and paired with chained-API scenarios, yet no dataset size, number of annotators, annotation guidelines, or inter-annotator agreement statistics are reported. This directly affects verifiability of all comparative results (e.g., the Pass@1 and latency rankings).
[Evaluation Methodology] Evaluation section (implied by reported metrics): no rubric is supplied for determining Pass@1 success on multi-step tool sequences, no definition or detection method is given for interruption avoidance or turn-taking events, and no statistical tests, confidence intervals, or error analysis accompany the numerical comparisons (e.g., 0.600 vs. other Pass@1 values).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important aspects of verifiability and methodological transparency. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims rest on a dataset of real human audio annotated for five disfluency categories and paired with chained-API scenarios, yet no dataset size, number of annotators, annotation guidelines, or inter-annotator agreement statistics are reported. This directly affects verifiability of all comparative results (e.g., the Pass@1 and latency rankings).

Authors: We agree that the abstract should provide these details to support verifiability of the reported rankings. While the full manuscript describes the dataset construction process, we acknowledge that explicit statistics on size, annotators, guidelines, and inter-annotator agreement are not summarized in the abstract. In the revised version, we will expand the abstract to include a concise statement of these elements and add a dedicated summary table or paragraph in the main text to make the annotation protocol fully transparent. revision: yes
Referee: [Evaluation Methodology] Evaluation section (implied by reported metrics): no rubric is supplied for determining Pass@1 success on multi-step tool sequences, no definition or detection method is given for interruption avoidance or turn-taking events, and no statistical tests, confidence intervals, or error analysis accompany the numerical comparisons (e.g., 0.600 vs. other Pass@1 values).

Authors: We agree that the evaluation methodology section requires greater explicitness. The manuscript reports aggregate metrics but does not supply the full rubrics, detection criteria, or accompanying statistical analysis. We will revise the evaluation section to define the Pass@1 rubric for chained tasks, specify detection methods for interruptions and turn-taking based on audio timing, include confidence intervals and appropriate statistical tests for all comparisons, and expand the error analysis to cover failure modes in greater detail. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct model evaluations

full rationale

The paper introduces Full-Duplex-Bench-v3 as a new dataset of real human audio with disfluency annotations and chained-API scenarios, then reports direct empirical results for six model configurations on accuracy (Pass@1), latency, and turn-taking metrics. No equations, derivations, fitted parameters, or predictions appear in the abstract or described content. No self-citations are invoked as load-bearing premises, and results are presented as straightforward comparisons without reduction to inputs by construction. This is a standard benchmark paper whose central claims rest on external model performance rather than internal self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces an empirical benchmark and reports evaluation results without relying on mathematical derivations, fitted parameters, or new postulated entities.

pith-pipeline@v0.9.0 · 5519 in / 1202 out tokens · 66631 ms · 2026-05-10T19:12:27.932399+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
cs.SD 2026-05 accept novelty 8.0

EVA-Bench introduces a simulation-plus-scoring framework for voice agents that reveals no tested system exceeds 0.5 on both accuracy and experience metrics at pass@1.

Reference graph

Works this paper leans on

17 extracted references · 3 canonical work pages · cited by 1 Pith paper

[1]

Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a

Shanks: Simultaneous hearing and think- ing for spoken language models.arXiv preprint arXiv:2510.06917. Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. 2024. Moshi: a speech- text foundation model for real-time dialogue. Tech- nical report, Kyutai. Pengchao Feng, Ziyang Ma, Wenx...

work page arXiv 2024
[2]

ArXiv:2510.07978

V oiceagentbench: Are voice assistants ready for agentic tasks?arXiv preprint arXiv:2510.07978. Chun-Yi Kuan, Chih-Kai Yang, Wei-Ping Huang, Ke- Han Lu, and Hung-Yi Lee. 2024. Speech-copilot: Leveraging large language models for speech pro- cessing via task decomposition, modularization, and program generation. In2024 IEEE Spoken Language Technology Works...

work page arXiv 2024
[3]

Correct",

CRAG - comprehensive RAG benchmark. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR...

work page arXiv 2023
[4]

$” (like “$RESULT_0.flights[0].flight_id

Arguments that start with “$” (like “$RESULT_0.flights[0].flight_id”) are dynamic references — the actual value should be any real value that could plausibly come from a previous API call
[5]

August 20

Minor formatting differences are fine: “August 20” == “2026-08-20”, “New York” == “new york”

2026
[6]

Las Vegas

“Las Vegas” == “Vegas” — abbreviations and common aliases are acceptable
[7]

Numeric tolerance:±5% is acceptable
[8]

driver_license

doc_type: “driver_license” == “driver license” (underscore vs space). Respond with ONLY a JSON object: {"correct": true/false, "explanation": "brief reason"} A.2 Response Quality Judge This prompt evaluates whether the agent’s spoken response correctly fulfills the user’s intent (§4, Re- sponse Quality). Response Quality Judge Prompt You are evaluating wh...
[9]

Did the agent perform the CORRECT actions (right tools, right parameters)?
[10]

Did the response indicate the task was completed or is being handled?
[11]

Providing additional helpful information is NOT a penalty

It is FINE if the agent provides MORE detail than expected (e.g., giving specific results, prices, confirmation numbers). Providing additional helpful information is NOT a penalty
[12]

It is INCORRECT if the agent says it cannot perform the action, lacks tools, or refuses
[13]

It is INCORRECT if the agent performs the WRONG action (e.g., wrong destination, wrong document type)
[14]

correct": true/false,

Partial delivery of multi-step tasks (e.g., completes 2 of 3 required steps) should be scored 0. Respond with ONLY a JSON object: {"correct": true/false, "explanation": "brief reason"} A.3 Key Information Identifier (Latency) This system prompt is used to decompose agent speech into filler and key information for task com- pletion latency measurement (§4,...
[15]

USER_SPEECH_END_REL: timestamp (seconds) when the user finished speaking
[16]

ASR_CHUNKS: list of the AI agent’s spoken words with [start, end] timestamps
[17]

Let me check that for you

TOOL_CALLS: list of tool calls the agent made (with timestamps). The typical flow after a user query is: [User finishes]→(silence)→[Filler sentence]→(silence during tool execution) →[Key information response] Your task: Identify and separate the agent’s speech into: 1.filler_sentence: Conversational filler like “Let me check that for you”, “Sure, I’ll loo...