Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

Guan-Ting Lin; Hung-yi Lee; Jiachen Lian; Qirui Wang; Shih-Yun Shan Kuan; Shinji Watanabe; Tingle Li

arxiv: 2507.23159 · v4 · submitted 2025-07-30 · 📡 eess.AS

Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

Guan-Ting Lin , Shih-Yun Shan Kuan , Qirui Wang , Jiachen Lian , Tingle Li , Shinji Watanabe , Hung-yi Lee This is my paper

Pith reviewed 2026-05-19 01:28 UTC · model grok-4.3

classification 📡 eess.AS

keywords full-duplex speechoverlap handlingspoken dialogue systemsspeech overlap benchmarkresponsive strategyfloor-holding strategyfull-duplex agentsconversation flow

0 comments

The pith

Full-duplex speech models split into responsive or floor-holding strategies when overlaps occur.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Full-Duplex-Bench v1.5 to systematically test how full-duplex spoken dialogue systems manage overlapping speech, a key barrier to making machine conversations feel natural instead of rigid and turn-based. It creates four simulated scenarios such as user interruptions, backchannels, talking to others, and background speech, then measures models on their categorical behaviors, stop and response latencies, and prosodic adaptations. When applied to five state-of-the-art agents, the benchmark uncovers two clear patterns: some models respond rapidly to user input while others filter out overlaps to maintain conversational flow. This matters to readers because without reliable overlap handling, spoken AI systems will continue to feel unnatural in real-time exchanges.

Core claim

Full-Duplex-Bench v1.5 reveals that state-of-the-art full-duplex agents follow one of two strategies during speech overlap: a responsive approach that prioritizes rapid response to user input, or a floor-holding approach that preserves conversational flow by filtering overlapping events. The benchmark achieves this by simulating four representative overlap scenarios and supplying metrics on dialogue behaviors, latencies, and prosodic changes, all within a framework usable with both open-source and commercial models.

What carries the argument

Full-Duplex-Bench v1.5, an automated evaluation framework that simulates four overlap scenarios and quantifies categorical behaviors, stop and response latency, plus prosodic adaptation.

If this is right

Developers gain a reproducible way to compare overlap handling across open-source and API-based full-duplex models.
Models can be categorized by whether they favor quick replies or filtered continuity during overlaps.
The metrics on latency and prosody give concrete targets for improving naturalness in spoken dialogue.
Practitioners receive tools to accelerate creation of robust full-duplex systems that handle overlaps without breaking flow.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A hybrid model that switches between responsive and floor-holding behavior based on context might outperform either pure strategy.
Extending the benchmark to real acoustic environments with variable noise could test whether the simulated scenarios generalize.
The same evaluation approach might apply to other real-time interactive systems where overlap or interruption is common.

Load-bearing premise

The four simulated overlap scenarios are representative enough to systematically probe and reveal meaningful differences in model behavior during real speech overlap.

What would settle it

If live user studies with actual overlapping speech show that the five agents do not consistently display either the responsive or floor-holding patterns, the benchmark's claim to reveal divergent strategies would not hold.

read the original abstract

Full-duplex spoken dialogue systems promise to transform human-machine interaction from a rigid, turn-based protocol into a fluid, natural conversation. However, the central challenge to realizing this vision, managing overlapping speech, remains critically under-evaluated. We introduce Full-Duplex-Bench v1.5, the first fully automated benchmark designed to systematically probe how models behave during speech overlap. The benchmark simulates four representative overlap scenarios: user interruption, user backchannel, talking to others, and background speech. Our framework, compatible with open-source and commercial API-based models, provides a comprehensive suite of metrics analyzing categorical dialogue behaviors, stop and response latency, and prosodic adaptation. Benchmarking five state-of-the-art agents reveals two divergent strategies: a responsive approach prioritizing rapid response to user input, and a floor-holding approach that preserves conversational flow by filtering overlapping events. Our open-source framework enables practitioners to accelerate the development of robust full-duplex systems by providing the tools for reproducible evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Full-Duplex-Bench v1.5 as the first fully automated benchmark for systematically probing overlap handling in full-duplex spoken dialogue systems. It simulates four representative overlap scenarios (user interruption, user backchannel, talking to others, and background speech), supplies metrics for categorical dialogue behaviors, stop/response latency, and prosodic adaptation, and benchmarks five state-of-the-art agents. The evaluation identifies two divergent strategies: a responsive approach that prioritizes rapid response to user input and a floor-holding approach that filters overlapping events to preserve conversational flow. The open-source framework is compatible with both open-source and commercial API-based models.

Significance. If the simulations prove representative of real overlaps and the metrics cleanly separate behavioral strategies without confounding artifacts, this benchmark could meaningfully advance full-duplex dialogue research by supplying a reproducible, automated evaluation tool. The broad compatibility with open-source and commercial models and the open-source release are practical strengths that could accelerate development of more natural spoken interaction systems.

major comments (2)

[Abstract] Abstract: The claim that benchmarking reveals two divergent strategies depends on the four simulated overlap scenarios being representative enough to reveal meaningful differences; however, the abstract supplies no details on acoustic modeling of overlaps, validation of simulation fidelity, exact definitions of categorical behaviors, or controls for confounding factors, leaving the central classification unsupported by inspectable evidence.
[Abstract] Abstract: No quantitative results (e.g., per-model metric scores, latency values, or statistical tests) are reported to justify grouping the five agents into exactly two strategies, which is load-bearing for the main empirical finding.

minor comments (1)

[Abstract] The title references version 1.5 without any mention of prior versions or incremental changes, which could be clarified for reader context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the major comments point by point below, focusing on revisions to the abstract to improve clarity and support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that benchmarking reveals two divergent strategies depends on the four simulated overlap scenarios being representative enough to reveal meaningful differences; however, the abstract supplies no details on acoustic modeling of overlaps, validation of simulation fidelity, exact definitions of categorical behaviors, or controls for confounding factors, leaving the central classification unsupported by inspectable evidence.

Authors: We agree that the abstract, in its current concise form, does not include these methodological details. We will revise the abstract to briefly describe the acoustic modeling of overlaps, the validation of simulation fidelity, the definitions of categorical behaviors, and the controls for confounding factors. This will make the basis for the strategy classification more transparent and inspectable directly from the abstract. revision: yes
Referee: [Abstract] Abstract: No quantitative results (e.g., per-model metric scores, latency values, or statistical tests) are reported to justify grouping the five agents into exactly two strategies, which is load-bearing for the main empirical finding.

Authors: We agree that the abstract currently reports no quantitative results to support the grouping. We will revise the abstract to include key quantitative results, such as representative per-model metric scores, latency values, and reference to statistical tests that justify the identification of the two divergent strategies. This will provide direct evidence for the main empirical finding within the abstract. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark introduction with no derivation or self-referential elements

full rationale

The provided abstract and full text describe an empirical benchmark for evaluating full-duplex speech models on simulated overlap scenarios. It introduces metrics for behaviors, latency, and prosody, then reports observed strategies from benchmarking five agents. No equations, first-principles derivations, parameter fitting, predictions, or self-citations appear. The central claim rests on empirical observation rather than any reduction to inputs by construction. This is a standard benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities. The contribution is an empirical evaluation framework rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5699 in / 1186 out tokens · 40175 ms · 2026-05-19T01:28:15.747285+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The benchmark simulates four representative overlap scenarios: user interruption, user backchannel, talking to others, and background speech... Benchmarking five state-of-the-art agents reveals two divergent strategies: a responsive approach prioritizing rapid response to user input, and a floor-holding approach that preserves conversational flow by filtering overlapping events.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
cs.SD 2026-05 accept novelty 8.0

EVA-Bench introduces a simulation-plus-scoring framework for voice agents that reveals no tested system exceeds 0.5 on both accuracy and experience metrics at pass@1.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
cs.CL 2026-05 unverdicted novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
eess.AS 2026-04 unverdicted novelty 5.0

A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.