Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models
Pith reviewed 2026-05-19 01:28 UTC · model grok-4.3
The pith
Full-duplex speech models split into responsive or floor-holding strategies when overlaps occur.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Full-Duplex-Bench v1.5 reveals that state-of-the-art full-duplex agents follow one of two strategies during speech overlap: a responsive approach that prioritizes rapid response to user input, or a floor-holding approach that preserves conversational flow by filtering overlapping events. The benchmark achieves this by simulating four representative overlap scenarios and supplying metrics on dialogue behaviors, latencies, and prosodic changes, all within a framework usable with both open-source and commercial models.
What carries the argument
Full-Duplex-Bench v1.5, an automated evaluation framework that simulates four overlap scenarios and quantifies categorical behaviors, stop and response latency, plus prosodic adaptation.
If this is right
- Developers gain a reproducible way to compare overlap handling across open-source and API-based full-duplex models.
- Models can be categorized by whether they favor quick replies or filtered continuity during overlaps.
- The metrics on latency and prosody give concrete targets for improving naturalness in spoken dialogue.
- Practitioners receive tools to accelerate creation of robust full-duplex systems that handle overlaps without breaking flow.
Where Pith is reading between the lines
- A hybrid model that switches between responsive and floor-holding behavior based on context might outperform either pure strategy.
- Extending the benchmark to real acoustic environments with variable noise could test whether the simulated scenarios generalize.
- The same evaluation approach might apply to other real-time interactive systems where overlap or interruption is common.
Load-bearing premise
The four simulated overlap scenarios are representative enough to systematically probe and reveal meaningful differences in model behavior during real speech overlap.
What would settle it
If live user studies with actual overlapping speech show that the five agents do not consistently display either the responsive or floor-holding patterns, the benchmark's claim to reveal divergent strategies would not hold.
read the original abstract
Full-duplex spoken dialogue systems promise to transform human-machine interaction from a rigid, turn-based protocol into a fluid, natural conversation. However, the central challenge to realizing this vision, managing overlapping speech, remains critically under-evaluated. We introduce Full-Duplex-Bench v1.5, the first fully automated benchmark designed to systematically probe how models behave during speech overlap. The benchmark simulates four representative overlap scenarios: user interruption, user backchannel, talking to others, and background speech. Our framework, compatible with open-source and commercial API-based models, provides a comprehensive suite of metrics analyzing categorical dialogue behaviors, stop and response latency, and prosodic adaptation. Benchmarking five state-of-the-art agents reveals two divergent strategies: a responsive approach prioritizing rapid response to user input, and a floor-holding approach that preserves conversational flow by filtering overlapping events. Our open-source framework enables practitioners to accelerate the development of robust full-duplex systems by providing the tools for reproducible evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Full-Duplex-Bench v1.5 as the first fully automated benchmark for systematically probing overlap handling in full-duplex spoken dialogue systems. It simulates four representative overlap scenarios (user interruption, user backchannel, talking to others, and background speech), supplies metrics for categorical dialogue behaviors, stop/response latency, and prosodic adaptation, and benchmarks five state-of-the-art agents. The evaluation identifies two divergent strategies: a responsive approach that prioritizes rapid response to user input and a floor-holding approach that filters overlapping events to preserve conversational flow. The open-source framework is compatible with both open-source and commercial API-based models.
Significance. If the simulations prove representative of real overlaps and the metrics cleanly separate behavioral strategies without confounding artifacts, this benchmark could meaningfully advance full-duplex dialogue research by supplying a reproducible, automated evaluation tool. The broad compatibility with open-source and commercial models and the open-source release are practical strengths that could accelerate development of more natural spoken interaction systems.
major comments (2)
- [Abstract] Abstract: The claim that benchmarking reveals two divergent strategies depends on the four simulated overlap scenarios being representative enough to reveal meaningful differences; however, the abstract supplies no details on acoustic modeling of overlaps, validation of simulation fidelity, exact definitions of categorical behaviors, or controls for confounding factors, leaving the central classification unsupported by inspectable evidence.
- [Abstract] Abstract: No quantitative results (e.g., per-model metric scores, latency values, or statistical tests) are reported to justify grouping the five agents into exactly two strategies, which is load-bearing for the main empirical finding.
minor comments (1)
- [Abstract] The title references version 1.5 without any mention of prior versions or incremental changes, which could be clarified for reader context.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address the major comments point by point below, focusing on revisions to the abstract to improve clarity and support for our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that benchmarking reveals two divergent strategies depends on the four simulated overlap scenarios being representative enough to reveal meaningful differences; however, the abstract supplies no details on acoustic modeling of overlaps, validation of simulation fidelity, exact definitions of categorical behaviors, or controls for confounding factors, leaving the central classification unsupported by inspectable evidence.
Authors: We agree that the abstract, in its current concise form, does not include these methodological details. We will revise the abstract to briefly describe the acoustic modeling of overlaps, the validation of simulation fidelity, the definitions of categorical behaviors, and the controls for confounding factors. This will make the basis for the strategy classification more transparent and inspectable directly from the abstract. revision: yes
-
Referee: [Abstract] Abstract: No quantitative results (e.g., per-model metric scores, latency values, or statistical tests) are reported to justify grouping the five agents into exactly two strategies, which is load-bearing for the main empirical finding.
Authors: We agree that the abstract currently reports no quantitative results to support the grouping. We will revise the abstract to include key quantitative results, such as representative per-model metric scores, latency values, and reference to statistical tests that justify the identification of the two divergent strategies. This will provide direct evidence for the main empirical finding within the abstract. revision: yes
Circularity Check
Empirical benchmark introduction with no derivation or self-referential elements
full rationale
The provided abstract and full text describe an empirical benchmark for evaluating full-duplex speech models on simulated overlap scenarios. It introduces metrics for behaviors, latency, and prosody, then reports observed strategies from benchmarking five agents. No equations, first-principles derivations, parameter fitting, predictions, or self-citations appear. The central claim rests on empirical observation rather than any reduction to inputs by construction. This is a standard benchmark paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The benchmark simulates four representative overlap scenarios: user interruption, user backchannel, talking to others, and background speech... Benchmarking five state-of-the-art agents reveals two divergent strategies: a responsive approach prioritizing rapid response to user input, and a floor-holding approach that preserves conversational flow by filtering overlapping events.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
EVA-Bench introduces a simulation-plus-scoring framework for voice agents that reveals no tested system exceeds 0.5 on both accuracy and experience metrics at pass@1.
-
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
-
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
-
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.