MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
Pith reviewed 2026-05-17 22:35 UTC · model grok-4.3
The pith
Current full-duplex speech models lose performance consistency across multiple conversation rounds and quality checks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MTR-DuplexBench segments continuous full-duplex dialogues into discrete turns for turn-by-turn assessment and evaluates FD-SLMs across conversational features, dialogue quality, instruction following, and safety, with results indicating that current models struggle to keep consistent performance over multiple rounds and across these dimensions.
What carries the argument
MTR-DuplexBench, which turns ongoing full-duplex audio into discrete turns while measuring multiple evaluation aspects at once.
If this is right
- Testing FD-SLMs must include multi-round scenarios to capture real-world behavior.
- Benchmarks for speech models should address turn boundaries and context retention explicitly.
- Model training should target stability in performance from one turn to the next.
- Safety and instruction-following checks become necessary alongside conversational metrics.
Where Pith is reading between the lines
- This segmentation method could help create training data for more stable multi-turn models.
- The benchmark structure might apply to evaluating text-based dialogue systems with similar context issues.
- Automated ways to detect turn boundaries could reduce reliance on manual segmentation.
Load-bearing premise
Segmenting continuous full-duplex audio into discrete turns can be done reliably enough to support fair turn-by-turn assessment without artifacts from blurred boundaries or context loss.
What would settle it
Finding that FD-SLMs maintain steady performance across many rounds when measured with MTR-DuplexBench would contradict the reported difficulties.
Figures
read the original abstract
Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions, neglecting the complexities of multi-round communication. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. Also, existing benchmarks often focus solely on evaluating conversational features, neglecting other critical aspects. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark designed for a comprehensive multi-round evaluation of FD-SLMs. MTR-DuplexBench not only segments continuous full-duplex dialogues into discrete turns for turn-by-turn assessment but also incorporates various evaluation aspects, including conversational features, dialogue quality, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our benchmark. Code and data are available at: https://github.com/ZhangHe0918/MTR-DuplexBench
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MTR-DuplexBench, a new benchmark for comprehensive multi-round evaluation of Full-Duplex Speech Language Models (FD-SLMs). It segments continuous full-duplex dialogues into discrete turns for turn-by-turn assessment and evaluates models across conversational features, dialogue quality, instruction following, and safety. The authors report that current FD-SLMs exhibit difficulties maintaining consistent performance across multiple rounds and evaluation dimensions, underscoring the benchmark's value.
Significance. If the segmentation protocol and evaluation metrics prove robust, this benchmark would address a clear gap in moving beyond single-round assessments for full-duplex conversational models. The multi-dimensional evaluation scope and public release of code and data are positive contributions that could support future model development in real-time speech interaction.
major comments (2)
- [Benchmark Construction and Experimental Setup] The central experimental claim—that FD-SLMs show inconsistent multi-round performance—depends on reliable turn segmentation, yet the manuscript provides no description of the segmentation algorithm, no human validation of boundary accuracy, and no ablation demonstrating that performance drops persist under manually corrected boundaries. This directly engages the abstract's own identification of 'blurred turn boundaries' as a core challenge and leaves open the possibility that reported inconsistencies are benchmark-induced artifacts rather than intrinsic model limitations.
- [Experimental Results] The abstract states that experiments were conducted and that models showed difficulties, but supplies no information on the concrete metrics for each evaluation dimension, the selection of baselines, the number of models or dialogues tested, or statistical significance of the consistency drops. Without these details the evidence supporting the headline claim remains too thin to evaluate.
minor comments (1)
- A summary table listing each evaluation dimension, its specific metrics, and scoring protocol would improve clarity and allow readers to quickly grasp the benchmark's scope.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [Benchmark Construction and Experimental Setup] The central experimental claim—that FD-SLMs show inconsistent multi-round performance—depends on reliable turn segmentation, yet the manuscript provides no description of the segmentation algorithm, no human validation of boundary accuracy, and no ablation demonstrating that performance drops persist under manually corrected boundaries. This directly engages the abstract's own identification of 'blurred turn boundaries' as a core challenge and leaves open the possibility that reported inconsistencies are benchmark-induced artifacts rather than intrinsic model limitations.
Authors: We appreciate the referee pointing out this critical detail. The current manuscript describes the overall segmentation approach at a high level but lacks the requested specifics. In the revised version we will add a dedicated subsection that fully specifies the segmentation algorithm, including the acoustic and linguistic cues employed to detect turn boundaries in continuous full-duplex audio. We will also report human validation results on a random sample of 200 turns (inter-annotator agreement and boundary accuracy) and include an ablation that re-evaluates a subset of models on manually corrected boundaries. This will allow readers to assess whether the observed multi-round inconsistencies remain after boundary correction. revision: yes
-
Referee: [Experimental Results] The abstract states that experiments were conducted and that models showed difficulties, but supplies no information on the concrete metrics for each evaluation dimension, the selection of baselines, the number of models or dialogues tested, or statistical significance of the consistency drops. Without these details the evidence supporting the headline claim remains too thin to evaluate.
Authors: We agree that the experimental section would benefit from greater transparency. In the revision we will expand the results section to provide: (i) explicit metric definitions and scoring rubrics for conversational features, dialogue quality, instruction following, and safety; (ii) the rationale and list of baseline FD-SLMs; (iii) the exact counts of models and multi-round dialogues evaluated; and (iv) statistical tests (including p-values and confidence intervals) for the reported drops in consistency across rounds. These additions will make the evidence for our claims fully evaluable. revision: yes
Circularity Check
No significant circularity in benchmark definition or empirical results
full rationale
The paper introduces MTR-DuplexBench as a new evaluation tool for multi-round FD-SLM conversations and reports experimental findings on model performance inconsistencies. No derivation chain, equations, fitted parameters, or self-referential definitions appear in the provided text. The central claims rest on standard benchmark construction and direct evaluation rather than reducing to inputs by construction. This is a typical non-circular benchmark paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Continuous full-duplex dialogues can be segmented into discrete turns that preserve the original interaction dynamics for evaluation purposes.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a turn segmentation methodology for segmenting continuous full-duplex dialogues into discrete turns... GPT-4o... majority voting... 30% time overlap
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
success rate... latency... refusal rate... GPT-score
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
A Survey of Audio Reasoning in Multimodal Foundation Models
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.