pith. sign in

arxiv: 2511.10262 · v3 · submitted 2025-11-13 · 💻 cs.CL · cs.AI· eess.AS

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

Pith reviewed 2026-05-17 22:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIeess.AS
keywords full-duplex speech modelsmulti-round evaluationbenchmarkspeech language modelsdialogue qualityinstruction followingconversational AI
0
0 comments X

The pith

Current full-duplex speech models lose performance consistency across multiple conversation rounds and quality checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MTR-DuplexBench as a way to test full-duplex speech language models over repeated turns instead of single exchanges. Single-round benchmarks miss issues such as unclear turn boundaries and drifting context that arise in ongoing talks. The new benchmark divides continuous audio into separate turns and scores models on conversational behavior, overall dialogue quality, instruction adherence, and safety. Experiments using the benchmark show that existing models vary widely in how well they hold up from one round to the next.

Core claim

MTR-DuplexBench segments continuous full-duplex dialogues into discrete turns for turn-by-turn assessment and evaluates FD-SLMs across conversational features, dialogue quality, instruction following, and safety, with results indicating that current models struggle to keep consistent performance over multiple rounds and across these dimensions.

What carries the argument

MTR-DuplexBench, which turns ongoing full-duplex audio into discrete turns while measuring multiple evaluation aspects at once.

If this is right

  • Testing FD-SLMs must include multi-round scenarios to capture real-world behavior.
  • Benchmarks for speech models should address turn boundaries and context retention explicitly.
  • Model training should target stability in performance from one turn to the next.
  • Safety and instruction-following checks become necessary alongside conversational metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This segmentation method could help create training data for more stable multi-turn models.
  • The benchmark structure might apply to evaluating text-based dialogue systems with similar context issues.
  • Automated ways to detect turn boundaries could reduce reliance on manual segmentation.

Load-bearing premise

Segmenting continuous full-duplex audio into discrete turns can be done reliably enough to support fair turn-by-turn assessment without artifacts from blurred boundaries or context loss.

What would settle it

Finding that FD-SLMs maintain steady performance across many rounds when measured with MTR-DuplexBench would contradict the reported difficulties.

Figures

Figures reproduced from arXiv: 2511.10262 by Haoli Bai, Haoning Xu, He Zhang, Irwin King, Lei Zhu, Shaohua Ma, Wenqian Cui, Xiaohui Li.

Figure 1
Figure 1. Figure 1: Illustration of the Blurred Turn Boundary and the Context Inconsistency challenges in the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the assistant response period in the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions, neglecting the complexities of multi-round communication. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. Also, existing benchmarks often focus solely on evaluating conversational features, neglecting other critical aspects. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark designed for a comprehensive multi-round evaluation of FD-SLMs. MTR-DuplexBench not only segments continuous full-duplex dialogues into discrete turns for turn-by-turn assessment but also incorporates various evaluation aspects, including conversational features, dialogue quality, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our benchmark. Code and data are available at: https://github.com/ZhangHe0918/MTR-DuplexBench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MTR-DuplexBench, a new benchmark for comprehensive multi-round evaluation of Full-Duplex Speech Language Models (FD-SLMs). It segments continuous full-duplex dialogues into discrete turns for turn-by-turn assessment and evaluates models across conversational features, dialogue quality, instruction following, and safety. The authors report that current FD-SLMs exhibit difficulties maintaining consistent performance across multiple rounds and evaluation dimensions, underscoring the benchmark's value.

Significance. If the segmentation protocol and evaluation metrics prove robust, this benchmark would address a clear gap in moving beyond single-round assessments for full-duplex conversational models. The multi-dimensional evaluation scope and public release of code and data are positive contributions that could support future model development in real-time speech interaction.

major comments (2)
  1. [Benchmark Construction and Experimental Setup] The central experimental claim—that FD-SLMs show inconsistent multi-round performance—depends on reliable turn segmentation, yet the manuscript provides no description of the segmentation algorithm, no human validation of boundary accuracy, and no ablation demonstrating that performance drops persist under manually corrected boundaries. This directly engages the abstract's own identification of 'blurred turn boundaries' as a core challenge and leaves open the possibility that reported inconsistencies are benchmark-induced artifacts rather than intrinsic model limitations.
  2. [Experimental Results] The abstract states that experiments were conducted and that models showed difficulties, but supplies no information on the concrete metrics for each evaluation dimension, the selection of baselines, the number of models or dialogues tested, or statistical significance of the consistency drops. Without these details the evidence supporting the headline claim remains too thin to evaluate.
minor comments (1)
  1. A summary table listing each evaluation dimension, its specific metrics, and scoring protocol would improve clarity and allow readers to quickly grasp the benchmark's scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Benchmark Construction and Experimental Setup] The central experimental claim—that FD-SLMs show inconsistent multi-round performance—depends on reliable turn segmentation, yet the manuscript provides no description of the segmentation algorithm, no human validation of boundary accuracy, and no ablation demonstrating that performance drops persist under manually corrected boundaries. This directly engages the abstract's own identification of 'blurred turn boundaries' as a core challenge and leaves open the possibility that reported inconsistencies are benchmark-induced artifacts rather than intrinsic model limitations.

    Authors: We appreciate the referee pointing out this critical detail. The current manuscript describes the overall segmentation approach at a high level but lacks the requested specifics. In the revised version we will add a dedicated subsection that fully specifies the segmentation algorithm, including the acoustic and linguistic cues employed to detect turn boundaries in continuous full-duplex audio. We will also report human validation results on a random sample of 200 turns (inter-annotator agreement and boundary accuracy) and include an ablation that re-evaluates a subset of models on manually corrected boundaries. This will allow readers to assess whether the observed multi-round inconsistencies remain after boundary correction. revision: yes

  2. Referee: [Experimental Results] The abstract states that experiments were conducted and that models showed difficulties, but supplies no information on the concrete metrics for each evaluation dimension, the selection of baselines, the number of models or dialogues tested, or statistical significance of the consistency drops. Without these details the evidence supporting the headline claim remains too thin to evaluate.

    Authors: We agree that the experimental section would benefit from greater transparency. In the revision we will expand the results section to provide: (i) explicit metric definitions and scoring rubrics for conversational features, dialogue quality, instruction following, and safety; (ii) the rationale and list of baseline FD-SLMs; (iii) the exact counts of models and multi-round dialogues evaluated; and (iv) statistical tests (including p-values and confidence intervals) for the reported drops in consistency across rounds. These additions will make the evidence for our claims fully evaluable. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark definition or empirical results

full rationale

The paper introduces MTR-DuplexBench as a new evaluation tool for multi-round FD-SLM conversations and reports experimental findings on model performance inconsistencies. No derivation chain, equations, fitted parameters, or self-referential definitions appear in the provided text. The central claims rest on standard benchmark construction and direct evaluation rather than reducing to inputs by construction. This is a typical non-circular benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that continuous full-duplex speech can be segmented into discrete turns without losing essential context or introducing bias; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Continuous full-duplex dialogues can be segmented into discrete turns that preserve the original interaction dynamics for evaluation purposes.
    This segmentation step is required to enable the turn-by-turn assessment described in the abstract.

pith-pipeline@v0.9.0 · 5525 in / 1169 out tokens · 30704 ms · 2026-05-17T22:35:11.567401+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Survey of Audio Reasoning in Multimodal Foundation Models

    eess.AS 2026-05 unverdicted novelty 2.0

    A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 1 Pith paper

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...