DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

Boyong Wu; Chao Yan; Che Liu; Donghang Wu; Eng Siong Chng; Fei Tian; Haoyang Zhang; Hexin Liu; Jun Chen; Qingjian Lin

arxiv: 2605.20755 · v2 · pith:XCGNDNCGnew · submitted 2026-05-20 · 📡 eess.AS

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

Haoyang Zhang , Jun Chen , Donghang Wu , Yuxin Li , Yuxin Zhang , Xiangyu Tony Zhang , Che Liu , Qingjian Lin

show 8 more authors

Yizhou Peng Hexin Liu Eng Siong Chng Chao Yan Boyong Wu Yechang Huang Xuerui Yang Fei Tian

This is my paper

Pith reviewed 2026-05-21 02:37 UTC · model grok-4.3

classification 📡 eess.AS

keywords full-duplex spoken dialoguespeech-language-action modelsemantic turn-takingin-conversation tool callingdual-stream three-channelshared timeline decodingagentic spoken model

0 comments

The pith

DuplexSLA decodes user audio, assistant speech, and structured actions jointly on one shared 160 ms timeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DuplexSLA is a full-duplex spoken language model built to listen and respond at the same time while also handling planning and tool calls. It processes three synchronized channels through a single backbone: continuous incoming user audio, discrete outgoing assistant audio, and a rate-limited textual action stream. This joint decoding lets the model manage semantic turn-taking such as interruptions and backchannels internally and emit planning steps or tool calls without stopping speech. The authors evaluate the combined capabilities with DuplexSLA-Bench, which tests pause, interrupt, and backchannel behaviors alongside different styles of in-conversation tool use.

Core claim

DuplexSLA is a native full-duplex Speech-Language-Action foundation model that decodes assistant audio together with a structured action stream on a shared 160 ms chunk timeline. It is built on a dual-stream three-channel formulation: a continuous user audio channel, a discrete assistant audio channel, and a rate-limited textual action channel, all decoded jointly by a single backbone so that listening, speaking, planning, and tool calling unfold on one shared clock.

What carries the argument

Dual-stream three-channel formulation that keeps continuous user audio, discrete assistant audio, and rate-limited action text aligned on a common 160 ms timeline inside one backbone.

If this is right

Semantic turn-taking control for interruption, pause, and backchannel occurs inside the backbone instead of relying on an external semantic VAD.
Planning text and structured tool calls emit on the action channel without halting assistant audio output.
Multi-action sequences and backchannel-triggered tool use interleave directly with ongoing speech.
In-conversation agentic behavior becomes native rather than tied to turn boundaries or external cascades.
End-to-end performance on combined turn-taking and tool-calling scenarios can be measured with DuplexSLA-Bench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared-timeline design may reduce overall system latency by removing separate modules for voice activity detection and planning.
Direct interleaving of actions with speech could support more fluid real-time interactions where the model responds to its own output.
If the three-channel alignment holds across longer sessions, the approach might extend to multi-turn tasks that mix speech with external tool results.
Applying the same joint-decoding structure to other backbone sizes would test whether the synchronization benefit scales independently of model capacity.

Load-bearing premise

A single backbone can produce high-quality audio output while simultaneously delivering accurate semantic turn-taking and in-conversation action emission without external components.

What would settle it

A clear drop in audio quality or rise in turn-taking errors on DuplexSLA-Bench when the action channel is active during speech would show the joint decoding cannot sustain all three tasks at once.

Figures

Figures reproduced from arXiv: 2605.20755 by Boyong Wu, Chao Yan, Che Liu, Donghang Wu, Eng Siong Chng, Fei Tian, Haoyang Zhang, Hexin Liu, Jun Chen, Qingjian Lin, Xiangyu Tony Zhang, Xuerui Yang, Yechang Huang, Yizhou Peng, Yuxin Li, Yuxin Zhang.

**Figure 1.** Figure 1: DuplexSLA chunk-level architecture. Each chunk is 160 ms. The user channel contributes 2 causal audio features (80 ms each); the assistant channel contributes a TA4 unit (one text anchor and 4 discrete audio tokens at 40 ms each); the action channel emits up to 10 text tokens that may be delayed transcript text, planning text, or tool calls. The same backbone autoregressively predicts the assistant TA4 and… view at source ↗

**Figure 2.** Figure 2: Native interaction-control behaviours. (a) A short user backchannel (“You are right”) does not stop the assistant; the action channel emits a backchannel label while assistant speech keeps flowing. (b) When the user starts a real new thought (“You are right, but the project schedule is tight. . . ”), DuplexSLA emits an interrupt label and the assistant yields the floor within a small chunk-level latency. 2… view at source ↗

**Figure 3.** Figure 3: shows both patterns. In the first row, the user issues a backchannel-style request and the action channel emits a tool call without disturbing the assistant. In the second row, a single user turn produces three time-aligned tool calls, each anchored to the relevant chunk on the action channel. User Channel Assistant Channel Action Channel User Channel Assistant Channel Action Channel The May Day holiday is… view at source ↗

**Figure 4.** Figure 4: Data-construction pipeline. (a) An LLM annotates each raw dialogue with tool-call objects (function name, arguments, planning text, semantic offset). (b) The user and assistant utterances are synthesized with TTS and voice cloning, force-aligned, time-merged, and the action-channel labels (backchannel, interrupt, planning, tool calls) are merged at the chunk grid. 3 Data Construction The chunked, dual-stre… view at source ↗

**Figure 5.** Figure 5: Audio-data distribution across continued pretraining (left) and post-training (right). CPT is dominated by [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Recent advances in spoken dialogue language models have shifted from turn-based to full-duplex designs, where the model continuously listens to the user while generating responses. However, existing duplex backbones still lack a native channel for in-conversation planning and tool calling, leaving real-time agentic behaviour either tied to turn boundaries or relegated to an external cascade. We propose DuplexSLA, a native full-duplex Speech-Language-Action foundation model that decodes assistant audio together with a structured action stream on a shared 160 ms chunk timeline. DuplexSLA is built on a dual-stream three-channel formulation: a continuous user audio channel, a discrete assistant audio channel, and a rate-limited textual action channel, all decoded jointly by a single backbone, so that listening, speaking, planning, and tool calling unfold on one shared clock. Two capabilities define the model: (1) semantic-driven turn-taking control, where interruption, pause, and backchannel are handled inside the same backbone instead of by an external semantic VAD; and (2) in-conversation planning and tool calling, where planning text and structured tool calls are emitted on the action channel without halting assistant audio, so that multi-action and backchannel-triggered tool use are interleaved with ongoing speech. To evaluate these capabilities together, we further construct DuplexSLA-Bench, a duplex benchmark covering pause, interrupt, and backchannel turn-taking together with three styles of in-conversation tool calling. Our project page, interactive demos, and the DuplexSLA-Bench evaluation suite are publicly available at https://github.com/hyzhang24/DuplexSLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DuplexSLA adds a synchronized action channel to full-duplex speech models on a 160 ms shared timeline, but the available description supplies no results or implementation details to show the joint decoding actually works without trade-offs.

read the letter

The main takeaway is that this paper folds structured actions and tool calling into a full-duplex spoken model using one backbone and a shared 160 ms chunk timeline. That three-channel setup (continuous user audio, discrete assistant audio, rate-limited actions) is the concrete step beyond prior duplex work that lets planning happen without stopping speech or relying on external cascades.

Referee Report

2 major / 2 minor

Summary. The paper proposes DuplexSLA, a native full-duplex Speech-Language-Action foundation model built on a dual-stream three-channel formulation. It jointly decodes a continuous user audio channel, a discrete assistant audio channel, and a rate-limited textual action channel on a shared 160 ms chunk timeline using a single backbone. This design aims to enable semantic-driven turn-taking (interruptions, pauses, backchannels) and in-conversation planning/tool calling without external VAD or cascades. The authors also introduce DuplexSLA-Bench, a benchmark covering turn-taking scenarios and three styles of in-conversation tool use, with public demos and evaluation suite.

Significance. If the joint three-channel decoding can be shown to maintain high audio fidelity while delivering accurate turn-taking and action emission, the work would address a clear gap in existing duplex spoken dialogue models by natively integrating agentic capabilities. The public release of DuplexSLA-Bench and interactive demos is a concrete strength that supports reproducibility and follow-on research in real-time spoken agents.

major comments (2)

[Abstract and Model Description] Abstract and architecture description: the central claim that a single backbone jointly decoding the three channels produces high-quality audio output alongside accurate semantic turn-taking and action emission without external components or hidden trade-offs is not supported by any quantitative results, ablations, loss curves, or baseline comparisons. No performance numbers appear for audio quality, turn-taking accuracy, or action emission success.
[Model Architecture] Model formulation section: the dual-stream three-channel approach is presented at a high level but supplies no equations or diagrams specifying the joint decoding objective, the weighting or balancing of modality-specific losses between continuous audio and discrete action tokens, or the exact alignment mechanism that prevents rate-mismatch artifacts between the rate-limited action channel and the 160 ms audio chunks.

minor comments (2)

[Abstract] The abstract states that 'planning text and structured tool calls are emitted on the action channel without halting assistant audio' but does not clarify whether any post-processing or buffering is applied to the action stream; a short clarifying sentence would improve precision.
[Benchmark Description] DuplexSLA-Bench is introduced to evaluate the combined capabilities, yet the main text provides only a high-level description of the covered scenarios; adding one or two concrete example dialogues or task definitions would aid reader understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating where revisions have been made to strengthen the presentation of our work.

read point-by-point responses

Referee: [Abstract and Model Description] Abstract and architecture description: the central claim that a single backbone jointly decoding the three channels produces high-quality audio output alongside accurate semantic turn-taking and action emission without external components or hidden trade-offs is not supported by any quantitative results, ablations, loss curves, or baseline comparisons. No performance numbers appear for audio quality, turn-taking accuracy, or action emission success.

Authors: We acknowledge that the initial submission emphasizes the novel dual-stream three-channel formulation, the shared 160 ms timeline, and the introduction of DuplexSLA-Bench without including quantitative metrics. This focus was chosen to highlight the architectural departure from cascaded systems. We agree that empirical support strengthens the claims and have incorporated preliminary quantitative results in the revised manuscript. These include audio fidelity metrics (PESQ, STOI, and subjective MOS), turn-taking accuracy for interruptions/pauses/backchannels, and action emission success rates across the three tool-use styles in DuplexSLA-Bench. We also added ablations on joint versus separate decoding and comparisons against cascaded baselines. Loss curves for the combined objective are now shown in the appendix. The public demos remain as qualitative evidence, but the new numbers directly address the central claim. revision: yes
Referee: [Model Architecture] Model formulation section: the dual-stream three-channel approach is presented at a high level but supplies no equations or diagrams specifying the joint decoding objective, the weighting or balancing of modality-specific losses between continuous audio and discrete action tokens, or the exact alignment mechanism that prevents rate-mismatch artifacts between the rate-limited action channel and the 160 ms audio chunks.

Authors: We agree that a more formal specification improves rigor. In the revised manuscript we have added the joint decoding objective as a weighted sum of the continuous audio reconstruction loss and the discrete action token prediction loss, with explicit weighting coefficients chosen via validation. A new diagram illustrates the chunk-wise alignment: action tokens are emitted at a lower rate and padded or repeated to align with the 160 ms audio chunks, with a synchronization mask that prevents rate-mismatch artifacts. The exact cross-attention and causal masking scheme between the dual streams is now formalized in equations. revision: yes

Circularity Check

0 steps flagged

No circularity: new dual-stream three-channel formulation and benchmark introduced without self-referential reductions

full rationale

The paper proposes DuplexSLA as a new architecture using a dual-stream three-channel formulation (continuous user audio, discrete assistant audio, rate-limited textual action) decoded jointly on a 160 ms timeline. Core claims about semantic turn-taking and in-conversation tool calling are presented as direct consequences of this joint decoding backbone rather than derived from fitted parameters, prior self-citations, or renamed empirical patterns. No equations, uniqueness theorems, or ansatzes are shown reducing to self-definitions or self-citations; the DuplexSLA-Bench is a newly constructed evaluation suite. The derivation remains self-contained as an architectural proposal with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level channel and timeline design choices.

pith-pipeline@v0.9.0 · 5886 in / 1091 out tokens · 56833 ms · 2026-05-21T02:37:45.669806+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DuplexSLA is built on a dual-stream three-channel formulation: a continuous user audio channel, a discrete assistant audio channel, and a rate-limited textual action channel, all decoded jointly by a single backbone... shared 160 ms chunk timeline... TA4 layout... up to ten action text tokens
IndisputableMonolith/Foundation/DimensionForcing.lean (and headline theorem) 8-tick period forcing unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each chunk contributes 2 causal features at an 80 ms stride... four 40 ms discrete assistant audio tokens... cap the action channel at 10 text tokens per chunk

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
cs.CV 2026-06 unverdicted novelty 5.0

Wan-Streamer is a unified Transformer model for low-latency streaming audio-visual interaction that jointly handles perception, reasoning, generation, and timing without external modules.
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
cs.CV 2026-06 unverdicted novelty 5.0

Wan-Streamer presents a unified end-to-end Transformer for low-latency multimodal streaming interaction without external modules.