End-to-end Listen, Look, Speak and Act

Chao Zhang; Jun Zhang; Lu Lu; Siyin Wang; Wenyi Yu; Xianzhao Chen; Xiaohai Tian

arxiv: 2510.16756 · v2 · submitted 2025-10-19 · 💻 cs.AI · cs.CL· cs.CV· cs.RO· eess.AS

End-to-end Listen, Look, Speak and Act

Siyin Wang , Wenyi Yu , Xianzhao Chen , Xiaohai Tian , Jun Zhang , Lu Lu , Chao Zhang This is my paper

Pith reviewed 2026-05-18 06:39 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.ROeess.AS

keywords multimodal modelsfull-duplex interactionend-to-end AImixture of expertsspeech and actionvision-language-action

0 comments

The pith

ELLSA is the first full-duplex end-to-end model that perceives and generates across vision, text, speech, and action in one architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ELLSA as a single system that can listen to speech, observe scenes, speak responses, and execute actions at the same time. This setup aims to support natural turn-taking, interruptions, and concurrent behaviors that separate models cannot handle together. The core design uses a Self-Attention Mixture-of-Experts structure to direct each input type to its own experts and combine them in a shared attention system. If the approach holds, it opens the door to interactive agents that manage real-world dialogues and physical tasks without switching between separate modules.

Core claim

ELLSA is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture. Its SA-MoE backbone routes each modality to specialized experts and fuses them through unified attention while reusing strong pre-trained components, allowing it to match separate baselines on speech and manipulation tasks while adding full-duplex capabilities such as turn-taking, barge-ins, and speaking while acting.

What carries the argument

SA-MoE (Self-Attention Mixture-of-Experts) architecture that routes modalities to specialized experts and fuses them via a unified attention backbone.

Load-bearing premise

The SA-MoE structure can route modalities to separate experts, fuse them without major interference, and still match the performance of dedicated single-modality models.

What would settle it

A side-by-side test on a robot manipulation task with speech interruptions where ELLSA shows a clear drop in action success rate or speech coherence compared with separate vision-action and speech models.

read the original abstract

Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released at https://github.com/bytedance/SALMONN/tree/ELLSA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ELLSA brings a new SA-MoE architecture for unified full-duplex multimodal interaction, though validation of its integration benefits is limited.

read the letter

ELLSA claims to be the first full-duplex end-to-end system that perceives and generates across vision, text, speech, and action all at once. The central new piece is the SA-MoE architecture, which routes each modality to its own experts and then combines them with a shared attention backbone. The paper shows that this setup can keep up with specialized models on speech interaction and robot manipulation benchmarks. It also supports behaviors that feel more natural, such as handling interruptions, rejecting flawed instructions, and speaking while performing actions. The plan to release code, data, and checkpoints is a real help for anyone wanting to reproduce or extend the work. Where it falls short is in proving that the routing and fusion work as intended without problems. The experiments report overall results but skip detailed measurements of interference between modalities or ablations that isolate the contribution of the expert specialization. This makes it harder to judge if the architecture truly mitigates the issues it sets out to solve. This kind of paper is for teams working on unified multimodal models, especially those interested in robotics or advanced conversational agents. A reader focused on full-duplex interaction would get concrete ideas from the architecture and the demonstrated capabilities. It deserves a serious referee. The work is ambitious and the open-sourcing lowers the barrier for verification. I recommend sending it to peer review with suggestions to add more targeted experiments on the MoE components.

Referee Report

2 major / 2 minor

Summary. The paper presents ELLSA, claimed to be the first full-duplex end-to-end model simultaneously perceiving and generating across vision, text, speech, and action in a single architecture. Its core is a novel SA-MoE (Self-Attention Mixture-of-Experts) that routes modalities to specialized experts and fuses them via a unified attention backbone, leveraging pre-trained components to match modality-specific baselines on speech-interaction and robot-manipulation benchmarks while enabling new behaviors including dialogue turn-taking, action barge-ins, speaking-while-acting, and context-grounded VQA.

Significance. If the central claims hold, this would represent a meaningful advance toward unified multimodal interactive systems capable of natural full-duplex human-like behaviors, with the open release of code, data, and checkpoints providing a concrete contribution to the field.

major comments (2)

[Experiments and Results] The manuscript asserts that SA-MoE successfully routes modalities without significant interference and matches baselines, yet the experimental results primarily report aggregate benchmark performance rather than targeted ablations or interference metrics (e.g., speech quality degradation when vision is active or expert specialization tests). This leaves the load-bearing assumption about efficient integration unvalidated.
[Experiments and Results] Claims of enabling previously unreachable interaction patterns (dialogue turn-taking, action barge-ins, defective instruction rejection) are stated without quantitative results, error bars, or detailed evaluation protocols, making it difficult to assess whether these behaviors are robustly demonstrated beyond qualitative examples.

minor comments (2)

[Abstract] The abstract and introduction use 'to our knowledge' for the 'first full-duplex' claim; adding explicit comparison to prior multimodal or full-duplex works would strengthen positioning.
[Method] Notation for the SA-MoE routing mechanism and unified attention backbone could be clarified with a diagram or pseudocode in the methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper to incorporate additional experimental details where possible.

read point-by-point responses

Referee: The manuscript asserts that SA-MoE successfully routes modalities without significant interference and matches baselines, yet the experimental results primarily report aggregate benchmark performance rather than targeted ablations or interference metrics (e.g., speech quality degradation when vision is active or expert specialization tests). This leaves the load-bearing assumption about efficient integration unvalidated.

Authors: We agree that the current presentation emphasizes aggregate results and that targeted ablations would more directly validate the routing and integration claims. In the revised version we will add: (i) expert activation histograms demonstrating modality-specific specialization, (ii) interference measurements (e.g., speech WER and perceptual quality scores with versus without concurrent vision input), and (iii) an ablation comparing SA-MoE against a non-routed unified backbone. These new analyses will be reported with the same benchmark settings used in the original experiments. revision: yes
Referee: Claims of enabling previously unreachable interaction patterns (dialogue turn-taking, action barge-ins, defective instruction rejection) are stated without quantitative results, error bars, or detailed evaluation protocols, making it difficult to assess whether these behaviors are robustly demonstrated beyond qualitative examples.

Authors: The interaction patterns are emergent capabilities shown via qualitative demonstrations because they lack standardized quantitative benchmarks. We will nevertheless strengthen this section by adding explicit evaluation protocols (including task definitions, success criteria, and human rating scales), success-rate statistics collected over multiple interaction sessions, and error bars from repeated trials. Where full quantitative metrics remain difficult to define, we will clearly state the limitations of the current evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: architecture and claims rest on external benchmarks and pre-trained components

full rationale

The manuscript describes the SA-MoE architecture and reports empirical performance on speech-interaction and robot-manipulation benchmarks. All central claims reference modality-specific baselines and pre-trained components rather than deriving results from parameters or definitions internal to the paper. No equations, uniqueness theorems, or predictions reduce to self-defined inputs or self-citation chains. The work is therefore self-contained and independently falsifiable against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the SA-MoE is presented as a novel integration technique without technical derivation details.

pith-pipeline@v0.9.0 · 5810 in / 1079 out tokens · 33613 ms · 2026-05-18T06:39:32.205879+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone.
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ELLSA operates on a one-second time block, within which it processes one second of speech input and a single video frame, generates eight tokens of text output

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.