OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Chaohong Tan; Chong Deng; Hai Yu; Jiaqing Liu; Luyao Cheng; Qian Chen; Qinglin Zhang; Shiliang Zhang; Siqi Zheng; Wen Wang

arxiv: 2410.17799 · v2 · pith:Z6CMF4JKnew · submitted 2024-10-23 · 💻 cs.CL · cs.AI· cs.SD· eess.AS

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Qinglin Zhang , Luyao Cheng , Chong Deng , Qian Chen , Wen Wang , Siqi Zheng , Jiaqing Liu , Hai Yu

show 3 more authors

Chaohong Tan Zhihao Du Shiliang Zhang

This is my paper

classification 💻 cs.CL cs.AIcs.SDeess.AS

keywords dialoguefull-duplexconversationomniflattensystemsbackboneend-to-endmodel

0 comments

read the original abstract

Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
eess.AS 2026-05 unverdicted novelty 7.0

DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-...
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations
cs.CL 2026-04 unverdicted novelty 7.0

Human-1 is the first open full-duplex spoken dialogue system for Hindi, created by adapting Moshi with a custom tokenizer and training on 26,000 hours of real-world conversations to enable natural interruptions and overlaps.
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
cs.CV 2026-06 unverdicted novelty 6.0

Wan-Streamer is a unified end-to-end Transformer for low-latency streaming audio-visual interaction using block-causal attention on interleaved multimodal tokens.
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
cs.CV 2026-06 unverdicted novelty 5.0

Wan-Streamer presents a unified end-to-end Transformer for low-latency multimodal streaming interaction without external modules.
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
cs.CV 2026-06 unverdicted novelty 5.0

Wan-Streamer is a unified Transformer model for low-latency streaming audio-visual interaction that jointly handles perception, reasoning, generation, and timing without external modules.
Adaptive Turn-Taking for Real-time Multi-Party Voice Agents
eess.AS 2026-06 unverdicted novelty 5.0

ModeratorLM conditions a streaming speech LLM on assigned roles for adaptive turn-taking in multi-party settings, reporting over 40% higher precision and 70% higher recall than non-role baselines on real meetings and ...
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
eess.AS 2026-05 unverdicted novelty 5.0

DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations
cs.CL 2026-04 conditional novelty 5.0

Adapting Moshi to Hindi with a custom tokenizer and 26k hours of real conversations yields the first open full-duplex spoken dialogue system for an Indian language.
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
Adaptive Turn-Taking for Real-time Multi-Party Voice Agents
eess.AS 2026-06 unverdicted novelty 4.0

ModeratorLM conditions a chunk-wise streaming speech LLM on assigned roles (with optional CoT) to raise turn-taking precision over 40% and recall over 70% versus non-role baselines on synthetic RolePlayConv data and r...
On The Landscape of Spoken Language Models: A Comprehensive Survey
cs.CL 2025-04 unverdicted novelty 3.0

A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.