OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
read the original abstract
Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).
This paper has not been read by Pith yet.
Forward citations
Cited by 11 Pith papers
-
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-...
-
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations
Human-1 is the first open full-duplex spoken dialogue system for Hindi, created by adapting Moshi with a custom tokenizer and training on 26,000 hours of real-world conversations to enable natural interruptions and overlaps.
-
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
Wan-Streamer is a unified end-to-end Transformer for low-latency streaming audio-visual interaction using block-causal attention on interleaved multimodal tokens.
-
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
Wan-Streamer presents a unified end-to-end Transformer for low-latency multimodal streaming interaction without external modules.
-
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
Wan-Streamer is a unified Transformer model for low-latency streaming audio-visual interaction that jointly handles perception, reasoning, generation, and timing without external modules.
-
Adaptive Turn-Taking for Real-time Multi-Party Voice Agents
ModeratorLM conditions a streaming speech LLM on assigned roles for adaptive turn-taking in multi-party settings, reporting over 40% higher precision and 70% higher recall than non-role baselines on real meetings and ...
-
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.
-
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations
Adapting Moshi to Hindi with a custom tokenizer and 26k hours of real conversations yields the first open full-duplex spoken dialogue system for an Indian language.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
-
Adaptive Turn-Taking for Real-time Multi-Party Voice Agents
ModeratorLM conditions a chunk-wise streaming speech LLM on assigned roles (with optional CoT) to raise turn-taking precision over 40% and recall over 70% versus non-role baselines on synthetic RolePlayConv data and r...
-
On The Landscape of Spoken Language Models: A Comprehensive Survey
A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.