pith. sign in

super hub Canonical reference

Moshi: a speech-text foundation model for real-time dialogue

Canonical reference. 75% of citing Pith papers cite this work as background.

113 Pith papers citing it
Background 75% of classified citations
abstract

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this "Inner Monologue" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at https://github.com/kyutai-labs/moshi.

hub tools

citation-role summary

background 13 baseline 3

citation-polarity summary

claims ledger

  • abstract We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Fina
  • background -✗✓ ✓ SpeechVerse [113] May 2024 Flan-T5-XL 3B EN Contin. -✗✓ ✓ GAMA [114] Jun 2024 LLaMA2-7B 7B EN Contin. 2.2M audio-caption pairs✗✓ ✓ Qwen2-Audio [18] Jul 2024 Qwen-7B 7B Multi. Contin. 520K Hrs audio✗✓ ✓ FunAudioLLM [115] Jul 2024 - - Multi. - -✗✓ ✓ Mini-Omni [116] Aug 2024 Qwen2-0.5B 0.5B - Discrete 8K Hrs speech + 2M text examples✓ ✓ ✓ Moshi [117] Sep 2024 Helium 7B EN Discrete 7M Hrs audio + 2.1T text tokens✓ ✓ ✓ LLaMA-Omni [118] Sep 2024 Llama-3.1-8B-Instruct 8B EN Contin. -✗✓ ✓ Parrot [
  • background components: aspeech moduleto generate natural voice, and avisual moduleto synthesize realistic motion, including fa- 2 cial expressions and gestures, in either 2D or 3D form. Speech dialog research splits into two paradigms: mod- ular and end-to-end. Modular systems pair an LLM with a TTS/vocoder-LLM emits text or semantic tokens, TTS renders waveforms [19, 22-24, 74, 88, 128]. End-to-end voice agents integrate understanding and speech generation in a single model, tightly coupling semantics, pr
  • baseline We compare WavCodec against repre- sentative low-bitrate neural codecs and tokenizers. These baselines can be roughly divided into two groups. The first group does not explicitly introduce semantic supervision, including SimCodec [60], BigCodec [56], and WavTokenizer [19]. The second group incorpo- rates semantic constraints from pretrained speech models, includ- ing Mimi [7], XY-Tokenizer [12], X-codec2 [61], and BiCodec [47]. While high-fidelity codecs such as DAC [24] and EnCodec [6] can achi
  • background bines them with text tokens as input to the LLM back- bone. During audio tokenization, LALMs extract acoustic features from raw audio signals and then apply vector quantization techniques to derive discrete audio tokens. Meanwhile, the LLM backbone extends its vocabulary and embedding matrix to accommodate audio tokens. Instead of audio tokenization, thecontinuous-featurescheme [32]- [43] directly aligns audio and text inputs within a unified embedding space. Such LALMs project acoustic features
  • baseline These methods are lightweight and fast, but primarily capture speech presence rather than communicative intent, making them prone to false triggers from backchannels, hesitations, or background noise. The second group introduces explicit turn prediction modules using learned models. Representative examples include Smart Turn, TEN Turn Detection, and Easy Turn [14]. The second ap- proach enhances conversational intent detection by leveraging learned models and text-based cues, making it more adap
  • background This requires modeling both arXiv:2604.08363v1 [cs.SD] 9 Apr 2026 Conference'17, July 2017, Washington, DC, USA Xiaosu Su, Zihan Sun, Peilei Jia, and Jun Gao stable speaker timbre and turn-level expressive variation: the former is better characterized by speaker-level global conditioning, while the latter needs to be explicitly modeled at the current turn[7, 28, 40]. A central open problem, however, is how to preserve a satisfactorily designed timbre for ongoing generation. Once a desired voice

authors

co-cited works

clear filters

representative citing papers

Privacy Auditing with Zero (0) Training Run

cs.CR · 2026-05-14 · unverdicted · novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

Interleaved Speech Language Models Latently Work In Text

cs.CL · 2026-06-21 · unverdicted · novelty 7.0

Interleaved SLMs implicitly transcribe spoken words to text tokens in middle layers (top candidate for 77% of data) before predicting in text space and returning to speech.

LLMs Need Encoders for Semantic IDs Too

cs.IR · 2026-05-29 · unverdicted · novelty 7.0

PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

Liberating LLM Capabilities in Full-Duplex Speech Models

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

LWS is a text-first paradigm for full-duplex speech LLMs that treats visible writing as a primary output channel alongside audio input and spoken response, implemented via token schema and synthetic per-second annotations.

citing papers explorer

Showing 2 of 2 citing papers after filters.