hub Canonical reference

Moshi: a speech-text foundation model for real-time dialogue

· 2024 · eess.AS · arXiv 2410.00037

Canonical reference. 75% of citing Pith papers cite this work as background.

86 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 86 citing papers arXiv PDF

abstract

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this "Inner Monologue" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at https://github.com/kyutai-labs/moshi.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 baseline 3

citation-polarity summary

background 12 baseline 3 support 1

representative citing papers

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

Privacy Auditing with Zero (0) Training Run

cs.CR · 2026-05-14 · unverdicted · novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

X-Stream benchmark shows SOTA MLLMs score ~50% on concurrent multi-stream tasks and lack proactive ability, using a dual-verification pipeline to avoid single-stream bias.

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

LLMs Need Encoders for Semantic IDs Too

cs.IR · 2026-05-29 · unverdicted · novelty 7.0

PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.

TokTalk: Expressive Real-time Facial Animation from Audio-LLM Tokens

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

TokTalk trains a chunk-based conditional flow matching model on a new audio-token to 3D facial motion dataset to enable real-time expressive facial animation from Audio-LLM tokens with low overhead adaptation.

RealityTest: How People Probe AI Identity and Whether Models Disclose It

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

RealityTest is a human-grounded multilingual multimodal benchmark showing that only 31% of people ask AI identity directly and that suppression instructions plus question phrasing dominate disclosure behavior over model choice.

AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ

cs.SD · 2026-05-22 · unverdicted · novelty 7.0

AffectCodec applies block-diagonal projections in residual FSQ to explicitly allocate bits to emotion and acoustic subspaces, combined with emotion conditioning, yielding better emotion preservation at low bitrates with competitive acoustic quality.

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

eess.AS · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

cs.SD · 2026-05-19 · unverdicted · novelty 7.0

ClariCodec achieves 3.55% WER on LibriSpeech test-clean at 300 bps by RL fine-tuning the encoder for intelligibility, yielding a 23% relative WER reduction while preserving perceptual quality.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

eess.IV · 2026-05-07 · unverdicted · novelty 7.0

LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constrained sensors across modalities.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.

Liberating LLM Capabilities in Full-Duplex Speech Models

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

LWS is a text-first paradigm for full-duplex speech LLMs that treats visible writing as a primary output channel alongside audio input and spoken response, implemented via token schema and synthetic per-second annotations.

SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

eess.AS · 2026-04-29 · unverdicted · novelty 7.0

Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

eess.AS · 2026-04-21 · unverdicted · novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

eess.AS · 2026-04-13 · unverdicted · novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

cs.SD · 2026-04-09 · unverdicted · novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

citing papers explorer

Showing 36 of 86 citing papers.

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer cs.CV · 2026-04-07 · unverdicted · none · ref 19 · internal anchor
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
Qwen3-TTS Technical Report cs.SD · 2026-01-22 · unverdicted · none · ref 4 · internal anchor
Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs cs.LG · 2025-12-11 · unverdicted · none · ref 2 · internal anchor
Using properties of positional embeddings, reasoning LLMs can be made to think, listen, and generate outputs asynchronously without any additional training, cutting time to first token to under 5 seconds.
Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models cs.CL · 2025-10-10 · unverdicted · none · ref 5 · internal anchor
MPS proposes a dual-brain architecture separating formulation reasoning from articulation to achieve real-time CoT in SLMs with accuracy comparable to full pre-computation but much lower latency.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs cs.CL · 2025-09-26 · unverdicted · none · ref 20 · internal anchor
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
Training-Free Multimodal Large Language Model Orchestration cs.CL · 2025-08-06 · unverdicted · none · ref 22 · 2 links · internal anchor
LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.
Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 16 · internal anchor
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching eess.AS · 2025-07-12 · conditional · none · ref 38 · internal anchor
ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics cs.LG · 2025-06-02 · unverdicted · none · ref 15 · internal anchor
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction cs.CL · 2025-02-17 · unverdicted · none · ref 40 · internal anchor
Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a new benchmark.
Adaptive Turn-Taking for Real-time Multi-Party Voice Agents eess.AS · 2026-06-11 · unverdicted · none · ref 13 · internal anchor
ModeratorLM conditions a streaming speech LLM on assigned roles for adaptive turn-taking in multi-party settings, reporting over 40% higher precision and 70% higher recall than non-role baselines on real meetings and a new synthetic dataset.
Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning cs.SD · 2026-05-30 · unverdicted · none · ref 16 · internal anchor
Sympatheia introduces a continuous affect-conditioned speech dialogue model and the Sympatheia-18k synthetic dataset, showing improved emotional appropriateness over baselines when speech cues are limited.
Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models cs.CL · 2026-05-19 · unverdicted · none · ref 19 · internal anchor
Full-duplex SDMs show strong representational synchronization that peaks near zero lag and degrades with noise, with internal states encoding anticipatory turn-taking cues detectable ahead of time.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook cs.SD · 2026-05-18 · unverdicted · none · ref 117 · internal anchor
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
SAME: A Semantically-Aligned Music Autoencoder cs.SD · 2026-05-18 · unverdicted · none · ref 18 · internal anchor
SAME is a semantically regularized transformer autoencoder for music that delivers 4096x compression with open-weights release of large and small variants.
Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization eess.AS · 2026-05-14 · unverdicted · none · ref 19 · internal anchor
Task-dependent simulation strategies for synthetic conversational data allow synthetic-only training to approach real-data baselines for multi-talker ASR and diarization, with mixing yielding further gains.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM cs.CL · 2026-05-07 · unverdicted · none · ref 3 · 2 links · internal anchor
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge eess.AS · 2026-04-23 · unverdicted · none · ref 12 · internal anchor
A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.
Sema: Semantic Transport for Real-Time Multimodal Agents cs.MM · 2026-04-22 · unverdicted · none · ref 11 · internal anchor
Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.
Voxtral TTS cs.AI · 2026-03-26 · unverdicted · none · ref 5 · internal anchor
Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification cs.CL · 2025-12-08 · unverdicted · none · ref 9 · internal anchor
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
Enhancing Speech Large Language Models through Reinforced Behavior Alignment cs.CL · 2025-08-25 · unverdicted · none · ref 17 · internal anchor
Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 14 · internal anchor
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement cs.SD · 2026-06-01 · unverdicted · none · ref 5 · internal anchor
EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.
MOSS-Audio Technical Report cs.SD · 2026-06-01 · unverdicted · none · ref 51 · internal anchor
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.
MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables eess.AS · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
MELD jointly optimizes a discrete latent variable encoder on mel-spectrograms with an autoregressive speech LM, claiming gains over codec and mel baselines on zero-shot TTS/STT plus fewer autoregressive artifacts.
PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding eess.AS · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
PlanRAG-Audio introduces planning-based retrieval-augmented generation to improve accuracy and stability of long-form audio understanding in LALMs by decoupling model input from raw audio duration.
UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars cs.GR · 2026-05-14 · unverdicted · none · ref 15 · internal anchor
UMo presents a sparse MoE-based unified model for real-time co-speech avatar animation that claims superior quality under latency constraints via keyframe-centric design and multi-stage audio-augmented training.
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory cs.AI · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction cs.CV · 2025-01-03 · conditional · none · ref 36 · internal anchor
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.
Screening Matters: A Comparative Study of Conventional and Crowdsourced Listening Tests eess.AS · 2026-06-26 · unverdicted · none · ref 24 · internal anchor
Screening methods based on anchor ordering, rating span, traps, and gold standard questions improve the reliability of crowdsourced listening tests for speech codecs relative to conventional lab tests.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 54 · internal anchor
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
On The Landscape of Spoken Language Models: A Comprehensive Survey cs.CL · 2025-04-11 · unverdicted · none · ref 12 · internal anchor
A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations cs.CL · 2026-04-25 · unreviewed · ref 4 · internal anchor
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection cs.SD · 2026-04-02 · unreviewed · ref 14 · internal anchor
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning eess.AS · 2026-03-18 · unreviewed · ref 9 · 2 links · internal anchor

Moshi: a speech-text foundation model for real-time dialogue

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer