super hub Canonical reference

Moshi: a speech-text foundation model for real-time dialogue

Manu Orsini · 2024 · eess.AS · arXiv 2410.00037

Canonical reference. 75% of citing Pith papers cite this work as background.

108 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 108 citing papers more from Manu Orsini arXiv PDF

abstract

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this "Inner Monologue" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at https://github.com/kyutai-labs/moshi.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 baseline 3

citation-polarity summary

background 12 baseline 3 support 1

claims ledger

abstract We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Fina
background -✗✓ ✓ SpeechVerse [113] May 2024 Flan-T5-XL 3B EN Contin. -✗✓ ✓ GAMA [114] Jun 2024 LLaMA2-7B 7B EN Contin. 2.2M audio-caption pairs✗✓ ✓ Qwen2-Audio [18] Jul 2024 Qwen-7B 7B Multi. Contin. 520K Hrs audio✗✓ ✓ FunAudioLLM [115] Jul 2024 - - Multi. - -✗✓ ✓ Mini-Omni [116] Aug 2024 Qwen2-0.5B 0.5B - Discrete 8K Hrs speech + 2M text examples✓ ✓ ✓ Moshi [117] Sep 2024 Helium 7B EN Discrete 7M Hrs audio + 2.1T text tokens✓ ✓ ✓ LLaMA-Omni [118] Sep 2024 Llama-3.1-8B-Instruct 8B EN Contin. -✗✓ ✓ Parrot [
background components: aspeech moduleto generate natural voice, and avisual moduleto synthesize realistic motion, including fa- 2 cial expressions and gestures, in either 2D or 3D form. Speech dialog research splits into two paradigms: mod- ular and end-to-end. Modular systems pair an LLM with a TTS/vocoder-LLM emits text or semantic tokens, TTS renders waveforms [19, 22-24, 74, 88, 128]. End-to-end voice agents integrate understanding and speech generation in a single model, tightly coupling semantics, pr
baseline We compare WavCodec against repre- sentative low-bitrate neural codecs and tokenizers. These baselines can be roughly divided into two groups. The first group does not explicitly introduce semantic supervision, including SimCodec [60], BigCodec [56], and WavTokenizer [19]. The second group incorpo- rates semantic constraints from pretrained speech models, includ- ing Mimi [7], XY-Tokenizer [12], X-codec2 [61], and BiCodec [47]. While high-fidelity codecs such as DAC [24] and EnCodec [6] can achi
background bines them with text tokens as input to the LLM back- bone. During audio tokenization, LALMs extract acoustic features from raw audio signals and then apply vector quantization techniques to derive discrete audio tokens. Meanwhile, the LLM backbone extends its vocabulary and embedding matrix to accommodate audio tokens. Instead of audio tokenization, thecontinuous-featurescheme [32]- [43] directly aligns audio and text inputs within a unified embedding space. Such LALMs project acoustic features
baseline These methods are lightweight and fast, but primarily capture speech presence rather than communicative intent, making them prone to false triggers from backchannels, hesitations, or background noise. The second group introduces explicit turn prediction modules using learned models. Representative examples include Smart Turn, TEN Turn Detection, and Easy Turn [14]. The second ap- proach enhances conversational intent detection by leveraging learned models and text-based cues, making it more adap
background This requires modeling both arXiv:2604.08363v1 [cs.SD] 9 Apr 2026 Conference'17, July 2017, Washington, DC, USA Xiaosu Su, Zihan Sun, Peilei Jia, and Jun Gao stable speaker timbre and turn-level expressive variation: the former is better characterized by speaker-level global conditioning, while the latter needs to be explicitly modeled at the current turn[7, 28, 40]. A central open problem, however, is how to preserve a satisfactorily designed timbre for ongoing generation. Once a desired voice

authors

Alexandre D\'efossez Am\'elie Royer Herv\'e J\'egou Laurent Mazar\'e Manu Orsini Patrick P\'erez

co-cited works

representative citing papers

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

Privacy Auditing with Zero (0) Training Run

cs.CR · 2026-05-14 · unverdicted · novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

eess.AS · 2026-06-17 · accept · novelty 7.0

A survey proposing an L0-L3 architectural hierarchy, T×I×R interaction ontology, and IDLE/LISTEN/SPEAK/WAIT/DUAL decision state machine for full-duplex spoken dialogue systems, documenting a realization gap between architectural potential and observed behavior due to training data limits.

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

cs.SD · 2026-06-09 · unverdicted · novelty 7.0

Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

X-Stream benchmark shows SOTA MLLMs score ~50% on concurrent multi-stream tasks and lack proactive ability, using a dual-verification pipeline to avoid single-stream bias.

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

LLMs Need Encoders for Semantic IDs Too

cs.IR · 2026-05-29 · unverdicted · novelty 7.0

PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.

TokTalk: Expressive Real-time Facial Animation from Audio-LLM Tokens

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

TokTalk trains a chunk-based conditional flow matching model on a new audio-token to 3D facial motion dataset to enable real-time expressive facial animation from Audio-LLM tokens with low overhead adaptation.

RealityTest: How People Probe AI Identity and Whether Models Disclose It

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

RealityTest is a human-grounded multilingual multimodal benchmark showing that only 31% of people ask AI identity directly and that suppression instructions plus question phrasing dominate disclosure behavior over model choice.

AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ

cs.SD · 2026-05-22 · unverdicted · novelty 7.0

AffectCodec applies block-diagonal projections in residual FSQ to explicitly allocate bits to emotion and acoustic subspaces, combined with emotion conditioning, yielding better emotion preservation at low bitrates with competitive acoustic quality.

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

eess.AS · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

cs.SD · 2026-05-19 · unverdicted · novelty 7.0

ClariCodec achieves 3.55% WER on LibriSpeech test-clean at 300 bps by RL fine-tuning the encoder for intelligibility, yielding a 23% relative WER reduction while preserving perceptual quality.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

eess.IV · 2026-05-07 · unverdicted · novelty 7.0

LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constrained sensors across modalities.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.

Liberating LLM Capabilities in Full-Duplex Speech Models

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

LWS is a text-first paradigm for full-duplex speech LLMs that treats visible writing as a primary output channel alongside audio input and spoken response, implemented via token schema and synthetic per-second annotations.

SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

eess.AS · 2026-04-29 · unverdicted · novelty 7.0

Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

citing papers explorer

Showing 50 of 108 citing papers.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents cs.CV · 2026-05-28 · unverdicted · none · ref 10 · internal anchor
VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.
Privacy Auditing with Zero (0) Training Run cs.CR · 2026-05-14 · unverdicted · none · ref 7 · internal anchor
Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning cs.CL · 2026-07-02 · unverdicted · none · ref 8 · internal anchor
SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model cs.SD · 2026-06-30 · unverdicted · none · ref 195 · internal anchor
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine eess.AS · 2026-06-17 · accept · none · ref 1 · internal anchor
A survey proposing an L0-L3 architectural hierarchy, T×I×R interaction ontology, and IDLE/LISTEN/SPEAK/WAIT/DUAL decision state machine for full-duplex spoken dialogue systems, documenting a realization gap between architectural potential and observed behavior due to training data limits.
Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models cs.SD · 2026-06-09 · unverdicted · none · ref 19 · internal anchor
Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.
Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering cs.CL · 2026-06-09 · unverdicted · none · ref 13 · internal anchor
FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.
X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding cs.CV · 2026-06-01 · unverdicted · none · ref 14 · 2 links · internal anchor
X-Stream benchmark shows SOTA MLLMs score ~50% on concurrent multi-stream tasks and lack proactive ability, using a dual-verification pipeline to avoid single-stream bias.
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects cs.CL · 2026-05-31 · unverdicted · none · ref 27 · internal anchor
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
LLMs Need Encoders for Semantic IDs Too cs.IR · 2026-05-29 · unverdicted · none · ref 8 · internal anchor
PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.
TokTalk: Expressive Real-time Facial Animation from Audio-LLM Tokens cs.CV · 2026-05-29 · unverdicted · none · ref 8 · internal anchor
TokTalk trains a chunk-based conditional flow matching model on a new audio-token to 3D facial motion dataset to enable real-time expressive facial animation from Audio-LLM tokens with low overhead adaptation.
RealityTest: How People Probe AI Identity and Whether Models Disclose It cs.CL · 2026-05-29 · unverdicted · none · ref 13 · internal anchor
RealityTest is a human-grounded multilingual multimodal benchmark showing that only 31% of people ask AI identity directly and that suppression instructions plus question phrasing dominate disclosure behavior over model choice.
AffectCodec: Emotion-Preserving Neural Speech Codec with Block-Diagonal Residual FSQ cs.SD · 2026-05-22 · unverdicted · none · ref 8 · internal anchor
AffectCodec applies block-diagonal projections in residual FSQ to explicitly allocate bits to emotion and acoustic subspaces, combined with emotion conditioning, yielding better emotion preservation at low bitrates with competitive acoustic quality.
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action eess.AS · 2026-05-20 · unverdicted · none · ref 3 · 2 links · internal anchor
DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.
Codec-Robust Attacks on Audio LLMs cs.SD · 2026-05-19 · unverdicted · none · ref 22 · 2 links · internal anchor
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning cs.SD · 2026-05-19 · unverdicted · none · ref 43 · internal anchor
ClariCodec achieves 3.55% WER on LibriSpeech test-clean at 300 bps by RL fine-tuning the encoder for intelligibility, yielding a 23% relative WER reduction while preserving perceptual quality.
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling cs.SD · 2026-05-11 · unverdicted · none · ref 16 · internal anchor
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue cs.CL · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 46 · internal anchor
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation eess.IV · 2026-05-07 · unverdicted · none · ref 29 · internal anchor
LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constrained sensors across modalities.
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization cs.LG · 2026-05-07 · unverdicted · none · ref 14 · 2 links · internal anchor
PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.
Liberating LLM Capabilities in Full-Duplex Speech Models cs.CL · 2026-05-04 · unverdicted · none · ref 4 · internal anchor
LWS is a text-first paradigm for full-duplex speech LLMs that treats visible writing as a primary output channel alongside audio input and spoken response, implemented via token schema and synthetic per-second annotations.
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding eess.AS · 2026-04-29 · unverdicted · none · ref 9 · internal anchor
Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation cs.CL · 2026-04-22 · unverdicted · none · ref 8 · internal anchor
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages eess.AS · 2026-04-21 · unverdicted · none · ref 184 · internal anchor
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection cs.CR · 2026-04-16 · unverdicted · none · ref 32 · internal anchor
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models eess.AS · 2026-04-13 · unverdicted · none · ref 27 · internal anchor
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation cs.SD · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
TiCo: Time-Controllable Spoken Dialogue Model cs.CL · 2026-03-23 · unverdicted · none · ref 59 · internal anchor
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models cs.CL · 2025-12-29 · accept · none · ref 9 · internal anchor
Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body cs.CV · 2025-12-16 · unverdicted · none · ref 19 · internal anchor
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models eess.AS · 2025-09-30 · unverdicted · none · ref 24 · internal anchor
Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
VoiceBench: Benchmarking LLM-Based Voice Assistants cs.CL · 2024-10-22 · unverdicted · none · ref 65 · internal anchor
VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation eess.AS · 2026-07-02 · unverdicted · none · ref 21 · internal anchor
Extends vLLM with delay-pattern de-interleaving, multi-stream sampling, and co-scheduled CFG to achieve 80% of non-CFG throughput for unified audio tasks while open-sourcing the pipeline.
TurnNat: Automatic Evaluation of Turn-Taking Naturalness in Dyadic Spoken Dialogue cs.CL · 2026-07-01 · unverdicted · none · ref 4 · internal anchor
TurnNat introduces a likelihood-based automatic evaluation method for turn-taking naturalness in dyadic spoken dialogues using a causal prediction model and a human-validated perturbation benchmark.
Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation eess.AS · 2026-06-29 · unverdicted · none · ref 21 · internal anchor
PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.
COSM: A Cooperative Scheduling Framework for Concurrent PIM and CPU Execution on Mobile Devices cs.AR · 2026-06-29 · unverdicted · none · ref 1 · 2 links · internal anchor
COSM enables concurrent PIM and CPU execution on mobiles via low-interference control and idleness-aware scheduling, delivering up to 2.8x PIM throughput with under 2% CPU slowdown.
FacePlex: Full-Duplex Joint Speech-Facial Motion Generation for Conversational Avatars cs.AI · 2026-06-29 · unverdicted · none · ref 10 · internal anchor
FacePlex introduces a unified streaming model with Rolling Flow Matching and Rolling Cross-Attention to enable full-duplex joint real-time generation of speech and facial motion tokens.
Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models cs.CV · 2026-06-23 · unverdicted · none · ref 12 · internal anchor
Wan-Streamer is a unified end-to-end Transformer for low-latency streaming audio-visual interaction using block-causal attention on interleaved multimodal tokens.
Next-Turn: Duration-Aware Streaming Endpoint Detection via Time-to-Next-Speech-Onset Prediction cs.SD · 2026-06-16 · unverdicted · none · ref 10 · internal anchor
Next-Turn introduces time-to-next-speech-onset prediction for duration-aware streaming endpoint detection, reporting a 25.9% improvement in accuracy within 320 ms.
NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation cs.CL · 2026-06-11 · unverdicted · none · ref 37 · internal anchor
A fluency-aware optimization framework is introduced to minimize inter-chunk silences in simultaneous speech-to-speech translation by leveraging model-internal signals including linguistic diversity and temporal variability.
Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment cs.SD · 2026-06-11 · unverdicted · none · ref 97 · internal anchor
Self-guidance adds a lightweight feature-mapping loss to align decoder manifolds in VQ-VAE speech codecs, raising reconstruction metrics and allowing 4x codebook reduction with no fidelity loss.
Tight Boundary Prediction in Speaker Diarization Using Causal-Anticausal Consistency eess.AS · 2026-06-10 · unverdicted · none · ref 30 · internal anchor
Causal-anticausal consistency co-training recovers about 70% of the boundary-tightening effect possible with ideal tight labels in speaker diarization.
Benchmarking Neural Speech Compression from a Rate-Distortion Perspective eess.AS · 2026-06-10 · unverdicted · none · ref 12 · internal anchor
ECC integrates hyperprior side information, channel-wise context, latent residual prediction, temporal modeling, and entropy skip into a learned entropy model, yielding 39.9% and 76.3% average BD-rate reductions on ViSQOL and PESQ over baselines.
Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models cs.CL · 2026-06-09 · unverdicted · none · ref 13 · internal anchor
A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.
TRADE: Transducer-Augmented Decoder for Speech LLM cs.CL · 2026-06-07 · unverdicted · none · ref 7 · internal anchor
TRADE augments multimodal Speech LLMs with a transducer branch for streaming ASR, reporting 6.71% WER offline and 8.40% streaming on the Open ASR Leaderboard from one checkpoint.
HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec cs.SD · 2026-06-04 · unverdicted · none · ref 10 · internal anchor
HybridCodec unifies SSL distillation and dual-stream design in a neural audio codec for improved semantic specialization, competitive reconstruction, and faster inference.
Audio Interaction Model cs.SD · 2026-06-03 · unverdicted · none · ref 9 · internal anchor
Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding cs.SD · 2026-06-03 · unverdicted · none · ref 27 · internal anchor
CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.
DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction cs.CV · 2026-06-02 · unverdicted · none · ref 4 · internal anchor
DyaPlex introduces a dual-tower Transformer that adds a streaming motion pathway to a frozen full-duplex speech model using dyadic token interleaving and time-aligned RoPE for synchronized multimodal dyadic interaction.

Moshi: a speech-text foundation model for real-time dialogue

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer