Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Title resolution pending
14 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
BlasBench supplies an Irish-aware normalizer and scoring harness that enables reproducible ASR comparisons and exposes a 33-43 point generalization gap for fine-tuned models versus 7-10 points for massively multilingual ones.
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
BEARD adapts Whisper encoder for ATC domain via BEST-RQ and distillation on 5000h unlabeled speech then 2h labeled fine-tuning, delivering 12% relative WER gain over fine-tuned baseline.
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
Task-dependent simulation strategies for synthetic conversational data allow synthetic-only training to approach real-data baselines for multi-talker ASR and diarization, with mixing yielding further gains.
UniPASE extends the PASE framework with DeWavLM-Omni to convert degraded speech into high-fidelity, low-hallucination audio across sampling rates via phonetic enhancement, acoustic adaptation, and multi-rate vocoding.
Prioritizing longest utterances in SSL speech pre-training data outperforms random or diversity-based sampling for ASR performance while using half the data volume.
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
Dolphin-CN-Dialect is a compact ASR model that boosts Chinese dialect accuracy through balanced sampling of rare dialects and character-level tokenization while staying smaller than recent open-source competitors.
A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.
A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.
citing papers explorer
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
-
BlasBench: An Open Benchmark for Irish Speech Recognition
BlasBench supplies an Irish-aware normalizer and scoring harness that enables reproducible ASR comparisons and exposes a 33-43 point generalization gap for fine-tuned models versus 7-10 points for massively multilingual ones.
-
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
-
BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation
BEARD adapts Whisper encoder for ATC domain via BEST-RQ and distillation on 5000h unlabeled speech then 2h labeled fine-tuning, delivering 12% relative WER gain over fine-tuned baseline.
-
Gemini: A Family of Highly Capable Multimodal Models
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
-
AudioPaLM: A Large Language Model That Can Speak and Listen
AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
-
Mind the Gap: Impact of Synthetic Conversational Data on Multi-Talker ASR and Speaker Diarization
Task-dependent simulation strategies for synthetic conversational data allow synthetic-only training to approach real-data baselines for multi-talker ASR and diarization, with mixing yielding further gains.
-
UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations
UniPASE extends the PASE framework with DeWavLM-Omni to convert degraded speech into high-fidelity, low-hallucination audio across sampling rates via phonetic enhancement, acoustic adaptation, and multi-rate vocoding.
-
A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models
Prioritizing longest utterances in SSL speech pre-training data outperforms random or diversity-based sampling for ASR performance while using half the data volume.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
-
Dolphin-CN-Dialect: Where Chinese Dialects Matter
Dolphin-CN-Dialect is a compact ASR model that boosts Chinese dialect accuracy through balanced sampling of rare dialects and character-level tokenization while staying smaller than recent open-source competitors.
-
On The Landscape of Spoken Language Models: A Comprehensive Survey
A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.
-
Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)
A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.