Recognition: 2 theorem links
Moshi: a speech-text foundation model for real-time dialogue
Pith reviewed 2026-05-12 08:07 UTC · model grok-4.3
The pith
Moshi treats spoken dialogue as parallel speech-to-speech generation from a text model backbone to enable real-time full-duplex interaction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Moshi is a speech-text foundation model that casts spoken dialogue as direct speech-to-speech generation. Starting from a text language model backbone, it produces speech as tokens from the residual quantizer of a neural audio codec while modeling its own speech and the user's speech into parallel streams. This removes explicit speaker turns and supports arbitrary conversational dynamics such as overlapping speech and interruptions. The model further extends prior hierarchical token generation by first predicting time-aligned text tokens as a prefix to the audio tokens; the authors call this the inner monologue and show that it improves the linguistic quality of generated speech while also,
What carries the argument
Parallel streams for user and system speech combined with an inner monologue that predicts time-aligned text tokens as a prefix to audio tokens.
If this is right
- The system can handle interruptions, interjections, and simultaneous speech without post-processing steps.
- Non-linguistic signals such as emotion and non-speech sounds remain available to shape the response.
- Streaming speech recognition and text-to-speech emerge directly from the same token-generation process.
- Theoretical latency drops to 160 ms, with measured practice at 200 ms, for immediate back-and-forth.
Where Pith is reading between the lines
- The parallel-stream design could extend to three or more participants by adding further independent audio streams.
- The inner-monologue prefix might transfer to other modalities, such as video tokens, to add visual context to dialogue.
- Real-world performance with background noise or diverse accents would require separate checks beyond the reported results.
- Such low-latency full-duplex models could support new uses like live translation or hands-free assistance tools.
Load-bearing premise
That jointly modeling parallel speech streams and prefixing audio tokens with aligned text will maintain coherence and quality across all conversational patterns without needing explicit turn segmentation or later corrections.
What would settle it
A live test in which the model produces incoherent replies or exceeds 200 ms latency during frequent interruptions and overlapping speech would show the parallel-stream plus inner-monologue method does not fully replace segmented pipelines.
read the original abstract
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this "Inner Monologue" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at https://github.com/kyutai-labs/moshi.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Moshi, a speech-text foundation model for real-time full-duplex spoken dialogue. It replaces conventional pipeline components (VAD, ASR, text LLM, TTS) with a unified speech-to-speech generation approach based on a text LM backbone that produces residual-quantized audio tokens. User and system speech are modeled in parallel streams to remove explicit turn segmentation and handle overlaps/interruptions; an 'Inner Monologue' extension first predicts time-aligned text tokens as a prefix to the audio tokens. The resulting system is claimed to be the first real-time full-duplex spoken LLM, with 160 ms theoretical latency (200 ms measured) and open-source release at https://github.com/kyutai-labs/moshi.
Significance. If the central claims hold, the work is significant because it directly targets the three core limitations of current spoken dialogue systems (multi-second latency, loss of paralinguistic cues, and inability to model unsegmented overlaps). The parallel-stream architecture plus inner-monologue prefix constitute a clean architectural departure from turn-based pipelines. The open GitHub release supplies concrete artifacts that allow independent verification of the reported streaming latency and full-duplex behavior.
major comments (1)
- [Abstract] Abstract: The headline claim that the parallel user/system speech streams together with the inner-monologue text prefix produce coherent, high-quality responses 'across arbitrary conversational dynamics' without explicit turn segmentation is load-bearing for the 'first real-time full-duplex spoken LLM' assertion. The abstract supplies only high-level illustrations of quality gains and streaming capability; no quantitative ablations, error rates, or targeted metrics are reported for interruption handling, overlap resolution, or coherence degradation when user speech arrives mid-generation.
minor comments (2)
- [Abstract] The abstract states that the inner-monologue method 'significantly improves the linguistic quality of generated speech' and 'can provide streaming speech recognition and text-to-speech,' yet supplies neither concrete metrics nor a pointer to the relevant results section or table.
- Notation for the residual quantizer and the parallel-stream tokenization should be introduced with a brief equation or diagram reference early in the manuscript to aid readers who are not already familiar with the neural audio codec literature.
Simulated Author's Rebuttal
We thank the referee for highlighting the need to better substantiate the full-duplex claims in the abstract. We address this point directly below and propose a targeted revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that the parallel user/system speech streams together with the inner-monologue text prefix produce coherent, high-quality responses 'across arbitrary conversational dynamics' without explicit turn segmentation is load-bearing for the 'first real-time full-duplex spoken LLM' assertion. The abstract supplies only high-level illustrations of quality gains and streaming capability; no quantitative ablations, error rates, or targeted metrics are reported for interruption handling, overlap resolution, or coherence degradation when user speech arrives mid-generation.
Authors: We agree that the abstract, constrained by length, presents the claims at a high level without embedding specific quantitative metrics for interruption handling or coherence under mid-generation user speech. The manuscript body (Sections 3 and 4) provides the supporting architecture details, latency measurements (theoretical 160 ms, measured 200 ms), qualitative demonstrations of overlap and interruption handling via parallel streams, and ablations of the inner-monologue prefix showing improved linguistic quality. No dedicated error-rate metrics (e.g., word-error-rate on overlapped segments or coherence scores under interruption) are reported. We will revise the abstract to (a) explicitly state the measured latency, (b) note that parallel streams enable modeling of arbitrary dynamics without turn segmentation, and (c) reference the evaluation sections for supporting evidence. This constitutes a partial revision focused on clarity rather than new experiments. revision: partial
Circularity Check
No significant circularity in architectural claims or latency derivation
full rationale
The paper's core contribution is an architectural framework that casts spoken dialogue as parallel-stream speech-to-speech generation with an inner-monologue text prefix. Latency bounds (160 ms theoretical, 200 ms practical) and full-duplex capability follow directly from the removal of explicit turn segmentation and the choice of residual quantizer tokens; these are design consequences, not quantities fitted to data and then re-labeled as predictions. No equations, self-definitional loops, or load-bearing self-citations are present in the provided derivation chain. The result is self-contained as an engineering system whose performance claims rest on implementation rather than tautological reduction to inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Neural audio codecs produce faithful discrete representations of speech signals suitable for autoregressive generation
- domain assumption Transformer language models can be extended to joint text-audio token prediction without fundamental architectural changes
invented entities (2)
-
Inner Monologue
no independent evidence
-
Parallel speech streams
no independent evidence
Forward citations
Cited by 36 Pith papers
-
Privacy Auditing with Zero (0) Training Run
Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.
-
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
-
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation
LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constr...
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
-
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding
Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
-
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations
Human-1 is the first open full-duplex spoken dialogue system for Hindi, created by adapting Moshi with a custom tokenizer and training on 26,000 hours of real-world conversations to enable natural interruptions and overlaps.
-
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
TiCo: Time-Controllable Spoken Dialogue Model
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
-
Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis
Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.
-
Exploring Token-Space Manipulation in Latent Audio Tokenizers
LATTE creates a compact latent token bottleneck in audio tokenizers that aggregates global information and enables unsupervised editing of attributes like speaker identity via token swapping.
-
PoDAR: Power-Disentangled Audio Representation for Generative Modeling
PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when appl...
-
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation
L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
-
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
An initial continuous autoencoder training phase prevents dimensional collapse in VQ-VAEs and yields lower reconstruction and perceptual losses.
-
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
A warm-up phase training VQ-VAEs as autoencoders first avoids dimensional collapse and yields better reconstruction and perceptual quality.
-
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
-
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.
-
Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning
A contrastive LLM fine-tuning method creates joint embeddings for dialogue contexts and backchannel realizations, improving retrieval performance and alignment with human judgments over raw WavLM features.
-
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
-
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
-
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
-
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
-
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection
FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
-
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
-
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.
-
Sema: Semantic Transport for Real-Time Multimodal Agents
Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.
-
Voxtral TTS
Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in hu...
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
-
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while fi...
Reference graph
Works this paper leans on
-
[1]
Watermarking gpt outputs, 2023
Scott Aaronson and Hendrik Kirchner. Watermarking gpt outputs, 2023. URL https://www.scottaaronson.com/talks/watermark.ppt
work page 2023
-
[2]
The falcon series of open language models.arXiv preprint arXiv:2311.16867, 2023
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M \'e rouane Debbah, \'E tienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023
-
[3]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv:2006.11477, 2020
-
[4]
Chou, Roy Frostig, and Percy Liang
Jonathan Berant, Andrew K. Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Conference on Empirical Methods in Natural Language Processing, 2013
work page 2013
-
[5]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020
work page 2020
-
[6]
Audiolm: A language modeling approach to audio generation
Zal \'a n Borsos, Rapha \"e l Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matthew Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022
work page 2022
-
[7]
Soundstorm: Efficient parallel audio generation,
Zal \' a n Borsos, Matthew Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, and Marco Tagliasacchi. Soundstorm: Efficient parallel audio generation. CoRR, abs/2305.09636, 2023. doi:10.48550/ARXIV.2305.09636
-
[8]
pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe
Hervé Bredin. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe . In Proc. INTERSPEECH 2023, 2023
work page 2023
-
[9]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...
work page 2020
-
[10]
Quantifying Memorization Across Neural Language Models
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022
work page internal anchor Pith review arXiv 2022
-
[11]
\" O zg \" u r C etin and Elizabeth Shriberg. Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition. In Ninth International Conference on Spoken Language Processing, INTERSPEECH-ICSLP 2006, Pittsburgh, PA, USA, September 17-21, 2006 . ISCA , 2006
work page 2006
-
[12]
Wavlm: Large-scale self-supervised pre-training for full stack speech processing
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. , 2022
work page 2022
-
[13]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek B Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Yu - An Chung, Yu Zhang, Wei Han, Chung - Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU . IEEE , 2021
work page 2021
-
[15]
Fisher english training speech parts 1 and 2
Christopher Cieri, David Miller, and Kevin Walker. Fisher english training speech parts 1 and 2. https://doi.org/10.35111/da4a-se30, 2004
-
[16]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Fast and accurate deep network learning by exponential linear units (elus)
Djork - Arn \' e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings , 2016
work page 2016
-
[18]
Simple and controllable music generation
Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D \' e fossez. Simple and controllable music generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proc...
work page 2023
-
[19]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35: 0 16344--16359, 2022
work page 2022
-
[20]
Real time speech enhancement in the waveform domain
Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain. In Interspeech, 2020
work page 2020
-
[21]
High fidelity neural audio compression
Alexandre D \'e fossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. Transactions on Machine Learning Research, 2023
work page 2023
-
[22]
The case for 4-bit precision: k-bit inference scaling laws
Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7750--7774. PM...
work page 2023
-
[23]
LLM.int8(): 8-bit matrix multiplication for transformers at scale
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. In NeurIPS, 2022
work page 2022
-
[24]
BERT: Pre-training of deep bidi- rectional transformers for language understanding
Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ) , pages 4171--4186. Association for Computational Linguistics, 2019. doi...
-
[25]
Icassp 2023 deep noise suppression challenge
Harishchandra Dubey, Ashkan Aazami, Vishak Gopal, Babak Naderi, Sebastian Braun, Ross Cutler, Hannes Gamper, Mehrsa Golestaneh, and Robert Aichner. Icassp 2023 deep noise suppression challenge. In ICASSP, 2023
work page 2023
-
[26]
The zero resource speech challenge 2021: Spoken language modelling
Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Roz \' e , Morgane Rivi \` e re, Eugene Kharitonov, and Emmanuel Dupoux. The zero resource speech challenge 2021: Spoken language modelling. In Interspeech. ISCA , 2021. doi:10.21437/Interspeech.2021-1755
-
[27]
Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. arXiv preprint arXiv:2407.14358, 2024
-
[28]
Watermarking images in self-supervised latent spaces
Pierre Fernandez, Alexandre Sablayrolles, Teddy Furon, Herv \'e J \'e gou, and Matthijs Douze. Watermarking images in self-supervised latent spaces. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2022
work page 2022
-
[29]
Three bricks to consolidate watermarks for large language models
Pierre Fernandez, Antoine Chaffin, Karim Tit, Vivien Chappelier, and Teddy Furon. Three bricks to consolidate watermarks for large language models. In Proc. International Workshop on Information Forensics and Security (WIFS), 2023
work page 2023
-
[30]
OPTQ: A ccurate post-training quantization for generative pre-trained transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: A ccurate post-training quantization for generative pre-trained transformers. In ICLR, 2023
work page 2023
-
[31]
Gemini: A Family of Highly Capable Multimodal Models
Team Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Nai...
work page 2024
-
[33]
Textually pretrained speech language models
Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre D \' e fossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, and Yossi Adi. Textually pretrained speech language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Informatio...
work page 2023
-
[34]
Julien Hauret, Thomas Joubaud, V \'e ronique Zimpfer, and \'E ric Bavu. Eben: Extreme bandwidth extension network applied to speech signals captured with noise-resilient body-conduction microphones. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023
work page 2023
-
[36]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016 b
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[37]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[38]
ViSQOL : an objective speech quality model
Andrew Hines, Jan Skoglund, Anil C Kokaram, and Naomi Harte. ViSQOL : an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing, 2015 0 (1): 0 1--18, 2015
work page 2015
-
[39]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in neural information processing systems, 2020
work page 2020
-
[40]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556 2022
-
[41]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[42]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units
Wei - Ning Hsu, Benjamin Bolte, Yao - Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process. , 29, 2021
work page 2021
-
[43]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
Bag of Tricks for Efficient Text Classification
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016
work page Pith review arXiv 2016
-
[45]
arXiv preprint arXiv:2403.03100 , year=
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024
-
[46]
Vggsound: A Large-Scale Audio-Visual Dataset
Jacob Kahn, Morgane Rivi \` e re, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre - Emmanuel Mazar \' e , Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, and Emmanuel Dupoux. Libri-light: A benchmark for ASR with limited or no supervision. In IEEE Inte...
-
[47]
Available: https://doi.org/10.1162/tacl a 00449
Eugene Kharitonov, Damien Vincent, Zal \' a n Borsos, Rapha \" e l Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Trans. Assoc. Comput. Linguistics, 11: 0 1703--1718, 2023. doi:10.1162/TACL\_A\_00618
work page internal anchor Pith review doi:10.1162/tacl 2023
-
[48]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, 2015
work page 2015
-
[49]
A watermark for large language models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In International Conference on Machine Learning. PMLR, 2023
work page 2023
-
[50]
Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018
work page internal anchor Pith review arXiv 2018
-
[51]
High-fidelity audio compression with improved RVQGAN
Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved RVQGAN . In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36, 2023
work page 2023
-
[52]
High-fidelity audio compression with improved rvqgan
Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved rvqgan. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[53]
Natural questions: a benchmark for question answering research
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019
work page 2019
-
[54]
On generative spoken language modeling from raw audio
Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9: 0 1336--1354, 2021
work page 2021
-
[55]
Autoregressive image generation using residual quantization
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook - Shin Han. Autoregressive image generation using residual quantization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , 2022
work page 2022
-
[56]
An independence-promoting loss for music generation with language models
Jean-Marie Lemercier, Simon Rouard, Jade Copet, Yossi Adi, and Alexandre Défossez. An independence-promoting loss for music generation with language models. In ICML, 2024
work page 2024
-
[57]
AudioSR : Versatile audio super-resolution at scale
Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, and Mark D Plumbley. AudioSR : Versatile audio super-resolution at scale. arXiv preprint arXiv:2309.07314, 2023 a
-
[58]
AudioLDM : Text-to-audio generation with latent diffusion models
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. AudioLDM : Text-to-audio generation with latent diffusion models. In Proceedings of the International Conference on Machine Learning, 2023 b
work page 2023
-
[59]
Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, and Mark D Plumbley. Semanticodec: An ultra low bitrate semantic audio codec for general sound. arXiv preprint arXiv:2405.00233, 2024
- [60]
-
[61]
Mosnet: Deep learning based objective assessment for voice conversion
Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning based objective assessment for voice conversion. In Proc. Interspeech 2019, 2019
work page 2019
-
[62]
SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[63]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[64]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 , 2019
work page 2019
-
[65]
J \'e r \^o me Louradour. whisper-timestamped. https://github.com/linto-ai/whisper-timestamped, 2023
work page 2023
-
[66]
Soumi Maiti, Yifan Peng, Shukjae Choi, Jee weon Jung, Xuankai Chang, and Shinji Watanabe. Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. ArXiv, abs/2309.07937, 2023
-
[67]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018
work page internal anchor Pith review arXiv 2018
-
[68]
Pslm: Parallel generation of text and speech with llms for low-latency spoken dialogue systems, 2024
Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, and Kei Sawada. Pslm: Parallel generation of text and speech with llms for low-latency spoken dialogue systems, 2024
work page 2024
-
[69]
Spoken question answering and speech continuation using spectrogram-powered LLM
Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question answering and speech continuation using spectrogram-powered LLM . In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[70]
A white paper on neural network quantization, 2021
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization, 2021
work page 2021
-
[71]
Generative spoken dialogue language modeling
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Beno \^ t Sagot, Abdelrahman Mohamed, and Emmanuel Dupoux. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11: 0 250--266, 2023. doi:10.1162/tacl_a_00545
-
[72]
Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa - juss \` a , Maha Elbayad, Sravya Popuri, Paul - Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Beno \^ t Sagot, and Emmanuel Dupoux. Spirit-lm: Interleaved spoken and written language model. CoRR, abs/2402.05755, 2024. doi:10.48550/ARXIV.2402.05755
-
[73]
Stateful conformer with cache-based inference for streaming automatic speech recognition
Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, and Boris Ginsburg. Stateful conformer with cache-based inference for streaming automatic speech recognition. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12041--12045. IEEE, 2024
work page 2024
-
[74]
Librispeech: An ASR corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 5206--5210. IEEE , 2015. doi:10.1109/ICASSP.2015.7178964
-
[75]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018
work page 2018
-
[76]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019
work page 2019
-
[77]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawai...
work page 2023
-
[78]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean - Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew M. Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanz...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05530 2024
-
[79]
Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zal \' a n Borsos, F \' e lix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara N. Sainath, Johan Schalkwyk, Matthew Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tud...
-
[80]
Radioactive data: tracing through training
Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herv \'e J \'e gou. Radioactive data: tracing through training. In International Conference on Machine Learning, pages 8326--8335. PMLR, 2020
work page 2020
-
[81]
Winogrande: An adversarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64 0 (9): 0 99--106, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.