arxiv: 2410.00037 · v2 · submitted 2024-09-17 · 📡 eess.AS · cs.AI· cs.CL· cs.LG· cs.SD

Recognition: 2 theorem links

Moshi: a speech-text foundation model for real-time dialogue

Alexandre D\'efossez, Am\'elie Royer, Edouard Grave, Herv\'e J\'egou, Laurent Mazar\'e, Manu Orsini, Neil Zeghidour, Patrick P\'erez

Authors on Pith no claims yet

Pith reviewed 2026-05-12 08:07 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CLcs.LGcs.SD

keywords spoken dialoguefull-duplexspeech-to-speechinner monologuereal-time latencyneural audio codecparallel streamsspoken language model

0 comments

The pith

Moshi treats spoken dialogue as parallel speech-to-speech generation from a text model backbone to enable real-time full-duplex interaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Moshi to overcome the high latency, lost non-verbal information, and rigid turn-taking of pipeline-based spoken dialogue systems. It starts from a text language model and generates speech tokens while tracking user and system speech in separate parallel streams, removing the need for explicit speaker segmentation. An inner monologue step first predicts time-aligned text tokens before the audio tokens, which the authors show improves linguistic quality and supports streaming recognition and synthesis. This matters because it could let AI hold conversations that feel immediate and natural, including overlaps and interruptions, instead of waiting for processed turns. If the approach holds, spoken interaction with machines would no longer require separate components for detection, recognition, dialogue, and synthesis.

Core claim

Moshi is a speech-text foundation model that casts spoken dialogue as direct speech-to-speech generation. Starting from a text language model backbone, it produces speech as tokens from the residual quantizer of a neural audio codec while modeling its own speech and the user's speech into parallel streams. This removes explicit speaker turns and supports arbitrary conversational dynamics such as overlapping speech and interruptions. The model further extends prior hierarchical token generation by first predicting time-aligned text tokens as a prefix to the audio tokens; the authors call this the inner monologue and show that it improves the linguistic quality of generated speech while also,

What carries the argument

Parallel streams for user and system speech combined with an inner monologue that predicts time-aligned text tokens as a prefix to audio tokens.

If this is right

The system can handle interruptions, interjections, and simultaneous speech without post-processing steps.
Non-linguistic signals such as emotion and non-speech sounds remain available to shape the response.
Streaming speech recognition and text-to-speech emerge directly from the same token-generation process.
Theoretical latency drops to 160 ms, with measured practice at 200 ms, for immediate back-and-forth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The parallel-stream design could extend to three or more participants by adding further independent audio streams.
The inner-monologue prefix might transfer to other modalities, such as video tokens, to add visual context to dialogue.
Real-world performance with background noise or diverse accents would require separate checks beyond the reported results.
Such low-latency full-duplex models could support new uses like live translation or hands-free assistance tools.

Load-bearing premise

That jointly modeling parallel speech streams and prefixing audio tokens with aligned text will maintain coherence and quality across all conversational patterns without needing explicit turn segmentation or later corrections.

What would settle it

A live test in which the model produces incoherent replies or exceeds 200 ms latency during frequent interruptions and overlapping speech would show the parallel-stream plus inner-monologue method does not fully replace segmented pipelines.

read the original abstract

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this "Inner Monologue" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at https://github.com/kyutai-labs/moshi.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Moshi's parallel user/system audio streams plus inner-monologue text prefix deliver a workable full-duplex speech model at ~200 ms latency, with open code that lets others check the claims.

read the letter

Moshi starts from a text LM backbone and generates residual-quantized audio tokens while keeping separate parallel streams for user and system speech. It adds a time-aligned text token prefix they call the inner monologue before the audio tokens. This single architecture targets the three pipeline problems at once: multi-second latency, loss of non-linguistic cues, and rigid turn segmentation. The result is a claimed 160 ms theoretical and 200 ms practical latency for spoken dialogue that can handle overlaps and interruptions by design. The GitHub release is the strongest part of the submission because it ships the model and lets anyone measure the streaming behavior directly rather than taking the numbers on faith.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Moshi, a speech-text foundation model for real-time full-duplex spoken dialogue. It replaces conventional pipeline components (VAD, ASR, text LLM, TTS) with a unified speech-to-speech generation approach based on a text LM backbone that produces residual-quantized audio tokens. User and system speech are modeled in parallel streams to remove explicit turn segmentation and handle overlaps/interruptions; an 'Inner Monologue' extension first predicts time-aligned text tokens as a prefix to the audio tokens. The resulting system is claimed to be the first real-time full-duplex spoken LLM, with 160 ms theoretical latency (200 ms measured) and open-source release at https://github.com/kyutai-labs/moshi.

Significance. If the central claims hold, the work is significant because it directly targets the three core limitations of current spoken dialogue systems (multi-second latency, loss of paralinguistic cues, and inability to model unsegmented overlaps). The parallel-stream architecture plus inner-monologue prefix constitute a clean architectural departure from turn-based pipelines. The open GitHub release supplies concrete artifacts that allow independent verification of the reported streaming latency and full-duplex behavior.

major comments (1)

[Abstract] Abstract: The headline claim that the parallel user/system speech streams together with the inner-monologue text prefix produce coherent, high-quality responses 'across arbitrary conversational dynamics' without explicit turn segmentation is load-bearing for the 'first real-time full-duplex spoken LLM' assertion. The abstract supplies only high-level illustrations of quality gains and streaming capability; no quantitative ablations, error rates, or targeted metrics are reported for interruption handling, overlap resolution, or coherence degradation when user speech arrives mid-generation.

minor comments (2)

[Abstract] The abstract states that the inner-monologue method 'significantly improves the linguistic quality of generated speech' and 'can provide streaming speech recognition and text-to-speech,' yet supplies neither concrete metrics nor a pointer to the relevant results section or table.
Notation for the residual quantizer and the parallel-stream tokenization should be introduced with a brief equation or diagram reference early in the manuscript to aid readers who are not already familiar with the neural audio codec literature.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need to better substantiate the full-duplex claims in the abstract. We address this point directly below and propose a targeted revision.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that the parallel user/system speech streams together with the inner-monologue text prefix produce coherent, high-quality responses 'across arbitrary conversational dynamics' without explicit turn segmentation is load-bearing for the 'first real-time full-duplex spoken LLM' assertion. The abstract supplies only high-level illustrations of quality gains and streaming capability; no quantitative ablations, error rates, or targeted metrics are reported for interruption handling, overlap resolution, or coherence degradation when user speech arrives mid-generation.

Authors: We agree that the abstract, constrained by length, presents the claims at a high level without embedding specific quantitative metrics for interruption handling or coherence under mid-generation user speech. The manuscript body (Sections 3 and 4) provides the supporting architecture details, latency measurements (theoretical 160 ms, measured 200 ms), qualitative demonstrations of overlap and interruption handling via parallel streams, and ablations of the inner-monologue prefix showing improved linguistic quality. No dedicated error-rate metrics (e.g., word-error-rate on overlapped segments or coherence scores under interruption) are reported. We will revise the abstract to (a) explicitly state the measured latency, (b) note that parallel streams enable modeling of arbitrary dynamics without turn segmentation, and (c) reference the evaluation sections for supporting evidence. This constitutes a partial revision focused on clarity rather than new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity in architectural claims or latency derivation

full rationale

The paper's core contribution is an architectural framework that casts spoken dialogue as parallel-stream speech-to-speech generation with an inner-monologue text prefix. Latency bounds (160 ms theoretical, 200 ms practical) and full-duplex capability follow directly from the removal of explicit turn segmentation and the choice of residual quantizer tokens; these are design consequences, not quantities fitted to data and then re-labeled as predictions. No equations, self-definitional loops, or load-bearing self-citations are present in the provided derivation chain. The result is self-contained as an engineering system whose performance claims rest on implementation rather than tautological reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard neural-network training assumptions plus two new architectural elements (parallel speech streams and inner monologue) introduced without independent external validation in the provided text.

axioms (2)

domain assumption Neural audio codecs produce faithful discrete representations of speech signals suitable for autoregressive generation
The model generates speech tokens from the residual quantizer of a neural audio codec.
domain assumption Transformer language models can be extended to joint text-audio token prediction without fundamental architectural changes
The backbone is described as a text language model extended to audio tokens.

invented entities (2)

Inner Monologue no independent evidence
purpose: Time-aligned text token prediction that precedes and conditions audio token generation
New prefix mechanism introduced to improve linguistic quality and enable streaming ASR/TTS.
Parallel speech streams no independent evidence
purpose: Separate modeling of user and system speech to support full-duplex and overlapping speech without turn segmentation
Core design choice that removes explicit speaker turns.

pith-pipeline@v0.9.0 · 5628 in / 1551 out tokens · 98120 ms · 2026-05-12T08:07:03.153429+00:00 · methodology

discussion (0)

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Privacy Auditing with Zero (0) Training Run
cs.CR 2026-05 unverdicted novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
cs.SD 2026-05 unverdicted novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
cs.CL 2026-05 unverdicted novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation
eess.IV 2026-05 unverdicted novelty 7.0

LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constr...
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 7.0

TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding
eess.AS 2026-04 unverdicted novelty 7.0

Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations
cs.CL 2026-04 unverdicted novelty 7.0

Human-1 is the first open full-duplex spoken dialogue system for Hindi, created by adapting Moshi with a custom tokenizer and training on 26,000 hours of real-world conversations to enable natural interruptions and overlaps.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
cs.CL 2026-04 unverdicted novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
eess.AS 2026-04 unverdicted novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
cs.CR 2026-04 unverdicted novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
eess.AS 2026-04 unverdicted novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
TiCo: Time-Controllable Spoken Dialogue Model
cs.CL 2026-03 unverdicted novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis
cs.SD 2026-05 unverdicted novelty 6.0

Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.
Exploring Token-Space Manipulation in Latent Audio Tokenizers
cs.SD 2026-05 unverdicted novelty 6.0

LATTE creates a compact latent token bottleneck in audio tokenizers that aggregates global information and enables unsupervised editing of attributes like speaker identity via token swapping.
PoDAR: Power-Disentangled Audio Representation for Generative Modeling
eess.AS 2026-05 unverdicted novelty 6.0

PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when appl...
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation
eess.AS 2026-05 unverdicted novelty 6.0

L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
cs.LG 2026-05 unverdicted novelty 6.0

An initial continuous autoencoder training phase prevents dimensional collapse in VQ-VAEs and yields lower reconstruction and perceptual losses.
Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse
cs.LG 2026-05 unverdicted novelty 6.0

A warm-up phase training VQ-VAEs as autoencoders first avoids dimensional collapse and yields better reconstruction and perceptual quality.
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
cs.SD 2026-05 unverdicted novelty 6.0

VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model
cs.SD 2026-05 accept novelty 6.0

MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.
Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning
cs.CL 2026-04 unverdicted novelty 6.0

A contrastive LLM fine-tuning method creates joint embeddings for dialogue contexts and backchannel realizations, improving retrieval performance and alignment with human judgments over raw WavLM features.
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
eess.AS 2026-04 unverdicted novelty 6.0

A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
cs.CL 2026-04 unverdicted novelty 6.0

SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
cs.CV 2026-04 unverdicted novelty 6.0

PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection
cs.SD 2026-04 unverdicted novelty 6.0

FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
cs.LG 2025-06 unverdicted novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 5.0

TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
eess.AS 2026-04 unverdicted novelty 5.0

A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.
Sema: Semantic Transport for Real-Time Multimodal Agents
cs.MM 2026-04 unverdicted novelty 5.0

Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.
Voxtral TTS
cs.AI 2026-03 unverdicted novelty 5.0

Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in hu...
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
cs.AI 2026-04 unverdicted novelty 4.0

PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while fi...

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · cited by 34 Pith papers · 22 internal anchors

[1]

Watermarking gpt outputs, 2023

Scott Aaronson and Hendrik Kirchner. Watermarking gpt outputs, 2023. URL https://www.scottaaronson.com/talks/watermark.ppt

work page 2023
[2]

The falcon series of open language models.arXiv preprint arXiv:2311.16867, 2023

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M \'e rouane Debbah, \'E tienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023

work page arXiv 2023
[3]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv:2006.11477, 2020

work page arXiv 2006
[4]

Chou, Roy Frostig, and Percy Liang

Jonathan Berant, Andrew K. Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Conference on Empirical Methods in Natural Language Processing, 2013

work page 2013
[5]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020

work page 2020
[6]

Audiolm: A language modeling approach to audio generation

Zal \'a n Borsos, Rapha \"e l Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matthew Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022

work page 2022
[7]

Soundstorm: Efficient parallel audio generation,

Zal \' a n Borsos, Matthew Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, and Marco Tagliasacchi. Soundstorm: Efficient parallel audio generation. CoRR, abs/2305.09636, 2023. doi:10.48550/ARXIV.2305.09636

work page doi:10.48550/arxiv.2305.09636 2023
[8]

pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe

Hervé Bredin. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe . In Proc. INTERSPEECH 2023, 2023

work page 2023
[9]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020
[10]

Quantifying Memorization Across Neural Language Models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022

work page internal anchor Pith review arXiv 2022
[11]

Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition

\" O zg \" u r C etin and Elizabeth Shriberg. Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition. In Ninth International Conference on Spoken Language Processing, INTERSPEECH-ICSLP 2006, Pittsburgh, PA, USA, September 17-21, 2006 . ISCA , 2006

work page 2006
[12]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. , 2022

work page 2022
[13]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek B Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

Yu - An Chung, Yu Zhang, Wei Han, Chung - Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU . IEEE , 2021

work page 2021
[15]

Fisher english training speech parts 1 and 2

Christopher Cieri, David Miller, and Kevin Walker. Fisher english training speech parts 1 and 2. https://doi.org/10.35111/da4a-se30, 2004

work page doi:10.35111/da4a-se30 2004
[16]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

Fast and accurate deep network learning by exponential linear units (elus)

Djork - Arn \' e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings , 2016

work page 2016
[18]

Simple and controllable music generation

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D \' e fossez. Simple and controllable music generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proc...

work page 2023
[19]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35: 0 16344--16359, 2022

work page 2022
[20]

Real time speech enhancement in the waveform domain

Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain. In Interspeech, 2020

work page 2020
[21]

High fidelity neural audio compression

Alexandre D \'e fossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. Transactions on Machine Learning Research, 2023

work page 2023
[22]

The case for 4-bit precision: k-bit inference scaling laws

Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7750--7774. PM...

work page 2023
[23]

LLM.int8(): 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. In NeurIPS, 2022

work page 2022
[24]

BERT: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ) , pages 4171--4186. Association for Computational Linguistics, 2019. doi...

work page doi:10.18653/v1/n19-1423 2019
[25]

Icassp 2023 deep noise suppression challenge

Harishchandra Dubey, Ashkan Aazami, Vishak Gopal, Babak Naderi, Sebastian Braun, Ross Cutler, Hannes Gamper, Mehrsa Golestaneh, and Robert Aichner. Icassp 2023 deep noise suppression challenge. In ICASSP, 2023

work page 2023
[26]

The zero resource speech challenge 2021: Spoken language modelling

Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Roz \' e , Morgane Rivi \` e re, Eugene Kharitonov, and Emmanuel Dupoux. The zero resource speech challenge 2021: Spoken language modelling. In Interspeech. ISCA , 2021. doi:10.21437/Interspeech.2021-1755

work page doi:10.21437/interspeech.2021-1755 2021
[27]

Stable audio open

Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. arXiv preprint arXiv:2407.14358, 2024

work page arXiv 2024
[28]

Watermarking images in self-supervised latent spaces

Pierre Fernandez, Alexandre Sablayrolles, Teddy Furon, Herv \'e J \'e gou, and Matthijs Douze. Watermarking images in self-supervised latent spaces. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2022

work page 2022
[29]

Three bricks to consolidate watermarks for large language models

Pierre Fernandez, Antoine Chaffin, Karim Tit, Vivien Chappelier, and Teddy Furon. Three bricks to consolidate watermarks for large language models. In Proc. International Workshop on Information Forensics and Security (WIFS), 2023

work page 2023
[30]

OPTQ: A ccurate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: A ccurate post-training quantization for generative pre-trained transformers. In ICLR, 2023

work page 2023
[31]

Gemini: A Family of Highly Capable Multimodal Models

Team Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Nai...

work page 2024
[33]

Textually pretrained speech language models

Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre D \' e fossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, and Yossi Adi. Textually pretrained speech language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Informatio...

work page 2023
[34]

Eben: Extreme bandwidth extension network applied to speech signals captured with noise-resilient body-conduction microphones

Julien Hauret, Thomas Joubaud, V \'e ronique Zimpfer, and \'E ric Bavu. Eben: Extreme bandwidth extension network applied to speech signals captured with noise-resilient body-conduction microphones. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

work page 2023
[36]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016 b

work page internal anchor Pith review Pith/arXiv arXiv 2016
[37]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[38]

ViSQOL : an objective speech quality model

Andrew Hines, Jan Skoglund, Anil C Kokaram, and Naomi Harte. ViSQOL : an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing, 2015 0 (1): 0 1--18, 2015

work page 2015
[39]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in neural information processing systems, 2020

work page 2020
[40]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556 2022
[41]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[42]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units

Wei - Ning Hsu, Benjamin Bolte, Yao - Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process. , 29, 2021

work page 2021
[43]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Bag of Tricks for Efficient Text Classification

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016

work page Pith review arXiv 2016
[45]

arXiv preprint arXiv:2403.03100 , year=

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024

work page arXiv 2024
[46]

Vggsound: A Large-Scale Audio-Visual Dataset

Jacob Kahn, Morgane Rivi \` e re, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre - Emmanuel Mazar \' e , Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, and Emmanuel Dupoux. Libri-light: A benchmark for ASR with limited or no supervision. In IEEE Inte...

work page doi:10.1109/icassp40776.2020.9052942 2020
[47]

Available: https://doi.org/10.1162/tacl a 00449

Eugene Kharitonov, Damien Vincent, Zal \' a n Borsos, Rapha \" e l Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Trans. Assoc. Comput. Linguistics, 11: 0 1703--1718, 2023. doi:10.1162/TACL\_A\_00618

work page internal anchor Pith review doi:10.1162/tacl 2023
[48]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, 2015

work page 2015
[49]

A watermark for large language models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In International Conference on Machine Learning. PMLR, 2023

work page 2023
[50]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018

work page internal anchor Pith review arXiv 2018
[51]

High-fidelity audio compression with improved RVQGAN

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved RVQGAN . In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36, 2023

work page 2023
[52]

High-fidelity audio compression with improved rvqgan

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved rvqgan. In Advances in Neural Information Processing Systems, 2024

work page 2024
[53]

Natural questions: a benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019

work page 2019
[54]

On generative spoken language modeling from raw audio

Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9: 0 1336--1354, 2021

work page 2021
[55]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook - Shin Han. Autoregressive image generation using residual quantization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , 2022

work page 2022
[56]

An independence-promoting loss for music generation with language models

Jean-Marie Lemercier, Simon Rouard, Jade Copet, Yossi Adi, and Alexandre Défossez. An independence-promoting loss for music generation with language models. In ICML, 2024

work page 2024
[57]

AudioSR : Versatile audio super-resolution at scale

Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, and Mark D Plumbley. AudioSR : Versatile audio super-resolution at scale. arXiv preprint arXiv:2309.07314, 2023 a

work page arXiv 2023
[58]

AudioLDM : Text-to-audio generation with latent diffusion models

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. AudioLDM : Text-to-audio generation with latent diffusion models. In Proceedings of the International Conference on Machine Learning, 2023 b

work page 2023
[59]

SemantiCodec: An ultra low bitrate semantic audio codec for general sound.arXiv preprint arXiv:2405.00233,

Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, and Mark D Plumbley. Semanticodec: An ultra low bitrate semantic audio codec for general sound. arXiv preprint arXiv:2405.00233, 2024

work page arXiv 2024
[60]

The llama 3 herd of models

Team Llama. The llama 3 herd of models. preprint, 2024

work page 2024
[61]

Mosnet: Deep learning based objective assessment for voice conversion

Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning based objective assessment for voice conversion. In Proc. Interspeech 2019, 2019

work page 2019
[62]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[63]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[64]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 , 2019

work page 2019
[65]

whisper-timestamped

J \'e r \^o me Louradour. whisper-timestamped. https://github.com/linto-ai/whisper-timestamped, 2023

work page 2023
[66]

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

Soumi Maiti, Yifan Peng, Shukjae Choi, Jee weon Jung, Xuankai Chang, and Shinji Watanabe. Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. ArXiv, abs/2309.07937, 2023

work page arXiv 2023
[67]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018

work page internal anchor Pith review arXiv 2018
[68]

Pslm: Parallel generation of text and speech with llms for low-latency spoken dialogue systems, 2024

Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, and Kei Sawada. Pslm: Parallel generation of text and speech with llms for low-latency spoken dialogue systems, 2024

work page 2024
[69]

Spoken question answering and speech continuation using spectrogram-powered LLM

Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question answering and speech continuation using spectrogram-powered LLM . In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[70]

A white paper on neural network quantization, 2021

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization, 2021

work page 2021
[71]

Generative spoken dialogue language modeling

Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Beno \^ t Sagot, Abdelrahman Mohamed, and Emmanuel Dupoux. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11: 0 250--266, 2023. doi:10.1162/tacl_a_00545

work page doi:10.1162/tacl_a_00545 2023
[72]

Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa - juss \` a , Maha Elbayad, Sravya Popuri, Paul - Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Beno \^ t Sagot, and Emmanuel Dupoux. Spirit-lm: Interleaved spoken and written language model. CoRR, abs/2402.05755, 2024. doi:10.48550/ARXIV.2402.05755

work page doi:10.48550/arxiv.2402.05755 2024
[73]

Stateful conformer with cache-based inference for streaming automatic speech recognition

Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, and Boris Ginsburg. Stateful conformer with cache-based inference for streaming automatic speech recognition. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12041--12045. IEEE, 2024

work page 2024
[74]

Librispeech: An ASR corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 5206--5210. IEEE , 2015. doi:10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015
[75]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018

work page 2018
[76]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019

work page 2019
[77]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawai...

work page 2023
[78]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean - Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew M. Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanz...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05530 2024
[79]

Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zal \' a n Borsos, F \' e lix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara N. Sainath, Johan Schalkwyk, Matthew Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tud...

work page doi:10.48550/arxiv.2306.12925 2023
[80]

Radioactive data: tracing through training

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herv \'e J \'e gou. Radioactive data: tracing through training. In International Conference on Machine Learning, pages 8326--8335. PMLR, 2020

work page 2020
[81]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64 0 (9): 0 99--106, 2021

work page 2021

Showing first 80 references.