arxiv: 2602.11298 · v3 · submitted 2026-02-11 · 💻 cs.AI

Recognition: no theorem link

Voxtral Realtime

Mistral-AI: Alexander H. Liu , Andy Ehrenberg , Andy Lo , Chen-Yo Sun , Guillaume Lample , Jean-Malo Delignon , Khyathi Raghavi Chandu , Patrick von Platen

show 159 more authors

Pavankumar Reddy Muddireddy Rohin Arora Sanchit Gandhi Sandeep Subramanian Soham Ghosh Srijan Mishra Abhinav Rastogi Adrien Sad\'e Alan Jeffares Albert Jiang Alexandre Cahill Alexandre Gavaudan Alexandre Sablayrolles Am\'elie H\'eliou Amos You Andrew Bai Angele Lenglemetz Anmol Agarwal Anton Eliseev Antonia Calvi Arjun Majumdar Avi Sooriyarachchi Baptiste Bout Baptiste Rozi\`ere Baudouin De Monicault Benjamin Tibi Charlotte Cronj\"ager Cl\'emence Lanfranchi Connor Chen Corentin Barreau Corentin Sautier Cyprien Courtot Darius Dabert Diego de las Casas Elizaveta Demyanenko Elliot Chane-Sane Enguerrand Paquin Etienne Goffinet Fabien Niel Faruk Ahmed Federico Baldassarre Gabrielle Berrada Ga\"etan Ecrepont Gauthier Guinet Genevieve Hayes Georgii Novikov Giada Pistilli Guillaume Kunsch Guillaume Martin Guillaume Raille Gunjan Dhanuka Gunshi Gupta Han Zhou Harshil Shah Hope McGovern Hugo Thimonier Indraneel Mukherjee Irene Zhang Jaeyoung Kim Jan Ludziejewski Jason Rute Joachim Studnia John Harvill Jonas Amar Jos\'ephine Delas Josselin Somerville Roberts Julien Tauran Karmesh Yadav Kartik Khandelwal Kilian Tep Kush Jain Laurence Aitchison Laurent Fainsin L\'eonard Blier Lingxiao Zhao Louis Martin Lucile Saulnier Luyu Gao Maarten Buyl Manan Sharma Margaret Jennings Marie Pellat Mark Prins Martin Alexandre Mathieu Poir\'ee Mathilde Guillaumin Matthieu Dinot Matthieu Futeral Maxime Darrin Maximilian Augustin Mert Unsal Mia Chiquier Minh-Quang Pham Nathan Grinsztajn Neha Gupta Olivier Bousquet Olivier Duchenne Patricia Wang Paul Jacob Paul Wambergue Paula Kurylowicz Philippe Pinel Philom\`ene Chagniot Pierre Stock Piotr Mi{\l}o\'s Prateek Gupta Pravesh Agrawal Quentin Torroba Ram Ramrakhya Rishi Shah Romain Sauvestre Roman Soletskyi Rosalie Millner Rupert Menneer Sagar Vaze Samuel Barry Samuel Humeau Sean Cha Shashwat Verma Siddhant Waghjale Siddharth Gandhi Simon Lepage Sumukh Aithal Szymon Antoniak Teven Le Scao Th\'eo Cachet Theo Simon Sorg Thibaut Lavril Thomas Chabal Thomas Foubert Thomas Robert Thomas Wang Tim Lawson Tom Bewley Tom Edwards Tyler Wang Umar Jamil Umberto Tomasini Valeriia Nemychnikova Van Phung Vedant Nanda Victor Jouault Vincent Maladi\`ere Virgile Richard Vladislav Bataev Wassim Bouaziz Wen-Ding Li William Havard William Marshall Xinghui Li Xingran Guo Xinyu Yang Yannic Neuhaus Yassine El Ouahidi Yassir Bendou Yihan Wang Yimu Pan Zaccharie Ramzi Zhenlin Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords streaming ASRlow latency transcriptionend-to-end modelmultilingual ASRcausal audio encoderDelayed Streams ModelingVoxtral Realtimereal-time speech recognition

0 comments

The pith

Voxtral Realtime matches Whisper transcription quality at 480 milliseconds latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Voxtral Realtime introduces a natively streaming automatic speech recognition model trained end-to-end rather than adapting offline systems through chunking. It incorporates explicit alignment between audio and text streams along with a new causal audio encoder and Ada RMS-Norm to handle delays. The approach scales pretraining across a large 13-language dataset. At 480ms delay the model reaches performance on par with Whisper, the leading offline transcription system. A sympathetic reader cares because this removes the traditional tradeoff between speed and accuracy in real-time speech applications.

Core claim

Voxtral Realtime achieves performance on par with Whisper at a delay of 480ms through end-to-end training for streaming with explicit alignment between audio and text streams, using a new causal audio encoder and Ada RMS-Norm for improved delay conditioning within the Delayed Streams Modeling framework on a 13-language dataset.

What carries the argument

Delayed Streams Modeling framework with a new causal audio encoder and Ada RMS-Norm that conditions the model on delays while maintaining explicit audio-text alignment during end-to-end training.

If this is right

Streaming ASR can reach offline accuracy without relying on chunking techniques.
Low-latency transcription becomes feasible across multiple languages without quality loss.
End-to-end streaming training provides a scalable path for future ASR models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This training method may extend to other real-time sequence tasks like speech translation.
Further reductions in delay could be explored while monitoring for quality degradation.
The architecture might inspire causal adaptations in non-speech audio tasks.

Load-bearing premise

End-to-end training with explicit audio-text alignment and the new causal encoder plus Ada RMS-Norm maintains offline-level quality at low latency without introducing artifacts across the 13-language dataset.

What would settle it

Word error rate measurements on the 13-language test sets where Voxtral Realtime at 480ms exceeds Whisper's error rate.

Figures

Figures reproduced from arXiv: 2602.11298 by Abhinav Rastogi, Adrien Sad\'e, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Am\'elie H\'eliou, Amos You, Andrew Bai, Andy Ehrenberg, Andy Lo, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Avi Sooriyarachchi, Baptiste Bout, Baptiste Rozi\`ere, Baudouin De Monicault, Benjamin Tibi, Charlotte Cronj\"ager, Chen-Yo Sun, Cl\'emence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elizaveta Demyanenko, Elliot Chane-Sane, Enguerrand Paquin, Etienne Goffinet, Fabien Niel, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Ga\"etan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Kunsch, Guillaume Lample, Guillaume Martin, Guillaume Raille, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Harshil Shah, Hope McGovern, Hugo Thimonier, Indraneel Mukherjee, Irene Zhang, Jaeyoung Kim, Jan Ludziejewski, Jason Rute, Jean-Malo Delignon, Joachim Studnia, John Harvill, Jonas Amar, Jos\'ephine Delas, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, Kilian Tep, Kush Jain, Laurence Aitchison, Laurent Fainsin, L\'eonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Manan Sharma, Margaret Jennings, Marie Pellat, Mark Prins, Martin Alexandre, Mathieu Poir\'ee, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mert Unsal, Mia Chiquier, Minh-Quang Pham, Mistral-AI: Alexander H. Liu, Nathan Grinsztajn, Neha Gupta, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Patrick von Platen, Paula Kurylowicz, Paul Jacob, Paul Wambergue, Pavankumar Reddy Muddireddy, Philippe Pinel, Philom\`ene Chagniot, Pierre Stock, Piotr Mi{\l}o\'s, Prateek Gupta, Pravesh Agrawal, Quentin Torroba, Ram Ramrakhya, Rishi Shah, Rohin Arora, Romain Sauvestre, Roman Soletskyi, Rosalie Millner, Rupert Menneer, Sagar Vaze, Samuel Barry, Samuel Humeau, Sanchit Gandhi, Sandeep Subramanian, Sean Cha, Shashwat Verma, Siddhant Waghjale, Siddharth Gandhi, Simon Lepage, Soham Ghosh, Srijan Mishra, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Th\'eo Cachet, Theo Simon Sorg, Thibaut Lavril, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Edwards, Tyler Wang, Umar Jamil, Umberto Tomasini, Valeriia Nemychnikova, Van Phung, Vedant Nanda, Victor Jouault, Vincent Maladi\`ere, Virgile Richard, Vladislav Bataev, Wassim Bouaziz, Wen-Ding Li, William Havard, William Marshall, Xinghui Li, Xingran Guo, Xinyu Yang, Yannic Neuhaus, Yassine El Ouahidi, Yassir Bendou, Yihan Wang, Yimu Pan, Zaccharie Ramzi, Zhenlin Xu.

**Figure 2.** Figure 2: Voxtral Realtime architecture and decoding scheme for a target delay [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation of delay-conditioning mechanisms. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation of target construction schemes. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Voxtral streaming session via vLLM resumable requests. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Voxtral Realtime claims end-to-end streaming ASR that matches Whisper at 480 ms latency via a new causal encoder and Ada RMS-Norm, but the abstract supplies no numbers to check whether the claim holds.

read the letter

The central point is that Mistral trained a streaming ASR model from the ground up rather than bolting streaming onto an offline checkpoint. They use the Delayed Streams Modeling framework, add a causal audio encoder and Ada RMS-Norm for delay conditioning, train on a 13-language dataset, and report parity with Whisper at 480 ms delay. The model is released under Apache 2.0, which is straightforwardly useful for anyone who wants to run or fine-tune it locally. That end-to-end alignment between audio and text streams is the main technical departure from chunking or sliding-window adaptations, and it could reduce some of the artifacts those methods introduce. The release itself gives the community something concrete to test. The obvious gap is that the abstract states performance parity without any word-error rates, latency curves, ablations, or dataset statistics. Without those numbers it is impossible to judge whether the new components actually preserve quality across languages or whether the 480 ms figure comes with hidden costs in robustness. The full paper may contain the tables, but on the supplied text the evidence for the headline claim is missing. This work is aimed at teams building live voice interfaces, accessibility tools, or low-latency agents who already care about streaming constraints. A reader who needs a starting point for real-time ASR would get value from the architecture description and the open weights even if the experiments later need tightening. The paper is coherent on its own terms and shows clear engineering choices, so it is worth sending to peer review; the referees will mainly have to verify the quantitative results and check for cross-language generalization.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Voxtral Realtime, a natively streaming automatic speech recognition model trained end-to-end on the Delayed Streams Modeling framework with explicit audio-text alignment. It incorporates a new causal audio encoder and Ada RMS-Norm for delay conditioning, scales pretraining across a 13-language dataset, and claims to achieve performance parity with the offline Whisper model at a 480 ms delay while releasing the weights under Apache 2.0.

Significance. If the parity claim holds under rigorous evaluation, the work would be significant for demonstrating that end-to-end streaming training can match offline quality without chunking-induced artifacts, advancing real-time multilingual ASR. The open release of model weights is a clear strength supporting reproducibility and downstream applications.

major comments (1)

[Abstract] Abstract: the central claim of performance parity with Whisper at 480 ms delay is stated without any quantitative metrics (e.g., WER, CER), error bars, dataset sizes, benchmark names, or ablation results, preventing verification of the result against the assumption that the causal encoder and Ada RMS-Norm maintain offline-level quality without introducing new artifacts across languages.

minor comments (1)

The description of 'delay' and its measurement protocol (e.g., how end-to-end latency is computed in the streaming setting) should be defined more explicitly, ideally with a diagram or equation in the methods section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work's significance and for the constructive feedback on the abstract. We address the major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of performance parity with Whisper at 480 ms delay is stated without any quantitative metrics (e.g., WER, CER), error bars, dataset sizes, benchmark names, or ablation results, preventing verification of the result against the assumption that the causal encoder and Ada RMS-Norm maintain offline-level quality without introducing new artifacts across languages.

Authors: We agree that the abstract would be strengthened by including key quantitative results to make the central claim immediately verifiable. In the revised manuscript we will update the abstract to report specific WER numbers on the primary multilingual evaluation sets (including the 13-language Common Voice test set and LibriSpeech), the exact pretraining data scale, and a direct comparison to Whisper at the 480 ms delay. Detailed error bars, per-language breakdowns, and ablation studies on the causal encoder and Ada RMS-Norm remain in the experimental section (Section 4) as before; the abstract revision will simply surface the headline parity result with supporting numbers while preserving its concise style. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the model is described at high level only.

pith-pipeline@v0.9.0 · 6200 in / 1096 out tokens · 85058 ms · 2026-05-16T02:07:07.841846+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tadabur: A Large-Scale Quran Audio Dataset
cs.SD 2026-04 unverdicted novelty 7.0

Tadabur is a large-scale Quran audio dataset with over 1400 hours from 600+ reciters to support speech research and benchmarks.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

URLhttps://arxiv.org/abs/2305.13245. Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

URL https://arxiv.org/abs/ 2006.11477. Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Transformer,

work page arXiv 2006
[3]

URLhttps://arxiv.org/abs/2004.05150. Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuai- jiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, and Zhiyong Yan. GigaSpeech: An Evolving, Multi-domain AS...

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

Developing real-time streaming transformer transducer for speech recognition on large-scale dataset

Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, and Jinyu Li. Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5904–5908. IEEE,

work page 2021
[5]

URLhttps://arxiv.org/abs/1904.10509. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben...

work page internal anchor Pith review Pith/arXiv arXiv 1904
[6]

PaLM: Scaling Language Modeling with Pathways

URL https://arxiv.org/abs/2204.02311. Steven B Davis and Paul Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences.IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family

URLhttps://arxiv.org/abs/1604.08859. Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Huang, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr ˙Zelasko, and Miguel Jetté. Earnings-21: A Practical Benchmark for ASR in the Wild. InProc. Interspeech 2021, pages 3465–3469,

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Miguel Del Rio, Peter Ha, Quinten McNamara, Corey Miller, and Shipra Chandra

doi: 10.21437/Interspeech.2021-1915. Miguel Del Rio, Peter Ha, Quinten McNamara, Corey Miller, and Shipra Chandra. Earnings-22: A Practical Benchmark for Accents in the Wild.arXiv e-prints, art. arXiv:2203.15591, March

work page doi:10.21437/interspeech.2021-1915 2021
[9]

Accessed: 2026-02-06

URL https://elevenlabs.io/blog/i ntroducing-scribe-v2. Accessed: 2026-02-06. J.J. Godfrey, E.C. Holliman, and J. McDaniel. SWITCHBOARD: telephone speech corpus for research and development. In[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 517–520 vol.1,

work page 2026
[10]

10 Alex Graves

doi: 10.1109/IC ASSP.1992.225858. 10 Alex Graves. Sequence Transduction with Recurrent Neural Networks,

work page doi:10.1109/ic 1992
[11]

Sequence Transduction with Recurrent Neural Networks

URL https://arxi v.org/abs/1211.3711. François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Estève. TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation, page 198–208. Springer International Publishing,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

doi: 10.1007/97 8-3-319-99579-3_21

ISBN 9783319995793. doi: 10.1007/97 8-3-319-99579-3_21. URLhttp://dx.doi.org/10.1007/978-3-319-99579-3_21. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symp...

work page doi:10.1007/97
[13]

Alexander H

URLhttps://arxiv.org/abs/2507.13264. Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Sooriyarachchi, Baptiste Bout, Baptis...

work page arXiv
[14]

Ministral 3

URL https://arxiv.org/abs/2601.08584. Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Decoupled Weight Decay Regularization

URL https: //arxiv.org/abs/1711.05101. 11 Dominik Macháˇcek, Raj Dabre, and Ondˇrej Bojar. Turning Whisper into Real-Time Transcription System. In Sriparna Saha and Herry Sujaini, editors,Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computati...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

URL https: //aclanthology.org/2023.ijcnlp-demo.3

Association for Computational Linguistics. URL https: //aclanthology.org/2023.ijcnlp-demo.3. Mistral AI Team. V oxtral Transcribe 2, February

work page 2023
[17]

Accessed: 2026-02-06

URLhttps://mistral.ai/news/voxt ral-transcribe-2. Accessed: 2026-02-06. Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, and Boris Ginsburg. Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition,

work page 2026
[18]

Patrick K

URL https://arxiv.org/abs/2312.17279. Patrick K. O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, and Georg Kucsko. SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speec...

work page arXiv 2021
[19]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur

doi: 10.21437/Interspeech.2021-1860. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210,

work page doi:10.21437/interspeech.2021-1860 2021
[20]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever

doi: 10.1109/ICASSP.2015.7178964. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision. InInternational Conference on Machine Learning, pages 28492–28518. PMLR,

work page doi:10.1109/icassp.2015.7178964 2015
[21]

Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Michael L

URL https://arxiv.org/abs/2002 .05202. Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Michael L. Seltzer. Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition.ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICA...

work page 2002
[22]

LLaMA: Open and Efficient Foundation Language Models

URLhttps://arxiv.org/abs/2302.13971. Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. An Analysis of Environment, Microphone and Data Simulation Mismatches in Robust Speech Recognition.Comput. Speech Lang., 46(C):535–557, nov

work page internal anchor Pith review Pith/arXiv arXiv
[23]

doi: 10.1016/j.csl.2016.11.005

ISSN 0885-2308. doi: 10.1016/j.csl.2016.11.005. URLhttps://doi.org/10.1016/j.csl.2016.11.005. Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. V oxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. In...

work page doi:10.1016/j.csl.2016.11.005 2016
[24]

doi: 10.18653/v1/2021.acl-long.80

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.80. URLhttps://aclanthology.org/2021.acl-long.80. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient Streaming Language Models with Attention Sinks,

work page doi:10.18653/v1/2021.acl-long.80 2021
[25]

Efficient Streaming Language Models with Attention Sinks

URLhttps://arxiv.org/abs/2309.17453. 12 Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav V olhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, and Alexandre Défossez. Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Biao Zhang and Rico Sennrich

URL https://arxiv.org/abs/2509.08753. Biao Zhang and Rico Sennrich. Root Mean Square Layer Normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume

work page arXiv
[27]

URL https://proceedings.ne urips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Pap er.pdf. 13 A Appendix A.1 Speech Recognition Results Table 5 shows a task-breakdown of short-form English speech recognition results for LibriSpeech Test Clean [Panayotov et al., 2015], LibriSpeech Test Other, GigaSpeech [Chen et al., 2021], V oxPopuli [Wa...

work page 2019