pith. machine review for the scientific record. sign in

arxiv: 2602.11298 · v3 · submitted 2026-02-11 · 💻 cs.AI

Recognition: no theorem link

Voxtral Realtime

Mistral-AI: Alexander H. Liu , Andy Ehrenberg , Andy Lo , Chen-Yo Sun , Guillaume Lample , Jean-Malo Delignon , Khyathi Raghavi Chandu , Patrick von Platen
show 159 more authors
Pavankumar Reddy Muddireddy Rohin Arora Sanchit Gandhi Sandeep Subramanian Soham Ghosh Srijan Mishra Abhinav Rastogi Adrien Sad\'e Alan Jeffares Albert Jiang Alexandre Cahill Alexandre Gavaudan Alexandre Sablayrolles Am\'elie H\'eliou Amos You Andrew Bai Angele Lenglemetz Anmol Agarwal Anton Eliseev Antonia Calvi Arjun Majumdar Avi Sooriyarachchi Baptiste Bout Baptiste Rozi\`ere Baudouin De Monicault Benjamin Tibi Charlotte Cronj\"ager Cl\'emence Lanfranchi Connor Chen Corentin Barreau Corentin Sautier Cyprien Courtot Darius Dabert Diego de las Casas Elizaveta Demyanenko Elliot Chane-Sane Enguerrand Paquin Etienne Goffinet Fabien Niel Faruk Ahmed Federico Baldassarre Gabrielle Berrada Ga\"etan Ecrepont Gauthier Guinet Genevieve Hayes Georgii Novikov Giada Pistilli Guillaume Kunsch Guillaume Martin Guillaume Raille Gunjan Dhanuka Gunshi Gupta Han Zhou Harshil Shah Hope McGovern Hugo Thimonier Indraneel Mukherjee Irene Zhang Jaeyoung Kim Jan Ludziejewski Jason Rute Joachim Studnia John Harvill Jonas Amar Jos\'ephine Delas Josselin Somerville Roberts Julien Tauran Karmesh Yadav Kartik Khandelwal Kilian Tep Kush Jain Laurence Aitchison Laurent Fainsin L\'eonard Blier Lingxiao Zhao Louis Martin Lucile Saulnier Luyu Gao Maarten Buyl Manan Sharma Margaret Jennings Marie Pellat Mark Prins Martin Alexandre Mathieu Poir\'ee Mathilde Guillaumin Matthieu Dinot Matthieu Futeral Maxime Darrin Maximilian Augustin Mert Unsal Mia Chiquier Minh-Quang Pham Nathan Grinsztajn Neha Gupta Olivier Bousquet Olivier Duchenne Patricia Wang Paul Jacob Paul Wambergue Paula Kurylowicz Philippe Pinel Philom\`ene Chagniot Pierre Stock Piotr Mi{\l}o\'s Prateek Gupta Pravesh Agrawal Quentin Torroba Ram Ramrakhya Rishi Shah Romain Sauvestre Roman Soletskyi Rosalie Millner Rupert Menneer Sagar Vaze Samuel Barry Samuel Humeau Sean Cha Shashwat Verma Siddhant Waghjale Siddharth Gandhi Simon Lepage Sumukh Aithal Szymon Antoniak Teven Le Scao Th\'eo Cachet Theo Simon Sorg Thibaut Lavril Thomas Chabal Thomas Foubert Thomas Robert Thomas Wang Tim Lawson Tom Bewley Tom Edwards Tyler Wang Umar Jamil Umberto Tomasini Valeriia Nemychnikova Van Phung Vedant Nanda Victor Jouault Vincent Maladi\`ere Virgile Richard Vladislav Bataev Wassim Bouaziz Wen-Ding Li William Havard William Marshall Xinghui Li Xingran Guo Xinyu Yang Yannic Neuhaus Yassine El Ouahidi Yassir Bendou Yihan Wang Yimu Pan Zaccharie Ramzi Zhenlin Xu
Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords streaming ASRlow latency transcriptionend-to-end modelmultilingual ASRcausal audio encoderDelayed Streams ModelingVoxtral Realtimereal-time speech recognition
0
0 comments X

The pith

Voxtral Realtime matches Whisper transcription quality at 480 milliseconds latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Voxtral Realtime introduces a natively streaming automatic speech recognition model trained end-to-end rather than adapting offline systems through chunking. It incorporates explicit alignment between audio and text streams along with a new causal audio encoder and Ada RMS-Norm to handle delays. The approach scales pretraining across a large 13-language dataset. At 480ms delay the model reaches performance on par with Whisper, the leading offline transcription system. A sympathetic reader cares because this removes the traditional tradeoff between speed and accuracy in real-time speech applications.

Core claim

Voxtral Realtime achieves performance on par with Whisper at a delay of 480ms through end-to-end training for streaming with explicit alignment between audio and text streams, using a new causal audio encoder and Ada RMS-Norm for improved delay conditioning within the Delayed Streams Modeling framework on a 13-language dataset.

What carries the argument

Delayed Streams Modeling framework with a new causal audio encoder and Ada RMS-Norm that conditions the model on delays while maintaining explicit audio-text alignment during end-to-end training.

If this is right

  • Streaming ASR can reach offline accuracy without relying on chunking techniques.
  • Low-latency transcription becomes feasible across multiple languages without quality loss.
  • End-to-end streaming training provides a scalable path for future ASR models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This training method may extend to other real-time sequence tasks like speech translation.
  • Further reductions in delay could be explored while monitoring for quality degradation.
  • The architecture might inspire causal adaptations in non-speech audio tasks.

Load-bearing premise

End-to-end training with explicit audio-text alignment and the new causal encoder plus Ada RMS-Norm maintains offline-level quality at low latency without introducing artifacts across the 13-language dataset.

What would settle it

Word error rate measurements on the 13-language test sets where Voxtral Realtime at 480ms exceeds Whisper's error rate.

Figures

Figures reproduced from arXiv: 2602.11298 by Abhinav Rastogi, Adrien Sad\'e, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Am\'elie H\'eliou, Amos You, Andrew Bai, Andy Ehrenberg, Andy Lo, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Avi Sooriyarachchi, Baptiste Bout, Baptiste Rozi\`ere, Baudouin De Monicault, Benjamin Tibi, Charlotte Cronj\"ager, Chen-Yo Sun, Cl\'emence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elizaveta Demyanenko, Elliot Chane-Sane, Enguerrand Paquin, Etienne Goffinet, Fabien Niel, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Ga\"etan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Kunsch, Guillaume Lample, Guillaume Martin, Guillaume Raille, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Harshil Shah, Hope McGovern, Hugo Thimonier, Indraneel Mukherjee, Irene Zhang, Jaeyoung Kim, Jan Ludziejewski, Jason Rute, Jean-Malo Delignon, Joachim Studnia, John Harvill, Jonas Amar, Jos\'ephine Delas, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, Kilian Tep, Kush Jain, Laurence Aitchison, Laurent Fainsin, L\'eonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Manan Sharma, Margaret Jennings, Marie Pellat, Mark Prins, Martin Alexandre, Mathieu Poir\'ee, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mert Unsal, Mia Chiquier, Minh-Quang Pham, Mistral-AI: Alexander H. Liu, Nathan Grinsztajn, Neha Gupta, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Patrick von Platen, Paula Kurylowicz, Paul Jacob, Paul Wambergue, Pavankumar Reddy Muddireddy, Philippe Pinel, Philom\`ene Chagniot, Pierre Stock, Piotr Mi{\l}o\'s, Prateek Gupta, Pravesh Agrawal, Quentin Torroba, Ram Ramrakhya, Rishi Shah, Rohin Arora, Romain Sauvestre, Roman Soletskyi, Rosalie Millner, Rupert Menneer, Sagar Vaze, Samuel Barry, Samuel Humeau, Sanchit Gandhi, Sandeep Subramanian, Sean Cha, Shashwat Verma, Siddhant Waghjale, Siddharth Gandhi, Simon Lepage, Soham Ghosh, Srijan Mishra, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Th\'eo Cachet, Theo Simon Sorg, Thibaut Lavril, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Edwards, Tyler Wang, Umar Jamil, Umberto Tomasini, Valeriia Nemychnikova, Van Phung, Vedant Nanda, Victor Jouault, Vincent Maladi\`ere, Virgile Richard, Vladislav Bataev, Wassim Bouaziz, Wen-Ding Li, William Havard, William Marshall, Xinghui Li, Xingran Guo, Xinyu Yang, Yannic Neuhaus, Yassine El Ouahidi, Yassir Bendou, Yihan Wang, Yimu Pan, Zaccharie Ramzi, Zhenlin Xu.

Figure 1
Figure 1. Figure 1: Voxtral Realtime approaches offline accuracy at sub-second latency. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Voxtral Realtime architecture and decoding scheme for a target delay [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation of delay-conditioning mechanisms. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation of target construction schemes. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Voxtral streaming session via vLLM resumable requests. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Voxtral Realtime, a natively streaming automatic speech recognition model trained end-to-end on the Delayed Streams Modeling framework with explicit audio-text alignment. It incorporates a new causal audio encoder and Ada RMS-Norm for delay conditioning, scales pretraining across a 13-language dataset, and claims to achieve performance parity with the offline Whisper model at a 480 ms delay while releasing the weights under Apache 2.0.

Significance. If the parity claim holds under rigorous evaluation, the work would be significant for demonstrating that end-to-end streaming training can match offline quality without chunking-induced artifacts, advancing real-time multilingual ASR. The open release of model weights is a clear strength supporting reproducibility and downstream applications.

major comments (1)
  1. [Abstract] Abstract: the central claim of performance parity with Whisper at 480 ms delay is stated without any quantitative metrics (e.g., WER, CER), error bars, dataset sizes, benchmark names, or ablation results, preventing verification of the result against the assumption that the causal encoder and Ada RMS-Norm maintain offline-level quality without introducing new artifacts across languages.
minor comments (1)
  1. The description of 'delay' and its measurement protocol (e.g., how end-to-end latency is computed in the streaming setting) should be defined more explicitly, ideally with a diagram or equation in the methods section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work's significance and for the constructive feedback on the abstract. We address the major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of performance parity with Whisper at 480 ms delay is stated without any quantitative metrics (e.g., WER, CER), error bars, dataset sizes, benchmark names, or ablation results, preventing verification of the result against the assumption that the causal encoder and Ada RMS-Norm maintain offline-level quality without introducing new artifacts across languages.

    Authors: We agree that the abstract would be strengthened by including key quantitative results to make the central claim immediately verifiable. In the revised manuscript we will update the abstract to report specific WER numbers on the primary multilingual evaluation sets (including the 13-language Common Voice test set and LibriSpeech), the exact pretraining data scale, and a direct comparison to Whisper at the 480 ms delay. Detailed error bars, per-language breakdowns, and ablation studies on the causal encoder and Ada RMS-Norm remain in the experimental section (Section 4) as before; the abstract revision will simply surface the headline parity result with supporting numbers while preserving its concise style. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the model is described at high level only.

pith-pipeline@v0.9.0 · 6200 in / 1096 out tokens · 85058 ms · 2026-05-16T02:07:07.841846+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Tadabur: A Large-Scale Quran Audio Dataset

    cs.SD 2026-04 unverdicted novelty 7.0

    Tadabur is a large-scale Quran audio dataset with over 1400 hours from 600+ reciters to support speech research and benchmarks.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    URLhttps://arxiv.org/abs/2305.13245. Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,

  2. [2]

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

    URL https://arxiv.org/abs/ 2006.11477. Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Transformer,

  3. [3]

    URLhttps://arxiv.org/abs/2004.05150. Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuai- jiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, and Zhiyong Yan. GigaSpeech: An Evolving, Multi-domain AS...

  4. [4]

    Developing real-time streaming transformer transducer for speech recognition on large-scale dataset

    Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, and Jinyu Li. Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5904–5908. IEEE,

  5. [5]

    URLhttps://arxiv.org/abs/1904.10509. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben...

  6. [6]

    PaLM: Scaling Language Modeling with Pathways

    URL https://arxiv.org/abs/2204.02311. Steven B Davis and Paul Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences.IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366,

  7. [7]

    The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family

    URLhttps://arxiv.org/abs/1604.08859. Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Huang, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr ˙Zelasko, and Miguel Jetté. Earnings-21: A Practical Benchmark for ASR in the Wild. InProc. Interspeech 2021, pages 3465–3469,

  8. [8]

    Miguel Del Rio, Peter Ha, Quinten McNamara, Corey Miller, and Shipra Chandra

    doi: 10.21437/Interspeech.2021-1915. Miguel Del Rio, Peter Ha, Quinten McNamara, Corey Miller, and Shipra Chandra. Earnings-22: A Practical Benchmark for Accents in the Wild.arXiv e-prints, art. arXiv:2203.15591, March

  9. [9]

    Accessed: 2026-02-06

    URL https://elevenlabs.io/blog/i ntroducing-scribe-v2. Accessed: 2026-02-06. J.J. Godfrey, E.C. Holliman, and J. McDaniel. SWITCHBOARD: telephone speech corpus for research and development. In[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 517–520 vol.1,

  10. [10]

    10 Alex Graves

    doi: 10.1109/IC ASSP.1992.225858. 10 Alex Graves. Sequence Transduction with Recurrent Neural Networks,

  11. [11]

    Sequence Transduction with Recurrent Neural Networks

    URL https://arxi v.org/abs/1211.3711. François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Estève. TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation, page 198–208. Springer International Publishing,

  12. [12]

    doi: 10.1007/97 8-3-319-99579-3_21

    ISBN 9783319995793. doi: 10.1007/97 8-3-319-99579-3_21. URLhttp://dx.doi.org/10.1007/978-3-319-99579-3_21. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symp...

  13. [13]

    Alexander H

    URLhttps://arxiv.org/abs/2507.13264. Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Sooriyarachchi, Baptiste Bout, Baptis...

  14. [14]

    Ministral 3

    URL https://arxiv.org/abs/2601.08584. Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization,

  15. [15]

    Decoupled Weight Decay Regularization

    URL https: //arxiv.org/abs/1711.05101. 11 Dominik Macháˇcek, Raj Dabre, and Ondˇrej Bojar. Turning Whisper into Real-Time Transcription System. In Sriparna Saha and Herry Sujaini, editors,Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computati...

  16. [16]

    URL https: //aclanthology.org/2023.ijcnlp-demo.3

    Association for Computational Linguistics. URL https: //aclanthology.org/2023.ijcnlp-demo.3. Mistral AI Team. V oxtral Transcribe 2, February

  17. [17]

    Accessed: 2026-02-06

    URLhttps://mistral.ai/news/voxt ral-transcribe-2. Accessed: 2026-02-06. Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, and Boris Ginsburg. Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition,

  18. [18]

    Patrick K

    URL https://arxiv.org/abs/2312.17279. Patrick K. O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, and Georg Kucsko. SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speec...

  19. [19]

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur

    doi: 10.21437/Interspeech.2021-1860. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210,

  20. [20]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever

    doi: 10.1109/ICASSP.2015.7178964. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision. InInternational Conference on Machine Learning, pages 28492–28518. PMLR,

  21. [21]

    Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Michael L

    URL https://arxiv.org/abs/2002 .05202. Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Michael L. Seltzer. Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition.ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICA...

  22. [22]

    LLaMA: Open and Efficient Foundation Language Models

    URLhttps://arxiv.org/abs/2302.13971. Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. An Analysis of Environment, Microphone and Data Simulation Mismatches in Robust Speech Recognition.Comput. Speech Lang., 46(C):535–557, nov

  23. [23]

    doi: 10.1016/j.csl.2016.11.005

    ISSN 0885-2308. doi: 10.1016/j.csl.2016.11.005. URLhttps://doi.org/10.1016/j.csl.2016.11.005. Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. V oxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. In...

  24. [24]

    doi: 10.18653/v1/2021.acl-long.80

    Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.80. URLhttps://aclanthology.org/2021.acl-long.80. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient Streaming Language Models with Attention Sinks,

  25. [25]

    Efficient Streaming Language Models with Attention Sinks

    URLhttps://arxiv.org/abs/2309.17453. 12 Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav V olhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, and Alexandre Défossez. Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling,

  26. [26]

    Biao Zhang and Rico Sennrich

    URL https://arxiv.org/abs/2509.08753. Biao Zhang and Rico Sennrich. Root Mean Square Layer Normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume

  27. [27]

    URL https://proceedings.ne urips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Pap er.pdf. 13 A Appendix A.1 Speech Recognition Results Table 5 shows a task-breakdown of short-form English speech recognition results for LibriSpeech Test Clean [Panayotov et al., 2015], LibriSpeech Test Other, GigaSpeech [Chen et al., 2021], V oxPopuli [Wa...