Recognition: no theorem link
Voxtral Realtime
Pith reviewed 2026-05-16 02:07 UTC · model grok-4.3
The pith
Voxtral Realtime matches Whisper transcription quality at 480 milliseconds latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Voxtral Realtime achieves performance on par with Whisper at a delay of 480ms through end-to-end training for streaming with explicit alignment between audio and text streams, using a new causal audio encoder and Ada RMS-Norm for improved delay conditioning within the Delayed Streams Modeling framework on a 13-language dataset.
What carries the argument
Delayed Streams Modeling framework with a new causal audio encoder and Ada RMS-Norm that conditions the model on delays while maintaining explicit audio-text alignment during end-to-end training.
If this is right
- Streaming ASR can reach offline accuracy without relying on chunking techniques.
- Low-latency transcription becomes feasible across multiple languages without quality loss.
- End-to-end streaming training provides a scalable path for future ASR models.
Where Pith is reading between the lines
- This training method may extend to other real-time sequence tasks like speech translation.
- Further reductions in delay could be explored while monitoring for quality degradation.
- The architecture might inspire causal adaptations in non-speech audio tasks.
Load-bearing premise
End-to-end training with explicit audio-text alignment and the new causal encoder plus Ada RMS-Norm maintains offline-level quality at low latency without introducing artifacts across the 13-language dataset.
What would settle it
Word error rate measurements on the 13-language test sets where Voxtral Realtime at 480ms exceeds Whisper's error rate.
Figures
read the original abstract
We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Voxtral Realtime, a natively streaming automatic speech recognition model trained end-to-end on the Delayed Streams Modeling framework with explicit audio-text alignment. It incorporates a new causal audio encoder and Ada RMS-Norm for delay conditioning, scales pretraining across a 13-language dataset, and claims to achieve performance parity with the offline Whisper model at a 480 ms delay while releasing the weights under Apache 2.0.
Significance. If the parity claim holds under rigorous evaluation, the work would be significant for demonstrating that end-to-end streaming training can match offline quality without chunking-induced artifacts, advancing real-time multilingual ASR. The open release of model weights is a clear strength supporting reproducibility and downstream applications.
major comments (1)
- [Abstract] Abstract: the central claim of performance parity with Whisper at 480 ms delay is stated without any quantitative metrics (e.g., WER, CER), error bars, dataset sizes, benchmark names, or ablation results, preventing verification of the result against the assumption that the causal encoder and Ada RMS-Norm maintain offline-level quality without introducing new artifacts across languages.
minor comments (1)
- The description of 'delay' and its measurement protocol (e.g., how end-to-end latency is computed in the streaming setting) should be defined more explicitly, ideally with a diagram or equation in the methods section.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work's significance and for the constructive feedback on the abstract. We address the major comment below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of performance parity with Whisper at 480 ms delay is stated without any quantitative metrics (e.g., WER, CER), error bars, dataset sizes, benchmark names, or ablation results, preventing verification of the result against the assumption that the causal encoder and Ada RMS-Norm maintain offline-level quality without introducing new artifacts across languages.
Authors: We agree that the abstract would be strengthened by including key quantitative results to make the central claim immediately verifiable. In the revised manuscript we will update the abstract to report specific WER numbers on the primary multilingual evaluation sets (including the 13-language Common Voice test set and LibriSpeech), the exact pretraining data scale, and a direct comparison to Whisper at the 480 ms delay. Detailed error bars, per-language breakdowns, and ablation studies on the causal encoder and Ada RMS-Norm remain in the experimental section (Section 4) as before; the abstract revision will simply surface the headline parity result with supporting numbers while preserving its concise style. revision: yes
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Tadabur: A Large-Scale Quran Audio Dataset
Tadabur is a large-scale Quran audio dataset with over 1400 hours from 600+ reciters to support speech research and benchmarks.
Reference graph
Works this paper leans on
-
[1]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
URLhttps://arxiv.org/abs/2305.13245. Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
URL https://arxiv.org/abs/ 2006.11477. Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Transformer,
-
[3]
URLhttps://arxiv.org/abs/2004.05150. Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuai- jiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, and Zhiyong Yan. GigaSpeech: An Evolving, Multi-domain AS...
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
Developing real-time streaming transformer transducer for speech recognition on large-scale dataset
Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, and Jinyu Li. Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5904–5908. IEEE,
work page 2021
-
[5]
URLhttps://arxiv.org/abs/1904.10509. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben...
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[6]
PaLM: Scaling Language Modeling with Pathways
URL https://arxiv.org/abs/2204.02311. Steven B Davis and Paul Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences.IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family
URLhttps://arxiv.org/abs/1604.08859. Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Huang, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr ˙Zelasko, and Miguel Jetté. Earnings-21: A Practical Benchmark for ASR in the Wild. InProc. Interspeech 2021, pages 3465–3469,
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Miguel Del Rio, Peter Ha, Quinten McNamara, Corey Miller, and Shipra Chandra
doi: 10.21437/Interspeech.2021-1915. Miguel Del Rio, Peter Ha, Quinten McNamara, Corey Miller, and Shipra Chandra. Earnings-22: A Practical Benchmark for Accents in the Wild.arXiv e-prints, art. arXiv:2203.15591, March
-
[9]
URL https://elevenlabs.io/blog/i ntroducing-scribe-v2. Accessed: 2026-02-06. J.J. Godfrey, E.C. Holliman, and J. McDaniel. SWITCHBOARD: telephone speech corpus for research and development. In[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 517–520 vol.1,
work page 2026
-
[10]
doi: 10.1109/IC ASSP.1992.225858. 10 Alex Graves. Sequence Transduction with Recurrent Neural Networks,
work page doi:10.1109/ic 1992
-
[11]
Sequence Transduction with Recurrent Neural Networks
URL https://arxi v.org/abs/1211.3711. François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Estève. TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation, page 198–208. Springer International Publishing,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
doi: 10.1007/97 8-3-319-99579-3_21
ISBN 9783319995793. doi: 10.1007/97 8-3-319-99579-3_21. URLhttp://dx.doi.org/10.1007/978-3-319-99579-3_21. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symp...
-
[13]
URLhttps://arxiv.org/abs/2507.13264. Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Sooriyarachchi, Baptiste Bout, Baptis...
-
[14]
URL https://arxiv.org/abs/2601.08584. Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Decoupled Weight Decay Regularization
URL https: //arxiv.org/abs/1711.05101. 11 Dominik Macháˇcek, Raj Dabre, and Ondˇrej Bojar. Turning Whisper into Real-Time Transcription System. In Sriparna Saha and Herry Sujaini, editors,Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computati...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
URL https: //aclanthology.org/2023.ijcnlp-demo.3
Association for Computational Linguistics. URL https: //aclanthology.org/2023.ijcnlp-demo.3. Mistral AI Team. V oxtral Transcribe 2, February
work page 2023
-
[17]
URLhttps://mistral.ai/news/voxt ral-transcribe-2. Accessed: 2026-02-06. Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, and Boris Ginsburg. Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition,
work page 2026
-
[18]
URL https://arxiv.org/abs/2312.17279. Patrick K. O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, and Georg Kucsko. SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speec...
-
[19]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur
doi: 10.21437/Interspeech.2021-1860. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210,
-
[20]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever
doi: 10.1109/ICASSP.2015.7178964. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision. InInternational Conference on Machine Learning, pages 28492–28518. PMLR,
-
[21]
URL https://arxiv.org/abs/2002 .05202. Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Michael L. Seltzer. Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition.ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICA...
work page 2002
-
[22]
LLaMA: Open and Efficient Foundation Language Models
URLhttps://arxiv.org/abs/2302.13971. Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, and Ricard Marxer. An Analysis of Environment, Microphone and Data Simulation Mismatches in Robust Speech Recognition.Comput. Speech Lang., 46(C):535–557, nov
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
doi: 10.1016/j.csl.2016.11.005
ISSN 0885-2308. doi: 10.1016/j.csl.2016.11.005. URLhttps://doi.org/10.1016/j.csl.2016.11.005. Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. V oxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. In...
-
[24]
doi: 10.18653/v1/2021.acl-long.80
Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.80. URLhttps://aclanthology.org/2021.acl-long.80. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient Streaming Language Models with Attention Sinks,
-
[25]
Efficient Streaming Language Models with Attention Sinks
URLhttps://arxiv.org/abs/2309.17453. 12 Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav V olhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, and Alexandre Défossez. Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
URL https://arxiv.org/abs/2509.08753. Biao Zhang and Rico Sennrich. Root Mean Square Layer Normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume
-
[27]
URL https://proceedings.ne urips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Pap er.pdf. 13 A Appendix A.1 Speech Recognition Results Table 5 shows a task-breakdown of short-form English speech recognition results for LibriSpeech Test Clean [Panayotov et al., 2015], LibriSpeech Test Other, GigaSpeech [Chen et al., 2021], V oxPopuli [Wa...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.