MedASR: An Open-Source Model for High-Accuracy Medical Dictation
Pith reviewed 2026-05-19 21:15 UTC · model grok-4.3
The pith
MedASR is a 105M-parameter open-source model that achieves a 58% relative WER reduction on medical dictation versus Whisper Large-v3.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MedASR achieves a 58% relative WER reduction on the Eye Gaze dataset compared to Whisper Large-v3 through targeted solutions for clinical data imbalance, long-form modeling, and accurate pseudo-streaming inference in a 105M-parameter open-source model.
What carries the argument
The pseudo-streaming sliding-window inference approach that enables accurate transcription of long medical dictations while keeping the model small and fast.
If this is right
- Open-source tools can deliver specialized medical transcription performance that rivals or exceeds larger general models.
- Healthcare systems gain a transparent alternative to proprietary dictation software for clinical documentation.
- Domain-specific fine-tuning on limited clinical data becomes sufficient for high-accuracy results when paired with efficient inference techniques.
- Smaller models can be deployed in resource-constrained medical environments without sacrificing transcription quality.
Where Pith is reading between the lines
- The same data-handling and sliding-window techniques might transfer to other high-stakes dictation domains such as legal or technical speech.
- Wider adoption of the open-sourced model could enable community-driven adaptations for accents, dialects, or non-English medical terms.
- If the performance holds on broader datasets, it suggests that targeted engineering can close the gap between general and domain-specific ASR without scaling model size.
Load-bearing premise
The Eye Gaze dataset and evaluation protocol provide a fair and representative measure of real-world medical dictation performance.
What would settle it
Evaluating MedASR on an independent medical dictation dataset collected under different recording conditions or specialties and checking whether the 58% relative WER reduction versus Whisper Large-v3 is replicated.
Figures
read the original abstract
We present MedASR, an open-source 105M-parameter model engineered for high-accuracy medical dictation. Prioritizing a "small, fast, and accurate" design, MedASR addresses 3 core pillars (1) Data: overcoming clinical corpora scarcity and class imbalance; (2) Modeling: efficient long-form training; and (3) Inference: accurate transcription via a pseudo-streaming sliding-window approach. Our evaluation shows that MedASR achieves a 58% relative WER reduction on Eye Gaze compared to Whisper Large-v3. By open-sourcing MedASR, we provide a transparent, high-performance backbone for specialized health-care applications, breaking down the barriers to clinical documentation often obscured by proprietary systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MedASR, an open-source 105M-parameter ASR model for medical dictation. It identifies three pillars—data curation to address clinical corpus scarcity and imbalance, modeling for efficient long-form training, and pseudo-streaming sliding-window inference—and reports a 58% relative WER reduction on the Eye Gaze dataset relative to Whisper Large-v3.
Significance. If the performance claims can be substantiated with complete evaluation details, the work would supply a compact, open-source backbone that lowers barriers to specialized clinical documentation. The emphasis on open-sourcing and the focus on practical inference constraints are positive contributions, but the current evidence base is too thin to evaluate whether the reported gains generalize or arise from the claimed architectural and data choices.
major comments (1)
- [Abstract] Abstract: The central claim of a 58% relative WER reduction on Eye Gaze is presented without absolute WER numbers for MedASR or Whisper Large-v3, without test-set size or speaker statistics, and without explicit confirmation that the same audio files, segmentation, and medical-vocabulary handling were used for both systems. These omissions make it impossible to determine whether the reported gain reflects the three pillars or an evaluation mismatch.
minor comments (1)
- The manuscript should supply a reference or brief description of the Eye Gaze dataset and state whether it is publicly available.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We appreciate the emphasis on ensuring that performance claims are presented with sufficient context and transparency. We address the major comment below and will revise the manuscript accordingly to strengthen the presentation of our evaluation results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of a 58% relative WER reduction on Eye Gaze is presented without absolute WER numbers for MedASR or Whisper Large-v3, without test-set size or speaker statistics, and without explicit confirmation that the same audio files, segmentation, and medical-vocabulary handling were used for both systems. These omissions make it impossible to determine whether the reported gain reflects the three pillars or an evaluation mismatch.
Authors: We agree that absolute WER values and supporting evaluation details are necessary for readers to properly interpret the relative improvement. In the revised manuscript, we will update the abstract to report the absolute WER for MedASR and for Whisper Large-v3 on the Eye Gaze dataset. We will also include concise information on test-set size (number of utterances and total duration), speaker statistics, and an explicit statement that both systems were evaluated on identical audio files with the same segmentation and medical-vocabulary handling. These details already appear in the Experiments section; we will summarize them in the abstract to eliminate any ambiguity about the fairness of the comparison. revision: yes
Circularity Check
No circularity: purely empirical model presentation
full rationale
The paper describes an empirical ASR model (MedASR) built on three pillars of data handling, modeling, and inference, with a central performance claim consisting of a relative WER reduction measured against an external baseline (Whisper Large-v3) on the Eye Gaze dataset. No equations, derivations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The result is an external empirical comparison rather than any chain that reduces to the paper's own inputs by construction, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MedASR utilizes a 105M-parameter Conformer-L backbone … CTC objective … Temporal Posterior Fusion … sliding window of fixed length W and stride S … Hann window as w
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We demonstrate that MedASR achieves superior performance … 58% WER reduction on Eye Gaze
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction The administrative burden of clinical documentation is a pri- mary driver of physician burnout, creating an urgent need for robust Automated Speech Recognition (ASR) systems [1, 2]. While general-purpose foundation models have achieved re- markable versatility [3, 4], they often lack the domain-specific grounding and structural awareness requ...
-
[2]
The MedASR Foundation MedASR is built on a 105M-parameter Conformer architecture
-
[3]
MedASR: An Open-Source Model for High-Accuracy Medical Dictation
and trained using a JAX-based [7] framework. To address the development hurdles outlined previously, we implemented the following strategies: 2.1. Data Scarcity, Acoustic Imbalance, and Formatting The primary bottleneck in medical ASR is the acquisition of large-scale, high-fidelity audio that is both clinically relevant and acoustically diverse. We ident...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
• Conformer Encoder: 17 layers, 512 activations, and 8 at- tention heads
reduce the encoder frame rate to 25Hz. • Conformer Encoder: 17 layers, 512 activations, and 8 at- tention heads. Refinements: We deviate from the original implementation by using Rotary Positional Embeddings (RoPE) [17], and removing biases in all layer normaliza- tion and dense layers to improve stability [18]. Consistency RegularizationTo enhance robust...
-
[5]
Bootstrapping: A seed model was trained exclusively on a subset of the data containing naturally occurring short sequences (up to 36s)
-
[6]
Forced Alignment: This seed model was utilized to per- form forced alignment on a fused CTC lattice (Section 2.3) Table 1:Statistics of the proprietary medical corpus. Percentiles (P90, P95, P99) for duration and token counts (512-vocab) illustrate the long-form challenge across specialties. Data Scale Duration (seconds) Token Count (512-V ocab) Specialty...
work page 2042
-
[7]
Lattice-Based Segmentation: At every 500 encoder frame mark, we extract the corresponding audio chunk and aligned text as a segmented example. While this boundary-agnostic segmentation can result in sub- word units being split at the edges of a segment, the CTC objective remains mathematically sound as it optimizes for token-level sequences rather than wo...
-
[8]
with 0.001 learning rate. Stability0.1 dropout in the Conformer encoder during both pre-training and fine-tuning. Exponential Moving Average with a 0.9999 decay rate during fine-tuning. 2.3. Pseudo-Streaming Inference for Long-Form Stability While models like Whisper [3] have advanced ASR signifi- cantly, they often exhibit instability when processing lon...
-
[9]
Experiments 3.1. Test Sets We evaluate the performance of MedASR using both a publicly available test set (EyeGaze [23, 24]), and held-out sets from our proprietary data (Section 2.1). The proprietary evaluation sets are carefully curated to be speaker-independent, featuring approximately 5% of the total unique speakers in the corpus with zero overlap bet...
-
[10]
320ms) based on the latency and compute budget
Choose a small strideS(e.g. 320ms) based on the latency and compute budget
-
[11]
Pad the start of the audio withWseconds of zero-valued samples, so that the first sliding window ends at the very start of the audio (rather than timeW). Figure 4 shows this approach exhibits no significant in- crease in WER in most test sets3 for MedASR (no LM), demon- strating that MedASR can be used with streaming inference without significantly decreasing WER
-
[12]
Conclusion We presented MedASR, an open-source ASR model optimized for long-form medical dictation. By utilizing Temporal Poste- rior Fusion within a pseudo-streaming framework, we success- fully eliminated the “drift” and hallucination issues common in 3Except Eye Gaze, which saw a 0.3% absolute increase in WER apparently due to the padding at start. lar...
-
[13]
T. D. Shanafelt, L. N. Dyrbye, C. Sinsky, O. Hasan, D. Satele, J. Sloan, and C. P. West, “Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction,” inMayo clinic proceed- ings, vol. 91, no. 7. Elsevier, 2016, pp. 836–848
work page 2016
-
[14]
B. G. Arndt, J. W. Beasley, M. D. Watkinson, J. L. Temte, W.-J. Tuan, C. A. Sinsky, and V . J. Gilchrist, “Tethered to the ehr: pri- mary care physician workload assessment using ehr event log data and time-motion observations,”The Annals of Family Medicine, vol. 15, no. 5, pp. 419–426, 2017
work page 2017
-
[15]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
work page 2023
-
[16]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
C., Parmar, N., Zhang, Y., Yu, J.,
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,”arXiv preprint arXiv:2005.08100, 2020
-
[18]
WhisperX: Time-accurate speech transcription of long-form audio.arXiv preprint, 2023
M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time- accurate speech transcription of long-form audio,”arXiv preprint arXiv:2303.00747, 2023
-
[19]
JAX: composable transformations of Python+NumPy programs,
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander- Plas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/jax-ml/jax
work page 2018
-
[20]
Lib- rispeech: an asr corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210
work page 2015
-
[21]
Libriheavy: a 50,000 hours asr corpus with punctua- tion casing and context,
W. Kang, X. Yang, Z. Yao, F. Kuang, Y . Yang, L. Guo, L. Lin, and D. Povey, “Libriheavy: a 50,000 hours asr corpus with punctua- tion casing and context,” 2023
work page 2023
-
[22]
T. Kudo and J. Richardson, “Sentencepiece: A simple and lan- guage independent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 conference on empir- ical methods in natural language processing: System demonstra- tions, 2018, pp. 66–71
work page 2018
-
[23]
E. Variani, T. Bagby, E. McDermott, and M. Bacchiani, “End- to-end training of acoustic models for large vocabulary continu- ous speech recognition with tensorflow.” inInterspeech, 2017, pp. 1641–1645
work page 2017
-
[24]
Connectionist temporal classification,
A. Graves, “Connectionist temporal classification,” inSupervised sequence labelling with recurrent neural networks. Springer, 2012, pp. 61–93
work page 2012
-
[25]
Sequence Transduction with Recurrent Neural Networks
——, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[26]
Hybrid au- toregressive transducer (hat),
E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid au- toregressive transducer (hat),” inICASSP 2020-2020 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6139–6143
work page 2020
-
[27]
W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell,”arXiv preprint arXiv:1508.01211, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
Global normalization for streaming speech recog- nition in a modular framework,
E. Variani, K. Wu, M. D. Riley, D. Rybach, M. Shannon, and C. Allauzen, “Global normalization for streaming speech recog- nition in a modular framework,”Advances in Neural Information Processing Systems, vol. 35, pp. 4257–4269, 2022
work page 2022
-
[29]
Roformer: Enhanced transformer with rotary position embedding,
J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neuro- computing, vol. 568, p. 127063, 2024
work page 2024
-
[30]
Palm: Scaling language modeling with pathways,
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,”Jour- nal of machine learning research, vol. 24, no. 240, pp. 1–113, 2023
work page 2023
-
[31]
Cr-ctc: Consistency regulariza- tion on ctc for improved speech recognition,
Z. Yao, W. Kang, X. Yang, F. Kuang, L. Guo, H. Zhu, Z. Jin, Z. Li, L. Lin, and D. Povey, “Cr-ctc: Consistency regulariza- tion on ctc for improved speech recognition,”arXiv preprint arXiv:2410.05101, 2024
-
[32]
Specaugment: A simple data augmentation method for automatic speech recognition,
D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmen- tation method for automatic speech recognition,”arXiv preprint arXiv:1904.08779, 2019
-
[33]
Adafactor: Adaptive learning rates with sublinear memory cost,
N. Shazeer and M. Stern, “Adafactor: Adaptive learning rates with sublinear memory cost,” inInternational conference on machine learning. PMLR, 2018, pp. 4596–4604
work page 2018
-
[34]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,”arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[35]
Eye Gaze Data for Chest X-rays,
A. Karargyris, S. Kashyap, I. Lourentzou, J. Wu, M. Tong, A. Sharma, S. Abedin, D. Beymer, V . Mukherjee, E. Krupinski, and M. Moradi, “Eye Gaze Data for Chest X-rays,”PhysioNet, Sep. 2020, version 1.0.0. [Online]. Available: https://doi.org/10. 13026/qfdz-zr67
work page 2020
-
[36]
A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet: compo- nents of a new research resource for complex physiologic signals,” circulation, vol. 101, no. 23, pp. e215–e220, 2000
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.