pith. sign in

arxiv: 2605.16555 · v1 · pith:WDGDTGCOnew · submitted 2026-05-15 · 📡 eess.AS

MedASR: An Open-Source Model for High-Accuracy Medical Dictation

Pith reviewed 2026-05-19 21:15 UTC · model grok-4.3

classification 📡 eess.AS
keywords medical ASRspeech recognitionopen-source modelmedical dictationword error rateclinical documentationWhisper comparison
0
0 comments X

The pith

MedASR is a 105M-parameter open-source model that achieves a 58% relative WER reduction on medical dictation versus Whisper Large-v3.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedASR as a compact model engineered for high-accuracy medical dictation by tackling data scarcity in clinical corpora, efficient long-form training, and inference via a pseudo-streaming sliding-window method. It prioritizes a small, fast, and accurate design to make specialized transcription practical without relying on large proprietary systems. A sympathetic reader would care because this approach could improve clinical documentation speed and reduce errors in healthcare settings where accurate speech-to-text is essential. The reported evaluation on the Eye Gaze dataset supports the performance gains over general-purpose models.

Core claim

MedASR achieves a 58% relative WER reduction on the Eye Gaze dataset compared to Whisper Large-v3 through targeted solutions for clinical data imbalance, long-form modeling, and accurate pseudo-streaming inference in a 105M-parameter open-source model.

What carries the argument

The pseudo-streaming sliding-window inference approach that enables accurate transcription of long medical dictations while keeping the model small and fast.

If this is right

  • Open-source tools can deliver specialized medical transcription performance that rivals or exceeds larger general models.
  • Healthcare systems gain a transparent alternative to proprietary dictation software for clinical documentation.
  • Domain-specific fine-tuning on limited clinical data becomes sufficient for high-accuracy results when paired with efficient inference techniques.
  • Smaller models can be deployed in resource-constrained medical environments without sacrificing transcription quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same data-handling and sliding-window techniques might transfer to other high-stakes dictation domains such as legal or technical speech.
  • Wider adoption of the open-sourced model could enable community-driven adaptations for accents, dialects, or non-English medical terms.
  • If the performance holds on broader datasets, it suggests that targeted engineering can close the gap between general and domain-specific ASR without scaling model size.

Load-bearing premise

The Eye Gaze dataset and evaluation protocol provide a fair and representative measure of real-world medical dictation performance.

What would settle it

Evaluating MedASR on an independent medical dictation dataset collected under different recording conditions or specialties and checking whether the 58% relative WER reduction versus Whisper Large-v3 is replicated.

Figures

Figures reproduced from arXiv: 2605.16555 by Ehsan Variani, Ke Wu, Rory Pilgrim, Shashir Reddy, Tom Bagby.

Figure 1
Figure 1. Figure 1: Temporal Fusion mechanism. Posterior logits zt,k from different windows are fused via weighted averaging. of windows covering t so far: Pθ,a(zt|x) = min(KXT ∪{a}) k=min(KT ) αt,kPt,k where αt,k is the normalized weight for the k-th window cov￾ering frame t: αt,k = P wrel(t,k) k′∈KT wrel(t,k) where rel(t, k) = t−Bk is the relative index of frame t within window k. We obtain the first P(yt|x≤a) when a = min(… view at source ↗
Figure 2
Figure 2. Figure 2: Offline MedASR (no LM) WER over different strides 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Stride (s) 6 7 8 9 10 11 WER (%) EyeGaze RAD IM GENINT FM 3.2. MedASR as an Offline Recognizer We compare MedASR (105M parameters) against two state-of￾the-art foundational models: OpenAI Whisper (Large-v3) and Google Gemini 2.5 Pro. For MedASR, we obtain the fused CTC lattices from sliding windows of length 20s & stride… view at source ↗
Figure 4
Figure 4. Figure 4: Streaming MedASR (no LM) WER over stride sizes 0.0 0.2 0.4 0.6 0.8 1.0 Stride (s) 6 7 8 9 10 WER (%) EyeGaze RAD IM GENINT FM 3.4. MedASR as a Streaming Recognizer For interactive use cases requiring low latency, MedASR can be used as a streaming recognizer with two simple changes to inference: 1. Choose a small stride S (e.g. 320ms) based on the latency and compute budget; 2. Pad the start of the audio wi… view at source ↗
read the original abstract

We present MedASR, an open-source 105M-parameter model engineered for high-accuracy medical dictation. Prioritizing a "small, fast, and accurate" design, MedASR addresses 3 core pillars (1) Data: overcoming clinical corpora scarcity and class imbalance; (2) Modeling: efficient long-form training; and (3) Inference: accurate transcription via a pseudo-streaming sliding-window approach. Our evaluation shows that MedASR achieves a 58% relative WER reduction on Eye Gaze compared to Whisper Large-v3. By open-sourcing MedASR, we provide a transparent, high-performance backbone for specialized health-care applications, breaking down the barriers to clinical documentation often obscured by proprietary systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents MedASR, an open-source 105M-parameter ASR model for medical dictation. It identifies three pillars—data curation to address clinical corpus scarcity and imbalance, modeling for efficient long-form training, and pseudo-streaming sliding-window inference—and reports a 58% relative WER reduction on the Eye Gaze dataset relative to Whisper Large-v3.

Significance. If the performance claims can be substantiated with complete evaluation details, the work would supply a compact, open-source backbone that lowers barriers to specialized clinical documentation. The emphasis on open-sourcing and the focus on practical inference constraints are positive contributions, but the current evidence base is too thin to evaluate whether the reported gains generalize or arise from the claimed architectural and data choices.

major comments (1)
  1. [Abstract] Abstract: The central claim of a 58% relative WER reduction on Eye Gaze is presented without absolute WER numbers for MedASR or Whisper Large-v3, without test-set size or speaker statistics, and without explicit confirmation that the same audio files, segmentation, and medical-vocabulary handling were used for both systems. These omissions make it impossible to determine whether the reported gain reflects the three pillars or an evaluation mismatch.
minor comments (1)
  1. The manuscript should supply a reference or brief description of the Eye Gaze dataset and state whether it is publicly available.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the emphasis on ensuring that performance claims are presented with sufficient context and transparency. We address the major comment below and will revise the manuscript accordingly to strengthen the presentation of our evaluation results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of a 58% relative WER reduction on Eye Gaze is presented without absolute WER numbers for MedASR or Whisper Large-v3, without test-set size or speaker statistics, and without explicit confirmation that the same audio files, segmentation, and medical-vocabulary handling were used for both systems. These omissions make it impossible to determine whether the reported gain reflects the three pillars or an evaluation mismatch.

    Authors: We agree that absolute WER values and supporting evaluation details are necessary for readers to properly interpret the relative improvement. In the revised manuscript, we will update the abstract to report the absolute WER for MedASR and for Whisper Large-v3 on the Eye Gaze dataset. We will also include concise information on test-set size (number of utterances and total duration), speaker statistics, and an explicit statement that both systems were evaluated on identical audio files with the same segmentation and medical-vocabulary handling. These details already appear in the Experiments section; we will summarize them in the abstract to eliminate any ambiguity about the fairness of the comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model presentation

full rationale

The paper describes an empirical ASR model (MedASR) built on three pillars of data handling, modeling, and inference, with a central performance claim consisting of a relative WER reduction measured against an external baseline (Whisper Large-v3) on the Eye Gaze dataset. No equations, derivations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The result is an external empirical comparison rather than any chain that reduces to the paper's own inputs by construction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities; all details on data handling, training, and inference are absent.

pith-pipeline@v0.9.0 · 5656 in / 1002 out tokens · 37531 ms · 2026-05-19T21:15:05.626716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

  1. [1]

    monolithic

    Introduction The administrative burden of clinical documentation is a pri- mary driver of physician burnout, creating an urgent need for robust Automated Speech Recognition (ASR) systems [1, 2]. While general-purpose foundation models have achieved re- markable versatility [3, 4], they often lack the domain-specific grounding and structural awareness requ...

  2. [2]

    The MedASR Foundation MedASR is built on a 105M-parameter Conformer architecture

  3. [3]

    MedASR: An Open-Source Model for High-Accuracy Medical Dictation

    and trained using a JAX-based [7] framework. To address the development hurdles outlined previously, we implemented the following strategies: 2.1. Data Scarcity, Acoustic Imbalance, and Formatting The primary bottleneck in medical ASR is the acquisition of large-scale, high-fidelity audio that is both clinically relevant and acoustically diverse. We ident...

  4. [4]

    • Conformer Encoder: 17 layers, 512 activations, and 8 at- tention heads

    reduce the encoder frame rate to 25Hz. • Conformer Encoder: 17 layers, 512 activations, and 8 at- tention heads. Refinements: We deviate from the original implementation by using Rotary Positional Embeddings (RoPE) [17], and removing biases in all layer normaliza- tion and dense layers to improve stability [18]. Consistency RegularizationTo enhance robust...

  5. [5]

    Bootstrapping: A seed model was trained exclusively on a subset of the data containing naturally occurring short sequences (up to 36s)

  6. [6]

    Percentiles (P90, P95, P99) for duration and token counts (512-vocab) illustrate the long-form challenge across specialties

    Forced Alignment: This seed model was utilized to per- form forced alignment on a fused CTC lattice (Section 2.3) Table 1:Statistics of the proprietary medical corpus. Percentiles (P90, P95, P99) for duration and token counts (512-vocab) illustrate the long-form challenge across specialties. Data Scale Duration (seconds) Token Count (512-V ocab) Specialty...

  7. [7]

    Lattice-Based Segmentation: At every 500 encoder frame mark, we extract the corresponding audio chunk and aligned text as a segmented example. While this boundary-agnostic segmentation can result in sub- word units being split at the edges of a segment, the CTC objective remains mathematically sound as it optimizes for token-level sequences rather than wo...

  8. [8]

    hallucination loops

    with 0.001 learning rate. Stability0.1 dropout in the Conformer encoder during both pre-training and fine-tuning. Exponential Moving Average with a 0.9999 decay rate during fine-tuning. 2.3. Pseudo-Streaming Inference for Long-Form Stability While models like Whisper [3] have advanced ASR signifi- cantly, they often exhibit instability when processing lon...

  9. [9]

    uh”, “oh

    Experiments 3.1. Test Sets We evaluate the performance of MedASR using both a publicly available test set (EyeGaze [23, 24]), and held-out sets from our proprietary data (Section 2.1). The proprietary evaluation sets are carefully curated to be speaker-independent, featuring approximately 5% of the total unique speakers in the corpus with zero overlap bet...

  10. [10]

    320ms) based on the latency and compute budget

    Choose a small strideS(e.g. 320ms) based on the latency and compute budget

  11. [11]

    Pad the start of the audio withWseconds of zero-valued samples, so that the first sliding window ends at the very start of the audio (rather than timeW). Figure 4 shows this approach exhibits no significant in- crease in WER in most test sets3 for MedASR (no LM), demon- strating that MedASR can be used with streaming inference without significantly decreasing WER

  12. [12]

    Conclusion We presented MedASR, an open-source ASR model optimized for long-form medical dictation. By utilizing Temporal Poste- rior Fusion within a pseudo-streaming framework, we success- fully eliminated the “drift” and hallucination issues common in 3Except Eye Gaze, which saw a 0.3% absolute increase in WER apparently due to the padding at start. lar...

  13. [13]

    Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction,

    T. D. Shanafelt, L. N. Dyrbye, C. Sinsky, O. Hasan, D. Satele, J. Sloan, and C. P. West, “Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction,” inMayo clinic proceed- ings, vol. 91, no. 7. Elsevier, 2016, pp. 836–848

  14. [14]

    Tethered to the ehr: pri- mary care physician workload assessment using ehr event log data and time-motion observations,

    B. G. Arndt, J. W. Beasley, M. D. Watkinson, J. L. Temte, W.-J. Tuan, C. A. Sinsky, and V . J. Gilchrist, “Tethered to the ehr: pri- mary care physician workload assessment using ehr event log data and time-motion observations,”The Annals of Family Medicine, vol. 15, no. 5, pp. 419–426, 2017

  15. [15]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  16. [16]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  17. [17]

    C., Parmar, N., Zhang, Y., Yu, J.,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,”arXiv preprint arXiv:2005.08100, 2020

  18. [18]

    WhisperX: Time-accurate speech transcription of long-form audio.arXiv preprint, 2023

    M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time- accurate speech transcription of long-form audio,”arXiv preprint arXiv:2303.00747, 2023

  19. [19]

    JAX: composable transformations of Python+NumPy programs,

    J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander- Plas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/jax-ml/jax

  20. [20]

    Lib- rispeech: an asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210

  21. [21]

    Libriheavy: a 50,000 hours asr corpus with punctua- tion casing and context,

    W. Kang, X. Yang, Z. Yao, F. Kuang, Y . Yang, L. Guo, L. Lin, and D. Povey, “Libriheavy: a 50,000 hours asr corpus with punctua- tion casing and context,” 2023

  22. [22]

    Sentencepiece: A simple and lan- guage independent subword tokenizer and detokenizer for neural text processing,

    T. Kudo and J. Richardson, “Sentencepiece: A simple and lan- guage independent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 conference on empir- ical methods in natural language processing: System demonstra- tions, 2018, pp. 66–71

  23. [23]

    End- to-end training of acoustic models for large vocabulary continu- ous speech recognition with tensorflow

    E. Variani, T. Bagby, E. McDermott, and M. Bacchiani, “End- to-end training of acoustic models for large vocabulary continu- ous speech recognition with tensorflow.” inInterspeech, 2017, pp. 1641–1645

  24. [24]

    Connectionist temporal classification,

    A. Graves, “Connectionist temporal classification,” inSupervised sequence labelling with recurrent neural networks. Springer, 2012, pp. 61–93

  25. [25]

    Sequence Transduction with Recurrent Neural Networks

    ——, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012

  26. [26]

    Hybrid au- toregressive transducer (hat),

    E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid au- toregressive transducer (hat),” inICASSP 2020-2020 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6139–6143

  27. [27]

    Listen, Attend and Spell

    W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell,”arXiv preprint arXiv:1508.01211, 2015

  28. [28]

    Global normalization for streaming speech recog- nition in a modular framework,

    E. Variani, K. Wu, M. D. Riley, D. Rybach, M. Shannon, and C. Allauzen, “Global normalization for streaming speech recog- nition in a modular framework,”Advances in Neural Information Processing Systems, vol. 35, pp. 4257–4269, 2022

  29. [29]

    Roformer: Enhanced transformer with rotary position embedding,

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neuro- computing, vol. 568, p. 127063, 2024

  30. [30]

    Palm: Scaling language modeling with pathways,

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,”Jour- nal of machine learning research, vol. 24, no. 240, pp. 1–113, 2023

  31. [31]

    Cr-ctc: Consistency regulariza- tion on ctc for improved speech recognition,

    Z. Yao, W. Kang, X. Yang, F. Kuang, L. Guo, H. Zhu, Z. Jin, Z. Li, L. Lin, and D. Povey, “Cr-ctc: Consistency regulariza- tion on ctc for improved speech recognition,”arXiv preprint arXiv:2410.05101, 2024

  32. [32]

    Specaugment: A simple data augmentation method for automatic speech recognition,

    D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmen- tation method for automatic speech recognition,”arXiv preprint arXiv:1904.08779, 2019

  33. [33]

    Adafactor: Adaptive learning rates with sublinear memory cost,

    N. Shazeer and M. Stern, “Adafactor: Adaptive learning rates with sublinear memory cost,” inInternational conference on machine learning. PMLR, 2018, pp. 4596–4604

  34. [34]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,”arXiv preprint arXiv:1412.6980, 2014

  35. [35]

    Eye Gaze Data for Chest X-rays,

    A. Karargyris, S. Kashyap, I. Lourentzou, J. Wu, M. Tong, A. Sharma, S. Abedin, D. Beymer, V . Mukherjee, E. Krupinski, and M. Moradi, “Eye Gaze Data for Chest X-rays,”PhysioNet, Sep. 2020, version 1.0.0. [Online]. Available: https://doi.org/10. 13026/qfdz-zr67

  36. [36]

    Physiobank, physiotoolkit, and physionet: compo- nents of a new research resource for complex physiologic signals,

    A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet: compo- nents of a new research resource for complex physiologic signals,” circulation, vol. 101, no. 23, pp. e215–e220, 2000