MedASR: An Open-Source Model for High-Accuracy Medical Dictation

Ehsan Variani; Ke Wu; Rory Pilgrim; Shashir Reddy; Tom Bagby

arxiv: 2605.16555 · v1 · pith:WDGDTGCOnew · submitted 2026-05-15 · 📡 eess.AS

MedASR: An Open-Source Model for High-Accuracy Medical Dictation

Ke Wu , Ehsan Variani , Tom Bagby , Shashir Reddy , Rory Pilgrim This is my paper

Pith reviewed 2026-05-19 21:15 UTC · model grok-4.3

classification 📡 eess.AS

keywords medical ASRspeech recognitionopen-source modelmedical dictationword error rateclinical documentationWhisper comparison

0 comments

The pith

MedASR is a 105M-parameter open-source model that achieves a 58% relative WER reduction on medical dictation versus Whisper Large-v3.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedASR as a compact model engineered for high-accuracy medical dictation by tackling data scarcity in clinical corpora, efficient long-form training, and inference via a pseudo-streaming sliding-window method. It prioritizes a small, fast, and accurate design to make specialized transcription practical without relying on large proprietary systems. A sympathetic reader would care because this approach could improve clinical documentation speed and reduce errors in healthcare settings where accurate speech-to-text is essential. The reported evaluation on the Eye Gaze dataset supports the performance gains over general-purpose models.

Core claim

MedASR achieves a 58% relative WER reduction on the Eye Gaze dataset compared to Whisper Large-v3 through targeted solutions for clinical data imbalance, long-form modeling, and accurate pseudo-streaming inference in a 105M-parameter open-source model.

What carries the argument

The pseudo-streaming sliding-window inference approach that enables accurate transcription of long medical dictations while keeping the model small and fast.

If this is right

Open-source tools can deliver specialized medical transcription performance that rivals or exceeds larger general models.
Healthcare systems gain a transparent alternative to proprietary dictation software for clinical documentation.
Domain-specific fine-tuning on limited clinical data becomes sufficient for high-accuracy results when paired with efficient inference techniques.
Smaller models can be deployed in resource-constrained medical environments without sacrificing transcription quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-handling and sliding-window techniques might transfer to other high-stakes dictation domains such as legal or technical speech.
Wider adoption of the open-sourced model could enable community-driven adaptations for accents, dialects, or non-English medical terms.
If the performance holds on broader datasets, it suggests that targeted engineering can close the gap between general and domain-specific ASR without scaling model size.

Load-bearing premise

The Eye Gaze dataset and evaluation protocol provide a fair and representative measure of real-world medical dictation performance.

What would settle it

Evaluating MedASR on an independent medical dictation dataset collected under different recording conditions or specialties and checking whether the 58% relative WER reduction versus Whisper Large-v3 is replicated.

Figures

Figures reproduced from arXiv: 2605.16555 by Ehsan Variani, Ke Wu, Rory Pilgrim, Shashir Reddy, Tom Bagby.

**Figure 1.** Figure 1: Temporal Fusion mechanism. Posterior logits zt,k from different windows are fused via weighted averaging. of windows covering t so far: Pθ,a(zt|x) = min(KXT ∪{a}) k=min(KT ) αt,kPt,k where αt,k is the normalized weight for the k-th window covering frame t: αt,k = P wrel(t,k) k′∈KT wrel(t,k) where rel(t, k) = t−Bk is the relative index of frame t within window k. We obtain the first P(yt|x≤a) when a = min(… view at source ↗

**Figure 2.** Figure 2: Offline MedASR (no LM) WER over different strides 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Stride (s) 6 7 8 9 10 11 WER (%) EyeGaze RAD IM GENINT FM 3.2. MedASR as an Offline Recognizer We compare MedASR (105M parameters) against two state-ofthe-art foundational models: OpenAI Whisper (Large-v3) and Google Gemini 2.5 Pro. For MedASR, we obtain the fused CTC lattices from sliding windows of length 20s & stride… view at source ↗

**Figure 4.** Figure 4: Streaming MedASR (no LM) WER over stride sizes 0.0 0.2 0.4 0.6 0.8 1.0 Stride (s) 6 7 8 9 10 WER (%) EyeGaze RAD IM GENINT FM 3.4. MedASR as a Streaming Recognizer For interactive use cases requiring low latency, MedASR can be used as a streaming recognizer with two simple changes to inference: 1. Choose a small stride S (e.g. 320ms) based on the latency and compute budget; 2. Pad the start of the audio wi… view at source ↗

read the original abstract

We present MedASR, an open-source 105M-parameter model engineered for high-accuracy medical dictation. Prioritizing a "small, fast, and accurate" design, MedASR addresses 3 core pillars (1) Data: overcoming clinical corpora scarcity and class imbalance; (2) Modeling: efficient long-form training; and (3) Inference: accurate transcription via a pseudo-streaming sliding-window approach. Our evaluation shows that MedASR achieves a 58% relative WER reduction on Eye Gaze compared to Whisper Large-v3. By open-sourcing MedASR, we provide a transparent, high-performance backbone for specialized health-care applications, breaking down the barriers to clinical documentation often obscured by proprietary systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedASR is a practical open-source 105M medical ASR model with a headline 58% relative WER claim that still needs absolute numbers and matched baseline details to hold up.

read the letter

The main thing to know about MedASR is that it delivers an open-source 105M-parameter model aimed at medical dictation, with a reported 58% relative WER reduction versus Whisper Large-v3 on the Eye Gaze set. The authors emphasize three practical pillars: handling scarce and imbalanced clinical data, efficient long-form training, and a pseudo-streaming sliding-window inference method. That framing is clear and directly addresses real constraints in healthcare transcription rather than chasing new architectures.

Referee Report

1 major / 1 minor

Summary. The manuscript presents MedASR, an open-source 105M-parameter ASR model for medical dictation. It identifies three pillars—data curation to address clinical corpus scarcity and imbalance, modeling for efficient long-form training, and pseudo-streaming sliding-window inference—and reports a 58% relative WER reduction on the Eye Gaze dataset relative to Whisper Large-v3.

Significance. If the performance claims can be substantiated with complete evaluation details, the work would supply a compact, open-source backbone that lowers barriers to specialized clinical documentation. The emphasis on open-sourcing and the focus on practical inference constraints are positive contributions, but the current evidence base is too thin to evaluate whether the reported gains generalize or arise from the claimed architectural and data choices.

major comments (1)

[Abstract] Abstract: The central claim of a 58% relative WER reduction on Eye Gaze is presented without absolute WER numbers for MedASR or Whisper Large-v3, without test-set size or speaker statistics, and without explicit confirmation that the same audio files, segmentation, and medical-vocabulary handling were used for both systems. These omissions make it impossible to determine whether the reported gain reflects the three pillars or an evaluation mismatch.

minor comments (1)

The manuscript should supply a reference or brief description of the Eye Gaze dataset and state whether it is publicly available.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the emphasis on ensuring that performance claims are presented with sufficient context and transparency. We address the major comment below and will revise the manuscript accordingly to strengthen the presentation of our evaluation results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a 58% relative WER reduction on Eye Gaze is presented without absolute WER numbers for MedASR or Whisper Large-v3, without test-set size or speaker statistics, and without explicit confirmation that the same audio files, segmentation, and medical-vocabulary handling were used for both systems. These omissions make it impossible to determine whether the reported gain reflects the three pillars or an evaluation mismatch.

Authors: We agree that absolute WER values and supporting evaluation details are necessary for readers to properly interpret the relative improvement. In the revised manuscript, we will update the abstract to report the absolute WER for MedASR and for Whisper Large-v3 on the Eye Gaze dataset. We will also include concise information on test-set size (number of utterances and total duration), speaker statistics, and an explicit statement that both systems were evaluated on identical audio files with the same segmentation and medical-vocabulary handling. These details already appear in the Experiments section; we will summarize them in the abstract to eliminate any ambiguity about the fairness of the comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model presentation

full rationale

The paper describes an empirical ASR model (MedASR) built on three pillars of data handling, modeling, and inference, with a central performance claim consisting of a relative WER reduction measured against an external baseline (Whisper Large-v3) on the Eye Gaze dataset. No equations, derivations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The result is an external empirical comparison rather than any chain that reduces to the paper's own inputs by construction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities; all details on data handling, training, and inference are absent.

pith-pipeline@v0.9.0 · 5656 in / 1002 out tokens · 37531 ms · 2026-05-19T21:15:05.626716+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MedASR utilizes a 105M-parameter Conformer-L backbone … CTC objective … Temporal Posterior Fusion … sliding window of fixed length W and stride S … Hann window as w
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We demonstrate that MedASR achieves superior performance … 58% WER reduction on Eye Gaze

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

[1]

monolithic

Introduction The administrative burden of clinical documentation is a pri- mary driver of physician burnout, creating an urgent need for robust Automated Speech Recognition (ASR) systems [1, 2]. While general-purpose foundation models have achieved re- markable versatility [3, 4], they often lack the domain-specific grounding and structural awareness requ...

work page
[2]

The MedASR Foundation MedASR is built on a 105M-parameter Conformer architecture

work page
[3]

MedASR: An Open-Source Model for High-Accuracy Medical Dictation

and trained using a JAX-based [7] framework. To address the development hurdles outlined previously, we implemented the following strategies: 2.1. Data Scarcity, Acoustic Imbalance, and Formatting The primary bottleneck in medical ASR is the acquisition of large-scale, high-fidelity audio that is both clinically relevant and acoustically diverse. We ident...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

• Conformer Encoder: 17 layers, 512 activations, and 8 at- tention heads

reduce the encoder frame rate to 25Hz. • Conformer Encoder: 17 layers, 512 activations, and 8 at- tention heads. Refinements: We deviate from the original implementation by using Rotary Positional Embeddings (RoPE) [17], and removing biases in all layer normaliza- tion and dense layers to improve stability [18]. Consistency RegularizationTo enhance robust...

work page
[5]

Bootstrapping: A seed model was trained exclusively on a subset of the data containing naturally occurring short sequences (up to 36s)

work page
[6]

Percentiles (P90, P95, P99) for duration and token counts (512-vocab) illustrate the long-form challenge across specialties

Forced Alignment: This seed model was utilized to per- form forced alignment on a fused CTC lattice (Section 2.3) Table 1:Statistics of the proprietary medical corpus. Percentiles (P90, P95, P99) for duration and token counts (512-vocab) illustrate the long-form challenge across specialties. Data Scale Duration (seconds) Token Count (512-V ocab) Specialty...

work page 2042
[7]

Lattice-Based Segmentation: At every 500 encoder frame mark, we extract the corresponding audio chunk and aligned text as a segmented example. While this boundary-agnostic segmentation can result in sub- word units being split at the edges of a segment, the CTC objective remains mathematically sound as it optimizes for token-level sequences rather than wo...

work page
[8]

hallucination loops

with 0.001 learning rate. Stability0.1 dropout in the Conformer encoder during both pre-training and fine-tuning. Exponential Moving Average with a 0.9999 decay rate during fine-tuning. 2.3. Pseudo-Streaming Inference for Long-Form Stability While models like Whisper [3] have advanced ASR signifi- cantly, they often exhibit instability when processing lon...

work page
[9]

uh”, “oh

Experiments 3.1. Test Sets We evaluate the performance of MedASR using both a publicly available test set (EyeGaze [23, 24]), and held-out sets from our proprietary data (Section 2.1). The proprietary evaluation sets are carefully curated to be speaker-independent, featuring approximately 5% of the total unique speakers in the corpus with zero overlap bet...

work page
[10]

320ms) based on the latency and compute budget

Choose a small strideS(e.g. 320ms) based on the latency and compute budget

work page
[11]

Pad the start of the audio withWseconds of zero-valued samples, so that the first sliding window ends at the very start of the audio (rather than timeW). Figure 4 shows this approach exhibits no significant in- crease in WER in most test sets3 for MedASR (no LM), demon- strating that MedASR can be used with streaming inference without significantly decreasing WER

work page
[12]

Conclusion We presented MedASR, an open-source ASR model optimized for long-form medical dictation. By utilizing Temporal Poste- rior Fusion within a pseudo-streaming framework, we success- fully eliminated the “drift” and hallucination issues common in 3Except Eye Gaze, which saw a 0.3% absolute increase in WER apparently due to the padding at start. lar...

work page
[13]

Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction,

T. D. Shanafelt, L. N. Dyrbye, C. Sinsky, O. Hasan, D. Satele, J. Sloan, and C. P. West, “Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction,” inMayo clinic proceed- ings, vol. 91, no. 7. Elsevier, 2016, pp. 836–848

work page 2016
[14]

Tethered to the ehr: pri- mary care physician workload assessment using ehr event log data and time-motion observations,

B. G. Arndt, J. W. Beasley, M. D. Watkinson, J. L. Temte, W.-J. Tuan, C. A. Sinsky, and V . J. Gilchrist, “Tethered to the ehr: pri- mary care physician workload assessment using ehr event log data and time-motion observations,”The Annals of Family Medicine, vol. 15, no. 5, pp. 419–426, 2017

work page 2017
[15]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023
[16]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

C., Parmar, N., Zhang, Y., Yu, J.,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,”arXiv preprint arXiv:2005.08100, 2020

work page arXiv 2005
[18]

WhisperX: Time-accurate speech transcription of long-form audio.arXiv preprint, 2023

M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time- accurate speech transcription of long-form audio,”arXiv preprint arXiv:2303.00747, 2023

work page arXiv 2023
[19]

JAX: composable transformations of Python+NumPy programs,

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander- Plas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/jax-ml/jax

work page 2018
[20]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210

work page 2015
[21]

Libriheavy: a 50,000 hours asr corpus with punctua- tion casing and context,

W. Kang, X. Yang, Z. Yao, F. Kuang, Y . Yang, L. Guo, L. Lin, and D. Povey, “Libriheavy: a 50,000 hours asr corpus with punctua- tion casing and context,” 2023

work page 2023
[22]

Sentencepiece: A simple and lan- guage independent subword tokenizer and detokenizer for neural text processing,

T. Kudo and J. Richardson, “Sentencepiece: A simple and lan- guage independent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 conference on empir- ical methods in natural language processing: System demonstra- tions, 2018, pp. 66–71

work page 2018
[23]

End- to-end training of acoustic models for large vocabulary continu- ous speech recognition with tensorflow

E. Variani, T. Bagby, E. McDermott, and M. Bacchiani, “End- to-end training of acoustic models for large vocabulary continu- ous speech recognition with tensorflow.” inInterspeech, 2017, pp. 1641–1645

work page 2017
[24]

Connectionist temporal classification,

A. Graves, “Connectionist temporal classification,” inSupervised sequence labelling with recurrent neural networks. Springer, 2012, pp. 61–93

work page 2012
[25]

Sequence Transduction with Recurrent Neural Networks

——, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[26]

Hybrid au- toregressive transducer (hat),

E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid au- toregressive transducer (hat),” inICASSP 2020-2020 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6139–6143

work page 2020
[27]

Listen, Attend and Spell

W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell,”arXiv preprint arXiv:1508.01211, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

Global normalization for streaming speech recog- nition in a modular framework,

E. Variani, K. Wu, M. D. Riley, D. Rybach, M. Shannon, and C. Allauzen, “Global normalization for streaming speech recog- nition in a modular framework,”Advances in Neural Information Processing Systems, vol. 35, pp. 4257–4269, 2022

work page 2022
[29]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neuro- computing, vol. 568, p. 127063, 2024

work page 2024
[30]

Palm: Scaling language modeling with pathways,

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,”Jour- nal of machine learning research, vol. 24, no. 240, pp. 1–113, 2023

work page 2023
[31]

Cr-ctc: Consistency regulariza- tion on ctc for improved speech recognition,

Z. Yao, W. Kang, X. Yang, F. Kuang, L. Guo, H. Zhu, Z. Jin, Z. Li, L. Lin, and D. Povey, “Cr-ctc: Consistency regulariza- tion on ctc for improved speech recognition,”arXiv preprint arXiv:2410.05101, 2024

work page arXiv 2024
[32]

Specaugment: A simple data augmentation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmen- tation method for automatic speech recognition,”arXiv preprint arXiv:1904.08779, 2019

work page arXiv 1904
[33]

Adafactor: Adaptive learning rates with sublinear memory cost,

N. Shazeer and M. Stern, “Adafactor: Adaptive learning rates with sublinear memory cost,” inInternational conference on machine learning. PMLR, 2018, pp. 4596–4604

work page 2018
[34]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[35]

Eye Gaze Data for Chest X-rays,

A. Karargyris, S. Kashyap, I. Lourentzou, J. Wu, M. Tong, A. Sharma, S. Abedin, D. Beymer, V . Mukherjee, E. Krupinski, and M. Moradi, “Eye Gaze Data for Chest X-rays,”PhysioNet, Sep. 2020, version 1.0.0. [Online]. Available: https://doi.org/10. 13026/qfdz-zr67

work page 2020
[36]

Physiobank, physiotoolkit, and physionet: compo- nents of a new research resource for complex physiologic signals,

A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet: compo- nents of a new research resource for complex physiologic signals,” circulation, vol. 101, no. 23, pp. e215–e220, 2000

work page 2000

[1] [1]

monolithic

Introduction The administrative burden of clinical documentation is a pri- mary driver of physician burnout, creating an urgent need for robust Automated Speech Recognition (ASR) systems [1, 2]. While general-purpose foundation models have achieved re- markable versatility [3, 4], they often lack the domain-specific grounding and structural awareness requ...

work page

[2] [2]

The MedASR Foundation MedASR is built on a 105M-parameter Conformer architecture

work page

[3] [3]

MedASR: An Open-Source Model for High-Accuracy Medical Dictation

and trained using a JAX-based [7] framework. To address the development hurdles outlined previously, we implemented the following strategies: 2.1. Data Scarcity, Acoustic Imbalance, and Formatting The primary bottleneck in medical ASR is the acquisition of large-scale, high-fidelity audio that is both clinically relevant and acoustically diverse. We ident...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

• Conformer Encoder: 17 layers, 512 activations, and 8 at- tention heads

reduce the encoder frame rate to 25Hz. • Conformer Encoder: 17 layers, 512 activations, and 8 at- tention heads. Refinements: We deviate from the original implementation by using Rotary Positional Embeddings (RoPE) [17], and removing biases in all layer normaliza- tion and dense layers to improve stability [18]. Consistency RegularizationTo enhance robust...

work page

[5] [5]

Bootstrapping: A seed model was trained exclusively on a subset of the data containing naturally occurring short sequences (up to 36s)

work page

[6] [6]

Percentiles (P90, P95, P99) for duration and token counts (512-vocab) illustrate the long-form challenge across specialties

Forced Alignment: This seed model was utilized to per- form forced alignment on a fused CTC lattice (Section 2.3) Table 1:Statistics of the proprietary medical corpus. Percentiles (P90, P95, P99) for duration and token counts (512-vocab) illustrate the long-form challenge across specialties. Data Scale Duration (seconds) Token Count (512-V ocab) Specialty...

work page 2042

[7] [7]

Lattice-Based Segmentation: At every 500 encoder frame mark, we extract the corresponding audio chunk and aligned text as a segmented example. While this boundary-agnostic segmentation can result in sub- word units being split at the edges of a segment, the CTC objective remains mathematically sound as it optimizes for token-level sequences rather than wo...

work page

[8] [8]

hallucination loops

with 0.001 learning rate. Stability0.1 dropout in the Conformer encoder during both pre-training and fine-tuning. Exponential Moving Average with a 0.9999 decay rate during fine-tuning. 2.3. Pseudo-Streaming Inference for Long-Form Stability While models like Whisper [3] have advanced ASR signifi- cantly, they often exhibit instability when processing lon...

work page

[9] [9]

uh”, “oh

Experiments 3.1. Test Sets We evaluate the performance of MedASR using both a publicly available test set (EyeGaze [23, 24]), and held-out sets from our proprietary data (Section 2.1). The proprietary evaluation sets are carefully curated to be speaker-independent, featuring approximately 5% of the total unique speakers in the corpus with zero overlap bet...

work page

[10] [10]

320ms) based on the latency and compute budget

Choose a small strideS(e.g. 320ms) based on the latency and compute budget

work page

[11] [11]

Pad the start of the audio withWseconds of zero-valued samples, so that the first sliding window ends at the very start of the audio (rather than timeW). Figure 4 shows this approach exhibits no significant in- crease in WER in most test sets3 for MedASR (no LM), demon- strating that MedASR can be used with streaming inference without significantly decreasing WER

work page

[12] [12]

Conclusion We presented MedASR, an open-source ASR model optimized for long-form medical dictation. By utilizing Temporal Poste- rior Fusion within a pseudo-streaming framework, we success- fully eliminated the “drift” and hallucination issues common in 3Except Eye Gaze, which saw a 0.3% absolute increase in WER apparently due to the padding at start. lar...

work page

[13] [13]

Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction,

T. D. Shanafelt, L. N. Dyrbye, C. Sinsky, O. Hasan, D. Satele, J. Sloan, and C. P. West, “Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction,” inMayo clinic proceed- ings, vol. 91, no. 7. Elsevier, 2016, pp. 836–848

work page 2016

[14] [14]

Tethered to the ehr: pri- mary care physician workload assessment using ehr event log data and time-motion observations,

B. G. Arndt, J. W. Beasley, M. D. Watkinson, J. L. Temte, W.-J. Tuan, C. A. Sinsky, and V . J. Gilchrist, “Tethered to the ehr: pri- mary care physician workload assessment using ehr event log data and time-motion observations,”The Annals of Family Medicine, vol. 15, no. 5, pp. 419–426, 2017

work page 2017

[15] [15]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023

[16] [16]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

C., Parmar, N., Zhang, Y., Yu, J.,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,”arXiv preprint arXiv:2005.08100, 2020

work page arXiv 2005

[18] [18]

WhisperX: Time-accurate speech transcription of long-form audio.arXiv preprint, 2023

M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time- accurate speech transcription of long-form audio,”arXiv preprint arXiv:2303.00747, 2023

work page arXiv 2023

[19] [19]

JAX: composable transformations of Python+NumPy programs,

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander- Plas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/jax-ml/jax

work page 2018

[20] [20]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210

work page 2015

[21] [21]

Libriheavy: a 50,000 hours asr corpus with punctua- tion casing and context,

W. Kang, X. Yang, Z. Yao, F. Kuang, Y . Yang, L. Guo, L. Lin, and D. Povey, “Libriheavy: a 50,000 hours asr corpus with punctua- tion casing and context,” 2023

work page 2023

[22] [22]

Sentencepiece: A simple and lan- guage independent subword tokenizer and detokenizer for neural text processing,

T. Kudo and J. Richardson, “Sentencepiece: A simple and lan- guage independent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 conference on empir- ical methods in natural language processing: System demonstra- tions, 2018, pp. 66–71

work page 2018

[23] [23]

End- to-end training of acoustic models for large vocabulary continu- ous speech recognition with tensorflow

E. Variani, T. Bagby, E. McDermott, and M. Bacchiani, “End- to-end training of acoustic models for large vocabulary continu- ous speech recognition with tensorflow.” inInterspeech, 2017, pp. 1641–1645

work page 2017

[24] [24]

Connectionist temporal classification,

A. Graves, “Connectionist temporal classification,” inSupervised sequence labelling with recurrent neural networks. Springer, 2012, pp. 61–93

work page 2012

[25] [25]

Sequence Transduction with Recurrent Neural Networks

——, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[26] [26]

Hybrid au- toregressive transducer (hat),

E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid au- toregressive transducer (hat),” inICASSP 2020-2020 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6139–6143

work page 2020

[27] [27]

Listen, Attend and Spell

W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell,”arXiv preprint arXiv:1508.01211, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

Global normalization for streaming speech recog- nition in a modular framework,

E. Variani, K. Wu, M. D. Riley, D. Rybach, M. Shannon, and C. Allauzen, “Global normalization for streaming speech recog- nition in a modular framework,”Advances in Neural Information Processing Systems, vol. 35, pp. 4257–4269, 2022

work page 2022

[29] [29]

Roformer: Enhanced transformer with rotary position embedding,

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neuro- computing, vol. 568, p. 127063, 2024

work page 2024

[30] [30]

Palm: Scaling language modeling with pathways,

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,”Jour- nal of machine learning research, vol. 24, no. 240, pp. 1–113, 2023

work page 2023

[31] [31]

Cr-ctc: Consistency regulariza- tion on ctc for improved speech recognition,

Z. Yao, W. Kang, X. Yang, F. Kuang, L. Guo, H. Zhu, Z. Jin, Z. Li, L. Lin, and D. Povey, “Cr-ctc: Consistency regulariza- tion on ctc for improved speech recognition,”arXiv preprint arXiv:2410.05101, 2024

work page arXiv 2024

[32] [32]

Specaugment: A simple data augmentation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “Specaugment: A simple data augmen- tation method for automatic speech recognition,”arXiv preprint arXiv:1904.08779, 2019

work page arXiv 1904

[33] [33]

Adafactor: Adaptive learning rates with sublinear memory cost,

N. Shazeer and M. Stern, “Adafactor: Adaptive learning rates with sublinear memory cost,” inInternational conference on machine learning. PMLR, 2018, pp. 4596–4604

work page 2018

[34] [34]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[35] [35]

Eye Gaze Data for Chest X-rays,

A. Karargyris, S. Kashyap, I. Lourentzou, J. Wu, M. Tong, A. Sharma, S. Abedin, D. Beymer, V . Mukherjee, E. Krupinski, and M. Moradi, “Eye Gaze Data for Chest X-rays,”PhysioNet, Sep. 2020, version 1.0.0. [Online]. Available: https://doi.org/10. 13026/qfdz-zr67

work page 2020

[36] [36]

Physiobank, physiotoolkit, and physionet: compo- nents of a new research resource for complex physiologic signals,

A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet: compo- nents of a new research resource for complex physiologic signals,” circulation, vol. 101, no. 23, pp. e215–e220, 2000

work page 2000