Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

Ambre Marie (LaTIM); Guillaume Dardenne (LaTIM); Gwenol\'e Quellec (LaTIM); Thomas Bertin (DySoLab)

arxiv: 2603.00086 · v2 · pith:7YDD7VVSnew · submitted 2026-02-16 · 💻 cs.CL · cs.AI· cs.SD· eess.AS

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

Ambre Marie (LaTIM) , Thomas Bertin (DySoLab) , Guillaume Dardenne (LaTIM) , Gwenol\'e Quellec (LaTIM) This is my paper

Pith reviewed 2026-05-21 12:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SDeess.AS

keywords French clinical transcriptionspeaker diarizationLLM post-processingword diarization error ratemedical speech recognitionmulti-pass architecture

0 comments

The pith

A multi-pass LLM architecture alternating speaker and word recognition passes improves WDER in French clinical conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether iterative LLM post-processing can fix high error rates in automatic transcription of spontaneous French medical speech. It alternates dedicated passes for identifying speakers and correcting words, then measures the effect on two real clinical datasets. On suicide-prevention telephone calls the method produces statistically significant drops in weighted diarization error rate; on awake neurosurgery consultations the error rate stays stable. The system never fails to produce output and runs at a real-time factor of 0.32, which the authors present as evidence that the approach could be practical for offline clinical use once tested on larger collections.

Core claim

An iterative multi-pass LLM post-processing pipeline that alternates Speaker Recognition and Word Recognition passes reduces weighted diarization error rate on French suicide-prevention counseling transcripts while leaving performance unchanged on awake neurosurgery consultation transcripts, with no output failures and an RTF of 0.32.

What carries the argument

Multi-pass LLM post-processing architecture that alternates Speaker Recognition and Word Recognition passes.

If this is right

Statistically significant WDER reduction on suicide-prevention telephone data (p<0.05, n=18)
Unchanged WDER on awake neurosurgery consultation data (n=10)
Zero output failures across all tested dialogues
Real-time factor of 0.32, compatible with offline clinical workflows

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alternating-pass structure could be applied to other languages or medical specialties once larger training sets become available.
Combining the post-processor with different base ASR models might yield further gains without retraining the underlying recognizer.
Explicit checks for introduced hallucinations or factual drift would strengthen the case for routine clinical use.

Load-bearing premise

Two small clinical datasets and the particular prompting and ordering choices tested are enough to support claims of feasibility for offline clinical deployment.

What would settle it

A larger, more diverse collection of French clinical recordings on which the same multi-pass procedure produces no WDER reduction or introduces new factual errors in the corrected text.

Figures

Figures reproduced from arXiv: 2603.00086 by Ambre Marie (LaTIM), Guillaume Dardenne (LaTIM), Gwenol\'e Quellec (LaTIM), Thomas Bertin (DySoLab).

**Figure 1.** Figure 1: Representative examples of qualitative improvements. imal, this establishes a stable foundation for iterative refinement strategies explored in subsequent sections. The comparable performance between GPT4omini and Qwen80B suggests that open-source models have reached sufficient capability for medical transcription post-processing tasks, provided that sufficient clinical context is given with structured … view at source ↗

read the original abstract

Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p<0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment, pending validation on larger corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The alternating LLM passes cut WDER on the suicide-prevention calls with a significant test, but the n=18/10 samples and missing checks on LLM content changes keep the deployment claim thin.

read the letter

The main point is that this paper gives a practical recipe for running an 80B LLM in alternating speaker-recognition and word-recognition passes on French clinical audio, and it produces a statistically detectable WDER drop on one of the two small test sets. The ablations on model choice, prompting, ordering, and iteration count are the part that actually adds something usable for someone trying to replicate or extend the approach. They also report a clean 0.32 RTF and zero output failures, which is the kind of detail that matters when you are thinking about offline use. That combination of concrete design choices and runtime numbers is what makes the work worth reading rather than just another post-processing note. The datasets are real clinical material, not simulated, and the Wilcoxon result on the n=18 set is reported plainly. The authors are straightforward that larger validation is still needed, which keeps the claims in proportion to the evidence they actually show. The soft spots are exactly where the stress-test note flags them. With only 18 and 10 conversations the gains could still be tied to the particular prompts or the narrow domains tested. More critically, there is no audit of whether the word-recognition pass is recovering ground truth or quietly rewriting medically relevant content. In a clinical setting that distinction is load-bearing, and the current results do not address it. The paper is aimed at people who already work on clinical ASR or LLM refinement pipelines and need a starting template for French or similar low-resource medical speech. A reader who wants to try the alternating-pass idea on their own data will get clear guidance on the four variables they varied. It is worth sending to peer review because the method is described at a level that allows reproduction and the statistical test is there to evaluate. A referee would likely ask for a larger held-out set and some manual or automatic check on semantic fidelity, but those are normal requests rather than reasons to desk-reject. I would send it out.

Referee Report

2 major / 2 minor

Summary. The paper proposes a multi-pass LLM post-processing architecture that alternates between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker diarization for French clinical interviews. Ablation studies examine four design choices (model selection, prompting strategy, pass ordering, and iteration depth) on two small French clinical datasets: suicide prevention telephone counseling (n=18) and preoperative awake neurosurgery consultations (n=10). Using Qwen3-Next-80B, the work reports statistically significant WDER reductions on the first dataset (p<0.05), stability on the second, zero output failures, and RTF of 0.32, concluding that the approach suggests feasibility for offline clinical deployment pending larger validation.

Significance. If the empirical gains hold under larger-scale testing, the method offers a practical, low-failure-rate way to post-process ASR output in French medical conversations where baseline WER often exceeds 30%. The ablation results, Wilcoxon tests, and reported computational cost provide concrete evidence of a workable pipeline that could be deployed offline without retraining ASR models.

major comments (2)

[Abstract] Abstract and Results: The central claim of feasibility for offline clinical deployment rests on Wilcoxon-significant WDER reductions with n=18 (suicide prevention) and n=10 (neurosurgery). These sample sizes are too small to support general deployment recommendations; the observed gains could reflect dataset-specific prompting artifacts rather than robust recovery of ground-truth content.
[Methods] Methods and Ablations: No details are provided on data splits, exact prompt templates, or manual audits for LLM-induced hallucinations or content alterations in the medical corrections. In a clinical setting, silent rewriting of patient statements would undermine the utility of any WDER improvement.

minor comments (2)

WDER and RTF are used without an initial definition or reference to standard formulas; adding these on first use would improve readability.
The description of the alternating pass architecture would benefit from a clearer diagram or pseudocode to show the exact iteration loop and termination criteria.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We appreciate the feedback on sample sizes and methodological details. We respond to each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and Results: The central claim of feasibility for offline clinical deployment rests on Wilcoxon-significant WDER reductions with n=18 (suicide prevention) and n=10 (neurosurgery). These sample sizes are too small to support general deployment recommendations; the observed gains could reflect dataset-specific prompting artifacts rather than robust recovery of ground-truth content.

Authors: We agree that the small sample sizes limit the generalizability of our findings and that larger validation is essential before recommending deployment. The abstract already qualifies the conclusion with 'suggesting feasibility for offline clinical deployment, pending validation on larger corpora.' To further address this, we will revise the abstract and discussion to more explicitly highlight the preliminary nature of the results and the potential for dataset-specific effects. We maintain that the significant WDER reduction (p<0.05) on the suicide prevention set and stability on the neurosurgery set, using Wilcoxon tests, provide valuable initial evidence for the multi-pass approach in this domain. We will also add more context on why these datasets were chosen and their representativeness within French clinical speech. revision: partial
Referee: [Methods] Methods and Ablations: No details are provided on data splits, exact prompt templates, or manual audits for LLM-induced hallucinations or content alterations in the medical corrections. In a clinical setting, silent rewriting of patient statements would undermine the utility of any WDER improvement.

Authors: We thank the referee for pointing this out. The original manuscript omitted these details for space reasons, but we will include them in the revision. Specifically: (1) Data splits: The 18 suicide prevention conversations and 10 neurosurgery consultations were used in their entirety for evaluation without train/test splits as this is a post-processing method on fixed ASR outputs; we will clarify this. (2) Exact prompt templates: We will add the full prompts for Speaker Recognition and Word Recognition passes to an appendix. (3) Manual audits: We performed manual reviews of a subset of outputs to check for hallucinations or content changes, finding none that altered medical meaning, and will describe the audit protocol and results. This ensures transparency regarding potential alterations. We agree that preventing silent rewriting is critical in clinical applications and will emphasize how the iterative passes are constrained to recognition tasks rather than generation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external datasets

full rationale

The paper reports an empirical evaluation of an LLM post-processing pipeline on two external clinical datasets (n=18 and n=10), using ablation studies over four design choices and Wilcoxon signed-rank tests to measure WDER changes. No equations, fitted parameters, or self-citations are presented that reduce the reported improvements to inputs by construction. The central feasibility claim rests on direct measurement against ground-truth transcriptions rather than any self-definitional or self-citation chain, making the derivation self-contained against the provided benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical testing of an LLM refinement pipeline rather than new theory; design choices like iteration depth and prompting are explored via ablation but treated as tunable without independent theoretical justification.

free parameters (2)

iteration depth
Tested as one of four design choices in ablation studies to optimize performance
prompting strategy
Varied across experiments as a key variable affecting pass effectiveness

axioms (2)

domain assumption Large language models can reliably alternate between speaker attribution and word correction tasks in clinical French speech without introducing new errors
Invoked as the basis for the multi-pass architecture in the abstract
domain assumption The two tested clinical conversation types are representative enough to indicate deployment feasibility
Used to support the conclusion of acceptable RTF and zero failures

pith-pipeline@v0.9.0 · 5699 in / 1550 out tokens · 105229 ms · 2026-05-21T12:17:27.390790+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes... ablation studies... model selection, prompting strategy, pass ordering, and iteration depth... Qwen3-Next-80B... WDER reductions... RTF 0.32

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

In clinical interviews, ASR performance remains far below that observed in controlled dictation set- tings

Introduction Automatic speech recognition (ASR) still presents major limi- tations for concrete medical applications in French, where tran- scription errors can directly affect clinical analysis and down- stream uses [1]. In clinical interviews, ASR performance remains far below that observed in controlled dictation set- tings. While dictation tasks often...

work page
[2]

Material and Methods The complete source code, including the transcrip- tion pipeline, LLM post-processing implementation, and domain-adapted prompts, is publicly available at https://github.com/amarie-research/ iterative-llm-clinical-transcription . 2.1. Datasets Experiments are conducted on two French medical speech datasets drawn from distinct clinical...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

SR Improvement : Maps generic speaker labels (SPEAKER 00, SPEAKER 01) to clinical roles (Pa- tient, Neurosurgeon, etc.) based on conversational patterns and medical terminology

work page
[4]

WR Improvement: Corrects ASR errors, resolves homo- phones using the identified clinical context, and anonymizes personal names while preserving spontaneous speech mark- ers

work page
[5]

SR Refinement: Re-evaluates speaker attributions using the corrected transcript to resolve ambiguities that were obscured by initial ASR errors

work page
[6]

Expert in neurosurgical con- sultation analysis

WR Refinement: Re-evaluates lexical corrections using the refined speaker attributions to resolve remaining transcription ambiguities that were obscured by initial speaker errors. This architecture is designed based on three key principles val- idated through systematic ablation studies presented in Sec- tion 3. All experiments use Qwen3-Next-80B, a recen...

work page
[7]

Results 3.1. Model selection To identify the optimal model for our application, three LLM candidates are first compared in a single-pass configuration where the model performs joint SR and WR in a single infer- ence. Table 2 presents the comparative performance on both clinical datasets. Table 2: Model comparison in single-pass configuration WDER Model AN...

work page
[8]

Model selection Table 2 presents the comparative performance of three LLM candidates in both clinical datasets

Discussion 4.1. Model selection Table 2 presents the comparative performance of three LLM candidates in both clinical datasets. QwenVL shows systematic degradation relative to baseline on AN (WDER +2.43 points) and produces three unparseable outputs. This was expected, as QwenVL is optimized for vision-language tasks and the 8B pa- rameter scale is insuff...

work page
[9]

Perspectives Several research directions could extend this work. Recent work suggests that Chain-of-Thought (CoT) prompting, where the LLM is instructed to explicitly verbalize its reasoning process before making corrections, can improve transcription accuracy on English medical consultations [7]. Integrating CoT reason- ing into the SR pass could help th...

work page
[10]

Conclusion This work proposes a N-pass LLM post-processing architecture for French clinical interview transcription, validated through systematic ablation studies on two distinct clinical datasets: sui- cide prevention telephone counseling and preoperative awake neurosurgery consultations. Large-scale open-source models (Qwen3-Next-80B, 80B parameters) ma...

work page
[11]

Allocations de Recherche Doc- torale

Acknowledgments This research was approved by an Ethic Committee and the institutional review boards of participating hospitals. All par- ticipants provided informed consent for their conversations to be used for research purposes. Audio recordings were pseudonymized prior to processing, and all personally identi- fiable information was replaced with gene...

work page
[12]

Transcription automatique des inter- actions verbales. limites observ ´ees et perspectives envisag ´ees `a partir d’un corpus de consultations m ´edicales,

T. Bertin and G. Quellec, “Transcription automatique des inter- actions verbales. limites observ ´ees et perspectives envisag ´ees `a partir d’un corpus de consultations m ´edicales,” Corpus, no. 26, 2025

work page 2025
[13]

The digital scribe in clinical practice: a scop- ing review and research agenda,

M. V . van Buchem, H. Boosman, M. P. Bauer, I. Kant, S. Cammel, and E. Steyerberg, “The digital scribe in clinical practice: a scop- ing review and research agenda,” NPJ Digital Medicine , vol. 4, 2021

work page 2021
[14]

Development of an asr sys- tem for medical conversations,

A. Renato, D. Luna, and S. Benitez, “Development of an asr sys- tem for medical conversations,” Studies in health technology and informatics, vol. 310, pp. 664–668, 2024

work page 2024
[15]

Context is all you need? low-resource conversational asr profits from context, coming from the same or from the other speaker,

J. Linke, J. Winkler, and B. Schuppler, “Context is all you need? low-resource conversational asr profits from context, coming from the same or from the other speaker,” inInterspeech 2025, 2025

work page 2025
[16]

Evaluation of asr systems for conversational speech: A linguistic perspective,

H. B. Pasandi and H. B. Pasandi, “Evaluation of asr systems for conversational speech: A linguistic perspective,” Proceedings of the 20th ACM Conference on Embedded Networked Sensor Sys- tems, 2022

work page 2022
[17]

Opportunities and chal- lenges of automatic speech recognition systems for low-resource language speakers,

T. Reitmaier, E. Wallington, D. Raju, O. Klejch, J. Pearson, M. Jones, P. Bell, and S. Robinson, “Opportunities and chal- lenges of automatic speech recognition systems for low-resource language speakers,” Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022

work page 2022
[18]

The sound of healthcare: Improving medical transcription asr accuracy with large language models

A. Adedeji, S. Joshi, and B. Doohan, “The sound of healthcare: Improving medical transcription asr accuracy with large language models,”arXiv preprint arXiv:2402.07658, 2024

work page arXiv 2024
[19]

Convention icor,

I. Groupe, “Convention icor,” Lyon: uni- versit´e de Lyon. URL: http://icar. cnrs. fr/projets/corinte/documents/2013 Conv ICOR 250313. pdf , 2013

work page 2013
[20]

Whisperx: Time-accurate speech transcription of long-form audio,

M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time-accurate speech transcription of long-form audio,” INTER- SPEECH 2023, 2023

work page 2023
[21]

Pyannote. audio: neural building blocks for speaker diarization,

H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote. audio: neural building blocks for speaker diarization,” in ICASSP 2020-2020 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 7124–7128

work page 2020
[22]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

A review of depression and suicide risk assessment using speech analysis,

N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, and T. F. Quatieri, “A review of depression and suicide risk assessment using speech analysis,” Speech Communication, vol. 71, pp. 10– 49, 2015

work page 2015
[24]

Acoustic and machine learning methods for speech-based suicide risk assessment: A systematic review,

A. Marie, M. Garnier, T. Bertin, L. Machart, G. Dardenne, G. Quellec, and S. Berrouiguet, “Acoustic and machine learning methods for speech-based suicide risk assessment: A systematic review,”arXiv preprint arXiv:2505.18195, 2025

work page arXiv 2025
[25]

Joint speech recognition and speaker diarization via sequence transduction,

L. E. Shafey, H. Soltau, and I. Shafran, “Joint speech recognition and speaker diarization via sequence transduction,” pp. 396–400, 2019

work page 2019
[26]

Automatic speech recognition performance for digital scribes: a performance comparison between general-purpose and special- ized models tuned for patient-clinician conversations,

B. D. Tran, M. Tai-Seale, R. Mangu, J. Lafata, and K. Zheng, “Automatic speech recognition performance for digital scribes: a performance comparison between general-purpose and special- ized models tuned for patient-clinician conversations,” AMIA ... Annual Symposium proceedings. AMIA Symposium, vol. 2022, pp. 1072–1080, 2022

work page 2022
[27]

Lexical speaker error cor- rection: Leveraging language models for speaker diarization error correction,

R. Paturi, S. Srinivasan, and X. Li, “Lexical speaker error cor- rection: Leveraging language models for speaker diarization error correction,” pp. 3567–3571, 2023

work page 2023
[28]

Ag-lsec: Audio grounded lex- ical speaker error correction,

R. Paturi, X. Li, and S. Srinivasan, “Ag-lsec: Audio grounded lex- ical speaker error correction,”ArXiv, vol. abs/2406.17266, 2024

work page arXiv 2024
[29]

Wilcoxon signed-rank test,

R. F. Woolson, “Wilcoxon signed-rank test,” Wiley encyclopedia of clinical trials, pp. 1–3, 2007

work page 2007
[30]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”ArXiv, vol. abs/2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Large lan- guage models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P. Payne, M. G. Seneviratne, P. Gamble, C. Kelly, N. Scharli, A. Chowdhery, P. A. Mansfield, B. A. Y . Arcas, D. Webster, G. S. Corrado, Y . Matias, K. Chou, J. Gottweis, N. Tomaˇsev, Y . Liu, A. Rajkomar, J. Barral, C. Semturs, A. Karthikesalinga...

work page 2022
[32]

Calibrate before use: Improving few-shot performance of language mod- els,

T. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate before use: Improving few-shot performance of language mod- els,” pp. 12 697–12 706, 2021

work page 2021
[33]

Llm-based speaker diarization correction: A generalizable approach,

G. Efstathiadis, V . Yadav, and A. Abbas, “Llm-based speaker diarization correction: A generalizable approach,” ArXiv, vol. abs/2406.04927, 2024

work page arXiv 2024

[1] [1]

In clinical interviews, ASR performance remains far below that observed in controlled dictation set- tings

Introduction Automatic speech recognition (ASR) still presents major limi- tations for concrete medical applications in French, where tran- scription errors can directly affect clinical analysis and down- stream uses [1]. In clinical interviews, ASR performance remains far below that observed in controlled dictation set- tings. While dictation tasks often...

work page

[2] [2]

Material and Methods The complete source code, including the transcrip- tion pipeline, LLM post-processing implementation, and domain-adapted prompts, is publicly available at https://github.com/amarie-research/ iterative-llm-clinical-transcription . 2.1. Datasets Experiments are conducted on two French medical speech datasets drawn from distinct clinical...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

SR Improvement : Maps generic speaker labels (SPEAKER 00, SPEAKER 01) to clinical roles (Pa- tient, Neurosurgeon, etc.) based on conversational patterns and medical terminology

work page

[4] [4]

WR Improvement: Corrects ASR errors, resolves homo- phones using the identified clinical context, and anonymizes personal names while preserving spontaneous speech mark- ers

work page

[5] [5]

SR Refinement: Re-evaluates speaker attributions using the corrected transcript to resolve ambiguities that were obscured by initial ASR errors

work page

[6] [6]

Expert in neurosurgical con- sultation analysis

WR Refinement: Re-evaluates lexical corrections using the refined speaker attributions to resolve remaining transcription ambiguities that were obscured by initial speaker errors. This architecture is designed based on three key principles val- idated through systematic ablation studies presented in Sec- tion 3. All experiments use Qwen3-Next-80B, a recen...

work page

[7] [7]

Results 3.1. Model selection To identify the optimal model for our application, three LLM candidates are first compared in a single-pass configuration where the model performs joint SR and WR in a single infer- ence. Table 2 presents the comparative performance on both clinical datasets. Table 2: Model comparison in single-pass configuration WDER Model AN...

work page

[8] [8]

Model selection Table 2 presents the comparative performance of three LLM candidates in both clinical datasets

Discussion 4.1. Model selection Table 2 presents the comparative performance of three LLM candidates in both clinical datasets. QwenVL shows systematic degradation relative to baseline on AN (WDER +2.43 points) and produces three unparseable outputs. This was expected, as QwenVL is optimized for vision-language tasks and the 8B pa- rameter scale is insuff...

work page

[9] [9]

Perspectives Several research directions could extend this work. Recent work suggests that Chain-of-Thought (CoT) prompting, where the LLM is instructed to explicitly verbalize its reasoning process before making corrections, can improve transcription accuracy on English medical consultations [7]. Integrating CoT reason- ing into the SR pass could help th...

work page

[10] [10]

Conclusion This work proposes a N-pass LLM post-processing architecture for French clinical interview transcription, validated through systematic ablation studies on two distinct clinical datasets: sui- cide prevention telephone counseling and preoperative awake neurosurgery consultations. Large-scale open-source models (Qwen3-Next-80B, 80B parameters) ma...

work page

[11] [11]

Allocations de Recherche Doc- torale

Acknowledgments This research was approved by an Ethic Committee and the institutional review boards of participating hospitals. All par- ticipants provided informed consent for their conversations to be used for research purposes. Audio recordings were pseudonymized prior to processing, and all personally identi- fiable information was replaced with gene...

work page

[12] [12]

Transcription automatique des inter- actions verbales. limites observ ´ees et perspectives envisag ´ees `a partir d’un corpus de consultations m ´edicales,

T. Bertin and G. Quellec, “Transcription automatique des inter- actions verbales. limites observ ´ees et perspectives envisag ´ees `a partir d’un corpus de consultations m ´edicales,” Corpus, no. 26, 2025

work page 2025

[13] [13]

The digital scribe in clinical practice: a scop- ing review and research agenda,

M. V . van Buchem, H. Boosman, M. P. Bauer, I. Kant, S. Cammel, and E. Steyerberg, “The digital scribe in clinical practice: a scop- ing review and research agenda,” NPJ Digital Medicine , vol. 4, 2021

work page 2021

[14] [14]

Development of an asr sys- tem for medical conversations,

A. Renato, D. Luna, and S. Benitez, “Development of an asr sys- tem for medical conversations,” Studies in health technology and informatics, vol. 310, pp. 664–668, 2024

work page 2024

[15] [15]

Context is all you need? low-resource conversational asr profits from context, coming from the same or from the other speaker,

J. Linke, J. Winkler, and B. Schuppler, “Context is all you need? low-resource conversational asr profits from context, coming from the same or from the other speaker,” inInterspeech 2025, 2025

work page 2025

[16] [16]

Evaluation of asr systems for conversational speech: A linguistic perspective,

H. B. Pasandi and H. B. Pasandi, “Evaluation of asr systems for conversational speech: A linguistic perspective,” Proceedings of the 20th ACM Conference on Embedded Networked Sensor Sys- tems, 2022

work page 2022

[17] [17]

Opportunities and chal- lenges of automatic speech recognition systems for low-resource language speakers,

T. Reitmaier, E. Wallington, D. Raju, O. Klejch, J. Pearson, M. Jones, P. Bell, and S. Robinson, “Opportunities and chal- lenges of automatic speech recognition systems for low-resource language speakers,” Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022

work page 2022

[18] [18]

The sound of healthcare: Improving medical transcription asr accuracy with large language models

A. Adedeji, S. Joshi, and B. Doohan, “The sound of healthcare: Improving medical transcription asr accuracy with large language models,”arXiv preprint arXiv:2402.07658, 2024

work page arXiv 2024

[19] [19]

Convention icor,

I. Groupe, “Convention icor,” Lyon: uni- versit´e de Lyon. URL: http://icar. cnrs. fr/projets/corinte/documents/2013 Conv ICOR 250313. pdf , 2013

work page 2013

[20] [20]

Whisperx: Time-accurate speech transcription of long-form audio,

M. Bain, J. Huh, T. Han, and A. Zisserman, “Whisperx: Time-accurate speech transcription of long-form audio,” INTER- SPEECH 2023, 2023

work page 2023

[21] [21]

Pyannote. audio: neural building blocks for speaker diarization,

H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote. audio: neural building blocks for speaker diarization,” in ICASSP 2020-2020 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 7124–7128

work page 2020

[22] [22]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

A review of depression and suicide risk assessment using speech analysis,

N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, and T. F. Quatieri, “A review of depression and suicide risk assessment using speech analysis,” Speech Communication, vol. 71, pp. 10– 49, 2015

work page 2015

[24] [24]

Acoustic and machine learning methods for speech-based suicide risk assessment: A systematic review,

A. Marie, M. Garnier, T. Bertin, L. Machart, G. Dardenne, G. Quellec, and S. Berrouiguet, “Acoustic and machine learning methods for speech-based suicide risk assessment: A systematic review,”arXiv preprint arXiv:2505.18195, 2025

work page arXiv 2025

[25] [25]

Joint speech recognition and speaker diarization via sequence transduction,

L. E. Shafey, H. Soltau, and I. Shafran, “Joint speech recognition and speaker diarization via sequence transduction,” pp. 396–400, 2019

work page 2019

[26] [26]

Automatic speech recognition performance for digital scribes: a performance comparison between general-purpose and special- ized models tuned for patient-clinician conversations,

B. D. Tran, M. Tai-Seale, R. Mangu, J. Lafata, and K. Zheng, “Automatic speech recognition performance for digital scribes: a performance comparison between general-purpose and special- ized models tuned for patient-clinician conversations,” AMIA ... Annual Symposium proceedings. AMIA Symposium, vol. 2022, pp. 1072–1080, 2022

work page 2022

[27] [27]

Lexical speaker error cor- rection: Leveraging language models for speaker diarization error correction,

R. Paturi, S. Srinivasan, and X. Li, “Lexical speaker error cor- rection: Leveraging language models for speaker diarization error correction,” pp. 3567–3571, 2023

work page 2023

[28] [28]

Ag-lsec: Audio grounded lex- ical speaker error correction,

R. Paturi, X. Li, and S. Srinivasan, “Ag-lsec: Audio grounded lex- ical speaker error correction,”ArXiv, vol. abs/2406.17266, 2024

work page arXiv 2024

[29] [29]

Wilcoxon signed-rank test,

R. F. Woolson, “Wilcoxon signed-rank test,” Wiley encyclopedia of clinical trials, pp. 1–3, 2007

work page 2007

[30] [30]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”ArXiv, vol. abs/2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Large lan- guage models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P. Payne, M. G. Seneviratne, P. Gamble, C. Kelly, N. Scharli, A. Chowdhery, P. A. Mansfield, B. A. Y . Arcas, D. Webster, G. S. Corrado, Y . Matias, K. Chou, J. Gottweis, N. Tomaˇsev, Y . Liu, A. Rajkomar, J. Barral, C. Semturs, A. Karthikesalinga...

work page 2022

[32] [32]

Calibrate before use: Improving few-shot performance of language mod- els,

T. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate before use: Improving few-shot performance of language mod- els,” pp. 12 697–12 706, 2021

work page 2021

[33] [33]

Llm-based speaker diarization correction: A generalizable approach,

G. Efstathiadis, V . Yadav, and A. Abbas, “Llm-based speaker diarization correction: A generalizable approach,” ArXiv, vol. abs/2406.04927, 2024

work page arXiv 2024