arxiv: 2604.06138 · v1 · submitted 2026-04-07 · 💻 cs.SD · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

Yanis Labrak , David Gr\"unert , S\'everin Baroudi , Jiyun Chun , Pawel Cyrta , Sergio Burdisso , Ahmed Hassoon , David Liu

show 6 more authors

Adam Rothschild Reed Van Deusen Petr Motlicek Andrew Perrault Ricard Marxer Thomas Schaaf

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords synthetic data generationdoctor-patient conversationslong-form audio summarizationSOAP notesmulti-speaker audiocascaded vs end-to-endmedical AIopen-weight models

0 comments

The pith

Synthetic doctor-patient audio conversations with SOAP notes reveal that cascaded systems still outperform end-to-end models for long-form summarization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a three-stage pipeline to generate synthetic first-visit doctor-patient dialogues, complete with multi-speaker audio that includes overlaps, pauses, room acoustics, and background sounds, plus matching reference SOAP notes. This produces a large open dataset that can train models and provide controlled tests for long-context audio reasoning, an area lacking real data due to privacy and evaluation difficulties. The authors show that current open-weight systems using separate transcription and summarization steps perform substantially better than single end-to-end models on the generated material.

Core claim

A pipeline of persona-driven dialogue generation followed by multi-speaker audio synthesis with acoustic modeling and then LLM-based SOAP note creation can produce 8,800 realistic conversations totaling 1.3k hours of audio; when used to evaluate open-weight systems, cascaded approaches still substantially outperform end-to-end models for long-form medical audio summarization.

What carries the argument

The three-stage synthetic data generation pipeline: persona-driven dialogue generation, multi-speaker audio synthesis including overlap/pause modeling, room acoustics and sound events, and LLM-based reference SOAP note production.

If this is right

The released dataset supplies training material for models that must reason over long audio without needing private real recordings.
The performance gap indicates that end-to-end audio models require further advances to match cascaded pipelines on extended medical dialogues.
Controlled synthetic audio allows repeatable experiments on the effects of overlap, background noise, and speaker turns in summarization.
Open-weight models can now bootstrap additional domain-specific long-context audio datasets following the same stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the acoustic modeling proves sufficiently realistic, similar pipelines could generate training data for other long-form audio tasks such as meetings or lectures while avoiding privacy barriers.
The continued superiority of cascaded systems suggests that current end-to-end audio models still struggle with the combination of long duration and precise factual extraction required in medical settings.
The dataset opens the possibility of studying how specific acoustic degradations affect summarization accuracy in a repeatable way.

Load-bearing premise

The synthetic dialogues and audio are realistic enough to serve as both training data and a controlled evaluation environment for real-world long-form medical audio summarization.

What would settle it

A direct comparison of model rankings and absolute performance on the synthetic dataset versus a held-out collection of real recorded doctor-patient conversations.

Figures

Figures reproduced from arXiv: 2604.06138 by Adam Rothschild, Ahmed Hassoon, Andrew Perrault, David Gr\"unert, David Liu, Jiyun Chun, Pawel Cyrta, Petr Motlicek, Reed Van Deusen, Ricard Marxer, Sergio Burdisso, S\'everin Baroudi, Thomas Schaaf, Yanis Labrak.

**Figure 1.** Figure 1: Overview of the audio generation pipeline, integrating Text-To-Speech, sound event generation, temporal scene composition, and environmental acoustic simulation. 3.2. Personas → Text Dialogue We experimented with a direct generation approach, where the generator LLM receives both the doctor and patient personas as input and generates the entire dialogue in a single shot. We found that doing so led to unre… view at source ↗

read the original abstract

Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant to long-context reasoning pose well-known challenges for automatic evaluation. We propose a synthetic data generation pipeline designed to serve both as a training resource and as a controlled evaluation environment, and instantiate it for first-visit doctor-patient conversations with SOAP note generation as the task. The pipeline has three stages, persona-driven dialogue generation, multi-speaker audio synthesis with overlap/pause modeling, room acoustics, and sound events, and LLM-based reference SOAP note production, built entirely on open-weight models. We release 8,800 synthetic conversations with 1.3k hours of corresponding audio and reference notes. Evaluating current open-weight systems, we find that cascaded approaches still substantially outperform end-to-end models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a sizable open synthetic dataset for doctor-patient audio summarization and shows cascaded systems beating end-to-end ones on it, but the result rests on unvalidated realism of the generated audio and notes.

read the letter

The useful part is the released set of 8800 synthetic first-visit conversations with 1.3k hours of multi-speaker audio and SOAP note references. The three-stage pipeline—persona LLMs for dialogue, audio synthesis that adds overlaps, pauses, room effects and events, then LLM notes—targets a real data gap in long-context medical audio. They evaluate open-weight models and report that cascaded ASR-plus-LLM pipelines still do better than end-to-end ones on the summarization task. That is the concrete new thing: the scale, the medical focus, and the specific finding on this benchmark, all built from open models and made public.

Referee Report

2 major / 2 minor

Summary. The paper proposes a synthetic data generation pipeline for long-form doctor-patient conversations aimed at training and evaluating audio summarization models, specifically for generating SOAP notes. The pipeline involves persona-driven LLM dialogue generation, multi-speaker audio synthesis incorporating overlaps, pauses, room acoustics, and sound events, and LLM-based reference SOAP note production. All components use open-weight models. The authors release a dataset of 8,800 synthetic conversations comprising 1.3k hours of audio along with reference notes, and evaluate open-weight systems on this data, concluding that cascaded approaches (ASR + LLM) substantially outperform end-to-end models.

Significance. If the synthetic data is sufficiently realistic, this work provides a valuable open resource for training and evaluating long-context audio reasoning in the medical domain, where real data is scarce due to privacy constraints. The release of 8,800 conversations with 1.3k hours of audio, built entirely on open-weight models, is a clear strength that promotes reproducibility and community use. The empirical result on cascaded vs. end-to-end performance offers a benchmark that could inform model development, provided the dataset's fidelity supports generalization.

major comments (2)

[Evaluation of current open-weight systems] The claim that cascaded approaches substantially outperform end-to-end models is presented without details on the exact evaluation metrics for SOAP note quality (e.g., ROUGE, BERTScore, or LLM-as-judge), the specific open-weight models and configurations tested, or statistical significance of the gap. This information is necessary to assess the result's robustness.
[Pipeline description and data release] No section reports validation of the synthetic dialogues or audio against real doctor-patient recordings, such as clinician ratings of clinical nuance, acoustic similarity metrics (e.g., MOS or spectrogram comparisons), or distributional checks (turn lengths, disfluency rates, medical terminology frequency). This is load-bearing for the assertion that the dataset serves as a controlled evaluation environment, since unvalidated artifacts could artifactually favor cascaded pipelines over end-to-end audio models.

minor comments (2)

[Abstract] The abstract would be strengthened by explicitly naming the task (SOAP note generation from long-form audio) and the dataset scale to better frame the contribution for readers.
[Experiments] Clarify the exact number of models evaluated and any hyperparameter details in the experimental setup to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below. We have revised the manuscript to improve clarity and completeness where feasible while remaining faithful to the synthetic nature of the work.

read point-by-point responses

Referee: [Evaluation of current open-weight systems] The claim that cascaded approaches substantially outperform end-to-end models is presented without details on the exact evaluation metrics for SOAP note quality (e.g., ROUGE, BERTScore, or LLM-as-judge), the specific open-weight models and configurations tested, or statistical significance of the gap. This information is necessary to assess the result's robustness.

Authors: We agree that the evaluation section would benefit from greater explicitness. In the revised manuscript we have expanded the relevant section to specify the exact metrics employed for SOAP note quality (ROUGE-1/2/L, BERTScore, and an LLM-as-judge protocol), to enumerate the precise open-weight models and hyperparameter configurations used for both the cascaded (ASR + LLM) and end-to-end pipelines, and to report statistical significance testing (bootstrap confidence intervals and paired tests) for the observed performance differences. These additions directly address the robustness concern. revision: yes
Referee: [Pipeline description and data release] No section reports validation of the synthetic dialogues or audio against real doctor-patient recordings, such as clinician ratings of clinical nuance, acoustic similarity metrics (e.g., MOS or spectrogram comparisons), or distributional checks (turn lengths, disfluency rates, medical terminology frequency). This is load-bearing for the assertion that the dataset serves as a controlled evaluation environment, since unvalidated artifacts could artifactually favor cascaded pipelines over end-to-end audio models.

Authors: We recognize that explicit validation against real recordings would strengthen claims about the dataset serving as a controlled environment. Because of strict privacy regulations, we have no access to real doctor-patient audio for direct comparison. In the revision we have added a new subsection detailing the pipeline's design choices that target realism (persona conditioning for clinical content, explicit modeling of overlaps/pauses/disfluencies, room acoustics, and medical terminology drawn from public medical dialogue resources). We also report internal distributional statistics of the generated data (turn lengths, disfluency rates, terminology frequency) and compare them to publicly available non-private medical dialogue corpora. We cannot supply clinician ratings or MOS scores against real audio. revision: partial

standing simulated objections not resolved

Direct clinician ratings of clinical nuance or acoustic similarity metrics (MOS, spectrogram comparisons) against real doctor-patient recordings, as privacy regulations preclude access to such real data.

Circularity Check

0 steps flagged

No circularity: empirical evaluation on generated data with no derivations or fitted predictions

full rationale

The paper describes a three-stage synthetic data pipeline (persona-driven LLM dialogue generation, multi-speaker audio synthesis, LLM SOAP note production) and reports direct empirical comparisons of cascaded vs. end-to-end models on the resulting 8,800 conversations. No equations, parameter fits, uniqueness theorems, or self-citations are invoked as load-bearing steps in any derivation chain. The central claim is an observation from evaluation on the constructed corpus rather than a reduction of outputs to inputs by construction. This is standard empirical work on synthetic benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that synthetic data can usefully stand in for real medical conversations; no free parameters or invented entities are described.

axioms (1)

domain assumption Synthetic data generated via persona-driven dialogue, multi-speaker audio synthesis, and LLM notes sufficiently approximates real doctor-patient interactions for training and evaluation purposes.
Invoked to justify both the training resource and the controlled evaluation environment.

pith-pipeline@v0.9.0 · 5494 in / 1213 out tokens · 54642 ms · 2026-05-10T18:37:44.426716+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a synthetic data generation pipeline... persona-driven dialogue generation, multi-speaker audio synthesis... LLM-based reference SOAP note production

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 12 canonical work pages · 7 internal anchors

[1]

However, these bench- marks focus predominantly on short-context tasks

Introduction Toward the broader goal of human-level audio understanding, recent large audio language models (LALMs) [1, 2, 3, 4] have demonstrated impressive progress on benchmarks for audio pro- cessing and comprehension [5, 6, 7, 8]. However, these bench- marks focus predominantly on short-context tasks. Our under- standing of LALM performance on long-c...

2025
[2]

Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

Related Work Recent advances have dramatically expanded the context win- dows of Large Audio Language Models (LALMs). While early attempts at end-to-end (E2E) speech summarization struggled with the quadratic memory complexity of processing long au- dio sequences [11], current systems can ingest continuous au- dio ranging from 40 minutes to over eight hou...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

hand swelling,

Transcript and Audio Generation We now describe the three stages of our data generation pipeline, each targeting a specific gap identified above: (1) per- sona and context sampling, (2) persona-conditioned text dia- logue generation, and (3) audio synthesis with acoustic simula- tion. Throughout, we use Gemma3-27B-IT [23] as the LLM, selected for its perf...
[4]

viral illness

SOAP Note Generation and Evaluation Using the speaker-attributed transcript produced by the pipeline, we generate reference SOAP notes and evaluate all 6Audio samples and example SOAP notes are provided as supple- mentary material for review. 7“wet” final augmented audio, optionally Opus-compressed systems via two-stage processes, both using Kimi K2 Think...
[5]

erence SOAP note production—built entirely on open-weight models, yielding 8,800 conversations and 1,329 hours of audio

Conclusion We presented a fully synthetic pipeline—persona-conditioned dialogue generation, multi-speaker audio synthesis, and ref- 8Medical concept F1: MeSH keyword matching + NER via en core sci md(scispaCy[37]). erence SOAP note production—built entirely on open-weight models, yielding 8,800 conversations and 1,329 hours of audio. Cascaded systems subs...
[6]

The contribution by Markus M¨uller was made in his capacity as a workshop leader and does not necessarily reflect the official position of Amazon

Acknowledgement The authors would like to thank Markus M¨uller (Amazon AGI) for his valuable discussions, leadership, and guidance through- out the duration of the workshop. The contribution by Markus M¨uller was made in his capacity as a workshop leader and does not necessarily reflect the official position of Amazon. This work was supported by the 2025 ...

2025
[7]

Qwen3-Omni technical report,

J. Xu, Z. Guo, H. Huet al., “Qwen3-Omni technical report,”
[8]

Qwen3-Omni Technical Report

[Online]. Available: https://arxiv.org/abs/2509.17765

work page internal anchor Pith review arXiv
[9]

Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,

S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S. gil Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, and B. Catanzaro, “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
[10]

Available: https://openreview.net/forum?id= FjByDpDVIO

[Online]. Available: https://openreview.net/forum?id= FjByDpDVIO
[11]

Ama- zon nova sonic: Technical report and model card,

Amazon Artificial General Intelligence, “Ama- zon nova sonic: Technical report and model card,”Amazon Technical Reports, 2025. [On- line]. Available: https://www.amazon.science/publications/ amazon-nova-sonic-technical-report-and-model-card

2025
[12]

Baichuan-Omni-1.5 technical report,

Y . Li, J. Liu, T. Zhanget al., “Baichuan-Omni-1.5 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2501. 15368

2025
[13]

Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,

S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. Valle, D. Manocha, and B. Catanzaro, “Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=xWu5qpDK6U

2025
[14]

Smith, Yulia Tsvetkov, and Sachin Kumar

O. Ahia, M. Bartelds, K. Ahujaet al., “BLAB: Brutally long audio bench,” 2025. [Online]. Available: https://arxiv.org/abs/ 2505.03054

work page arXiv 2025
[15]

J. Chen, Z. Guo, J. Chun, P. Wang, A. Perrault, and M. Elsner, “Do audio LLMs really LISTEN, or just transcribe? measuring 9Train/Dev released before Interspeech 2026; Test data withheld un- til December 2026 lexical vs. acoustic emotion cues reliance,” inProceedings of the 19th Conference of the European Chapter of the Association for Computational Lingu...

2026
[16]

Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence,

S. Kumar, ˇSimon Sedl ´aˇcek, V . Lokegaonkaret al., “MMAU- Pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence,” 2025. [Online]. Available: https://arxiv.org/abs/2508.13992

work page arXiv 2025
[17]

How NOT to evaluate your dialogue system: An em- pirical study of unsupervised evaluation metrics for dialogue re- sponse generation,

C.-W. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau, “How NOT to evaluate your dialogue system: An em- pirical study of unsupervised evaluation metrics for dialogue re- sponse generation,” inProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016

2016
[18]

On faith- fulness and factuality in abstractive summarization,

J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faith- fulness and factuality in abstractive summarization,” inProceed- ings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, Jul. 2020

2020
[19]

End-to-end speech summarization using restricted self-attention,

R. Sharma, A. Gupta, S. Kumar, and F. Metze, “End-to-end speech summarization using restricted self-attention,” inProc. ICASSP. IEEE, 2022, pp. 8072–8076

2022
[20]

Closing the modality reasoning gap for speech large language models,

J. Xiang, S. Zhang, W. Zhou, and Y . Liu, “Closing the modality reasoning gap for speech large language models,” inProc. IEEE ASRU, 2025

2025
[21]

The cascade equivalence hypothesis: When do speech LLMs behave like ASR→LLM pipelines?

J. Billa, “The cascade equivalence hypothesis: When do speech LLMs behave like ASR→LLM pipelines?”arXiv preprint arXiv:2602.17598, 2026

work page arXiv 2026
[22]

The medical scribe: Corpus development and model performance analyses,

I. Shafran, N. Du, L. Tran, A. Perry, L. Keyes, M. Knichel, A. Domin, L. Huang, Y .-h. Chen, G. Li, M. Wang, L. El Shafey, H. Soltau, and J. S. Paul, “The medical scribe: Corpus development and model performance analyses,” inProceedings of the Twelfth Language Resources and Evaluation Conference, N. Calzolari, F. B ´echet, P. Blache, K. Choukri, C. Cieri,...

2020
[23]

Understanding medical conversations: Rich transcription, confidence scores & information extraction,

H. Soltau, M. Wang, I. Shafran, and L. E. Shafey, “Understanding medical conversations: Rich transcription, confidence scores & information extraction,” inInterspeech, 2021

2021
[24]

Synthetic patient–physician conversations simu- lated by large language models: A multi-dimensional evaluation,

S. A. Haider, S. Prabha, C. A. Gomez-Cabello, S. Borna, A. Gen- ovese, M. Trabilsy, B. G. Collaco, N. G. Wood, S. Bagaria, C. Tao, and A. J. Forte, “Synthetic patient–physician conversations simu- lated by large language models: A multi-dimensional evaluation,” Sensors, vol. 25, no. 14, 2025

2025
[25]

NoteChat: A dataset of synthetic patient-physician con- versations conditioned on clinical notes,

J. Wang, Z. Yao, Z. Yang, H. Zhou, R. Li, X. Wang, Y . Xu, and H. Yu, “NoteChat: A dataset of synthetic patient-physician con- versations conditioned on clinical notes,” inFindings of the Asso- ciation for Computational Linguistics: ACL 2024. Association for Computational Linguistics, 2024

2024
[26]

PriMock57: A dataset of primary care mock consul- tations,

A. Papadopoulos Korfiatis, F. Moramarco, R. Sarac, and A. Savkov, “PriMock57: A dataset of primary care mock consul- tations,” inProceedings of the 60th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 2: Short Papers), 2022

2022
[27]

ACI-BENCH: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation,

W.-w. Yim, Y . Fu, A. Ben Abacha, N. Snider, T. Lin, and M. Yetisgen, “ACI-BENCH: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation,”Scien- tific Data, vol. 10, no. 1, p. 586, 2023

2023
[28]

Generating Data with Text-to-Speech and Large-Language Models for Con- versational Speech Recognition,

S. Cornell, J. Darefsky, Z. Duan, and S. Watanabe, “Generating Data with Text-to-Speech and Large-Language Models for Con- versational Speech Recognition,” inSynthetic Data’s Transforma- tive Role in Foundational Speech Models, 2024

2024
[29]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” inAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., ...

2023
[30]

G-eval: NLG evaluation using gpt-4 with better human alignment,

Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: NLG evaluation using gpt-4 with better human alignment,” inProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., 2023

2023
[31]

Gemma 3 technical report,

G. Team, A. Kamath, J. Ferretet al., “Gemma 3 technical report,”
[32]

Gemma 3 Technical Report

[Online]. Available: https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Sdialog: A python toolkit for end-to-end agent building, user simulation, dialog generation, and evaluation,

S. Burdisso, S. Baroudi, Y . Labraket al., “Sdialog: A python toolkit for end-to-end agent building, user simulation, dialog generation, and evaluation,” 2026. [Online]. Available: https://arxiv.org/abs/2506.10622

work page internal anchor Pith review arXiv 2026
[34]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” inInterspeech 2019, 2019, pp. 1526–1530

2019
[35]

Scaper: A library for soundscape synthesis and augmentation,

J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello, “Scaper: A library for soundscape synthesis and augmentation,” in2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 344–348

2017
[36]

Pyroomacoustics: A python package for audio room simulation and array processing algorithms,

R. Scheibler, E. Bezzam, and I. Dokmani ´c, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press, 2018, p. 351–355. [Online]. Available: https://doi.org/10. 1109/ICASSP.2018.8461310

work page arXiv 2018
[37]

Qwen3-ASR technical report,

X. Shi, X. Wang, Z. Guoet al., “Qwen3-ASR technical report,”
[38]

Qwen3-ASR Technical Report

[Online]. Available: https://arxiv.org/abs/2601.21337

work page internal anchor Pith review arXiv
[39]

Uni-VERSA: Versatile Speech Assessment with a Unified Network,

J. Shi, H. jin Shim, and S. Watanabe, “Uni-VERSA: Versatile Speech Assessment with a Unified Network,” inInterspeech 2025, 2025, pp. 1798–1802

2025
[40]

The t05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,

K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The t05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inIEEE Spoken Language Technology Work- shop (SLT), 2024, pp. 818–824

2024
[41]

Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, and Shinji Watanabe

D. E, A. Meena, M. Nanivadekar, N. A, V . Azad, A. N. Shenoy, P. R. Chowdhuri, S. Banga, V . Chhabra, C. Bhat, S. babu Kalluri, S. R. Chetupalli, D. Vijayasenan, and S. Ganapathy, “Benchmarking speech systems for frontline health conversations: The displace-m challenge,” 2026. [Online]. Available: https://arxiv.org/abs/2603.02813

work page arXiv 2026
[42]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23– ...

2023
[43]

Qwen3 Technical Report

A. Yang, A. Li, B. Yanget al., “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Kimi K2: Open Agentic Intelligence

K. Team, Y . Bai, Y . Baoet al., “Kimi K2: Open agentic intelligence,” 2026. [Online]. Available: https://arxiv.org/abs/ 2507.20534

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Revisiting text decomposition methods for NLI-based factuality scoring of summaries,

J. Glover, F. Fancellu, V . Jagannathan, M. R. Gormley, and T. Schaaf, “Revisiting text decomposition methods for NLI-based factuality scoring of summaries,” inProceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), A. Bosselut, K. Chandu, K. Dhole, V . Gangal, S. Gehrmann, Y . Jernite, J. Novikova, and L. Perez-B...

2022
[46]

rouge-score: A python implementation of rouge,

Google Research, “rouge-score: A python implementation of rouge,” 2019. [Online]. Available: https://github.com/ google-research/google-research/tree/master/rouge

2019
[47]

ScispaCy: Fast and robust models for biomedical natural language processing,

M. Neumann, D. King, I. Beltagy, and W. Ammar, “ScispaCy: Fast and robust models for biomedical natural language processing,” inProceedings of the 18th BioNLP Workshop and Shared Task, D. Demner-Fushman, K. B. Cohen, S. Ananiadou, and J. Tsujii, Eds. Florence, Italy: Association for Computational Linguistics, Aug. 2019, pp. 319–327. [Online]. Available: h...

2019
[48]

Leveraging pretrained models for automatic summarization of doctor- patient conversations,

L. Zhang, R. Negrinho, A. Ghosh, V . Jagannathan, H. R. Hassanzadeh, T. Schaaf, and M. R. Gormley, “Leveraging pretrained models for automatic summarization of doctor- patient conversations,” inFindings of the ACL: EMNLP 2021, 2021, pp. 3693–3712. [Online]. Available: https: //aclanthology.org/2021.findings-emnlp.313/

2021
[49]

Manuscript preparation.Large language models were used to assist with proofreading, improving conciseness, and formatting LATEX tables

Generative AI Use Disclosure Generative AI tools were used in two distinct ways in this work. Manuscript preparation.Large language models were used to assist with proofreading, improving conciseness, and formatting LATEX tables. All such use was directed and reviewed by an author; AI tools produced no significant portions of the manuscript without subseq...