pith. sign in

arxiv: 2605.26747 · v1 · pith:2PGHVFRLnew · submitted 2026-05-26 · 💻 cs.AI

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

Pith reviewed 2026-06-29 17:40 UTC · model grok-4.3

classification 💻 cs.AI
keywords medical dialoguesspoken language processingrobot-patient dialoguesdoctor-patient dialogueslarge language modelssentence selection benchmarkhealth conditionsMeDial-Speech
0
0 comments X

The pith

A new dataset of 111+ hours of robot-patient and doctor-patient medical dialogues enables benchmarking of LLMs on consultation sentence selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces MeDial-Speech, a speech dataset gathered from realistic robot-patient and doctor-patient dialogues covering four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. The collection totals over 111 hours without augmentation and is positioned for training and evaluating AI systems that perform medical consultations. The authors define a sentence-selection benchmark with 20 options per turn and test three large language models, showing Claude Sonnet 4 reaches 71.1% accuracy on manual transcriptions and 74.7% on automatic transcriptions while all models remain highly overconfident in their predictions. The work targets the open problem of applying LLMs effectively to spoken medical interactions by supplying a dedicated, publicly available resource.

Core claim

The paper establishes MeDial-Speech as a resource of 111+ hours of speech from robot-patient and doctor-patient dialogues across four health conditions, paired with a sentence-selection benchmark that identifies Claude Sonnet 4 as the strongest of three tested LLMs at 71.1% accuracy on manual transcriptions and 74.7% on automatic transcriptions, while documenting that all evaluated models exhibit high overconfidence irrespective of whether their selected sentence is correct.

What carries the argument

The MeDial-Speech dataset of spoken medical dialogues together with its sentence-selection benchmark using 20 options per turn.

If this is right

  • The dataset supplies training material specifically for spoken medical consultation tasks.
  • Automatic speech recognition transcriptions can serve as a viable alternative to manual ones for model evaluation in this domain.
  • Large language models require additional calibration techniques before deployment in medical dialogue settings due to consistent overconfidence.
  • Both robot-patient and doctor-patient interaction data become available for spoken language processing research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The resource could support development of AI assistants that handle spoken medical exchanges more reliably than current general models.
  • The documented overconfidence pattern suggests a broader need for uncertainty-aware methods when applying LLMs to high-stakes conversational domains.
  • Robot-patient dialogues within the collection may enable separate study of human-robot medical communication patterns.

Load-bearing premise

The collected dialogues are representative enough of real medical consultations to be useful for training and evaluating medical AI systems.

What would settle it

A controlled test in which models trained or fine-tuned on MeDial-Speech perform no better than general-purpose models when evaluated on actual patient consultations with human doctors would indicate the dataset does not deliver the claimed training or evaluation value.

read the original abstract

Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks. However, their application to textual or spoken medical consultations is still an open research problem. This paper proposes MeDial-Speech, a novel speech dataset for training and evaluating Med-AIs that can carry out consultations with patients. It was collected in realistic environments from robot-patient and doctor-patient dialogues, contains 111+ hours of speech data (without data augmentation), and covers four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. In addition, we propose a dialogue benchmark via sentence selection (with 20 options) to evaluate three state-of-the-art LLMs: GPT-5 mini, DeepSeek-V3, and Claude Sonnet 4. Experimental results reveal that Claude Sonnet 4 is the best in sentence selection, with 71.1% accuracy using manual transcriptions and 74.7% using automatic transcriptions, and that all LLMs are highly overconfident in their probabilistic predictions, regardless of selecting correct or incorrect sentences in medical dialogues. This dataset is free of charge for non-commercial purposes at: https://huggingface.co/datasets/hcuayahu/MeDial-Speech

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces MeDial-Speech, a speech dataset of 111+ hours collected from robot-patient and doctor-patient dialogues in realistic environments, covering four conditions (Lewy body dementia, heart failure, shoulder pain, angina). It proposes a sentence-selection benchmark (20 options) to evaluate LLMs on medical dialogues, reporting that Claude Sonnet 4 achieves the highest accuracy (71.1% on manual transcriptions, 74.7% on automatic transcriptions) while all tested models (including GPT-5 mini and DeepSeek-V3) exhibit high overconfidence in their predictions. The dataset is released freely for non-commercial use via Hugging Face.

Significance. A large-scale spoken medical dialogue dataset would address a clear resource gap for training and evaluating spoken Med-AIs. The sentence-selection benchmark supplies a reproducible task and initial LLM comparison, and the public release itself is a concrete contribution. However, the significance is conditional on the dialogues being sufficiently representative of actual clinical interactions; without supporting details this remains an open question.

major comments (3)
  1. [Abstract] Abstract: The claim that the dataset 'can carry out consultations with patients' and supports training/evaluating Med-AIs rests on the unverified assumption that the collected dialogues are representative of real medical consultations. No information is supplied on participant type (real patients vs. actors), elicitation method (scripted vs. free-form), presence of time pressure or physical exams, or any quantitative validation against existing real consultation corpora.
  2. [Abstract] Abstract (benchmark results): The reported accuracies of 71.1% and 74.7% and the statement that 'all LLMs are highly overconfident' are given without error bars, confidence intervals, or statistical significance tests comparing models or conditions. This prevents assessment of whether the observed differences and overconfidence pattern are reliable.
  3. [Abstract] Abstract: The benchmark relies on both manual and automatic transcriptions, yet no details are provided on transcription accuracy, inter-annotator agreement, or word-error-rate of the automatic system. These factors directly affect the validity of the 71.1% and 74.7% figures.
minor comments (1)
  1. [Abstract] Abstract: The model identifier 'GPT-5 mini' is non-standard; please specify the exact model versions and checkpoints used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the manuscript to provide the requested details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the dataset 'can carry out consultations with patients' and supports training/evaluating Med-AIs rests on the unverified assumption that the collected dialogues are representative of real medical consultations. No information is supplied on participant type (real patients vs. actors), elicitation method (scripted vs. free-form), presence of time pressure or physical exams, or any quantitative validation against existing real consultation corpora.

    Authors: The abstract summarizes the collection process but omits granular methodological details due to space limits. The full manuscript states that dialogues were collected in realistic environments from robot-patient and doctor-patient interactions, but we agree that explicit information on participant types, elicitation procedures, time pressure, physical exams, and validation against real corpora would better support claims of representativeness. We will expand the Methods section in the revision to include these details. revision: yes

  2. Referee: [Abstract] Abstract (benchmark results): The reported accuracies of 71.1% and 74.7% and the statement that 'all LLMs are highly overconfident' are given without error bars, confidence intervals, or statistical significance tests comparing models or conditions. This prevents assessment of whether the observed differences and overconfidence pattern are reliable.

    Authors: The abstract reports point accuracies and the overconfidence observation without accompanying statistical measures. We agree that error bars, confidence intervals, and significance tests are needed to assess reliability. We will add these analyses to the results section and update the abstract in the revised manuscript. revision: yes

  3. Referee: [Abstract] Abstract: The benchmark relies on both manual and automatic transcriptions, yet no details are provided on transcription accuracy, inter-annotator agreement, or word-error-rate of the automatic system. These factors directly affect the validity of the 71.1% and 74.7% figures.

    Authors: Transcription details were not included in the abstract. We will add information on manual transcription inter-annotator agreement and the word-error-rate of the automatic system to the revised manuscript to support the reported benchmark figures. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with empirical benchmark only

full rationale

This is a data-collection and benchmarking paper with no equations, derivations, fitted parameters, or predictions. The central claims rest on the existence of the collected MeDial-Speech corpus and the reported LLM accuracies on a sentence-selection task; neither reduces to a self-referential fit or self-citation chain. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset release with no mathematical derivations, fitted parameters, or postulated entities; the only background assumptions are standard ones about data collection ethics and transcription validity, none of which are novel to the paper.

pith-pipeline@v0.9.1-grok · 5761 in / 1147 out tokens · 27851 ms · 2026-06-29T17:40:51.944051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    MOTIV A TION Effective communication between doctors and patients is an essential aspect of the UK’s NHS core value, ‘commitment to quality of care’[1]. Exposure to medical consultations is cru- cial for trainees—whether human or machine—in the medical field to enhance their competence in communication, clinical reasoning, and professionalism. This paper ...

  2. [2]

    PREVIOUS WORKS Speech-based medical datasets publicly available are scarce. Some exceptions include [2], who used four participants for a total of 9 consultations with audio and video recorded, [3], who collected and annotated 272 human-human con- sultations, [4], who reported a dataset of 57 primary care consultations, [5], who collected and annotated 32...

  3. [3]

    let’s move on

    THE DA TA COLLECTION SYSTEM The Pepper robot [19] was equipped with a teleoperated sys- tem using in-person and remote embodied telepresence. In this setup, the teleoperator was embodied in the body of the robot by hearing and seeing what the robot perceives and by saying what the robot conveys to the person in front of it. This Wizard-of-Oz (WOZ) setting...

  4. [4]

    THE SPEECH DA TASET 4.1. Data Collection We collected data from 325 recruited and unpaid participants, mostly university students and from different schools (but mainly the Schools of Medicine and Computer Science at UoL) with age categories 18-24 (87.1%), 25-34 (8.9%), and 35+ (4.0%). All participants were fluent speakers of English, including native & n...

  5. [5]

    We focus on making data available with reference transcriptions and la- bels, and leave data splits of training, validation, & test open

    DA TA USES: BENCHMARKS AND TOOLS The MeDial-Speech dataset can be used for benchmarking the following Spoken Language Processing tasks. We focus on making data available with reference transcriptions and la- bels, and leave data splits of training, validation, & test open. 1.V oice Activity Detection (V AD). This task can be performed, for example, by ran...

  6. [6]

    The former included in-person and remote consultations, whereas the latter were carried out face-to-face

    CONCLUSIONS AND FUTURE WORK We present a novel dataset of spoken medical consultations, collected in realistic environments from robot-patient and doctor-patient dialogues, covering four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. The former included in-person and remote consultations, whereas the latter were carried o...

  7. [7]

    ACKNOWLEDGEMENTS We would like to acknowledge the Lincoln Medical School for reviewing and approving the ethics applications for col- lecting this dataset, our 14 BMedSci students for conducting the consultations, our 325 participants who took part as un- paid actor patients, our robot, and our 20 data annotators

  8. [8]

    The NHS values,

    National Health Service (NHS), “The NHS values,” 2025, Accessed: 2025-09-17

  9. [9]

    A multimodal corpus of simulated consulta- tions between a patient and multiple healthcare profes- sionals,

    Mark Snaith, Nicholas Conway, Tessa Beinema, and Et Al., “A multimodal corpus of simulated consulta- tions between a patient and multiple healthcare profes- sionals,”Lang. Resour. Evaluation, vol. 55, no. 4, 2021

  10. [10]

    A dataset of simulated patient-physician medi- cal interviews with a focus on respiratory cases,

    Faiha Fareez, Tishya Parikh, Christopher Wavell, and Et Al., “A dataset of simulated patient-physician medi- cal interviews with a focus on respiratory cases,”Scien- tific Data, vol. 9, 2022

  11. [11]

    Primock57: A dataset of primary care mock consultations,

    Alex Papadopoulos-Korfiatis, Francesco Moramarco, Radmila Sarac, and Aleksandar Savkov, “Primock57: A dataset of primary care mock consultations,” inAn- nual Meeting of the ACL, 2022

  12. [12]

    One in a million: a study of primary care consultations,

    R. Barnes et al., “One in a million: a study of primary care consultations,” 2017

  13. [13]

    Medical dialogues audio dataset,

    Defined.ai, “Medical dialogues audio dataset,” 2023

  14. [14]

    Speech recognition for medical conver- sations,

    Chung-Cheng Chiu, Anshuman Tripathi, Katherine Chou, et al., “Speech recognition for medical conver- sations,” inInterspeech, 2018

  15. [15]

    The sound of healthcare: Improving medical transcription ASR accuracy with large language models,

    Ayo Adedeji, Sarita Joshi, and Brendan Doohan, “The sound of healthcare: Improving medical transcription ASR accuracy with large language models,”arXiv preprint arXiv:2402.07658, 2024

  16. [16]

    MultiMed: Multilingual medical speech recogni- tion via attention encoder decoder,

    Khai Le-Duc, Phuc Phan, Tan-Hanh Pham, et al., “MultiMed: Multilingual medical speech recogni- tion via attention encoder decoder,”arXiv preprint arXiv:2409.14074, 2024

  17. [17]

    MIMIC-III, a freely accessible critical care database,

    A.E.W. Johnson, T.J. Pollard, L. Shen, L.H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L.A. Celi, and R.G. Mark, “MIMIC-III, a freely accessible critical care database,”Scientific Data, vol. 3, 2016

  18. [18]

    MIMIC-IV, a freely accessible electronic health record dataset,

    A.E.W. Johnson, L. Bulgarelli, T.J. Pollard, et al., “MIMIC-IV, a freely accessible electronic health record dataset,”Scientific Data, vol. 9, 2022

  19. [19]

    MedINST: Meta dataset of biomedical instructions,

    Wenhan Han, Meng Fang, Zihan Zhang, et al., “MedINST: Meta dataset of biomedical instructions,” Nov. 2024, Association for Computational Linguistics

  20. [20]

    A sur- vey of large language models in medicine: Progress, ap- plication, and challenge,

    Hongjian Zhou, Fenglin Liu, Boyang Gu, et al., “A sur- vey of large language models in medicine: Progress, ap- plication, and challenge,” 2024

  21. [21]

    Large language mod- els for medicine: a survey,

    Yanxin Zheng, Wensheng Gan, Zefeng Chen, Zhenlian Qi, Qian Liang, and Philip S. Yu, “Large language mod- els for medicine: a survey,”Int. J. Mach. Learn. Cy- bern., vol. 16, no. 2, pp. 1015–1040, 2025

  22. [22]

    The AI doctor is in: A survey of task-oriented dialogue systems for health- care applications,

    Mina Valizadeh and Natalie Parde, “The AI doctor is in: A survey of task-oriented dialogue systems for health- care applications,” inAnnual Meeting of the ACL

  23. [23]

    A systematic review on healthcare artificial intelligent conversational agents for chronic conditions,

    Abdullah Bin Sawad, Bhuva Narayan, Ahlam Alnefaie, and Et Al., “A systematic review on healthcare artificial intelligent conversational agents for chronic conditions,” Sensors, vol. 22, no. 7, 2022

  24. [24]

    A survey of robots in health- care,

    Maria Kyrarini, Fotios Lygerakis, Akilesh Rajavenkata- narayanan, and Et Al., “A survey of robots in health- care,”Technologies, vol. 9, no. 1, 2021

  25. [25]

    The use and promise of conversational agents in digital health,

    Tilman Dingler1, Dominika Kwasnicka, Jing Wei, and Et Al., “The use and promise of conversational agents in digital health,”IMIA Yearbook of Medical Informatics, 2021

  26. [26]

    A mass- produced sociable humanoid robot: Pepper: The first machine of its kind,

    Amit Kumar Pandey and Rodolphe Gelin, “A mass- produced sociable humanoid robot: Pepper: The first machine of its kind,”IEEE Robotics Autom. Mag., 2018

  27. [27]

    Woz4u: An open-source wizard-of-oz interface for easy, efficient and robust HRI experiments,

    Finn Rietz, Alexander Sutherland, Suna Bensch, et al., “Woz4u: An open-source wizard-of-oz interface for easy, efficient and robust HRI experiments,”Frontiers in Robotics and AI, 2021

  28. [28]

    Laying Down the Yellow Brick Road: Development of a Wizard-of-Oz Interface for Collecting Human-Robot Dialogue

    Claire Bonial, Matthew Marge, Ron Artstein, et al., “Laying down the yellow brick road: Development of a wizard-of-oz interface for collecting human-robot dia- logue,” vol. abs/1710.06406, 2017

  29. [29]

    GPT5-mini,

    OpenAI, “GPT5-mini,” 2025, Accessed: 2025-09-17

  30. [30]

    DeepSeek V3,

    DeepSeek, “DeepSeek V3,” 2025, Accessed: 2025-09- 17

  31. [31]

    Claude Sonnet 4,

    Anthropic, “Claude Sonnet 4,” 2025, Accessed: 2025- 09-17

  32. [32]

    The over- lapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two normal densities,

    Henry F. Inman and Edwin L. Bradley, “The over- lapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two normal densities,”Communications in Statistics - Theory and Methods, vol. 18, no. 10, 1989

  33. [33]

    Predict- ing good probabilities with supervised learning,

    Alexandru Niculescu-Mizil and Rich Caruana, “Predict- ing good probabilities with supervised learning,” inIn- ternational Conference on Machine Learning (ICML), 2005