Synthesis and Evaluation of Long-term History-aware Medical Dialogue

Ah-Hwee Tan; Hebin Hu; Renke Dai; Yilin Kang

arxiv: 2605.19766 · v1 · pith:F4VE4SQFnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI

Synthesis and Evaluation of Long-term History-aware Medical Dialogue

Hebin Hu , Renke Dai , Ah-Hwee Tan , Yilin Kang This is my paper

Pith reviewed 2026-05-20 06:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords medical dialoguelong-term historysynthetic data generationhealthcare agentsLLM benchmarkslongitudinal reasoningdialogue synthesis

0 comments

The pith

A new framework creates long-term medical dialogue datasets that expose gaps in LLM history reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The absence of realistic long-term medical dialogue datasets has prevented proper testing of healthcare agents that must recall patient histories across visits. This paper presents a knowledge-guided synthesis method that first constructs diverse patient profiles with disease trajectories, then produces multi-turn dialogues for each visit, and finally combines them into coherent longitudinal datasets known as MediLongChat. Three benchmark tasks are defined to measure in-dialogue reasoning, cross-dialogue reasoning, and synthesis reasoning over the history. A quality evaluation uses both automatic metrics like faithfulness and coherence and LLM-based judgments of correctness and realism. Tests indicate that even leading LLMs have difficulty succeeding on these benchmarks.

Core claim

Through a three-stage synthesis process, the authors generate the MediLongChat dataset of high-quality long-term medical dialogues and establish benchmarks that demonstrate the struggles of state-of-the-art LLMs with recalling and reasoning over extended patient medical histories.

What carries the argument

Knowledge-guided decomposition into three stages for creating synthetic patient profiles, encounter-specific dialogues, and integrated longitudinal histories, along with the In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesis Reasoning tasks.

If this is right

The benchmark will encourage creation of AI healthcare agents capable of managing longitudinal patient information.
Specialized techniques for long-term reasoning will be essential to improve medical dialogue systems.
The combined vector and LLM-as-judge evaluation method provides a template for assessing synthetic data quality in sensitive domains.
Advancing healthcare agents requires addressing memory limitations shown by current models on MediLongChat.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This synthesis technique may transfer to generating long-context datasets in other regulated fields such as legal consultations.
Confirmation that the synthetic data matches real clinical patterns would support broader use without privacy risks.
Models augmented with dedicated long-term memory systems could be tested as a direct response to the observed performance issues.

Load-bearing premise

The synthetic dialogues generated by LLMs are realistic and consistent enough to represent actual long-term clinical patient histories.

What would settle it

Evaluating the same reasoning tasks using real multi-visit clinical dialogue records, if available under privacy protections, to see if model performance matches the results on the synthetic MediLongChat.

Figures

Figures reproduced from arXiv: 2605.19766 by Ah-Hwee Tan, Hebin Hu, Renke Dai, Yilin Kang.

**Figure 2.** Figure 2: Overview of our dataset generation pipeline. Stage 1 builds records with knowledge guidance; Stage 2 generates per [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

An effective healthcare agent must be able to recall and reason over a patient's longitudinal medical history. However, the absence of datasets with realistic long-term dialogue timelines limits systematic evaluation. Real clinical text is constrained by privacy and ethics, while existing benchmarks focus on isolated interactions, failing to capture cross-session reasoning. We introduce a framework for synthesizing high-quality, long-term medical dialogues with LLMs. Our approach entails a knowledge-guided decomposition into three stages: constructing synthetic patient profiles with diverse disease and complication trajectories, generating multi-turn dialogues per encounter, and integrating them into a coherent longitudinal history dataset, MediLongChat. We establish three benchmark tasks-In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesis Reasoning-to evaluate the memory capabilities of healthcare agents. To assess data quality, we introduce a multi-dimensional evaluation framework combining vector-based metrics with LLM-as-a-judge assessments. Specifically, we define automatic measures-Faithfulness, Coherence, and Diversity-together with two LLM-based evaluations: Correctness and Realism. Benchmark experiments show that even state-of-the-art LLMs struggle with MediLongChat. These findings highlight the benchmark's applicability and underscore the need for tailored methods to advance healthcare agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper creates a synthetic longitudinal medical dialogue dataset and three reasoning tasks that highlight LLM weaknesses on cross-session memory, but the lack of real clinical grounding leaves the results preliminary.

read the letter

The main takeaway here is that the authors built MediLongChat through a three-stage LLM pipeline to generate patient profiles, per-visit dialogues, and then stitched-together histories, plus they defined In-dialogue, Cross-dialogue, and Synthesis Reasoning tasks to test long-term memory in healthcare agents. Current single-session benchmarks miss this, so the gap they target is real and practical for medical AI work. Their quality checks mix vector metrics with LLM judges on faithfulness, coherence, and realism, and the experiments show even strong models falter on the cross-session parts. That gives a concrete signal worth following up on. The approach is straightforward and directly addresses privacy barriers to using actual records. The soft spot is the data itself. Everything rests on LLM-generated content evaluated only by automatic scores and more LLM judgments, with no clinician review of full timelines or direct comparison to de-identified real EHR sequences. If the synthetic histories underplay things like contradictory updates or irregular disease courses that show up in actual care, then the reported struggles could partly be an artifact of the generation process rather than a pure test of model limits. This is for researchers building or evaluating long-context medical dialogue systems who need a starting benchmark for longitudinal reasoning. Readers working on agent memory or healthcare LLMs would find the tasks and initial results useful as an initial probe, though they should treat the numbers as directional. The paper shows clear engagement with the problem and prior single-session work. It deserves a serious referee to get feedback on adding human validation or real-data comparisons. I would send it to peer review.

Referee Report

1 major / 1 minor

Summary. The paper introduces a framework for synthesizing long-term medical dialogues via LLM-based knowledge-guided decomposition into patient profile construction with diverse disease trajectories, per-encounter multi-turn dialogue generation, and longitudinal integration to produce the MediLongChat dataset. It defines three benchmark tasks (In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesis Reasoning) to probe healthcare agents' memory and reasoning over extended histories. Data quality is assessed through automatic vector metrics (Faithfulness, Coherence, Diversity) plus LLM-as-judge evaluations for Correctness and Realism. Experiments demonstrate that state-of-the-art LLMs struggle on these tasks, motivating the need for tailored methods.

Significance. If the synthetic longitudinal dialogues prove representative of real clinical timelines, the benchmark could usefully expose gaps in current LLMs' cross-session memory and synthesis capabilities for healthcare applications, providing a concrete testbed that existing isolated-interaction benchmarks lack.

major comments (1)

[Data Quality Evaluation] Data Quality Evaluation section: The assessment of synthetic dialogues uses only automatic metrics (Faithfulness, Coherence, Diversity) and LLM-as-judge ratings (Correctness, Realism) with no reported clinician review of complete patient timelines, no comparison against de-identified real EHR dialogue sequences, and no measurement of properties such as incomplete information, contradictory updates, or non-stationary progression. Because the central claim that LLMs struggle on MediLongChat indicates a genuine capability gap rests on the dataset's realism, this omission is load-bearing; without such grounding the reported failures could be artifacts of the generation process.

minor comments (1)

[Abstract] The abstract states that benchmark experiments show LLMs struggle but does not report any quantitative scores or specific failure modes; adding a brief summary of key metrics (e.g., accuracy on each task) would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the major comment on data quality evaluation below, providing our honest assessment and planned revisions.

read point-by-point responses

Referee: [Data Quality Evaluation] Data Quality Evaluation section: The assessment of synthetic dialogues uses only automatic metrics (Faithfulness, Coherence, Diversity) and LLM-as-judge ratings (Correctness, Realism) with no reported clinician review of complete patient timelines, no comparison against de-identified real EHR dialogue sequences, and no measurement of properties such as incomplete information, contradictory updates, or non-stationary progression. Because the central claim that LLMs struggle on MediLongChat indicates a genuine capability gap rests on the dataset's realism, this omission is load-bearing; without such grounding the reported failures could be artifacts of the generation process.

Authors: We appreciate the referee's point that stronger grounding in clinical realism would bolster the central claims. Our multi-dimensional evaluation was intentionally designed around scalable, reproducible metrics—vector-based Faithfulness, Coherence, and Diversity together with LLM-as-judge Correctness and Realism—following established practices in synthetic medical data generation. We acknowledge, however, that these proxies do not replace direct clinician review of full longitudinal timelines or explicit quantification of properties such as incomplete information, contradictory updates, or non-stationary progression. In the revised manuscript we will add a dedicated Limitations subsection in the Data Quality Evaluation section that explicitly discusses these gaps, reports any feasible post-hoc checks for the mentioned properties within the synthetic framework, and outlines concrete plans for future clinician validation studies. Direct comparison against de-identified real EHR dialogue sequences remains infeasible under current privacy regulations; this constraint is precisely what motivates the synthetic approach described in the paper. We believe these additions will clarify the evidential basis without overstating current validation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; new benchmark creation is self-contained

full rationale

The paper introduces a novel synthesis framework and MediLongChat dataset via LLM-based decomposition into patient profiles, per-encounter dialogues, and longitudinal integration, then defines three explicit reasoning tasks and evaluates data quality with vector metrics (Faithfulness, Coherence, Diversity) plus LLM-as-judge (Correctness, Realism). Benchmark performance claims rest on direct task evaluation rather than any fitted parameter, self-referential definition, or reduction to prior self-citations. No equations, uniqueness theorems, or ansatzes are invoked that loop back to inputs by construction; the work creates independent artifacts and measures external LLM behavior on them, remaining self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim that LLMs struggle rests on the unverified quality of the synthetic data and the assumption that the chosen metrics and LLM judges accurately reflect real clinical reasoning demands.

axioms (1)

domain assumption LLMs can be prompted to produce high-quality, faithful, and realistic long-term medical dialogues that generalize beyond the generation model itself.
Invoked in the knowledge-guided decomposition stage to justify using LLMs for profile construction and dialogue generation.

pith-pipeline@v0.9.0 · 5739 in / 1285 out tokens · 52077 ms · 2026-05-20T06:38:21.472562+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

knowledge-guided decomposition into three stages: constructing synthetic patient profiles... generating multi-turn dialogues per encounter, and integrating them into... MediLongChat
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three benchmark tasks—In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesis Reasoning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

[1]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72. https://aclanthology.org/W05-0908/

work page 2005
[2]

Trisha Das, Dina Albassam, and Jimeng Sun. 2024. Synthetic Patient-Physician Dialogue Generation from Clinical Notes Using LLM.CoRR(2024)

work page 2024
[3]

Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2021. Survey on evaluation methods for dialogue systems.Artificial Intelligence Review54, 1 (2021), 755–810

work page 2021
[4]

Aldren Gonzales, Guruprabha Guruswamy, and Scott R Smith. 2023. Synthetic data in health care: A narrative review.PLOS Digital Health2, 1 (2023), e0000082

work page 2023
[5]

Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure.arXiv preprint arXiv:2203.05794(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Jihyoung Jang, Minseong Boo, and Hyounghun Kim. 2023. Conversation Chroni- cles: Towards Diverse Temporal and Relational Dynamics in Multi-Session Con- versations. InProceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing. Association for Computational Linguistics, Singapore, 13584–13606. https://doi.org/10.18653/v1/2023.e...

work page doi:10.18653/v1/2023.emnlp-main.838 2023
[7]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2567–2577

work page 2019
[8]

Aleksandar Kovačević, Bojana Bašaragin, Nikola Milošević, and Goran Nenadić

work page
[9]

De-identification of clinical free text using natural language processing: A systematic review of current approaches.Artificial intelligence in medicine151 (2024), 102845

work page 2024
[10]

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen

work page
[11]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 6449–6464

work page 2023
[12]

Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao

work page
[13]

InProceedings of the 2016 conference on empirical methods in natural language processing

Deep reinforcement learning for dialogue generation. InProceedings of the 2016 conference on empirical methods in natural language processing. 1192–1202

work page 2016
[14]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out: Proceedings of the ACL-04 Workshop. 74–81. https://aclanthology.org/W04-1013/

work page 2004
[15]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

work page 2024
[16]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. NLG Evaluation using GPT-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 2451–2470. https://doi.org/10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[17]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating very long-term conversational mem- ory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13851–13870

work page 2024
[18]

Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. SelfCheckGPT: Zero- Resource Black-Box Hallucination Detection for Generative Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 9004–9017

work page 2023
[19]

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning. PMLR, 248– 260

work page 2022
[20]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318. https://aclanthology.org/P02-1040/

work page 2002
[21]

Paloma Rabaey, Henri Arno, Stefan Heytens, and Thomas Demeester. [n.d.]. SynSUM–Synthetic Benchmark with Structured and Unstructured Medical Records. InWorkshop on Large Language Models and Generative AI for Health at AAAI 2025

work page 2025
[22]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992. htt...

work page doi:10.18653/v1/d19-1410 2019
[23]

Rush, Sumit Chopra, and Michael Collins

Alexander M. Rush, Sumit Chopra, and Michael Collins. 2015. A Neural At- tention Model for Abstractive Sentence Summarization. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 379–389. https://aclanthology.org/D15-1044/

work page 2015
[24]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

work page
[25]

Large language models encode clinical knowledge.Nature620, 7972 (2023), 172–180

work page 2023
[26]

Ava Spataru. 2024. Know When To Stop: A Study of Semantic Drift in Text Generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 3656–3671

work page 2024
[27]

Junda Wang, Zonghai Yao, Zhichao Yang, Huixue Zhou, Rumeng Li, Xun Wang, Yucheng Xu, and Hong Yu. 2024. NoteChat: A Dataset of Synthetic Patient- Physician Conversations Conditioned on Clinical Notes. InFindings of the Asso- ciation for Computational Linguistics ACL 2024. 15183–15201

work page 2024
[28]

Jing Xu, Arthur Szlam, and Jason Weston. 2022. Beyond Goldfish Memory: Long- Term Open-Domain Conversation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5180–5197

work page 2022
[29]

Lin Xu, Qixian Zhou, Ke Gong, Xiaodan Liang, Jianheng Tang, and Liang Lin

work page
[30]

InProceedings of the AAAI conference on artificial intelligence, Vol

End-to-end knowledge-routed relational dialogue system for automatic diagnosis. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 7346–7353

work page
[31]

Xuejiao Zhao, Siyan Liu, Su-Yin Yang, and Chunyan Miao. 2025. Medrag: Enhanc- ing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot. InProceedings of the ACM on Web Conference 2025. 4442–4457

work page 2025
[32]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

work page 2023

[1] [1]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72. https://aclanthology.org/W05-0908/

work page 2005

[2] [2]

Trisha Das, Dina Albassam, and Jimeng Sun. 2024. Synthetic Patient-Physician Dialogue Generation from Clinical Notes Using LLM.CoRR(2024)

work page 2024

[3] [3]

Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2021. Survey on evaluation methods for dialogue systems.Artificial Intelligence Review54, 1 (2021), 755–810

work page 2021

[4] [4]

Aldren Gonzales, Guruprabha Guruswamy, and Scott R Smith. 2023. Synthetic data in health care: A narrative review.PLOS Digital Health2, 1 (2023), e0000082

work page 2023

[5] [5]

Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure.arXiv preprint arXiv:2203.05794(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Jihyoung Jang, Minseong Boo, and Hyounghun Kim. 2023. Conversation Chroni- cles: Towards Diverse Temporal and Relational Dynamics in Multi-Session Con- versations. InProceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing. Association for Computational Linguistics, Singapore, 13584–13606. https://doi.org/10.18653/v1/2023.e...

work page doi:10.18653/v1/2023.emnlp-main.838 2023

[7] [7]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2567–2577

work page 2019

[8] [8]

Aleksandar Kovačević, Bojana Bašaragin, Nikola Milošević, and Goran Nenadić

work page

[9] [9]

De-identification of clinical free text using natural language processing: A systematic review of current approaches.Artificial intelligence in medicine151 (2024), 102845

work page 2024

[10] [10]

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen

work page

[11] [11]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 6449–6464

work page 2023

[12] [12]

Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao

work page

[13] [13]

InProceedings of the 2016 conference on empirical methods in natural language processing

Deep reinforcement learning for dialogue generation. InProceedings of the 2016 conference on empirical methods in natural language processing. 1192–1202

work page 2016

[14] [14]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out: Proceedings of the ACL-04 Workshop. 74–81. https://aclanthology.org/W04-1013/

work page 2004

[15] [15]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

work page 2024

[16] [16]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. NLG Evaluation using GPT-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 2451–2470. https://doi.org/10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023

[17] [17]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating very long-term conversational mem- ory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13851–13870

work page 2024

[18] [18]

Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. SelfCheckGPT: Zero- Resource Black-Box Hallucination Detection for Generative Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 9004–9017

work page 2023

[19] [19]

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. InConference on health, inference, and learning. PMLR, 248– 260

work page 2022

[20] [20]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318. https://aclanthology.org/P02-1040/

work page 2002

[21] [21]

Paloma Rabaey, Henri Arno, Stefan Heytens, and Thomas Demeester. [n.d.]. SynSUM–Synthetic Benchmark with Structured and Unstructured Medical Records. InWorkshop on Large Language Models and Generative AI for Health at AAAI 2025

work page 2025

[22] [22]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992. htt...

work page doi:10.18653/v1/d19-1410 2019

[23] [23]

Rush, Sumit Chopra, and Michael Collins

Alexander M. Rush, Sumit Chopra, and Michael Collins. 2015. A Neural At- tention Model for Abstractive Sentence Summarization. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 379–389. https://aclanthology.org/D15-1044/

work page 2015

[24] [24]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

work page

[25] [25]

Large language models encode clinical knowledge.Nature620, 7972 (2023), 172–180

work page 2023

[26] [26]

Ava Spataru. 2024. Know When To Stop: A Study of Semantic Drift in Text Generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 3656–3671

work page 2024

[27] [27]

Junda Wang, Zonghai Yao, Zhichao Yang, Huixue Zhou, Rumeng Li, Xun Wang, Yucheng Xu, and Hong Yu. 2024. NoteChat: A Dataset of Synthetic Patient- Physician Conversations Conditioned on Clinical Notes. InFindings of the Asso- ciation for Computational Linguistics ACL 2024. 15183–15201

work page 2024

[28] [28]

Jing Xu, Arthur Szlam, and Jason Weston. 2022. Beyond Goldfish Memory: Long- Term Open-Domain Conversation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5180–5197

work page 2022

[29] [29]

Lin Xu, Qixian Zhou, Ke Gong, Xiaodan Liang, Jianheng Tang, and Liang Lin

work page

[30] [30]

InProceedings of the AAAI conference on artificial intelligence, Vol

End-to-end knowledge-routed relational dialogue system for automatic diagnosis. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 7346–7353

work page

[31] [31]

Xuejiao Zhao, Siyan Liu, Su-Yin Yang, and Chunyan Miao. 2025. Medrag: Enhanc- ing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot. InProceedings of the ACM on Web Conference 2025. 4442–4457

work page 2025

[32] [32]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

work page 2023