MedExpMem: Adapting Experience Memory for Differential Diagnosis

Qianhan Feng; Qi Dou; Winnie Chiu Wing Chu; Xiaofan Zhang; Yakun Zhu; Yannian Gu; Zhongzhen Huang

arxiv: 2605.22872 · v1 · pith:6JLXFR2Enew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.CV

MedExpMem: Adapting Experience Memory for Differential Diagnosis

Qianhan Feng , Zhongzhen Huang , Yakun Zhu , Yannian Gu , Winnie Chiu Wing Chu , Xiaofan Zhang , Qi Dou This is my paper

Pith reviewed 2026-05-25 06:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords differential diagnosisexperience memoryvision-language modelsmedical AIradiology benchmarkdiagnostic agentspairwise notes

0 comments

The pith

MedExpMem lets diagnostic vision-language models learn from their own mistakes by storing pairwise notes on how to tell similar conditions apart.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MedExpMem to let medical vision-language models develop expertise in differential diagnosis by learning from their mistakes. Instead of static knowledge, it creates memory entries from failed diagnoses, formatted as notes comparing pairs of similar conditions with rules for distinguishing them. These notes are retrieved during new diagnoses to improve decisions. This matters because it provides a way for AI to adapt through experience on a large radiology benchmark, showing gains without retraining the model.

Core claim

MedExpMem is an experience memory framework that enables VLM-based diagnostic agents to accumulate differential diagnosis expertise. It memorizes discriminative experience from the agent's own diagnostic failures, organized as pairwise differential notes that encode key discriminators, actionable decision rules, and reasoning error patterns. When facing new cases, the agent retrieves relevant notes to guide reasoning. Evaluation on a radiology benchmark across 11 subspecialties shows consistent accuracy improvements, with a maximum of 7.0% across diverse models and scales.

What carries the argument

Pairwise differential notes that capture distinctions between confusable conditions derived from past diagnostic failures.

If this is right

Diagnostic agents achieve higher accuracy on radiology tasks without changing model parameters.
The method works across different VLM scales and architectures.
Experience is built in two phases: initial diagnosis to find gaps, then reflective re-diagnosis to refine notes.
It outperforms standard retrieval-augmented generation that uses static disease descriptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar memory structures might help in other high-stakes classification tasks where distinguishing confusable items is key.
Testing the framework on non-radiology medical data or different modalities would check its broader applicability.
The method implies that agent performance can improve iteratively without parameter updates if failure data is structured effectively.

Load-bearing premise

That experience from the agent's diagnostic failures can be reliably organized into pairwise notes that encode transferable discriminators and decision rules.

What would settle it

If accuracy on the radiology benchmark shows no improvement when experience memory retrieval is enabled compared to a no-memory baseline.

Figures

Figures reproduced from arXiv: 2605.22872 by Qianhan Feng, Qi Dou, Winnie Chiu Wing Chu, Xiaofan Zhang, Yakun Zhu, Yannian Gu, Zhongzhen Huang.

**Figure 1.** Figure 1: Overview of the MedExpMem framework. (a) Phase I: Zero-Shot BlindSpot Discovery. The agent conducts zero-shot diagnosis. (b) Phase II: Reflective Refinement. The agent re-diagnoses cases with experience memory access. (c) Test-Time Inference. Agent performs experience-memory-augmented reasoning with hybrid-retrieval. discriminators capturing distinguishing features, decision rule providing actionable co… view at source ↗

**Figure 2.** Figure 2: Case study comparing diagnosis with and without experience memory. The retrieved pairwise note provides actionable discriminators that guide correct diagnosis. due to fewer prior errors, whereas smaller models sometimes fail to identify optimal retrieval paths. Cases with retrieved notes are typically more challenging, yet experience memory elevates their accuracy toward baseline levels. Although memory c… view at source ↗

read the original abstract

Experienced physicians develop diagnostic expertise through clinical practice, acquiring not only disease knowledge but also the ability to differentiate confusable conditions. Current medical vision-language models (VLMs) lack this capability -- their parameters encode static knowledge that does not evolve across diagnostic encounters. We propose MedExpMem, an experience memory framework enabling VLM-based diagnostic agents to accumulate differential diagnosis expertise. Unlike retrieval-augmented generation, which retrieves encyclopedic disease descriptions, MedExpMem memorizes discriminative experience derived from the agent's own diagnostic failures and organizes them as pairwise differential notes encoding key discriminators, actionable decision rules and reasoning error patterns. The framework adopts a two-phase construction process mirroring physician learning: initial practice exposes knowledge gaps, and reflective re-diagnosis refines understanding. When encountering new cases, the agent retrieves experience memory to guide differential reasoning. We evaluate MedExpMem on a radiology benchmark spanning 11 subspecialties. Results demonstrate consistent accuracy improvements, maximum 7.0%, across diverse models and scales. Analytical experiments validate experience quality and robustness, demonstrating MedExpMem as a competitive method addresses medical adaptation needs beyond the reach of parameteric learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedExpMem adds failure-derived pairwise notes to VLMs for differential diagnosis and reports up to 7% gains on a radiology benchmark, but the abstract supplies almost no experimental controls or baselines.

read the letter

MedExpMem lets a VLM build an experience memory by turning its own diagnostic mistakes into short pairwise notes that highlight what distinguishes similar conditions and where the reasoning went wrong. The two-phase setup—initial run to expose gaps, then reflective update—directly targets the static-parameter problem the authors flag in current medical VLMs, and the distinction from standard RAG is clear in the framing. On the radiology benchmark across 11 subspecialties the method shows consistent lifts, maxing at 7%, across several model scales. That is the concrete result worth noting. The paper does not appear to introduce new math or large-scale data; the contribution sits in the memory construction and retrieval loop. The main soft spot is that the abstract gives no information on baselines, statistical tests, train-test splits, or controls for prompt quality and retrieval noise, so the 7% figure cannot be assessed for robustness yet. The central assumption—that the pairwise notes reliably encode usable discriminators—remains untested in the summary provided. This work is aimed at groups already running VLM diagnostic agents who want a lightweight adaptation layer beyond fine-tuning. It is coherent on its own terms and addresses a practical gap, so it deserves a serious referee even if the current evidence is preliminary and will likely need expansion on the experimental side.

Referee Report

2 major / 1 minor

Summary. The paper claims that MedExpMem enables VLM-based diagnostic agents to accumulate differential diagnosis expertise by memorizing discriminative experience from diagnostic failures as pairwise differential notes, yielding consistent accuracy improvements with a maximum of 7.0% across diverse models and scales on a radiology benchmark spanning 11 subspecialties.

Significance. If the results hold, the framework offers a non-parametric approach to adapting medical VLMs via failure-derived experience memory, addressing a gap in static knowledge encoding that could support more robust differential reasoning in clinical AI.

major comments (2)

[Abstract] Abstract: the accuracy improvement claim (maximum 7.0%) supplies no information on baselines, statistical testing, dataset splits, or controls for confounding factors such as prompt engineering or retrieval quality, so the data cannot be verified to support the claim as stated.
[Method] Method description: the central assumption that pairwise differential notes derived from the agent's own failures encode usable key discriminators, actionable decision rules, and reasoning error patterns is load-bearing for the claimed gains, yet the two-phase construction process provides no explicit validation or ablation showing these notes measurably improve reasoning on held-out cases.

minor comments (1)

[Abstract] Typo: 'parameteric learning' should read 'parametric learning'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires additional context on experimental details and that the validation of the pairwise differential notes can be strengthened with more targeted ablations. We will revise the manuscript accordingly while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the accuracy improvement claim (maximum 7.0%) supplies no information on baselines, statistical testing, dataset splits, or controls for confounding factors such as prompt engineering or retrieval quality, so the data cannot be verified to support the claim as stated.

Authors: We agree that the abstract should be self-contained. In the revised version we will expand the abstract to specify the baselines (standard RAG, zero-shot, and fine-tuned VLMs), the statistical testing performed (paired t-tests with p-values), the dataset construction and splits on the 11-subspecialty radiology benchmark, and controls for prompt engineering and retrieval quality. These details already appear in Sections 4 and 5 of the full manuscript; the revision will simply surface them in the abstract. revision: yes
Referee: [Method] Method description: the central assumption that pairwise differential notes derived from the agent's own failures encode usable key discriminators, actionable decision rules, and reasoning error patterns is load-bearing for the claimed gains, yet the two-phase construction process provides no explicit validation or ablation showing these notes measurably improve reasoning on held-out cases.

Authors: The manuscript already reports analytical experiments that validate experience quality and robustness, including retrieval ablations that compare performance with and without the learned pairwise notes. Nevertheless, we acknowledge that a more direct, isolated ablation focused on held-out differential reasoning would make the load-bearing assumption clearer. We will add this explicit ablation study in the revision, quantifying accuracy gains attributable to the notes alone on held-out cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a proposed framework (MedExpMem) consisting of a two-phase construction process that builds pairwise differential notes from an agent's diagnostic failures, followed by retrieval for new cases. No equations, fitted parameters, predictions of derived quantities, or self-citations appear in the abstract or described method. The central claim rests on empirical accuracy gains (up to 7%) on a held-out radiology benchmark rather than any mathematical reduction or self-referential definition. The construction process is presented as an explicit engineering choice mirroring physician learning, with no load-bearing step that reduces to its own inputs by construction. This is the expected honest non-finding for a methods paper without quantitative derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the premise that failure-derived notes can be structured to improve future reasoning; this premise is introduced without independent evidence or formal justification in the abstract.

axioms (1)

domain assumption The two-phase construction process (initial practice followed by reflective re-diagnosis) mirrors physician learning and produces useful experience memory.
Explicitly stated in the abstract as the adopted framework.

invented entities (2)

experience memory no independent evidence
purpose: Accumulate differential diagnosis expertise from the agent's own diagnostic failures.
Core new component of the proposed framework.
pairwise differential notes no independent evidence
purpose: Encode key discriminators, actionable decision rules, and reasoning error patterns for retrieval during new cases.
Specific data structure introduced to organize memorized experience.

pith-pipeline@v0.9.0 · 5748 in / 1407 out tokens · 48914 ms · 2026-05-25T06:15:53.881578+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 6 internal anchors

[1]

Academic Medicine65(10), 611–621 (1990)

Schmidt, H.G., Norman, G.R., Boshuizen, H.P.: A cognitive perspective on medical expertise: theory and implications. Academic Medicine65(10), 611–621 (1990)

work page 1990
[2]

Nature 620(7972), 171–180 (2023)

Singhal, K., et al.: Large language models encode clinical knowledge. Nature 620(7972), 171–180 (2023)

work page 2023
[3]

Nature Medicine29(8), 1930–1940 (2023)

Zhang, Y., et al.: Large language models in medicine. Nature Medicine29(8), 1930–1940 (2023)

work page 1930
[4]

NeurIPS (2020)

Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS (2020)

work page 2020
[5]

UIST (2023)

Park, J.S., et al.: Generative agents: Interactive simulacra of human behavior. UIST (2023)

work page 2023
[6]

AAAI (2024)

Zhong, W., et al.: MemoryBank: Enhancing large language models with long-term memory. AAAI (2024)

work page 2024
[7]

AAAI (2024)

Zhao, A., et al.: ExpeL: LLM agents are experiential learners. AAAI (2024)

work page 2024
[8]

Nature 518(7540), 529–533 (2015)

Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

work page 2015
[9]

NeurIPS (2017)

Lopez-Paz, D., Ranzato, M.: Gradient episodic memory for continual learning. NeurIPS (2017)

work page 2017
[10]

NeurIPS (2023)

Shinn, N., et al.: Reflexion: Language agents with verbal reinforcement learning. NeurIPS (2023)

work page 2023
[11]

A Survey on the Memory Mechanism of Large Language Model based Agents

Zhang, Z., et al.: A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

ACL (2024)

Maharana, A., et al.: Evaluating very long-term conversational memory of LLM agents. ACL (2024)

work page 2024
[13]

A-MEM: Agentic Memory for LLM Agents

Xu, W., et al.: A-MEM: Agentic memory for LLMs. arXiv preprint arXiv:2502.12110 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

ICLR (2024)

Asai, A., et al.: Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR (2024)

work page 2024
[15]

arXiv preprint (2022)

Deka, P., et al.: S-PubMedBert-MS-MARCO: An efficient embedding model for biomedical information retrieval. arXiv preprint (2022)

work page 2022
[16]

ACM Trans

Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare3(1), 1–23 (2021)

work page 2021
[17]

Eurorad, https://www.eurorad.org, last accessed 2026/02/26

work page 2026
[18]

PathVQA: 30000+ questions for medical visual question answering

He, X., Zhang, Y., Mou, L., Xing, E., and Xie, P. PathVQA: 30000+ questions for medical visual question answering. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, pp. 485–495. Springer, 2020

work page 2020
[19]

J., Gayen, S., Ben Abacha, A., and Demner-Fushman, D

Lau, J. J., Gayen, S., Ben Abacha, A., and Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 5(1):1–10, 2018

work page 2018
[20]

ISBI (2021)

Liu, B., et al.: SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. ISBI (2021)

work page 2021
[21]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Zhang, X., et al.: PMC-VQA: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., et al.: Qwen3-VL Technical Report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., et al.: InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., et al.: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv preprint arXiv:2506.07044 (2025) 10 Q. Feng et al., submission to MICCAI 2026 review

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

https://www.theabr.org/get-certified/subspecialties/, last accessed 2026/02/26

American Board of Radiology: Subspecialty Certifications in Diagnostic Radiology. https://www.theabr.org/get-certified/subspecialties/, last accessed 2026/02/26

work page 2026
[26]

https://pubmed.ncbi.nlm.nih.gov/, last accessed 2026/02/26

National Library of Medicine: PubMed. https://pubmed.ncbi.nlm.nih.gov/, last accessed 2026/02/26

work page 2026

[1] [1]

Academic Medicine65(10), 611–621 (1990)

Schmidt, H.G., Norman, G.R., Boshuizen, H.P.: A cognitive perspective on medical expertise: theory and implications. Academic Medicine65(10), 611–621 (1990)

work page 1990

[2] [2]

Nature 620(7972), 171–180 (2023)

Singhal, K., et al.: Large language models encode clinical knowledge. Nature 620(7972), 171–180 (2023)

work page 2023

[3] [3]

Nature Medicine29(8), 1930–1940 (2023)

Zhang, Y., et al.: Large language models in medicine. Nature Medicine29(8), 1930–1940 (2023)

work page 1930

[4] [4]

NeurIPS (2020)

Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS (2020)

work page 2020

[5] [5]

UIST (2023)

Park, J.S., et al.: Generative agents: Interactive simulacra of human behavior. UIST (2023)

work page 2023

[6] [6]

AAAI (2024)

Zhong, W., et al.: MemoryBank: Enhancing large language models with long-term memory. AAAI (2024)

work page 2024

[7] [7]

AAAI (2024)

Zhao, A., et al.: ExpeL: LLM agents are experiential learners. AAAI (2024)

work page 2024

[8] [8]

Nature 518(7540), 529–533 (2015)

Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

work page 2015

[9] [9]

NeurIPS (2017)

Lopez-Paz, D., Ranzato, M.: Gradient episodic memory for continual learning. NeurIPS (2017)

work page 2017

[10] [10]

NeurIPS (2023)

Shinn, N., et al.: Reflexion: Language agents with verbal reinforcement learning. NeurIPS (2023)

work page 2023

[11] [11]

A Survey on the Memory Mechanism of Large Language Model based Agents

Zhang, Z., et al.: A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

ACL (2024)

Maharana, A., et al.: Evaluating very long-term conversational memory of LLM agents. ACL (2024)

work page 2024

[13] [13]

A-MEM: Agentic Memory for LLM Agents

Xu, W., et al.: A-MEM: Agentic memory for LLMs. arXiv preprint arXiv:2502.12110 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

ICLR (2024)

Asai, A., et al.: Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR (2024)

work page 2024

[15] [15]

arXiv preprint (2022)

Deka, P., et al.: S-PubMedBert-MS-MARCO: An efficient embedding model for biomedical information retrieval. arXiv preprint (2022)

work page 2022

[16] [16]

ACM Trans

Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare3(1), 1–23 (2021)

work page 2021

[17] [17]

Eurorad, https://www.eurorad.org, last accessed 2026/02/26

work page 2026

[18] [18]

PathVQA: 30000+ questions for medical visual question answering

He, X., Zhang, Y., Mou, L., Xing, E., and Xie, P. PathVQA: 30000+ questions for medical visual question answering. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, pp. 485–495. Springer, 2020

work page 2020

[19] [19]

J., Gayen, S., Ben Abacha, A., and Demner-Fushman, D

Lau, J. J., Gayen, S., Ben Abacha, A., and Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 5(1):1–10, 2018

work page 2018

[20] [20]

ISBI (2021)

Liu, B., et al.: SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. ISBI (2021)

work page 2021

[21] [21]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Zhang, X., et al.: PMC-VQA: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., et al.: Qwen3-VL Technical Report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., et al.: InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., et al.: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv preprint arXiv:2506.07044 (2025) 10 Q. Feng et al., submission to MICCAI 2026 review

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

https://www.theabr.org/get-certified/subspecialties/, last accessed 2026/02/26

American Board of Radiology: Subspecialty Certifications in Diagnostic Radiology. https://www.theabr.org/get-certified/subspecialties/, last accessed 2026/02/26

work page 2026

[26] [26]

https://pubmed.ncbi.nlm.nih.gov/, last accessed 2026/02/26

National Library of Medicine: PubMed. https://pubmed.ncbi.nlm.nih.gov/, last accessed 2026/02/26

work page 2026