MedExpMem: Adapting Experience Memory for Differential Diagnosis
Pith reviewed 2026-05-25 06:15 UTC · model grok-4.3
The pith
MedExpMem lets diagnostic vision-language models learn from their own mistakes by storing pairwise notes on how to tell similar conditions apart.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MedExpMem is an experience memory framework that enables VLM-based diagnostic agents to accumulate differential diagnosis expertise. It memorizes discriminative experience from the agent's own diagnostic failures, organized as pairwise differential notes that encode key discriminators, actionable decision rules, and reasoning error patterns. When facing new cases, the agent retrieves relevant notes to guide reasoning. Evaluation on a radiology benchmark across 11 subspecialties shows consistent accuracy improvements, with a maximum of 7.0% across diverse models and scales.
What carries the argument
Pairwise differential notes that capture distinctions between confusable conditions derived from past diagnostic failures.
If this is right
- Diagnostic agents achieve higher accuracy on radiology tasks without changing model parameters.
- The method works across different VLM scales and architectures.
- Experience is built in two phases: initial diagnosis to find gaps, then reflective re-diagnosis to refine notes.
- It outperforms standard retrieval-augmented generation that uses static disease descriptions.
Where Pith is reading between the lines
- Similar memory structures might help in other high-stakes classification tasks where distinguishing confusable items is key.
- Testing the framework on non-radiology medical data or different modalities would check its broader applicability.
- The method implies that agent performance can improve iteratively without parameter updates if failure data is structured effectively.
Load-bearing premise
That experience from the agent's diagnostic failures can be reliably organized into pairwise notes that encode transferable discriminators and decision rules.
What would settle it
If accuracy on the radiology benchmark shows no improvement when experience memory retrieval is enabled compared to a no-memory baseline.
Figures
read the original abstract
Experienced physicians develop diagnostic expertise through clinical practice, acquiring not only disease knowledge but also the ability to differentiate confusable conditions. Current medical vision-language models (VLMs) lack this capability -- their parameters encode static knowledge that does not evolve across diagnostic encounters. We propose MedExpMem, an experience memory framework enabling VLM-based diagnostic agents to accumulate differential diagnosis expertise. Unlike retrieval-augmented generation, which retrieves encyclopedic disease descriptions, MedExpMem memorizes discriminative experience derived from the agent's own diagnostic failures and organizes them as pairwise differential notes encoding key discriminators, actionable decision rules and reasoning error patterns. The framework adopts a two-phase construction process mirroring physician learning: initial practice exposes knowledge gaps, and reflective re-diagnosis refines understanding. When encountering new cases, the agent retrieves experience memory to guide differential reasoning. We evaluate MedExpMem on a radiology benchmark spanning 11 subspecialties. Results demonstrate consistent accuracy improvements, maximum 7.0%, across diverse models and scales. Analytical experiments validate experience quality and robustness, demonstrating MedExpMem as a competitive method addresses medical adaptation needs beyond the reach of parameteric learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that MedExpMem enables VLM-based diagnostic agents to accumulate differential diagnosis expertise by memorizing discriminative experience from diagnostic failures as pairwise differential notes, yielding consistent accuracy improvements with a maximum of 7.0% across diverse models and scales on a radiology benchmark spanning 11 subspecialties.
Significance. If the results hold, the framework offers a non-parametric approach to adapting medical VLMs via failure-derived experience memory, addressing a gap in static knowledge encoding that could support more robust differential reasoning in clinical AI.
major comments (2)
- [Abstract] Abstract: the accuracy improvement claim (maximum 7.0%) supplies no information on baselines, statistical testing, dataset splits, or controls for confounding factors such as prompt engineering or retrieval quality, so the data cannot be verified to support the claim as stated.
- [Method] Method description: the central assumption that pairwise differential notes derived from the agent's own failures encode usable key discriminators, actionable decision rules, and reasoning error patterns is load-bearing for the claimed gains, yet the two-phase construction process provides no explicit validation or ablation showing these notes measurably improve reasoning on held-out cases.
minor comments (1)
- [Abstract] Typo: 'parameteric learning' should read 'parametric learning'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the abstract requires additional context on experimental details and that the validation of the pairwise differential notes can be strengthened with more targeted ablations. We will revise the manuscript accordingly while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the accuracy improvement claim (maximum 7.0%) supplies no information on baselines, statistical testing, dataset splits, or controls for confounding factors such as prompt engineering or retrieval quality, so the data cannot be verified to support the claim as stated.
Authors: We agree that the abstract should be self-contained. In the revised version we will expand the abstract to specify the baselines (standard RAG, zero-shot, and fine-tuned VLMs), the statistical testing performed (paired t-tests with p-values), the dataset construction and splits on the 11-subspecialty radiology benchmark, and controls for prompt engineering and retrieval quality. These details already appear in Sections 4 and 5 of the full manuscript; the revision will simply surface them in the abstract. revision: yes
-
Referee: [Method] Method description: the central assumption that pairwise differential notes derived from the agent's own failures encode usable key discriminators, actionable decision rules, and reasoning error patterns is load-bearing for the claimed gains, yet the two-phase construction process provides no explicit validation or ablation showing these notes measurably improve reasoning on held-out cases.
Authors: The manuscript already reports analytical experiments that validate experience quality and robustness, including retrieval ablations that compare performance with and without the learned pairwise notes. Nevertheless, we acknowledge that a more direct, isolated ablation focused on held-out differential reasoning would make the load-bearing assumption clearer. We will add this explicit ablation study in the revision, quantifying accuracy gains attributable to the notes alone on held-out cases. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes a proposed framework (MedExpMem) consisting of a two-phase construction process that builds pairwise differential notes from an agent's diagnostic failures, followed by retrieval for new cases. No equations, fitted parameters, predictions of derived quantities, or self-citations appear in the abstract or described method. The central claim rests on empirical accuracy gains (up to 7%) on a held-out radiology benchmark rather than any mathematical reduction or self-referential definition. The construction process is presented as an explicit engineering choice mirroring physician learning, with no load-bearing step that reduces to its own inputs by construction. This is the expected honest non-finding for a methods paper without quantitative derivations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The two-phase construction process (initial practice followed by reflective re-diagnosis) mirrors physician learning and produces useful experience memory.
invented entities (2)
-
experience memory
no independent evidence
-
pairwise differential notes
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Academic Medicine65(10), 611–621 (1990)
Schmidt, H.G., Norman, G.R., Boshuizen, H.P.: A cognitive perspective on medical expertise: theory and implications. Academic Medicine65(10), 611–621 (1990)
work page 1990
-
[2]
Nature 620(7972), 171–180 (2023)
Singhal, K., et al.: Large language models encode clinical knowledge. Nature 620(7972), 171–180 (2023)
work page 2023
-
[3]
Nature Medicine29(8), 1930–1940 (2023)
Zhang, Y., et al.: Large language models in medicine. Nature Medicine29(8), 1930–1940 (2023)
work page 1930
-
[4]
Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS (2020)
work page 2020
-
[5]
Park, J.S., et al.: Generative agents: Interactive simulacra of human behavior. UIST (2023)
work page 2023
-
[6]
Zhong, W., et al.: MemoryBank: Enhancing large language models with long-term memory. AAAI (2024)
work page 2024
-
[7]
Zhao, A., et al.: ExpeL: LLM agents are experiential learners. AAAI (2024)
work page 2024
-
[8]
Nature 518(7540), 529–533 (2015)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
work page 2015
-
[9]
Lopez-Paz, D., Ranzato, M.: Gradient episodic memory for continual learning. NeurIPS (2017)
work page 2017
-
[10]
Shinn, N., et al.: Reflexion: Language agents with verbal reinforcement learning. NeurIPS (2023)
work page 2023
-
[11]
A Survey on the Memory Mechanism of Large Language Model based Agents
Zhang, Z., et al.: A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Maharana, A., et al.: Evaluating very long-term conversational memory of LLM agents. ACL (2024)
work page 2024
-
[13]
A-MEM: Agentic Memory for LLM Agents
Xu, W., et al.: A-MEM: Agentic memory for LLMs. arXiv preprint arXiv:2502.12110 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Asai, A., et al.: Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR (2024)
work page 2024
-
[15]
Deka, P., et al.: S-PubMedBert-MS-MARCO: An efficient embedding model for biomedical information retrieval. arXiv preprint (2022)
work page 2022
- [16]
-
[17]
Eurorad, https://www.eurorad.org, last accessed 2026/02/26
work page 2026
-
[18]
PathVQA: 30000+ questions for medical visual question answering
He, X., Zhang, Y., Mou, L., Xing, E., and Xie, P. PathVQA: 30000+ questions for medical visual question answering. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, pp. 485–495. Springer, 2020
work page 2020
-
[19]
J., Gayen, S., Ben Abacha, A., and Demner-Fushman, D
Lau, J. J., Gayen, S., Ben Abacha, A., and Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 5(1):1–10, 2018
work page 2018
-
[20]
Liu, B., et al.: SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. ISBI (2021)
work page 2021
-
[21]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Zhang, X., et al.: PMC-VQA: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Bai, S., Cai, Y., Chen, R., et al.: Qwen3-VL Technical Report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., et al.: InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency. arXiv preprint arXiv:2508.18265 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Xu, W., et al.: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv preprint arXiv:2506.07044 (2025) 10 Q. Feng et al., submission to MICCAI 2026 review
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
https://www.theabr.org/get-certified/subspecialties/, last accessed 2026/02/26
American Board of Radiology: Subspecialty Certifications in Diagnostic Radiology. https://www.theabr.org/get-certified/subspecialties/, last accessed 2026/02/26
work page 2026
-
[26]
https://pubmed.ncbi.nlm.nih.gov/, last accessed 2026/02/26
National Library of Medicine: PubMed. https://pubmed.ncbi.nlm.nih.gov/, last accessed 2026/02/26
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.