pith. sign in

arxiv: 2605.22872 · v1 · pith:6JLXFR2Enew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.CV

MedExpMem: Adapting Experience Memory for Differential Diagnosis

Pith reviewed 2026-05-25 06:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords differential diagnosisexperience memoryvision-language modelsmedical AIradiology benchmarkdiagnostic agentspairwise notes
0
0 comments X

The pith

MedExpMem lets diagnostic vision-language models learn from their own mistakes by storing pairwise notes on how to tell similar conditions apart.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MedExpMem to let medical vision-language models develop expertise in differential diagnosis by learning from their mistakes. Instead of static knowledge, it creates memory entries from failed diagnoses, formatted as notes comparing pairs of similar conditions with rules for distinguishing them. These notes are retrieved during new diagnoses to improve decisions. This matters because it provides a way for AI to adapt through experience on a large radiology benchmark, showing gains without retraining the model.

Core claim

MedExpMem is an experience memory framework that enables VLM-based diagnostic agents to accumulate differential diagnosis expertise. It memorizes discriminative experience from the agent's own diagnostic failures, organized as pairwise differential notes that encode key discriminators, actionable decision rules, and reasoning error patterns. When facing new cases, the agent retrieves relevant notes to guide reasoning. Evaluation on a radiology benchmark across 11 subspecialties shows consistent accuracy improvements, with a maximum of 7.0% across diverse models and scales.

What carries the argument

Pairwise differential notes that capture distinctions between confusable conditions derived from past diagnostic failures.

If this is right

  • Diagnostic agents achieve higher accuracy on radiology tasks without changing model parameters.
  • The method works across different VLM scales and architectures.
  • Experience is built in two phases: initial diagnosis to find gaps, then reflective re-diagnosis to refine notes.
  • It outperforms standard retrieval-augmented generation that uses static disease descriptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar memory structures might help in other high-stakes classification tasks where distinguishing confusable items is key.
  • Testing the framework on non-radiology medical data or different modalities would check its broader applicability.
  • The method implies that agent performance can improve iteratively without parameter updates if failure data is structured effectively.

Load-bearing premise

That experience from the agent's diagnostic failures can be reliably organized into pairwise notes that encode transferable discriminators and decision rules.

What would settle it

If accuracy on the radiology benchmark shows no improvement when experience memory retrieval is enabled compared to a no-memory baseline.

Figures

Figures reproduced from arXiv: 2605.22872 by Qianhan Feng, Qi Dou, Winnie Chiu Wing Chu, Xiaofan Zhang, Yakun Zhu, Yannian Gu, Zhongzhen Huang.

Figure 1
Figure 1. Figure 1: Overview of the MedExpMem framework. (a) Phase I: Zero-Shot Blind￾Spot Discovery. The agent conducts zero-shot diagnosis. (b) Phase II: Reflec￾tive Refinement. The agent re-diagnoses cases with experience memory access. (c) Test-Time Inference. Agent performs experience-memory-augmented reasoning with hybrid-retrieval. discriminators capturing distinguishing features, decision rule providing ac￾tionable co… view at source ↗
Figure 2
Figure 2. Figure 2: Case study comparing diagnosis with and without experience memory. The retrieved pairwise note provides actionable discriminators that guide correct diagnosis. due to fewer prior errors, whereas smaller models sometimes fail to identify op￾timal retrieval paths. Cases with retrieved notes are typically more challenging, yet experience memory elevates their accuracy toward baseline levels. Although memory c… view at source ↗
read the original abstract

Experienced physicians develop diagnostic expertise through clinical practice, acquiring not only disease knowledge but also the ability to differentiate confusable conditions. Current medical vision-language models (VLMs) lack this capability -- their parameters encode static knowledge that does not evolve across diagnostic encounters. We propose MedExpMem, an experience memory framework enabling VLM-based diagnostic agents to accumulate differential diagnosis expertise. Unlike retrieval-augmented generation, which retrieves encyclopedic disease descriptions, MedExpMem memorizes discriminative experience derived from the agent's own diagnostic failures and organizes them as pairwise differential notes encoding key discriminators, actionable decision rules and reasoning error patterns. The framework adopts a two-phase construction process mirroring physician learning: initial practice exposes knowledge gaps, and reflective re-diagnosis refines understanding. When encountering new cases, the agent retrieves experience memory to guide differential reasoning. We evaluate MedExpMem on a radiology benchmark spanning 11 subspecialties. Results demonstrate consistent accuracy improvements, maximum 7.0%, across diverse models and scales. Analytical experiments validate experience quality and robustness, demonstrating MedExpMem as a competitive method addresses medical adaptation needs beyond the reach of parameteric learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that MedExpMem enables VLM-based diagnostic agents to accumulate differential diagnosis expertise by memorizing discriminative experience from diagnostic failures as pairwise differential notes, yielding consistent accuracy improvements with a maximum of 7.0% across diverse models and scales on a radiology benchmark spanning 11 subspecialties.

Significance. If the results hold, the framework offers a non-parametric approach to adapting medical VLMs via failure-derived experience memory, addressing a gap in static knowledge encoding that could support more robust differential reasoning in clinical AI.

major comments (2)
  1. [Abstract] Abstract: the accuracy improvement claim (maximum 7.0%) supplies no information on baselines, statistical testing, dataset splits, or controls for confounding factors such as prompt engineering or retrieval quality, so the data cannot be verified to support the claim as stated.
  2. [Method] Method description: the central assumption that pairwise differential notes derived from the agent's own failures encode usable key discriminators, actionable decision rules, and reasoning error patterns is load-bearing for the claimed gains, yet the two-phase construction process provides no explicit validation or ablation showing these notes measurably improve reasoning on held-out cases.
minor comments (1)
  1. [Abstract] Typo: 'parameteric learning' should read 'parametric learning'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires additional context on experimental details and that the validation of the pairwise differential notes can be strengthened with more targeted ablations. We will revise the manuscript accordingly while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the accuracy improvement claim (maximum 7.0%) supplies no information on baselines, statistical testing, dataset splits, or controls for confounding factors such as prompt engineering or retrieval quality, so the data cannot be verified to support the claim as stated.

    Authors: We agree that the abstract should be self-contained. In the revised version we will expand the abstract to specify the baselines (standard RAG, zero-shot, and fine-tuned VLMs), the statistical testing performed (paired t-tests with p-values), the dataset construction and splits on the 11-subspecialty radiology benchmark, and controls for prompt engineering and retrieval quality. These details already appear in Sections 4 and 5 of the full manuscript; the revision will simply surface them in the abstract. revision: yes

  2. Referee: [Method] Method description: the central assumption that pairwise differential notes derived from the agent's own failures encode usable key discriminators, actionable decision rules, and reasoning error patterns is load-bearing for the claimed gains, yet the two-phase construction process provides no explicit validation or ablation showing these notes measurably improve reasoning on held-out cases.

    Authors: The manuscript already reports analytical experiments that validate experience quality and robustness, including retrieval ablations that compare performance with and without the learned pairwise notes. Nevertheless, we acknowledge that a more direct, isolated ablation focused on held-out differential reasoning would make the load-bearing assumption clearer. We will add this explicit ablation study in the revision, quantifying accuracy gains attributable to the notes alone on held-out cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a proposed framework (MedExpMem) consisting of a two-phase construction process that builds pairwise differential notes from an agent's diagnostic failures, followed by retrieval for new cases. No equations, fitted parameters, predictions of derived quantities, or self-citations appear in the abstract or described method. The central claim rests on empirical accuracy gains (up to 7%) on a held-out radiology benchmark rather than any mathematical reduction or self-referential definition. The construction process is presented as an explicit engineering choice mirroring physician learning, with no load-bearing step that reduces to its own inputs by construction. This is the expected honest non-finding for a methods paper without quantitative derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the premise that failure-derived notes can be structured to improve future reasoning; this premise is introduced without independent evidence or formal justification in the abstract.

axioms (1)
  • domain assumption The two-phase construction process (initial practice followed by reflective re-diagnosis) mirrors physician learning and produces useful experience memory.
    Explicitly stated in the abstract as the adopted framework.
invented entities (2)
  • experience memory no independent evidence
    purpose: Accumulate differential diagnosis expertise from the agent's own diagnostic failures.
    Core new component of the proposed framework.
  • pairwise differential notes no independent evidence
    purpose: Encode key discriminators, actionable decision rules, and reasoning error patterns for retrieval during new cases.
    Specific data structure introduced to organize memorized experience.

pith-pipeline@v0.9.0 · 5748 in / 1407 out tokens · 48914 ms · 2026-05-25T06:15:53.881578+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 6 internal anchors

  1. [1]

    Academic Medicine65(10), 611–621 (1990)

    Schmidt, H.G., Norman, G.R., Boshuizen, H.P.: A cognitive perspective on medical expertise: theory and implications. Academic Medicine65(10), 611–621 (1990)

  2. [2]

    Nature 620(7972), 171–180 (2023)

    Singhal, K., et al.: Large language models encode clinical knowledge. Nature 620(7972), 171–180 (2023)

  3. [3]

    Nature Medicine29(8), 1930–1940 (2023)

    Zhang, Y., et al.: Large language models in medicine. Nature Medicine29(8), 1930–1940 (2023)

  4. [4]

    NeurIPS (2020)

    Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS (2020)

  5. [5]

    UIST (2023)

    Park, J.S., et al.: Generative agents: Interactive simulacra of human behavior. UIST (2023)

  6. [6]

    AAAI (2024)

    Zhong, W., et al.: MemoryBank: Enhancing large language models with long-term memory. AAAI (2024)

  7. [7]

    AAAI (2024)

    Zhao, A., et al.: ExpeL: LLM agents are experiential learners. AAAI (2024)

  8. [8]

    Nature 518(7540), 529–533 (2015)

    Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

  9. [9]

    NeurIPS (2017)

    Lopez-Paz, D., Ranzato, M.: Gradient episodic memory for continual learning. NeurIPS (2017)

  10. [10]

    NeurIPS (2023)

    Shinn, N., et al.: Reflexion: Language agents with verbal reinforcement learning. NeurIPS (2023)

  11. [11]

    A Survey on the Memory Mechanism of Large Language Model based Agents

    Zhang, Z., et al.: A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501 (2024)

  12. [12]

    ACL (2024)

    Maharana, A., et al.: Evaluating very long-term conversational memory of LLM agents. ACL (2024)

  13. [13]

    A-MEM: Agentic Memory for LLM Agents

    Xu, W., et al.: A-MEM: Agentic memory for LLMs. arXiv preprint arXiv:2502.12110 (2025)

  14. [14]

    ICLR (2024)

    Asai, A., et al.: Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR (2024)

  15. [15]

    arXiv preprint (2022)

    Deka, P., et al.: S-PubMedBert-MS-MARCO: An efficient embedding model for biomedical information retrieval. arXiv preprint (2022)

  16. [16]

    ACM Trans

    Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare3(1), 1–23 (2021)

  17. [17]

    Eurorad, https://www.eurorad.org, last accessed 2026/02/26

  18. [18]

    PathVQA: 30000+ questions for medical visual question answering

    He, X., Zhang, Y., Mou, L., Xing, E., and Xie, P. PathVQA: 30000+ questions for medical visual question answering. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, pp. 485–495. Springer, 2020

  19. [19]

    J., Gayen, S., Ben Abacha, A., and Demner-Fushman, D

    Lau, J. J., Gayen, S., Ben Abacha, A., and Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 5(1):1–10, 2018

  20. [20]

    ISBI (2021)

    Liu, B., et al.: SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. ISBI (2021)

  21. [21]

    PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

    Zhang, X., et al.: PMC-VQA: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)

  22. [22]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., et al.: Qwen3-VL Technical Report. arXiv preprint arXiv:2511.21631 (2025)

  23. [23]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., et al.: InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency. arXiv preprint arXiv:2508.18265 (2025)

  24. [24]

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

    Xu, W., et al.: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv preprint arXiv:2506.07044 (2025) 10 Q. Feng et al., submission to MICCAI 2026 review

  25. [25]

    https://www.theabr.org/get-certified/subspecialties/, last accessed 2026/02/26

    American Board of Radiology: Subspecialty Certifications in Diagnostic Radiology. https://www.theabr.org/get-certified/subspecialties/, last accessed 2026/02/26

  26. [26]

    https://pubmed.ncbi.nlm.nih.gov/, last accessed 2026/02/26

    National Library of Medicine: PubMed. https://pubmed.ncbi.nlm.nih.gov/, last accessed 2026/02/26