SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

James Cheng; Ruiying Chen; Yongfeng Huang

arxiv: 2605.17101 · v2 · pith:UJ74U3QCnew · submitted 2026-05-16 · 💻 cs.CL · cs.AI

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

Yongfeng Huang , Ruiying Chen , James Cheng This is my paper

Pith reviewed 2026-05-20 15:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords retrieval-augmented generationmulti-agent frameworkmedical question answeringclinical reasoningself-evolving retrievalevidence chainsinterpreter agentexplorer agent

0 comments

The pith

Decoupling medical reasoning into three specialist agents improves RAG performance by 6.46 accuracy points on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that standard single-round retrieval-augmented generation misaligns with the multi-stage nature of clinical reasoning, leading to weak evidence chains. It proposes that these issues arise because one model must simultaneously interpret the question, gather evidence iteratively, and adjudicate the results. By creating separate agents for each role, the system can perform dynamic, sufficiency-driven retrieval that builds more reliable support for answers. If this holds, AI-assisted medical question answering would become more accurate and trustworthy across different language models and datasets. Readers interested in practical AI tools for healthcare would care because better evidence synthesis directly affects the quality of generated medical responses.

Core claim

The authors establish that the structural deficiencies in single-round RAG for medical reasoning stem from overloading a single reasoning chain with the heterogeneous tasks of interpretation, exploration, and adjudication. Reconstructing the workflow through task decoupling and dynamic multi-round exploration via three specialist agents—the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection—produces more reliable evidence chains. This yields consistent improvements, raising accuracy by an average of 6.46 points over the strongest baseline across five 5 5

What carries the argument

The Self-Evolving Multi-Agent RAG framework that assigns interpretation, exploration, and adjudication roles to three distinct specialist agents to enable iterative evidence gathering and final judgment.

Load-bearing premise

The main problems with single-round RAG in medicine come from cramming interpretation, exploration, and judgment into one reasoning chain, and splitting these into three agents will reliably create better evidence chains.

What would settle it

An experiment showing that a carefully prompted single-agent system achieves comparable accuracy to the three-agent version on the same medical benchmarks would challenge the central claim.

Figures

Figures reproduced from arXiv: 2605.17101 by James Cheng, Ruiying Chen, Yongfeng Huang.

**Figure 2.** Figure 2: Overview of SEMA-RAG: (i) I-Agent structures the input question Q into a clinical schema tuple Q′ for retrieval; (ii) E-Agent conducts sufficiency-driven self-evolving multi-round retrieval to obtain a converged evidence set C ∗ ; (iii) A-Agent adjudicates evidence into a traceable report R and selects the final answer grounded in R. 3.1 I-Agent as a Question Interpreter I-Agent does not merely rephrase th… view at source ↗

**Figure 3.** Figure 3: Impact of max iterations Tmax (fix m = 3) on MedQA-US (deepseek-v3.1). 4.4 Further Analysis 4.4.1 Synergy of the Multi-Agent Architecture [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt template for the I-Agent clinical schema interpreter. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt template for the E-Agent self-evolving explorer. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template for the A-Agent evidence adjudicator. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for the A-Agent evidence adjudicator. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template for the A-Agent evidence-grounded answerer. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single-round, static retrieval paradigm misaligns with the multi-stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question-to-query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi-round exploration. To this end, we propose SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA-RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEMA-RAG splits medical RAG into Interpreter, self-evolving Explorer, and Arbiter agents and reports a 6.46-point average accuracy lift across five benchmarks and five backbones.

read the letter

The main thing to know is that this paper takes the standard single-round RAG pipeline for medical QA and breaks it into three specialist agents: one that handles clinical schema interpretation, one that runs iterative self-evolving retrieval based on sufficiency checks, and one that adjudicates the evidence and picks the answer. They show an average gain of 6.46 accuracy points over the strongest baseline, measured separately for each of five LLM backbones on five benchmarks. That is the concrete result on offer. The workflow itself is the clearest new piece. Prior multi-agent RAG work exists, but the explicit mapping to interpretation, sufficiency-driven exploration, and adjudication plus the self-evolving loop gives a practical recipe that matches how clinicians actually build evidence chains. The multi-backbone testing is also useful because it reduces the chance that gains are an artifact of one model family. The motivation section does a clean job of naming the two deficiencies (weak semantic translation and lack of iterative feedback) and tracing them to task overload in one chain. On the soft spots, the abstract gives no numbers on run variance, statistical tests, or how much prompt engineering went into the baselines. If the full paper shows those controls and includes ablations that isolate the contribution of each agent, the claim strengthens; without them the lift could partly reflect extra retrieval rounds rather than the agent split. The central assumption that decoupling reliably improves evidence quality is reasonable on its face but would benefit from error analysis showing where single-agent versions still fail. This paper is for people building or evaluating RAG systems in medicine or other high-stakes domains. Anyone already running multi-agent setups will get a concrete workflow to compare against. It deserves a serious referee because the empirical scope is broad enough for reviewers to check the controls and ablations directly, and the problem it targets is real even if the evaluation needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering. It identifies two structural deficiencies in single-round static RAG—insufficient clinical semantic interpretation in query translation and lack of iterative sufficiency feedback for evidence chains—and attributes both to overloading a single reasoning chain with heterogeneous tasks. The proposed remedy decouples these into three specialist agents (Interpreter for clinical schema interpretation, Explorer for sufficiency-driven self-evolving retrieval, and Arbiter for evidence adjudication). Across five benchmarks and five LLM backbones, the framework is reported to improve the strongest baseline by an average of +6.46 accuracy points measured per backbone.

Significance. If the reported gains hold under rigorous controls, the task-decoupling approach could advance RAG systems for multi-stage clinical reasoning by enabling specialized, dynamic workflows. The multi-backbone, multi-benchmark evaluation provides a reasonable test of generalizability. Credit is given for the concrete empirical claim and for explicitly linking the proposed architecture to the identified workflow misalignments.

major comments (2)

The central empirical claim of a +6.46 average accuracy lift is load-bearing for the paper's contribution, yet the provided abstract (and by extension the evaluation summary) supplies no information on baseline definitions, statistical significance, run-to-run variance, or controls for prompt-engineering effects. This directly affects whether the data support the assertion that the multi-agent structure, rather than other factors, drives the improvement.
The weakest assumption—that structural deficiencies arise primarily from single-chain overload and that assigning interpretation/exploration/adjudication to three distinct agents will reliably yield better evidence chains—is not isolated via ablation studies that compare the full three-agent system against controlled variants (e.g., two-agent or single-agent with equivalent retrieval rounds). Without such targeted tests, the causal link between the proposed remedy and the observed gains remains under-supported.

minor comments (2)

Notation for agent roles and the self-evolving retrieval loop should be introduced with explicit definitions or pseudocode in the framework section to improve clarity for readers unfamiliar with multi-agent RAG variants.
Figure captions and table headers would benefit from explicit statements of what is being averaged (per-backbone vs. per-benchmark) to avoid ambiguity when interpreting the +6.46 figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the empirical presentation and causal support for our claims.

read point-by-point responses

Referee: The central empirical claim of a +6.46 average accuracy lift is load-bearing for the paper's contribution, yet the provided abstract (and by extension the evaluation summary) supplies no information on baseline definitions, statistical significance, run-to-run variance, or controls for prompt-engineering effects. This directly affects whether the data support the assertion that the multi-agent structure, rather than other factors, drives the improvement.

Authors: We acknowledge that the abstract and high-level evaluation summary omit these details, which limits immediate assessment of the claim. The full experimental section (Section 4) defines baselines as the strongest single-round RAG and iterative variants using identical LLM backbones and prompt templates across all methods; results are averaged over three runs with different random seeds, and paired t-tests are used for significance. To directly address the concern, we will revise the abstract to briefly note these controls and add a short subsection on statistical analysis and prompt-engineering controls in the evaluation summary. This will clarify that the reported gains are measured under matched conditions. revision: yes
Referee: The weakest assumption—that structural deficiencies arise primarily from single-chain overload and that assigning interpretation/exploration/adjudication to three distinct agents will reliably yield better evidence chains—is not isolated via ablation studies that compare the full three-agent system against controlled variants (e.g., two-agent or single-agent with equivalent retrieval rounds). Without such targeted tests, the causal link between the proposed remedy and the observed gains remains under-supported.

Authors: This observation is correct; our current experiments compare SEMA-RAG only against external baselines rather than internal variants that hold retrieval rounds constant while varying the number of agents. We will add targeted ablation studies in the revised manuscript, including a two-agent configuration (Interpreter+Explorer merged, with Arbiter removed) and a single-agent multi-round retrieval setup with equivalent total retrieval steps and compute budget. These results will be reported alongside the main experiments to better isolate the contribution of the three-agent decoupling. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a multi-agent RAG framework motivated by identified structural deficiencies in single-round retrieval for medical reasoning and validates it through empirical accuracy gains across five benchmarks and five LLM backbones. No mathematical derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. The central improvement claim rests on external baseline comparisons rather than any reduction of results to quantities defined internally by the method itself, rendering the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no equations, parameters, or detailed methods are available to audit.

axioms (1)

domain assumption Overloading a single reasoning chain with interpretation, exploration, and adjudication causes the observed deficiencies in medical RAG.
Explicitly stated as the deeper cause in the abstract.

pith-pipeline@v0.9.0 · 5728 in / 1272 out tokens · 60279 ms · 2026-05-20T15:14:43.957373+00:00 · methodology

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)