Recognition: unknown
MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation
Pith reviewed 2026-05-10 09:07 UTC · model grok-4.3
The pith
A multi-agent system modeled on radiology roles generates CT reports with fewer clinical errors than single vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MARCH emulates the professional hierarchy of radiology departments by assigning distinct agents to roles: a Resident Agent performs initial drafting with multi-scale CT feature extraction, multiple Fellow Agents conduct retrieval-augmented revision, and an Attending Agent orchestrates an iterative stance-based consensus process to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, this multi-agent approach significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy.
What carries the argument
The Multi-Agent Radiology Clinical Hierarchy, which assigns role-specific agents and uses iterative stance-based consensus to resolve discrepancies between draft reports.
If this is right
- Report generation systems could move from monolithic models to role-structured agents for higher reliability in medical imaging.
- Iterative consensus mechanisms may become standard for handling diagnostic disagreements in automated clinical workflows.
- Hierarchical agent designs could scale to other high-stakes imaging tasks where verification steps matter.
- Clinical deployment might benefit from explicit modeling of human organizational checks rather than end-to-end prediction alone.
Where Pith is reading between the lines
- Similar role hierarchies might improve performance in pathology or MRI report generation if the same consensus process is adapted.
- The approach could be tested for robustness by swapping the retrieval sources or changing the number of fellow agents.
- Integration with human radiologists could occur at the attending stage, where the AI consensus serves as a draft for final review.
Load-bearing premise
That dividing labor among role-specialized agents and forcing them through repeated consensus rounds will reliably cut clinical hallucinations compared with a single model.
What would settle it
A controlled test on the same dataset showing MARCH produces equal or higher rates of factual clinical errors than the strongest single vision-language model baseline.
Figures
read the original abstract
Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MARCH, a multi-agent framework for 3D CT radiology report generation that emulates a clinical hierarchy: a Resident Agent performs initial drafting with multi-scale feature extraction, multiple Fellow Agents conduct retrieval-augmented revisions, and an Attending Agent orchestrates iterative stance-based consensus to resolve discrepancies. The central empirical claim is that MARCH significantly outperforms state-of-the-art vision-language model baselines on the RadGenome-ChestCT dataset in both clinical fidelity and linguistic accuracy.
Significance. If the gains are shown to stem specifically from the role specialization and consensus mechanism rather than retrieval or prompting alone, the work would provide concrete evidence that modeling professional organizational structures can improve reliability in medical AI, particularly for reducing hallucinations in high-stakes report generation.
major comments (2)
- [Experimental results / ablation studies] Experimental evaluation: No ablation is presented that holds the underlying VLM, multi-scale feature extraction, retrieval corpus, and total inference budget fixed while removing only the Resident-Fellow-Attending role assignments and the stance-based iterative consensus process. This comparison is required to establish that the hierarchy itself is load-bearing for the reported improvements in clinical fidelity on RadGenome-ChestCT.
- [Results] Results section: The abstract and main claims assert 'significant' outperformance in clinical fidelity and linguistic accuracy, yet the provided text supplies no quantitative metrics, confidence intervals, statistical tests, or error analysis tables to allow verification of the magnitude or reliability of the gains against the baselines.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., a clinical metric delta or p-value) to support the 'significantly outperforms' phrasing.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important ways to strengthen the empirical validation of the multi-agent hierarchy in MARCH. We address each major comment point-by-point below and outline the revisions we will make.
read point-by-point responses
-
Referee: Experimental evaluation: No ablation is presented that holds the underlying VLM, multi-scale feature extraction, retrieval corpus, and total inference budget fixed while removing only the Resident-Fellow-Attending role assignments and the stance-based iterative consensus process. This comparison is required to establish that the hierarchy itself is load-bearing for the reported improvements in clinical fidelity on RadGenome-ChestCT.
Authors: We agree that this ablation is necessary to isolate the contribution of the role specialization and stance-based consensus. In the revised manuscript, we will add a controlled ablation study that keeps the VLM backbone, multi-scale feature extraction, retrieval corpus, and total inference budget (measured by token usage and API calls) identical. We will compare the full MARCH system against a flat multi-agent variant in which all agents perform identical tasks without distinct Resident-Fellow-Attending roles or iterative stance-based consensus. This will directly test whether the clinical hierarchy drives the reported gains. revision: yes
-
Referee: The abstract and main claims assert 'significant' outperformance in clinical fidelity and linguistic accuracy, yet the provided text supplies no quantitative metrics, confidence intervals, statistical tests, or error analysis tables to allow verification of the magnitude or reliability of the gains against the baselines.
Authors: We acknowledge that the quantitative support for the claims should be more explicitly presented. The full results section contains tables reporting specific metrics (BLEU, ROUGE, clinical entity F1, hallucination rates), 95% confidence intervals, and statistical significance tests. In the revision, we will add a concise summary table in the main text, reference these results directly in the abstract and introduction claims, and expand the error analysis subsection with a dedicated table of common failure modes and their frequencies. revision: yes
Circularity Check
No circularity: empirical system proposal with external evaluation
full rationale
The paper describes an architectural proposal (MARCH multi-agent hierarchy with Resident/Fellow/Attending roles and stance-based consensus) and reports empirical gains on the RadGenome-ChestCT dataset against external baselines. No equations, parameter-fitting steps, first-principles derivations, or self-referential predictions appear. The central claim reduces to an experimental comparison whose validity depends on the (unexamined) ablation design rather than any definitional or self-citation loop. This is the normal non-circular case for an applied systems paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
InMedical Image Com- puting and Computer Assisted Intervention, Lecture notes in computer science, pages 476–486
CT2Rep: Automated radiology report genera- tion for 3D medical imaging. InMedical Image Com- puting and Computer Assisted Intervention, Lecture notes in computer science, pages 476–486. Springer Nature Switzerland, Cham. Katherine A Hill, Mohini Dasari, Eliza B Littleton, and Giselle G Hamad. 2017. How can surgeons facili- tate resident intraoperative dec...
2017
-
[2]
Comt: Chain-of-medical-thought reduces hal- lucination in medical report generation. InICASSP 2025-2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE. Alistair EW Johnson, Tom J Pollard, Nathaniel R Green- baum, Matthew P Lungren, Chih-ying Deng, Yifan Peng, Zhiyong Lu, Roger G Mark, Seth J Berkowitz,...
-
[3]
AgentMaster: A multi-agent conversational framework using A2A and MCP protocols for multi- modal information retrieval and analysis. InProceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstra- tions, pages 52–72, Stroudsburg, PA, USA. Associa- tion for Computational Linguistics. Chin-Yew Lin. 2004. ROUGE: ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
RadBERT: Adapting transformer-based lan- guage models to radiology.Radiol. Artif. Intell., 4(4):e210258. Ziruo Yi, Ting Xiao, and Mark V Albert. 2025. A mul- timodal multi-agent framework for radiology report generation.arXiv preprint arXiv:2505.09787. Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2024. Develo...
-
[5]
Zhusi Zhong, Yuli Wang, Jing Wu, Wen-Chi Hsu, Vin Somasundaram, Lulu Bi, Shreyas Kulkarni, Zhuoqi Ma, Scott Collins, Grayson Baird, and 1 others
Large-vocabulary segmentation for medical images with text prompts.NPJ Digital Medicine, 8(1):566. Zhusi Zhong, Yuli Wang, Jing Wu, Wen-Chi Hsu, Vin Somasundaram, Lulu Bi, Shreyas Kulkarni, Zhuoqi Ma, Scott Collins, Grayson Baird, and 1 others. 2025. Vision-language model for report generation and out- come prediction in CT pulmonary angiogram.NPJ Digital...
2025
-
[9]
report":
If no changes are necessary, output the original initial medical report as is. Here is the initial medical report: <init_report> Here are the retrieved relevant medical reports from the database: <retrieved_report> Here is an example of the format you should output: {"report": "The region 0 is abdomen: ... The region 1 is bone: ... The region 2 is breast:...
-
[10]
Final medical report after your analysis
-
[11]
Please output detailed content from some of the evidence provided by the doctors or some analysis results of the doctors
A List of supporting evidence represented as a string. Please output detailed content from some of the evidence provided by the doctors or some analysis results of the doctors. Do not just list the doctor’s name. Please follow the guidelines below to complete your task:
-
[15]
report":
If no changes are necessary, output the original initial medical report as is. Here is an example of the format you should output: {"report": "The region 0 is abdomen: ... The region 1 is bone: ... The region 2 is breast: ...", "reasons": ["...", "..."]}. Respond in JSON format without any additional content. Prompt 4: Fellow Agent for Iterative Refinemen...
-
[16]
agree" or
Your viewpoint on the opinion of the attending radiologist, i.e., respond with "agree" or "disagree"
-
[17]
The meaning of the confidence score is as follows: 3 for High - You are an expert in the subject area and have extensive knowledge in the medical domain
The confidence score of your opinion, respond with an integer between 1 and 3. The meaning of the confidence score is as follows: 3 for High - You are an expert in the subject area and have extensive knowledge in the medical domain. You are highly confident in your ability to provide an accurate and thorough assessment. Your evaluation is based on deep ex...
-
[18]
The reason for your opinion. If you change your opinion, for example, you agree with the attending radiologist’s opinion which is different from your last analysis, please respond that you have changed in your response and provide detailed reasons for the change. You need to point out the parts where you got the wrong conclusion or the important parts you...
-
[19]
answer":
The evidence you use to support your opinion. Please choose from the relevant medical knowledge I provide as your evidence, and must output important content from the evidence. Here are examples of the format you should output: "answer": "agree", "confidence": 3, "reason": "The reason for your opinion.", "evidences": ["Evidence 1 ...", "Evidence 2 ..."], ...
-
[20]
If all doctors agree with the previous synthesized report, there is no need to continue the discussion
-
[21]
If some doctors disagree with the previous report, but they are not confident in their judgment and have not listed convincing evidence, there is no need to continue the discussion
-
[22]
If some doctors strongly oppose the previous report and you think their evidence is worth discussing, please continue the discussion
-
[23]
If most doctors disagree with the previous report, please continue the discussion. If you think the discussion should continue, you need to analyze the rationality of each doctor’s opinions and summarize the opinions you think are reasonable to obtain a synthesized report for the patient. You should follow these cases:
-
[24]
If a doctor expresses strong opposition to your previous report, you need to focus on hisher reasons and arguments and think carefully about whether you need to reconsider the diagnosis of the patient and modify your synthesized report accordingly
-
[25]
If a doctor expresses opposition but also has some doubts about hisher own opinion, you need to consider hisher opinion, but you can stick to your original opinion. 3. If a doctor expresses agreement, then you do not need to modify your original synthesized report based on hisher opinion. Please output the following four contents: 1. Whether to continue t...
-
[26]
The format of the reason for revision is: which doctor’s opinion or relevant literature you refer to, and which original opinions you modify
Your reasons for revision. The format of the reason for revision is: which doctor’s opinion or relevant literature you refer to, and which original opinions you modify. Please output detailed content, don’t just list the doctor’s name
-
[27]
The instruction should be specific and actionable, guiding each doctor on how to adjust their analysis or what aspects to focus on based on the previous round’s discussion
The instructions for each doctor to follow in the next round of discussion. The instruction should be specific and actionable, guiding each doctor on how to adjust their analysis or what aspects to focus on based on the previous round’s discussion. When you output the revised report, please follow the guidelines below to complete your task:
-
[28]
Carefully read and understand both the initial medical report and the retrieved relevant medical reports
-
[29]
Identify any discrepancies, conflicts, or inconsistencies between the two reports
-
[30]
Make necessary modifications to the initial medical report to resolve any identified issues, ensuring that the final report is clinically accurate and coherent
-
[31]
action":
If no changes are necessary, output the original initial medical report as is. Here are two examples of the format you should output: {"action": "No", "report": "The region 0 is abdomen: ... The region 1 is bone: ... The region 2 is breast: ...", "reasons": ["reason1...", "reason2..."], "instructions": ["instruction1...", "instruction2..."]}. {"action": "...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.