IMACT-CXR: An Interactive Multi-Agent Conversational Tutoring System for Chest X-Ray Interpretation

Akash Awasthi; Anh Mai Vu; David Yang; Hien Van Nguyen; Tuan-Anh Le

arxiv: 2511.15825 · v2 · submitted 2025-11-19 · 💻 cs.AI

IMACT-CXR: An Interactive Multi-Agent Conversational Tutoring System for Chest X-Ray Interpretation

Tuan-Anh Le , Anh Mai Vu , David Yang , Akash Awasthi , Hien Van Nguyen This is my paper

Pith reviewed 2026-05-17 20:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords chest x-ray interpretationmulti-agent conversational systemmedical educationgaze analysisBayesian knowledge tracingradiology tutoringinteractive AI tutorREFLACX dataset

0 comments

The pith

A multi-agent conversational tutor improves chest X-ray localization and diagnostic reasoning by integrating gaze analysis, knowledge retrieval, and adaptive coaching in one workflow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IMACT-CXR as a system that lets medical trainees practice reading chest X-rays through a conversational interface built from several cooperating AI agents. The tutor takes in the learner's drawn boxes, eye-tracking data, and written notes, then responds with targeted questions, evidence from medical literature, and similar example cases while avoiding direct answers. Bayesian tracking estimates which skills the trainee has mastered to select useful follow-up cases and reinforcement. A lung segmentation tool gives anatomically specific feedback on where the learner looked. Early tests on cases from the REFLACX dataset indicate clearer localization of findings and stronger reasoning than simpler baseline tools.

Core claim

IMACT-CXR unifies spatial annotation, gaze analysis, knowledge retrieval, and image-grounded reasoning inside a single AutoGen-based multi-agent workflow. It ingests learner bounding boxes, gaze samples, and free-text observations, then deploys specialized agents to evaluate localization quality, generate Socratic coaching, retrieve PubMed evidence, suggest similar REFLACX cases, and trigger vision-language reasoning when needed. Bayesian Knowledge Tracing maintains skill-specific mastery estimates that drive reinforcement and case selection. A TensorFlow U-Net lung-lobe segmentation module supplies anatomically aware gaze feedback, and safety prompts limit premature disclosure of ground真相.

What carries the argument

The AutoGen multi-agent workflow coordinated with Bayesian Knowledge Tracing for mastery estimation and U-Net lung segmentation for gaze feedback.

If this is right

Trainees receive real-time coaching that reacts to both their eye movements and their annotations without revealing the correct diagnosis immediately.
PubMed evidence and similar-case retrieval from REFLACX are pulled only when the Bayesian tracker indicates low mastery in a skill.
Vision-language reasoning activates selectively when the learner asks or when mastery estimates remain low, preserving active problem-solving.
The architecture supports bounded response latency and controlled information leakage, enabling safe use with live DICOM images.
The modular agent design allows extension toward residency-program deployment without rebuilding the core tutoring loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the approach proves robust, residency programs could use it to scale supervised practice without increasing faculty workload.
Adding live integration with hospital PACS viewers might let the tutor coach during actual patient cases rather than only on archived studies.
Long-term follow-up studies could check whether repeated sessions with the system produce measurable retention of diagnostic skills months later.

Load-bearing premise

That gains observed in a small preliminary test on existing REFLACX cases will translate to lasting improvements in how trainees actually learn and perform on new chest X-rays in real training programs.

What would settle it

A randomized trial that measures diagnostic accuracy and reasoning quality on a fresh set of chest X-ray cases for trainees who use the full multi-agent tutor versus trainees who use only single-agent or non-interactive tools; no measurable difference would falsify the central claim.

read the original abstract

IMACT-CXR is an interactive multi-agent conversational tutor that helps trainees interpret chest X-rays by unifying spatial annotation, gaze analysis, knowledge retrieval, and image-grounded reasoning in a single AutoGen-based workflow. The tutor simultaneously ingests learner bounding boxes, gaze samples, and free-text observations. Specialized agents evaluate localization quality, generate Socratic coaching, retrieve PubMed evidence, suggest similar cases from REFLACX, and trigger NV-Reason-CXR-3B for vision-language reasoning when mastery remains low or the learner explicitly asks. Bayesian Knowledge Tracing (BKT) maintains skill-specific mastery estimates that drive both knowledge reinforcement and case similarity retrieval. A lung-lobe segmentation module derived from a TensorFlow U-Net enables anatomically aware gaze feedback, and safety prompts prevent premature disclosure of ground-truth labels. We describe the system architecture, implementation highlights, and integration with the REFLACX dataset for real DICOM cases. IMACT-CXR demonstrates responsive tutoring flows with bounded latency, precise control over answer leakage, and extensibility toward live residency deployment. Preliminary evaluation shows improved localization and diagnostic reasoning compared to baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a practical multi-agent CXR tutor by gluing together AutoGen, BKT, gaze feedback, and a vision model, but the evaluation is too thin to back the improvement claims.

read the letter

The core of this work is a unified AutoGen workflow that lets a trainee draw bounding boxes, share gaze data, and get Socratic coaching plus PubMed retrieval and on-demand vision-language reasoning when mastery is low. BKT tracks skill progress and pulls similar REFLACX cases, while a U-Net gives anatomically grounded gaze feedback and safety prompts limit answer leakage. That combination in one live tutoring loop is the new piece; prior work has used pieces of it separately but not this integrated setup for chest X-ray training.

Referee Report

2 major / 2 minor

Summary. The paper presents IMACT-CXR, an AutoGen-based interactive multi-agent conversational tutoring system for chest X-ray interpretation. It integrates learner-provided bounding boxes and gaze data, BKT for skill mastery tracking, a TensorFlow U-Net for lung-lobe segmentation and gaze feedback, PubMed retrieval, REFLACX case similarity, and conditional triggering of NV-Reason-CXR-3B for vision-language reasoning, with safety prompts to avoid premature ground-truth disclosure. The manuscript describes the architecture, implementation on real DICOM cases, and reports that a preliminary evaluation demonstrates improved localization and diagnostic reasoning relative to baselines.

Significance. A rigorously evaluated version of this system could advance AI-supported medical education by showing how multi-agent orchestration, gaze-aware feedback, and mastery tracking can deliver Socratic, anatomically grounded tutoring for visual diagnostic skills. The bounded-latency design and leakage controls are practical strengths for potential residency deployment; however, the current lack of quantitative evaluation details limits assessment of whether the multi-agent integration yields benefits beyond simpler tutoring interfaces.

major comments (2)

The central claim that 'preliminary evaluation shows improved localization and diagnostic reasoning compared to baselines' (abstract) is unsupported because the manuscript provides no participant count, exact metrics (e.g., IoU/Dice for localization or rubric scores for reasoning), baseline definitions, statistical tests, or ablation results. This absence makes it impossible to determine whether observed gains are attributable to the BKT + multi-agent design or to generic interaction effects.
§ on evaluation (or equivalent): the integration of REFLACX cases, U-Net gaze feedback, and NV-Reason-CXR-3B triggering is described at the architectural level, but no quantitative results or experimental protocol are reported to substantiate the improvement claim, rendering the empirical contribution unverifiable.

minor comments (2)

A workflow diagram or pseudocode for the AutoGen agent orchestration would clarify the conditional triggering logic and data flow between localization evaluation, BKT updates, and retrieval agents.
Clarify how 'bounded latency' was measured and what the observed values were; this detail would strengthen the deployment-readiness claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on our manuscript. We agree that the preliminary evaluation section lacks sufficient quantitative detail to support the claims made in the abstract, and we will revise the paper to address this by qualifying or removing unsubstantiated statements while preserving the description of the system architecture.

read point-by-point responses

Referee: The central claim that 'preliminary evaluation shows improved localization and diagnostic reasoning compared to baselines' (abstract) is unsupported because the manuscript provides no participant count, exact metrics (e.g., IoU/Dice for localization or rubric scores for reasoning), baseline definitions, statistical tests, or ablation results. This absence makes it impossible to determine whether observed gains are attributable to the BKT + multi-agent design or to generic interaction effects.

Authors: We acknowledge that this comment is correct and that the abstract claim is not supported by the level of detail provided in the current manuscript. Our preliminary evaluation consisted of informal testing with a small number of trainees to verify system responsiveness and tutoring flow on REFLACX cases, without a formal protocol, controlled baselines, or collection of specific metrics such as IoU or statistical comparisons. In the revised version we will update the abstract and any related text to state that the system demonstrates functional tutoring interactions with bounded latency and leakage controls, while removing the unsupported claim of improved performance relative to baselines. revision: yes
Referee: § on evaluation (or equivalent): the integration of REFLACX cases, U-Net gaze feedback, and NV-Reason-CXR-3B triggering is described at the architectural level, but no quantitative results or experimental protocol are reported to substantiate the improvement claim, rendering the empirical contribution unverifiable.

Authors: We agree that the evaluation content is limited to architectural description and qualitative observations of system behavior. No formal experimental protocol or quantitative results (e.g., mastery tracking accuracy or reasoning rubric scores) were included because the work focused on presenting the multi-agent framework and its integration with existing components. For the revision we will either add any available internal testing protocol details or reframe the section to emphasize design choices and extensibility, deferring empirical validation to future studies. revision: yes

standing simulated objections not resolved

We do not have recorded participant counts, exact quantitative metrics, baseline definitions, or statistical test results from the preliminary evaluation, as it was not structured as a controlled comparative study.

Circularity Check

0 steps flagged

No circularity: system composes external components without self-referential derivations

full rationale

The paper describes an architecture that integrates established external libraries and datasets (AutoGen, BKT, U-Net, PubMed, REFLACX) into a multi-agent tutor. No equations, fitted parameters, uniqueness theorems, or predictions are presented that reduce by construction to the paper's own inputs or prior self-citations. The preliminary evaluation claim is an empirical statement separate from the system definition itself, leaving the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on integration of existing technologies without new free parameters or invented entities. Key assumptions include reliable performance of BKT for mastery estimation and U-Net segmentation for anatomically aware feedback.

axioms (2)

domain assumption Bayesian Knowledge Tracing accurately maintains skill-specific mastery estimates from learner interactions
Invoked to drive knowledge reinforcement and case retrieval in the tutoring workflow.
domain assumption Lung-lobe segmentation from TensorFlow U-Net provides useful anatomically aware gaze feedback
Used to enable precise gaze analysis relative to lung anatomy.

pith-pipeline@v0.9.0 · 5515 in / 1495 out tokens · 44373 ms · 2026-05-17T20:10:04.317739+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Preliminary evaluation shows improved localization and diagnostic reasoning compared to baselines.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Bayesian Knowledge Tracing (BKT) maintains skill-specific mastery estimates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Traditional simulators rely on static quizzes and rarely explain why a diagnosis is correct, while human tutoring is limited by faculty availability

INTRODUCTION Chest X-rays remain the most common radiological examination, but novice readers require iterative feedback to build pattern recog- nition and diagnostic reasoning skills. Traditional simulators rely on static quizzes and rarely explain why a diagnosis is correct, while human tutoring is limited by faculty availability. Intelligent tutoring s...

work page
[2]

IMACT-CXR: An Interactive Multi-Agent Conversational Tutoring System for Chest X-Ray Interpretation

METHODS 2.1. System Architecture IMACT-CXR is implemented as an AutoGen conversational work- flow [5] that invokes Python function agents in a fixed order each turn (Fig. 1). Each student submission contains the current bounding boxes, optional gaze fixations, and free-text interpretation. The or- chestrator executes the following stages synchronously: fo...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Dataset and Environment IMACT-CXR operates on the REFLACX dataset, which provides chest X-ray DICOMs, expert bounding boxes, and eye-tracking fix- ations [7]

IMPLEMENTATION AND SYSTEM V ALIDATION 3.1. Dataset and Environment IMACT-CXR operates on the REFLACX dataset, which provides chest X-ray DICOMs, expert bounding boxes, and eye-tracking fix- ations [7]. Knowledge snippets are retrieved via the PubMed E- utilities API, while NV-Reason-CXR-3B is invoked through a lo- cally hosted PyTorch stack [6]. The syste...

work page
[4]

consider the right upper lobe

DISCUSSION The IMACT-CXR architecture demonstrates several practical bene- fits. First, simultaneous ingestion of spatial, gaze, and textual inputs allows the tutor to emulate expert mentors who expect all evidence before offering hints. Second, mastery-driven triggering of knowl- edge and reasoning agents reduces redundant feedback, ensuring that PubMed ...

work page
[5]

The AutoGen workflow ensures that tutors receive all learner evidence before responding, while safety prompts pre- vent premature disclosure

CONCLUSION We present IMACT-CXR, a multi-agent conversational tutor for chest X-ray interpretation that unifies spatial validation, gaze ana- lytics, PubMed retrieval, NV-Reason reasoning, and mastery-aware orchestration. The AutoGen workflow ensures that tutors receive all learner evidence before responding, while safety prompts pre- vent premature discl...

work page
[6]

ACKNOWLEDGMENTS We thank the REFLACX team for publicly releasing the dataset and NVIDIA for access to NV-Reason-CXR-3B

work page
[7]

The role of perception in imaging: past and future,

E. A. Krupinski, “The role of perception in imaging: past and future,”Seminars in Nuclear Medicine, 2011

work page 2011
[8]

Internet-Based Learning in the Health Pro- fessions,

D. A. Cooket al., “Internet-Based Learning in the Health Pro- fessions,”JAMA, 2008

work page 2008
[9]

Visual search, image organization, and reader error in roentgen diagnosis. Studies of the psycho- physiology of roentgen image perception,

W. J. Tuddenhamet al., “Visual search, image organization, and reader error in roentgen diagnosis. Studies of the psycho- physiology of roentgen image perception,”Radiology, 1962

work page 1962
[10]

Zheng, W

Yanxin Zhenget al., “Large language models for medicine: a survey,”arXiv preprint2405.13055, 2024

work page arXiv 2024
[11]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Q. Wuet al., “AutoGen: Enabling Next-Gen LLM Ap- plications via Multi-Agent Conversation,”arXiv preprint 2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

NV-Reason-CXR-3B: Vision-language reasoning for chest radiography,

NVIDIA, “NV-Reason-CXR-3B: Vision-language reasoning for chest radiography,”arXiv preprint2510.23968, 2025

work page arXiv 2025
[13]

REFLACX, a dataset of reports and eye- tracking data for localization of abnormalities in chest x-rays,

R. B. Lanfrediet al., “REFLACX, a dataset of reports and eye- tracking data for localization of abnormalities in chest x-rays,” Scientific Data, vol. 9, 2022

work page 2022
[14]

V.; Valanarasu, J

Z. Chenet al., “A Vision-Language Foundation Model to En- hance Efficiency of Chest X-ray Interpretation,”arXiv preprint 2401.12208, 2024

work page arXiv 2024
[15]

Radiology-GPT: A Large Language Model for Radiology,

Z. Liuet al.,“Radiology-GPT: A Large Language Model for Radiology,”arXiv preprint2306.08666, 2023

work page arXiv 2023

[1] [1]

Traditional simulators rely on static quizzes and rarely explain why a diagnosis is correct, while human tutoring is limited by faculty availability

INTRODUCTION Chest X-rays remain the most common radiological examination, but novice readers require iterative feedback to build pattern recog- nition and diagnostic reasoning skills. Traditional simulators rely on static quizzes and rarely explain why a diagnosis is correct, while human tutoring is limited by faculty availability. Intelligent tutoring s...

work page

[2] [2]

IMACT-CXR: An Interactive Multi-Agent Conversational Tutoring System for Chest X-Ray Interpretation

METHODS 2.1. System Architecture IMACT-CXR is implemented as an AutoGen conversational work- flow [5] that invokes Python function agents in a fixed order each turn (Fig. 1). Each student submission contains the current bounding boxes, optional gaze fixations, and free-text interpretation. The or- chestrator executes the following stages synchronously: fo...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Dataset and Environment IMACT-CXR operates on the REFLACX dataset, which provides chest X-ray DICOMs, expert bounding boxes, and eye-tracking fix- ations [7]

IMPLEMENTATION AND SYSTEM V ALIDATION 3.1. Dataset and Environment IMACT-CXR operates on the REFLACX dataset, which provides chest X-ray DICOMs, expert bounding boxes, and eye-tracking fix- ations [7]. Knowledge snippets are retrieved via the PubMed E- utilities API, while NV-Reason-CXR-3B is invoked through a lo- cally hosted PyTorch stack [6]. The syste...

work page

[4] [4]

consider the right upper lobe

DISCUSSION The IMACT-CXR architecture demonstrates several practical bene- fits. First, simultaneous ingestion of spatial, gaze, and textual inputs allows the tutor to emulate expert mentors who expect all evidence before offering hints. Second, mastery-driven triggering of knowl- edge and reasoning agents reduces redundant feedback, ensuring that PubMed ...

work page

[5] [5]

The AutoGen workflow ensures that tutors receive all learner evidence before responding, while safety prompts pre- vent premature disclosure

CONCLUSION We present IMACT-CXR, a multi-agent conversational tutor for chest X-ray interpretation that unifies spatial validation, gaze ana- lytics, PubMed retrieval, NV-Reason reasoning, and mastery-aware orchestration. The AutoGen workflow ensures that tutors receive all learner evidence before responding, while safety prompts pre- vent premature discl...

work page

[6] [6]

ACKNOWLEDGMENTS We thank the REFLACX team for publicly releasing the dataset and NVIDIA for access to NV-Reason-CXR-3B

work page

[7] [7]

The role of perception in imaging: past and future,

E. A. Krupinski, “The role of perception in imaging: past and future,”Seminars in Nuclear Medicine, 2011

work page 2011

[8] [8]

Internet-Based Learning in the Health Pro- fessions,

D. A. Cooket al., “Internet-Based Learning in the Health Pro- fessions,”JAMA, 2008

work page 2008

[9] [9]

Visual search, image organization, and reader error in roentgen diagnosis. Studies of the psycho- physiology of roentgen image perception,

W. J. Tuddenhamet al., “Visual search, image organization, and reader error in roentgen diagnosis. Studies of the psycho- physiology of roentgen image perception,”Radiology, 1962

work page 1962

[10] [10]

Zheng, W

Yanxin Zhenget al., “Large language models for medicine: a survey,”arXiv preprint2405.13055, 2024

work page arXiv 2024

[11] [11]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Q. Wuet al., “AutoGen: Enabling Next-Gen LLM Ap- plications via Multi-Agent Conversation,”arXiv preprint 2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

NV-Reason-CXR-3B: Vision-language reasoning for chest radiography,

NVIDIA, “NV-Reason-CXR-3B: Vision-language reasoning for chest radiography,”arXiv preprint2510.23968, 2025

work page arXiv 2025

[13] [13]

REFLACX, a dataset of reports and eye- tracking data for localization of abnormalities in chest x-rays,

R. B. Lanfrediet al., “REFLACX, a dataset of reports and eye- tracking data for localization of abnormalities in chest x-rays,” Scientific Data, vol. 9, 2022

work page 2022

[14] [14]

V.; Valanarasu, J

Z. Chenet al., “A Vision-Language Foundation Model to En- hance Efficiency of Chest X-ray Interpretation,”arXiv preprint 2401.12208, 2024

work page arXiv 2024

[15] [15]

Radiology-GPT: A Large Language Model for Radiology,

Z. Liuet al.,“Radiology-GPT: A Large Language Model for Radiology,”arXiv preprint2306.08666, 2023

work page arXiv 2023