An Artifact-based Agent Framework for Adaptive and Reproducible Medical Image Processing
Pith reviewed 2026-05-08 02:21 UTC · model gemini-3-flash-preview
The pith
An artifact-based agent framework automates medical image processing by synthesizing dataset-specific workflows while maintaining a deterministic record of every computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that medical imaging workflows can be made both adaptive and reproducible by introducing a semantic layer called an artifact contract. This contract formalizes the relationship between data inputs, processing rules, and outputs, allowing an agent to assemble valid computational graphs dynamically. Rather than replacing the workflow engine, the agent acts as a configuration synthesizer that maps high-level goals to specific tool chains based on data metadata. This approach successfully handles heterogeneous CT and MRI cohorts by generating tailored pipelines that remain fully auditable and re-executable.
What carries the argument
The Artifact Contract, a formal definition of intermediate and final outputs that enables a goal-conditioned agent to assemble modular processing rules into a deterministic computational graph.
If this is right
- Processing pipelines can automatically adjust to variations in image resolution, modality, or anatomical coverage without manual code changes.
- Provenance tracking becomes a native feature of the analysis, allowing researchers to audit the exact logic used to generate a clinical finding.
- Non-technical users can interact with complex imaging databases using semantic queries that ground natural language in actual data artifacts.
- The separation of agent reasoning and local execution allows for privacy-compliant processing where data never leaves the clinical environment.
Where Pith is reading between the lines
- This framework could serve as a compiler for clinical guidelines, translating textual medical protocols into executable image-processing pipelines.
- The reliance on metadata suggests that the future of medical imaging depends as much on standardized data-characterization tools as it does on the processing algorithms themselves.
- If the rule library expands, the agent could potentially optimize workflows for computational efficiency or cost rather than just analytical correctness.
Load-bearing premise
The system assumes that all relevant dataset conditions can be accurately captured as machine-readable metadata that the agent is capable of interpreting.
What would settle it
If a workflow generated by the agent for a new, unseen dataset fails to produce valid artifacts despite the necessary rules being present in the library, the framework's adaptive reasoning would be proven insufficient.
Figures
read the original abstract
Medical imaging research is increasingly shifting from controlled benchmark evaluation toward real-world clinical deployment. In such settings, applying analytical methods extends beyond model design to require dataset-aware workflow configuration and provenance tracking. Two requirements therefore become central: \textbf{adaptability}, the ability to configure workflows according to dataset-specific conditions and evolving analytical goals; and \textbf{reproducibility}, the guarantee that all transformations and decisions are explicitly recorded and re-executable. Here, we present an artifact-based agent framework that introduces a semantic layer to augment medical image processing. The framework formalizes intermediate and final outputs through an artifact contract, enabling structured interrogation of workflow state and goal-conditioned assembly of configurations from a modular rule library. Execution is delegated to a workflow executor to preserve deterministic computational graph construction and provenance tracking, while the agent operates locally to comply with most privacy constraints. We evaluate the framework on real-world clinical CT and MRI cohorts, demonstrating adaptive configuration synthesis, deterministic reproducibility across repeated executions, and artifact-grounded semantic querying. These results show that adaptive workflow configuration can be achieved without compromising reproducibility in heterogeneous clinical environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper introduces an agent-based framework for medical image processing designed to balance adaptability and reproducibility. The architecture separates the decision-making 'Agent' from a deterministic 'Executor'. The Agent uses a 'Rule Library' and 'Artifact Contracts'—a set of metadata specifications for inputs and outputs—to synthesize computational graphs tailored to specific dataset characteristics. The authors argue that by formalizing the intermediate states of a workflow as artifacts, they can achieve goal-conditioned configuration while maintaining an immutable provenance record through the executor. The framework is evaluated on CT organ segmentation and MRI brain tumor segmentation tasks using clinical datasets.
Significance. The paper addresses a critical bottleneck in clinical imaging: the high overhead of manually configuring pipelines for heterogeneous data. The primary strength is the formalization of the 'Artifact Contract', which provides a machine-readable interface between modular imaging tools. By decoupling the reasoning (Agent) from the execution (Executor), the work offers a pathway to use large language models or rule-based systems in clinical environments without sacrificing the auditability of the final execution graph. The inclusion of real-world clinical cohorts rather than just curated benchmarks is a significant strength that demonstrates the system's practical utility in handling varied metadata.
major comments (3)
- [§2.2, 'Artifact Contract'] The Artifact Contract definition focuses heavily on technical metadata (file formats, dimensions, voxel spacing). However, it lacks a mechanism for clinical semantic validation. For instance, the contract as described does not seem to prevent the Agent from applying a rule designed for non-contrast CT to a contrast-enhanced artifact if both share the same technical dimensions. The authors should clarify how the framework ensures 'scientific' validity versus merely 'execution' validity, or explicitly define the scope of the Agent's reasoning as limited to technical compatibility.
- [§4.1, Table 1 / Performance Comparison] The evaluation demonstrates that the Agent can successfully synthesize a valid path, but it lacks a comparison against a fixed, 'one-size-fits-all' baseline. To support the claim of 'adaptive configuration synthesis' as a benefit, the results should show that the Agent's dataset-specific adaptation outperforms a standard static pipeline (e.g., a default nnU-Net configuration) across the heterogeneous cohorts tested.
- [§3.3, 'Reproducibility'] The paper claims deterministic reproducibility, yet the Agent's decision-making process (choosing from the Rule Library) appears to happen outside the workflow executor. If the Agent's selection logic is updated or contains stochastic components (e.g., if using an LLM-based agent as hinted in the discussion), the 'same' input could result in different execution graphs over time. The authors must specify if the Agent's state and rule-set versions are captured in the provenance metadata to ensure the entire pipeline, not just the resulting DAG, is reproducible.
minor comments (3)
- [§2.3] The 'Rule Library' is described as modular, but the manuscript does not provide the scale of these libraries. Please clarify the number of rules used in the CT vs. MRI experiments to give the reader a sense of the search space complexity.
- [Figure 2] The notation for 'Artifact Flow' is slightly ambiguous; use distinct arrow styles or colors to differentiate between metadata interrogation (Agent-Artifact) and data movement (Executor-Artifact).
- [§5.2] The discussion mentions privacy constraints. It would be helpful to explicitly state whether the Artifact Contracts ever contain Protected Health Information (PHI) or if they are strictly limited to non-identifying technical headers.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for highlighting the significance of our Artifact-based Agent Framework in clinical imaging. We appreciate the recognition of the 'Artifact Contract' as a key innovation for decoupling reasoning from execution. We have addressed the referee's concerns regarding semantic validation, performance baselines, and the reproducibility of the synthesis process itself through revisions in Sections 2.2, 3.3, and 4.1.
read point-by-point responses
-
Referee: [§2.2, 'Artifact Contract'] The Artifact Contract definition focuses heavily on technical metadata... It lacks a mechanism for clinical semantic validation... The authors should clarify how the framework ensures 'scientific' validity versus merely 'execution' validity.
Authors: The referee is correct that our initial description focused on technical metadata necessary for low-level execution. However, the Artifact Contract's metadata field is designed to be extensible to clinical semantics. In the revised manuscript, we have clarified that the 'artifact_type' and 'metadata' fields are intended to encompass clinical attributes such as contrast phase, anatomical region, and sequence type. We have added a paragraph in Section 2.2 describing 'Semantic Rules' within the Rule Library that specifically check these clinical tags (e.g., ensuring a 'contrast-enhanced' tag is present before invoking a specific segmentation rule). This ensures that the Agent reasoning encompasses scientific validity in addition to technical compatibility. revision: yes
-
Referee: [§4.1, Table 1 / Performance Comparison] The evaluation... lacks a comparison against a fixed, 'one-size-fits-all' baseline. To support the claim of 'adaptive configuration synthesis' as a benefit, the results should show that the Agent's dataset-specific adaptation outperforms a standard static pipeline.
Authors: We agree that a static baseline is necessary to quantify the benefit of adaptivity. We have updated Table 1 and Section 4.1 to include a 'Static Baseline' comparison. In our tests on the CT organ segmentation task, we compare the Agent-synthesized pipeline against a fixed pre-processing pipeline (standardized resampling and intensity scaling based on the aggregate dataset mean). The results demonstrate that the Agent’s ability to detect and adapt to varying voxel spacings and manufacturer-specific metadata leads to a statistically significant improvement in segmentation accuracy (DSC) compared to the static baseline, particularly in the heterogeneous 'Clinical Cohort B'. revision: yes
-
Referee: [§3.3, 'Reproducibility'] The paper claims deterministic reproducibility, yet the Agent's decision-making process... appears to happen outside the workflow executor... The authors must specify if the Agent's state and rule-set versions are captured in the provenance metadata.
Authors: This is an important point regarding the scope of reproducibility. In the original manuscript, provenance focused on the execution of the computational graph. We have now revised Section 3.3 and our implementation to include 'Synthesis Provenance.' This ensures that the version of the Rule Library, the Agent's internal state (e.g., hyper-parameters), and any external model identifiers (for LLM-based reasoning) are captured as metadata in the final Artifact. This ensures that the entire lifecycle—from initial data ingestion to the final synthesized execution—is fully auditable and reproducible. We have clarified that if the Agent's logic is updated, it is treated as a new version of the synthesis engine, which is explicitly recorded in the provenance trace. revision: yes
Circularity Check
No significant circularity: Results are functional verifications of a self-contained system architecture.
full rationale
The framework's primary claims—reproducibility and adaptive synthesis—are established through the integration of a deterministic execution engine and a structured rule library. Reproducibility is an inherited property of the chosen executor (e.g., Snakemake), and the agent's 'adaptive' success is a verification that the rule library successfully covers the metadata variants present in the test cohorts. While the system's performance is a direct consequence of these authored inputs (rules and contracts), this represents a successful functional verification of the software architecture rather than a circular scientific derivation. The 'adaptive configuration' is a traversal of a search space defined by the authors; while limited by that library, its successful synthesis is an implementation result rather than a logical fallacy. The authors do not invoke unverified 'uniqueness theorems' or rely on load-bearing self-citations to justify the framework's validity, and the evaluation on real-world clinical data serves as an empirical check of the system's utility.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption DICOM/Metadata Reliability
- domain assumption Modular Rule Sufficiency
invented entities (1)
-
Artifact Contract
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2505.24160 (2025)
Chen, J., Wei, S., Honkamaa, J., Marttinen, P., Zhang, H., Liu, M., Zhou, Y., Tan, Z., Wang, Z., Wang, Y., et al.: Beyond the LUMIR challenge: The pathway to foundational registration models. arXiv preprint arXiv:2505.24160 (2025)
-
[2]
Nature biotechnology 35(4), 316–319 (2017)
DiTommaso,P.,Chatzou,M.,Floden,E.W.,Barja,P.P.,Palumbo,E.,Notredame, C.: Nextflow enables reproducible computational workflows. Nature biotechnology 35(4), 316–319 (2017)
2017
-
[3]
Hearing Loss Rehabilitation and Higher-Order Auditory and Cognitive Processing p
Ferrucci, L., Resnick, S.M., Deal, J.A.: Baltimore Longitudinal Study of Aging (BLSA). Hearing Loss Rehabilitation and Higher-Order Auditory and Cognitive Processing p. 116 (2023)
2023
-
[4]
Bioinformatics28(19), 2520–2522 (2012)
Köster, J., Rahmann, S.: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics28(19), 2520–2522 (2012)
2012
-
[5]
Kramer, B.S., Berg, C.D., Aberle, D.R., Prorok, P.C.: Lung cancer screening with low-dose helical CT: results from the National Lung Screening Trial (NLST) (2011)
2011
-
[6]
medrxiv pp
LaMontagne, P.J., Benzinger, T.L., Morris, J.C., Keefe, S., Hornbeck, R., Xiong, C., Grant, E., Hassenstab, J., Moulder, K., Vlassenko, A.G., et al.: OASIS-3: longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and Alzheimer disease. medrxiv pp. 2019–12 (2019)
2019
-
[7]
In: Medical Imaging 2023: Image Processing
Li, T.Z., Xu, K., Gao, R., Tang, Y., Lasko, T.A., Maldonado, F., Sandler, K.L., Landman, B.A.: Time-distance vision transformers in lung cancer diagnosis from longitudinal computed tomography. In: Medical Imaging 2023: Image Processing. vol. 12464, pp. 229–238. SPIE (2023)
2023
-
[8]
IEEE transactions on medical imaging 34(10), 1993–2024 (2014)
Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor image segmentation benchmark (BRATS). IEEE transactions on medical imaging 34(10), 1993–2024 (2014)
1993
-
[9]
Sensors22(12), 4426 (2022)
Naseer, I., Akram,S., Masood, T.,Jaffar, A.,Khan, M.A., Mosavi, A.:Performance analysis of state-of-the-art CNN architectures for LUNA16. Sensors22(12), 4426 (2022)
2022
-
[10]
In: Proceedings of the 36th annual acm symposium on user interface software and technology
Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Gener- ative agents: Interactive simulacra of human behavior. In: Proceedings of the 36th annual acm symposium on user interface software and technology. pp. 1–22 (2023)
2023
-
[11]
Science334(6060), 1226–1227 (2011)
Peng, R.D.: Reproducible research in computational science. Science334(6060), 1226–1227 (2011)
2011
-
[12]
Imaging Neuroscience2, imag–2 (2024)
Poldrack, R.A., Markiewicz, C.J., Appelhoff, S., Ashar, Y.K., Auer, T., Baillet, S., Bansal, S., Beltrachini, L., Benar, C.G., Bertazzoli, G., et al.: The past, present, and future of the brain imaging data structure (BIDS). Imaging Neuroscience2, imag–2 (2024)
2024
-
[13]
Advances in neural information processing systems36, 68539–68551 (2023)
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettle- moyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems36, 68539–68551 (2023)
2023
-
[14]
Science354(6317), 1240–1241 (2016) Adaptive and Reproducible Medical Image Processing 11
Stodden, V., McNutt, M., Bailey, D.H., Deelman, E., Gil, Y., Hanson, B., Heroux, M.A., Ioannidis, J.P., Taufer, M.: Enhancing reproducibility for computational methods. Science354(6317), 1240–1241 (2016) Adaptive and Reproducible Medical Image Processing 11
2016
-
[15]
Radiology: Artificial Intelligence 5(5), e230024 (2023)
Wasserthal, J., Breit, H.C., Meyer, M.T., Pradella, M., Hinck, D., Sauter, A.W., Heye, T., Boll, D.T., Cyriac, J., Yang, S., et al.: TotalSegmentator: robust segmen- tation of 104 anatomic structures in CT images. Radiology: Artificial Intelligence 5(5), e230024 (2023)
2023
-
[16]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022)
2022
-
[17]
Medical Image Analysis90, 102939 (2023)
Yu, X., Yang, Q., Zhou, Y., Cai, L.Y., Gao, R., Lee, H.H., Li, T., Bao, S., Xu, Z., Lasko, T.A., et al.: Unest: local spatial representation learning with hierarchical transformer for efficient medical segmentation. Medical Image Analysis90, 102939 (2023)
2023
-
[18]
In: Medical Imaging 2024: Clin- ical and Biomedical Imaging
Zhang, J., Zuo, L., Dewey, B.E., Remedios, S.W., Hays, S.P., Pham, D.L., Prince, J.L., Carass, A.: Harmonization-Enriched Domain Adaptation with Light Fine- tuning for Multiple Sclerosis Lesion Segmentation. In: Medical Imaging 2024: Clin- ical and Biomedical Imaging. vol. 12930, pp. 633–639. SPIE (2024)
2024
-
[19]
Computerized Medical Imaging and Graphics109, 102285 (2023)
Zuo, L., Liu, Y., Xue, Y., Dewey, B.E., Remedios, S.W., Hays, S.P., Bilgel, M., Mowry, E.M., Newsome, S.D., Calabresi, P.A., et al.: HACA3: A Unified Ap- proach for Multi-site MR Image Harmonization. Computerized Medical Imaging and Graphics109, 102285 (2023)
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.