pith. machine review for the scientific record. sign in

arxiv: 2604.22768 · v1 · submitted 2026-03-25 · 💻 cs.CY · cs.CL

Recognition: unknown

Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:19 UTC · model grok-4.3

classification 💻 cs.CY cs.CL
keywords on-premise LLM deploymentradiology AInetwork isolationdata privacyclinical utility evaluationopen-weights modelshospital IT securityPHI processing
0
0 comments X

The pith

An isolation-first on-premise architecture enables secure deployment of open-weights LLMs in radiology while processing unanonymized patient data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a containerized LLM system built for hospital use that keeps all inference inside the institution's network through strict segmentation, egress filtering, and monitoring. This setup received formal approval from compliance, data protection, and security officers to handle real patient information. In a one-week pilot, 22 radiologists rated the system stable and found it most helpful for editing reports and retrieving guidelines, while open-ended summary tasks produced more hallucinations. The architecture now supports an official service for a large German university hospital. The authors released the full deployment package for others to adopt.

Core claim

The isolation-first containerized LLM inference stack, relying on strict network segmentation, host-enforced egress filtering, and active isolation monitoring, overcomes regulatory barriers for on-premise use of open-weights models such as DeepSeek-R1 in radiology. This enables processing of unanonymized PHI under institutional governance, with pilot results showing highest clinical utility for text-anchored tasks like report correction and guideline lookup, while open-ended generation tasks exhibit more critical errors such as hallucinations.

What carries the argument

The isolation-first containerized LLM inference stack with host-enforced egress filtering and active isolation monitoring.

If this is right

  • The architecture supports official hospital-wide deployment serving over 10,000 employees.
  • Text-anchored tasks such as report corrections and simplifications receive the highest utility ratings from users.
  • Open-ended conclusion generation from findings produces the highest frequency of critical errors including hallucinations and omissions.
  • Automated isolation and hardening tests in the accompanying package support repeatable secure deployments.
  • Public release of the deployment package allows other institutions to replicate the setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar isolation techniques could extend to other clinical domains that process sensitive data under strict regulations.
  • Larger-scale use beyond the 22-person pilot may surface additional operational or security edge cases.
  • Combining the system with existing hospital dictation tools could reduce the observed error rate in model outputs.
  • Widespread adoption of the public package might standardize on-premise LLM practices across European healthcare settings.

Load-bearing premise

The combination of containerization, host-enforced egress filtering, and active isolation monitoring is sufficient to prevent unauthorized external connectivity and data exposure in a live hospital environment with real users.

What would settle it

A documented successful external network connection or unauthorized data transmission from the running system during routine clinical use would falsify the security claims.

Figures

Figures reproduced from arXiv: 2604.22768 by Alexander Isaak, Alois Martin Sprinkart, Babak Salam, Benjamin Wulff, Jann-Frederick La{\ss}, Julian Alexander Luetkens, Mohammed Bahaaeldin, Narine Mesropyan, Nico Piel, Sebastian Nowak, Wolfgang Block.

Figure 1
Figure 1. Figure 1: System architecture diagram: The core evaluated stack comprised browser access via the hospital intranet or VPN, an Nginx ingress proxy, an OpenWebUI frontend, and a vLLM inference backend. Security was enforced through layers of redundant isolation. Note: The custom clinical application block illustrates potential extensions of the same isolated backbone. These applications were not part of the present ra… view at source ↗
Figure 2
Figure 2. Figure 2: Perceived Clinical Utility by Task: Bar chart showing mean ratings for "Perceived Potential," "Output Quality," "Ease of Editing," and "Time Savings" across ten radiological tasks. Tasks are grouped by category: Report Transformation, Structuring, Coding, and General Utility. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Frequency of Critical Errors by Task: Stacked bar chart showing the system’s safety profile. We defined a "Critical Error" as an output containing hallucina￾tions, omissions, or incorrect medical advice requiring intervention. "Report: Conclusion Generation" was the only task with multiple critical errors per response, which indicates that automated summarization carries higher risks than linguistic transf… view at source ↗
Figure 4
Figure 4. Figure 4: Radiologist Ratings of Technical Performance: Histograms show the distribution of user ratings (0-10 scale) for system stability, speed, and usability (N=22). Stability ratings were the most consistent; most responses clustered between 9 and 10. Speed ratings varied more (range 5-9), likely due to differences in prompt complexity and output length. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Purpose: To design, implement, evaluate, and report on the regulatory requirements of a self-hosted LLM infrastructure for radiology adhering to the principle of least privilege, emphasizing technical feasibility, network isolation, and clinical utility. Materials and Methods: The isolation-first, containerized LLM inference stack relies on strict network segmentation, host-enforced egress filtering, and active isolation monitoring preventing unauthorized external connectivity. An accompanying deployment package provides automated isolation and hardening tests. The system served the open-weights DeepSeek-R1 model via vLLM. In a one-week pilot phase, 22 residents and radiologists were free to use 10 predefined prompt-templates whenever they considered them useful in daily work. Afterward, they rated clinical utility and system stability on an 0-10 Likert scale and reported observed critical errors in model output. Results: The applied institutional governance pathway achieved approval from clinic management, compliance, data protection and information security officers for processing unanonymized PHI. The system was rated stable and user friendly during the pilot. Source text-anchored tasks, such as report corrections or simplifications, and radiology guideline recommendations received the highest utility ratings, whereas open-ended conclusion generation based on findings resulted in the highest frequency of critical errors, such as clinically relevant hallucinations or omissions. Conclusion: The proposed isolation-first on-premise architecture enabled overcoming regulatory borders, showed promising clinical utility in text-anchored tasks and is the current base to serve open-weights LLMs as an official service of a German University Hospital with over 10,000 employees. The deployment package were made publicly available (https://github.com/ukbonn/ukb-gpt).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper describes the design and deployment of an isolation-first, containerized on-premise LLM inference system (using DeepSeek-R1 via vLLM) for radiology that enforces network segmentation, host-enforced egress filtering, and active monitoring to enable processing of unanonymized PHI. It reports successful institutional approvals from clinic management, compliance, data protection, and security officers, followed by a one-week pilot in which 22 residents and radiologists used 10 predefined prompt templates and provided 0-10 Likert ratings on clinical utility and stability plus qualitative error reports. Highest utility was reported for text-anchored tasks such as report correction; open-ended generation showed more hallucinations. The deployment package is released publicly on GitHub, and the system is positioned as the foundation for official hospital-wide service.

Significance. If the isolation controls prove robust under realistic threats, the work supplies a concrete, reproducible engineering template for compliant open-weights LLM deployment inside regulated medical environments, directly addressing data-protection barriers that currently limit clinical use. The public release of the hardening-test package and the demonstration of institutional approval pathways are concrete strengths that could accelerate adoption at other sites.

major comments (3)
  1. [Materials and Methods] Materials and Methods (isolation architecture description): the claim that containerization plus host-enforced egress filtering and active monitoring is sufficient to prevent unauthorized external connectivity rests entirely on design description; no penetration-test results, red-team outcomes, or quantitative monitoring-effectiveness metrics (e.g., detection rates for simulated exfiltration) are supplied, leaving the central security guarantee unverified against insider or advanced threats in a live PHI environment.
  2. [Results] Results (pilot evaluation): the reported clinical utility rests on subjective Likert scores and qualitative error counts from 22 voluntary users over one week; no quantitative model-performance metrics, baseline comparisons (e.g., against existing dictation tools or human-only workflows), or long-term usage statistics are provided, so the evidence does not yet support the stronger claim that the system is ready to serve as the official service for a >10,000-employee hospital.
  3. [Conclusion] Conclusion: the statement that the architecture 'enabled overcoming regulatory borders' and 'is the current base to serve open-weights LLMs as an official service' is grounded only in the one-week pilot approvals and ratings; scalability, sustained security monitoring, and error-rate data under routine high-volume use are not demonstrated within the manuscript's scope.
minor comments (2)
  1. [Abstract] Abstract: 'The deployment package were made publicly available' contains a subject-verb agreement error ('were' should be 'was').
  2. [Results] The paper would benefit from a brief table summarizing the ten prompt templates and their observed error frequencies to make the qualitative findings more transparent.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each of the major comments below, providing clarifications on the scope of our study as a design and short-term pilot evaluation. Where appropriate, we indicate revisions to the manuscript.

read point-by-point responses
  1. Referee: [Materials and Methods] Materials and Methods (isolation architecture description): the claim that containerization plus host-enforced egress filtering and active monitoring is sufficient to prevent unauthorized external connectivity rests entirely on design description; no penetration-test results, red-team outcomes, or quantitative monitoring-effectiveness metrics (e.g., detection rates for simulated exfiltration) are supplied, leaving the central security guarantee unverified against insider or advanced threats in a live PHI environment.

    Authors: We agree that the security description is design-based. The manuscript details the isolation mechanisms, which were sufficient for institutional approval by compliance, data protection, and security officers to process unanonymized PHI. The accompanying GitHub deployment package includes automated tests for isolation and hardening. We did not perform penetration testing or provide quantitative metrics, as the work focuses on feasible deployment rather than adversarial security evaluation. We will revise the Materials and Methods to explicitly note that the controls are intended to meet regulatory requirements as verified by institutional review, without claiming robustness against all advanced threats. This is a clarification of scope. revision: partial

  2. Referee: [Results] Results (pilot evaluation): the reported clinical utility rests on subjective Likert scores and qualitative error counts from 22 voluntary users over one week; no quantitative model-performance metrics, baseline comparisons (e.g., against existing dictation tools or human-only workflows), or long-term usage statistics are provided, so the evidence does not yet support the stronger claim that the system is ready to serve as the official service for a >10,000-employee hospital.

    Authors: The pilot was designed to assess initial user acceptance and perceived clinical utility in a real radiology workflow, using voluntary participation over one week. We intentionally did not include quantitative LLM performance metrics, as these are available in the model's original publications, nor baselines against existing tools, since the system is an addition rather than a replacement at this stage. Long-term statistics are beyond the scope of this initial report. We will add text in the Results section to emphasize that these findings are preliminary and do not yet demonstrate readiness for full-scale hospital deployment, with ongoing monitoring planned. revision: partial

  3. Referee: [Conclusion] Conclusion: the statement that the architecture 'enabled overcoming regulatory borders' and 'is the current base to serve open-weights LLMs as an official service' is grounded only in the one-week pilot approvals and ratings; scalability, sustained security monitoring, and error-rate data under routine high-volume use are not demonstrated within the manuscript's scope.

    Authors: We will revise the Conclusion to better align with the presented evidence. Specifically, we will state that the architecture enabled regulatory approval for the pilot deployment and provides a foundation for serving open-weights LLMs, while acknowledging that scalability and long-term data under high-volume use remain to be demonstrated in future work. This tempers the language to reflect the pilot nature of the study. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive engineering report and pilot without derivations or fitted predictions

full rationale

The manuscript describes the design and implementation of an isolation-first on-premise LLM deployment for radiology, followed by a one-week pilot with 22 users. There are no mathematical equations, parameter fittings, or predictions that reduce to prior quantities by construction. Institutional approvals and user Likert-scale ratings provide independent support for the claims. No self-citation chains or ansatzes are invoked in a load-bearing manner. The derivation chain is self-contained as a technical report.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard IT security assumptions about container isolation and network controls rather than new mathematical constructs or fitted parameters.

axioms (2)
  • domain assumption Strict network segmentation combined with host-enforced egress filtering and monitoring can reliably prevent unauthorized external connectivity for the containerized inference stack.
    Invoked in the description of the isolation-first architecture and the automated isolation tests.
  • domain assumption The vLLM serving engine and DeepSeek-R1 model can be deployed in containers without introducing exploitable vulnerabilities that bypass the isolation controls.
    Assumed in the choice of the inference stack and the claim of regulatory approval for PHI processing.

pith-pipeline@v0.9.0 · 5655 in / 1440 out tokens · 68235 ms · 2026-05-15T00:19:01.051379+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 68 canonical work pages · 4 internal anchors

  1. [1]

    Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment

    Savage CH, Kanhere A, Parekh V, et al. Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment. Radiology. 2025;314(1):e241073

  2. [2]

    Generative Pre-trained Transformer 4 makes cardiovascular magnetic resonance reports easy to understand

    Salam B, Kravchenko D, Nowak S, et al. Generative Pre-trained Transformer 4 makes cardiovascular magnetic resonance reports easy to understand. Journal of Cardiovascular Magnetic Resonance. 2024;26(1):101035

  3. [3]

    Evaluation of GPT-4o for multilingual translation of radiology reports across imaging modalities

    Terzis R, Salam B, Nowak S, et al. Evaluation of GPT-4o for multilingual translation of radiology reports across imaging modalities. European Journal of Radiology. 2025;191:112341

  4. [4]

    Privacy-ensuring Open-weights Large Lan- guage Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports

    Nowak S, Wulff B, Layer YC, et al. Privacy-ensuring Open-weights Large Lan- guage Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports. Radiology. 2025;314(1):e240895

  5. [5]

    Comparison of the Quality of Discharge Letters Written by Large Language Models and Junior Clinicians: Single-Blinded Study

    Tung JYM, Gill SR, Sng GGR, et al. Comparison of the Quality of Discharge Letters Written by Large Language Models and Junior Clinicians: Single-Blinded Study. J Med Internet Res. 2024;26:e57721

  6. [6]

    The growing issue of burnout in radiology — a survey-based evaluation of driving factors and potential impacts in pediatric radiologists

    Ayyala RS, Baird GL, Sze RW, Brown BP, Taylor GA. The growing issue of burnout in radiology — a survey-based evaluation of driving factors and potential impacts in pediatric radiologists. Pediatr Radiol. 2020;50(8):1071-1077

  7. [7]

    Google-Health/medasr

    Google Research. Google-Health/medasr. https://github.com/Google-Health/medasr. Published December 19, 2025. Accessed March 23, 2026

  8. [8]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Lin J, Tang J, Tang H, et al. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978. 2024

  9. [9]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Kwon W, Li Z, Zhuang S, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180. 2023

  10. [10]

    Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks

    Ohm M, Plate H, Sykosch A, Meier M. Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks. In: Maurice C, Bilge L, Stringhini G, Neves N, eds. Detection of Intrusions and Malware, and Vulnerability Assessment. Springer International Publishing; 2020:23-43

  11. [11]

    ukbonn/ukb-gpt

    University Hospital Bonn. ukbonn/ukb-gpt. GitHub. https://github.com/ukbonn/ukb-gpt. Published March 17, 2026. Accessed March 23, 2026

  12. [12]

    docker/compose

    Docker. docker/compose. GitHub. https://github.com/docker/compose. Published December 9, 2013. Accessed March 23, 2026

  13. [13]

    nginx. nginx. GitHub. https://github.com/nginx/nginx. Published June 23, 2015. Accessed March 23, 2026

  14. [14]

    Open WebUI: An Open, Extensible, and Usable Interface for AI Interaction

    Baek J, Hussain A, Liu D, Vincent N, Kim LH. Open WebUI: An Open, Extensible, and Usable Interface for AI Interaction. arXiv:2510.02546. 2025. 16

  15. [15]

    vllm-project/vllm

    vLLM. vllm-project/vllm. GitHub. https://github.com/vllm-project/vllm. Published February 9, 2023. Accessed March 23, 2026

  16. [16]

    prometheus

    Prometheus. prometheus. GitHub. https://github.com/prometheus/prometheus. Pub- lished November 24, 2012. Accessed March 23, 2026

  17. [17]

    Grafana Labs. grafana. GitHub. https://github.com/grafana/grafana. Published December 11, 2013. Accessed March 13, 2026

  18. [18]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Guo D, Yang D, Zhang H, Song J, Wang P, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature; 2025;645(8081):633–8

  19. [19]

    Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down har- monised rules on artificial intelligence (Artificial Intelligence Act)

    European Parliament and Council of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down har- monised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union. 2024

  20. [20]

    Pickle Scanning

    Hugging Face. Pickle Scanning. https://huggingface.co/docs/hub/security-pickle. Accessed March 23, 2026

  21. [21]

    In: 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC 2021)

    CheahCS,SelvarajahV.AReviewofCommonWebApplicationBreachingTechniques (SQLi, XSS, CSRF). In: 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC 2021). Atlantis Press; 2021:540-547

  22. [22]

    A Container Security Survey: Exploits, Attacks, and Defenses

    Jarkas O, Ko R, Dong N, Mahmud R. A Container Security Survey: Exploits, Attacks, and Defenses. ACM Comput Surv. 2025;57(7):1-36

  23. [23]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, Agarwal S, Ahmad L, Ai J, Altman S, Applebaum A, et al. gpt-oss-120b & gpt-oss-20b Model Card. arXiv; 2025; arXiv.2508.10925

  24. [24]

    Qwen3 Technical Report

    Yang A, Li A, Yang B, Zhang B, Hui B, Zheng B et al. Qwen3 Technical Report. arXiv; 2025; arXiv.2505.09388

  25. [25]

    ICD-10: international statistical classification of diseases and related health problems: tenth revision

    World Health Organization. ICD-10: international statistical classification of diseases and related health problems: tenth revision. https://iris.who.int/items/ab0c8de2- 762b-463e-b6ca-51af1753dbf3. Accessed March 23, 2026

  26. [26]

    Improving Rare and Common ICD Coding via a Multi-Agent LLM-Based Approach

    Li R, Wang X, Yu H. Improving Rare and Common ICD Coding via a Multi-Agent LLM-Based Approach. In: Proceedings of the 34th ACM International Conference on Information and Knowledge; 2025:4945–9

  27. [27]

    Assessing GPT-3.5 and GPT-4 in Generating International Classification of Diseases Billing Codes

    Soroush A, Glicksberg BS, Zimlichman E, Barash Y, Freeman R, Charney AW, et al. Assessing GPT-3.5 and GPT-4 in Generating International Classification of Diseases Billing Codes. medRxiv; 2023. https://doi.org/10.1101/2023.07.07.23292391

  28. [28]

    Automated clinical coding using off-the-shelf large language models

    Boyle JS, Kascenas A, Lok P, Liakata M, O’Neil AQ. Automated clinical coding using off-the-shelf large language models. arXiv; 2023; arXiv.2310.06552

  29. [29]

    OpenClaw

    Peter Steinberger. OpenClaw. https://github.com/openclaw/openclaw. Published November 24, 2025. Accessed March 23, 2026

  30. [30]

    Liver unremarkable, size approx. 15 cm in the right hepatic lobe

    NVIDIA Corporation. NemoClaw. https://github.com/NVIDIA/NemoClaw. Pub- lished March 15, 2026. Accessed March 23, 2026. 17 A Full Prompt Texts The following sections contain the exact prompt texts used in the study. A.1 Prompt 1: Report: Correction & Improvement ## Radiological Report - Review & Commenting Act as an experienced radiologist. Your task is to...

  31. [31]

    Synthetically summarize the **most important pathological and clinically relevant findings** of the report

  32. [32]

    Primarily answer the questions implied by the **indication**

  33. [33]

    Provide the referring medical colleague with a **clear basis for decision-making** regarding further management

  34. [34]

    Be based **exclusively** on the information provided in the report text

  35. [35]

    The assessment is:

    Size measurements that are already mentioned in the report should not be repeated. **Instructions for creating the assessment**: * **Focus & Relevance:** * Focus exclusively on the **essential and clinically relevant** results. * Establish a clear connection to the **indication**. Answer the clinical question. * Only mention normal findings if they are ex...

  36. [36]

    The complete report of a previous examination (reference for stable findings and baseline values). 19

  37. [37]

    Status idem regarding

    A memory protocol/notes on the changes, relevant observations, or new aspects in the current examination compared to the previous one. **Processing steps:** **0. Internal Analysis and Planning Strategy:** * **Before you create the actual report**, outline your thoughts and strategy here. * Analyze the two inputs (previous report, notes) carefully. * Expli...

  38. [38]

    **Medical Correctness:** The explanation must remain absolutely medically correct and must not distort the original meaning of the report or omit important information

  39. [39]

    Avoid complex sentence structures

    **Simplicity and Clarity:** Use short sentences and everyday language. Avoid complex sentence structures

  40. [40]

    liver" instead of

    **No Jargon:** * Consistently replace medical terminology with understandable descriptions or everyday explanations. * If a Latin/medical name for an organ or structure is given, use the common English name instead (e.g., "liver" instead of "hepar"). * Explain abbreviations if they are not absolutely commonplace

  41. [41]

    You had an X-ray of your lungs to see if everything is okay there

    **Structure and Logic:** Organize the explanation logically and clearly. A good structure could be: * **Introduction:** A brief, simple mention of what kind of examination was done and what was generally examined (e.g., "You had an X-ray of your lungs to see if everything is okay there."). * **Main Results:** What are the central findings? Describe these ...

  42. [42]

    **Enhancing Understanding:** Where appropriate, use simple analogies or everyday comparisons to make complex medical contexts clearer, but only if they are truly fitting and neither trivializing nor misleading

  43. [43]

    Avoid overly trivializing or unnecessarily alarming language

    **Tone:** Choose an empathetic, calm, and objectively informative tone. Avoid overly trivializing or unnecessarily alarming language. The explanation should be informative and supportive

  44. [44]

    e.g," to

    **Important Note (Disclaimer):** Always add a short, standardized note at the end of the explanation, roughly in this form: "Please note: This explanation is intended to help you better understand your report. However, it does not replace a personal conversation with your attending physician, who knows your entire medical situation and can answer all your...

  45. [45]

    * High risk: Optional FU with CT at 12 months (esp

    **Single solid SN:** * **<6 mm (<100 mm 3):** * Low risk: No routine FU. * High risk: Optional FU with CT at 12 months (esp. for suspicious morphology, upper lobe location). * **6-8 mm (100-250 mm 3):** * Low risk: CT at 6-12 months, then consider CT at 18-24 months. * High risk: CT at 6-12 months, then CT at 18-24 months. * **>8 mm (>250 mm 3):** * CT at...

  46. [46]

    Single solid SN

    **Multiple solid SN:** * **All SN <6 mm:** * Low risk: No routine FU. * High risk: Optional FU with CT at 12 months. * **At least 1 SN >=6 mm:** * Management is based on the most suspicious SN (analogous to "Single solid SN"). * Initial CT at 3-6 months, then optional CT at 18-24 months (risk-adapted). **B. SUBSOLID NODULES (SSN)** 26

  47. [47]

    (Exception: In high-risk patients or suspicious morphology, an FU at 2 and 4 years can be considered)

    **Single pure Ground Glass Nodule (GGN):** * **<6 mm (<100 mm 3):** No routine FU. (Exception: In high-risk patients or suspicious morphology, an FU at 2 and 4 years can be considered). * **>=6 mm (>100 mm 3):** CT at 6-12 months to confirm persistence, then every 2 years for a total of 5 years

  48. [48]

    * **>=6 mm (>100 mm 3):** * **Solid component remains <6 mm:** CT at 3-6 months to confirm persistence

    **Single Part-Solid Nodule (PSN):** * **<6 mm (<100 mm 3):** No routine FU. * **>=6 mm (>100 mm 3):** * **Solid component remains <6 mm:** CT at 3-6 months to confirm persistence. If persistent, annual CT for 5 years. * **Solid component >=6 mm (initially or during follow-up):** CT at 3-6 months. If persistent and solid component is suspicious (e.g., grow...

  49. [49]

    If persistent, consider FU at 2 and 4 years (risk-adapted)

    **Multiple SSN (<6 mm pure GGNs):** * CT at 3-6 months. If persistent, consider FU at 2 and 4 years (risk-adapted)

  50. [50]

    absent" or

    **Multiple SSN (with at least 1 nodule >=6 mm or PSN):** * Management is based on the most suspicious nodule. Initial CT at 3-6 months. **Important Notes:** * Follow-up intervals are often given as ranges to account for individual factors and patient preferences. * Perifissural nodules (typical intrapulmonary lymph nodes) <10mm: Usually no FU needed if ty...

  51. [51]

    No specific risk factors mentioned in the report, therefore low risk assumed unless nodule morphology/location is suspicious

    **Relevant Report Details and Risk Assessment:** * Nodule(s): (Type, size, number, location of the relevant nodule(s)) * Patient Risk Factors (per report): (List of risk factors mentioned in the report or statement "No specific risk factors mentioned in the report, therefore low risk assumed unless nodule morphology/location is suspicious.") * Resulting R...

  52. [52]

    **Justification for Management Recommendation:** (Brief explanation of which findings and risk assessment led to the Fleischner classification and which specific path of the guidelines is being applied.)

  53. [53]

    **Management Recommendations according to FLEISCHNER SOCIETY GUIDELINES:** (Specific recommendation based on the classification and risk category.)

  54. [54]

    **Additional Relevant Findings (if any):** (e.g., non-pulmonary incidental findings)

  55. [55]

    atypical nodule

    **Brief Summary for the Radiologist:** (A concise, **bolded** summary of the core message and the main recommendation(s) for a quick overview. Maximum 2-3 sentences.) ## Report for Analysis: ###### Guideline Recommendation (PI-RADS v2.1 2019): for Prostate MRI reports ###### ## Recommendation according to PI-RADS v2.1 **Role**: You are an AI assistant for...

  56. [57]

    **PI-RADS Category per Lesion with Justification (based on T2W, DWI, DCE scores and the matrices above):**

  57. [58]

    **Identification of the Index Lesion:**

  58. [60]

    **Other Relevant Observations in the Context of the PI-RADS Guideline:** **INSTRUCTIONS FOR THE RESPONSE:** * If you obviously lack other information in the report below to perform an adequate classification, clearly state this missing information (**in bold**) and request it before creating your classification! If the context is clear and no information ...

  59. [61]

    **Identified Lesions (Localization by Sector Map, Size):**

  60. [62]

    **Justification for the Classification:** (Brief explanation of which findings led to the classification.) 31

  61. [63]

    **PI-RADS Category per Lesion with Justification:** (based on T2W, DWI, DCE scores and the matrices above)

  62. [64]

    **Management Recommendations according to PI-RADS v2.1:** (Based on the classification AND the clinical context)

  63. [65]

    **Staging Notes (EPE, SVI, Lymph Nodes, Bones):**

  64. [66]

    Napkin-Ring Sign

    **Additional Relevant Findings (if any):** (e.g., non-coronary cardiac or extra-cardiac findings) ## Report for Analysis: ###### Guideline Recommendation (CAD-RADS v2.0 2019): for Coronary CT Angiography reports ###### ## Recommendation according to CAD-RADS v2.0 **Role**: You are an AI assistant for radiologists, specialized in the reporting of Coronary ...

  65. [68]

    **CAD-RADS Classification:** [e.g., CAD-RADS 3/P2/HRP/I+]

  66. [69]

    **Management Recommendations according to CAD-RADS 2.0:** (Based on the classification AND the clinical context)

  67. [70]

    cystic mass

    **Additional Relevant Findings (if any):** (e.g., non-coronary cardiac or extra-cardiac findings) ## Report for Analysis: ###### Guideline Recommendation (Bosniak v2019): Classification of Cystic Renal Masses ###### ## Recommendation according to Bosniak v2019 **Role**: You are an AI assistant for radiologists, specialized in the reporting and classificat...

  68. [71]

    Cystic masses with thin (<=2mm) and few (1-3) septa; septa/wall *may* enhance; calcifications *of any type* permitted

  69. [72]

    Homogeneous hyperattenuating masses (>=70 HU) on unenhanced CT

  70. [73]

    Homogeneous, non-enhancing masses >20 HU on renal protocol CT; calcifications *of any type* permitted

  71. [74]

    Homogeneous masses -9 to 20 HU on unenhanced CT

  72. [75]

    Homogeneous masses 21 to 30 HU on portal venous phase CT

  73. [76]

    * **MRI Types:**

    Homogeneous, low-attenuating masses too small to characterize. * **MRI Types:**

  74. [77]

    Cystic masses with thin (<=2mm) and few (1-3) *enhancing* septa; any non-enhancing septa; calcifications *of any type* permitted (if visible)

  75. [78]

    Homogeneous masses, markedly hyperintense on T2w (CSF-like) on unenhanced MRI

  76. [79]

    Too small to characterize

    Homogeneous masses, markedly hyperintense on T1w (approx. 2.5x parenchymal signal) on unenhanced MRI. * **Implication:** Benign or highly likely benign, usually no follow-up. **Bosniak IIF (F for Follow-up):** * **CT & MRI (Type 1):** * Smooth, minimally thickened (3mm) *enhancing* wall OR * Smooth, minimally thickened (3mm) one or more *enhancing* septa ...

  77. [80]

    **Justification for the Classification:** (Brief explanation of which findings led to the classification.)

  78. [81]

    **Bosniak Classification:**

  79. [82]

    **Management Recommendations Bosniak:** (Based on the classification AND the clinical context)

  80. [83]

    Custom Use Case

    **Additional Relevant Findings (if any):** ## Report for Analysis: 37 B Online Survey Questions The following questionnaire was translated into an online survey (Google Forms) to evaluate the LLM use cases in daily radiological practice.(Note: The questions below have been translated from German to English for the purpose of this publication.) Part 1: Gen...

Showing first 80 references.