arxiv: 2604.22768 · v1 · submitted 2026-03-25 · 💻 cs.CY · cs.CL

Recognition: unknown

Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation

Sebastian Nowak , Jann-Frederick La{\ss} , Narine Mesropyan , Babak Salam , Nico Piel , Mohammed Bahaaeldin , Wolfgang Block , Alois Martin Sprinkart

show 3 more authors

Julian Alexander Luetkens Benjamin Wulff Alexander Isaak

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:19 UTC · model grok-4.3

classification 💻 cs.CY cs.CL

keywords on-premise LLM deploymentradiology AInetwork isolationdata privacyclinical utility evaluationopen-weights modelshospital IT securityPHI processing

0 comments

The pith

An isolation-first on-premise architecture enables secure deployment of open-weights LLMs in radiology while processing unanonymized patient data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a containerized LLM system built for hospital use that keeps all inference inside the institution's network through strict segmentation, egress filtering, and monitoring. This setup received formal approval from compliance, data protection, and security officers to handle real patient information. In a one-week pilot, 22 radiologists rated the system stable and found it most helpful for editing reports and retrieving guidelines, while open-ended summary tasks produced more hallucinations. The architecture now supports an official service for a large German university hospital. The authors released the full deployment package for others to adopt.

Core claim

The isolation-first containerized LLM inference stack, relying on strict network segmentation, host-enforced egress filtering, and active isolation monitoring, overcomes regulatory barriers for on-premise use of open-weights models such as DeepSeek-R1 in radiology. This enables processing of unanonymized PHI under institutional governance, with pilot results showing highest clinical utility for text-anchored tasks like report correction and guideline lookup, while open-ended generation tasks exhibit more critical errors such as hallucinations.

What carries the argument

The isolation-first containerized LLM inference stack with host-enforced egress filtering and active isolation monitoring.

If this is right

The architecture supports official hospital-wide deployment serving over 10,000 employees.
Text-anchored tasks such as report corrections and simplifications receive the highest utility ratings from users.
Open-ended conclusion generation from findings produces the highest frequency of critical errors including hallucinations and omissions.
Automated isolation and hardening tests in the accompanying package support repeatable secure deployments.
Public release of the deployment package allows other institutions to replicate the setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar isolation techniques could extend to other clinical domains that process sensitive data under strict regulations.
Larger-scale use beyond the 22-person pilot may surface additional operational or security edge cases.
Combining the system with existing hospital dictation tools could reduce the observed error rate in model outputs.
Widespread adoption of the public package might standardize on-premise LLM practices across European healthcare settings.

Load-bearing premise

The combination of containerization, host-enforced egress filtering, and active isolation monitoring is sufficient to prevent unauthorized external connectivity and data exposure in a live hospital environment with real users.

What would settle it

A documented successful external network connection or unauthorized data transmission from the running system during routine clinical use would falsify the security claims.

Figures

Figures reproduced from arXiv: 2604.22768 by Alexander Isaak, Alois Martin Sprinkart, Babak Salam, Benjamin Wulff, Jann-Frederick La{\ss}, Julian Alexander Luetkens, Mohammed Bahaaeldin, Narine Mesropyan, Nico Piel, Sebastian Nowak, Wolfgang Block.

**Figure 1.** Figure 1: System architecture diagram: The core evaluated stack comprised browser access via the hospital intranet or VPN, an Nginx ingress proxy, an OpenWebUI frontend, and a vLLM inference backend. Security was enforced through layers of redundant isolation. Note: The custom clinical application block illustrates potential extensions of the same isolated backbone. These applications were not part of the present ra… view at source ↗

**Figure 2.** Figure 2: Perceived Clinical Utility by Task: Bar chart showing mean ratings for "Perceived Potential," "Output Quality," "Ease of Editing," and "Time Savings" across ten radiological tasks. Tasks are grouped by category: Report Transformation, Structuring, Coding, and General Utility. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Frequency of Critical Errors by Task: Stacked bar chart showing the system’s safety profile. We defined a "Critical Error" as an output containing hallucinations, omissions, or incorrect medical advice requiring intervention. "Report: Conclusion Generation" was the only task with multiple critical errors per response, which indicates that automated summarization carries higher risks than linguistic transf… view at source ↗

**Figure 4.** Figure 4: Radiologist Ratings of Technical Performance: Histograms show the distribution of user ratings (0-10 scale) for system stability, speed, and usability (N=22). Stability ratings were the most consistent; most responses clustered between 9 and 10. Speed ratings varied more (range 5-9), likely due to differences in prompt complexity and output length. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Purpose: To design, implement, evaluate, and report on the regulatory requirements of a self-hosted LLM infrastructure for radiology adhering to the principle of least privilege, emphasizing technical feasibility, network isolation, and clinical utility. Materials and Methods: The isolation-first, containerized LLM inference stack relies on strict network segmentation, host-enforced egress filtering, and active isolation monitoring preventing unauthorized external connectivity. An accompanying deployment package provides automated isolation and hardening tests. The system served the open-weights DeepSeek-R1 model via vLLM. In a one-week pilot phase, 22 residents and radiologists were free to use 10 predefined prompt-templates whenever they considered them useful in daily work. Afterward, they rated clinical utility and system stability on an 0-10 Likert scale and reported observed critical errors in model output. Results: The applied institutional governance pathway achieved approval from clinic management, compliance, data protection and information security officers for processing unanonymized PHI. The system was rated stable and user friendly during the pilot. Source text-anchored tasks, such as report corrections or simplifications, and radiology guideline recommendations received the highest utility ratings, whereas open-ended conclusion generation based on findings resulted in the highest frequency of critical errors, such as clinically relevant hallucinations or omissions. Conclusion: The proposed isolation-first on-premise architecture enabled overcoming regulatory borders, showed promising clinical utility in text-anchored tasks and is the current base to serve open-weights LLMs as an official service of a German University Hospital with over 10,000 employees. The deployment package were made publicly available (https://github.com/ukbonn/ukb-gpt).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They got a containerized on-prem LLM approved for real PHI in a German radiology department and ran a one-week pilot with 22 users showing it helps with report editing but not open-ended generation.

read the letter

The core of this paper is a working example of an isolation-first setup for running open-weights models like DeepSeek-R1 on hospital hardware. They used containerization, strict egress filtering, and monitoring to clear approvals from clinic management, compliance, data protection, and security teams for unanonymized patient data. The public deployment package on GitHub is a practical addition that others can try directly. The one-week pilot with 22 radiologists and residents gave the system decent marks for stability and usefulness on text-anchored jobs such as report corrections and guideline lookups, while open-ended conclusion writing produced more hallucinations and omissions. That distinction is useful to see in a real setting. The approvals and the fact that the system is now an official hospital service are the strongest parts. The pilot data is thin, though. It rests on voluntary users, short duration, and subjective 0-10 ratings plus error notes, with no baseline comparisons, quantitative accuracy measures, or longer-term logs. Security claims stay at the design level; there are no penetration tests, red-team results, or metrics on how well the monitoring catches exfiltration attempts. This is an engineering report rather than a methods advance, so the evidence supports feasibility in one European hospital but does not prove the controls would hold against determined insiders or advanced threats. Readers in hospital IT or medical informatics groups will find the concrete stack and regulatory path worth looking at. It deserves peer review because the approvals and pilot give usable real-world detail that similar efforts can build on, even if the evaluation needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper describes the design and deployment of an isolation-first, containerized on-premise LLM inference system (using DeepSeek-R1 via vLLM) for radiology that enforces network segmentation, host-enforced egress filtering, and active monitoring to enable processing of unanonymized PHI. It reports successful institutional approvals from clinic management, compliance, data protection, and security officers, followed by a one-week pilot in which 22 residents and radiologists used 10 predefined prompt templates and provided 0-10 Likert ratings on clinical utility and stability plus qualitative error reports. Highest utility was reported for text-anchored tasks such as report correction; open-ended generation showed more hallucinations. The deployment package is released publicly on GitHub, and the system is positioned as the foundation for official hospital-wide service.

Significance. If the isolation controls prove robust under realistic threats, the work supplies a concrete, reproducible engineering template for compliant open-weights LLM deployment inside regulated medical environments, directly addressing data-protection barriers that currently limit clinical use. The public release of the hardening-test package and the demonstration of institutional approval pathways are concrete strengths that could accelerate adoption at other sites.

major comments (3)

[Materials and Methods] Materials and Methods (isolation architecture description): the claim that containerization plus host-enforced egress filtering and active monitoring is sufficient to prevent unauthorized external connectivity rests entirely on design description; no penetration-test results, red-team outcomes, or quantitative monitoring-effectiveness metrics (e.g., detection rates for simulated exfiltration) are supplied, leaving the central security guarantee unverified against insider or advanced threats in a live PHI environment.
[Results] Results (pilot evaluation): the reported clinical utility rests on subjective Likert scores and qualitative error counts from 22 voluntary users over one week; no quantitative model-performance metrics, baseline comparisons (e.g., against existing dictation tools or human-only workflows), or long-term usage statistics are provided, so the evidence does not yet support the stronger claim that the system is ready to serve as the official service for a >10,000-employee hospital.
[Conclusion] Conclusion: the statement that the architecture 'enabled overcoming regulatory borders' and 'is the current base to serve open-weights LLMs as an official service' is grounded only in the one-week pilot approvals and ratings; scalability, sustained security monitoring, and error-rate data under routine high-volume use are not demonstrated within the manuscript's scope.

minor comments (2)

[Abstract] Abstract: 'The deployment package were made publicly available' contains a subject-verb agreement error ('were' should be 'was').
[Results] The paper would benefit from a brief table summarizing the ten prompt templates and their observed error frequencies to make the qualitative findings more transparent.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each of the major comments below, providing clarifications on the scope of our study as a design and short-term pilot evaluation. Where appropriate, we indicate revisions to the manuscript.

read point-by-point responses

Referee: [Materials and Methods] Materials and Methods (isolation architecture description): the claim that containerization plus host-enforced egress filtering and active monitoring is sufficient to prevent unauthorized external connectivity rests entirely on design description; no penetration-test results, red-team outcomes, or quantitative monitoring-effectiveness metrics (e.g., detection rates for simulated exfiltration) are supplied, leaving the central security guarantee unverified against insider or advanced threats in a live PHI environment.

Authors: We agree that the security description is design-based. The manuscript details the isolation mechanisms, which were sufficient for institutional approval by compliance, data protection, and security officers to process unanonymized PHI. The accompanying GitHub deployment package includes automated tests for isolation and hardening. We did not perform penetration testing or provide quantitative metrics, as the work focuses on feasible deployment rather than adversarial security evaluation. We will revise the Materials and Methods to explicitly note that the controls are intended to meet regulatory requirements as verified by institutional review, without claiming robustness against all advanced threats. This is a clarification of scope. revision: partial
Referee: [Results] Results (pilot evaluation): the reported clinical utility rests on subjective Likert scores and qualitative error counts from 22 voluntary users over one week; no quantitative model-performance metrics, baseline comparisons (e.g., against existing dictation tools or human-only workflows), or long-term usage statistics are provided, so the evidence does not yet support the stronger claim that the system is ready to serve as the official service for a >10,000-employee hospital.

Authors: The pilot was designed to assess initial user acceptance and perceived clinical utility in a real radiology workflow, using voluntary participation over one week. We intentionally did not include quantitative LLM performance metrics, as these are available in the model's original publications, nor baselines against existing tools, since the system is an addition rather than a replacement at this stage. Long-term statistics are beyond the scope of this initial report. We will add text in the Results section to emphasize that these findings are preliminary and do not yet demonstrate readiness for full-scale hospital deployment, with ongoing monitoring planned. revision: partial
Referee: [Conclusion] Conclusion: the statement that the architecture 'enabled overcoming regulatory borders' and 'is the current base to serve open-weights LLMs as an official service' is grounded only in the one-week pilot approvals and ratings; scalability, sustained security monitoring, and error-rate data under routine high-volume use are not demonstrated within the manuscript's scope.

Authors: We will revise the Conclusion to better align with the presented evidence. Specifically, we will state that the architecture enabled regulatory approval for the pilot deployment and provides a foundation for serving open-weights LLMs, while acknowledging that scalability and long-term data under high-volume use remain to be demonstrated in future work. This tempers the language to reflect the pilot nature of the study. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive engineering report and pilot without derivations or fitted predictions

full rationale

The manuscript describes the design and implementation of an isolation-first on-premise LLM deployment for radiology, followed by a one-week pilot with 22 users. There are no mathematical equations, parameter fittings, or predictions that reduce to prior quantities by construction. Institutional approvals and user Likert-scale ratings provide independent support for the claims. No self-citation chains or ansatzes are invoked in a load-bearing manner. The derivation chain is self-contained as a technical report.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard IT security assumptions about container isolation and network controls rather than new mathematical constructs or fitted parameters.

axioms (2)

domain assumption Strict network segmentation combined with host-enforced egress filtering and monitoring can reliably prevent unauthorized external connectivity for the containerized inference stack.
Invoked in the description of the isolation-first architecture and the automated isolation tests.
domain assumption The vLLM serving engine and DeepSeek-R1 model can be deployed in containers without introducing exploitable vulnerabilities that bypass the isolation controls.
Assumed in the choice of the inference stack and the claim of regulatory approval for PHI processing.

pith-pipeline@v0.9.0 · 5655 in / 1440 out tokens · 68235 ms · 2026-05-15T00:19:01.051379+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 68 canonical work pages · 4 internal anchors

[1]

Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment

Savage CH, Kanhere A, Parekh V, et al. Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment. Radiology. 2025;314(1):e241073

2025
[2]

Generative Pre-trained Transformer 4 makes cardiovascular magnetic resonance reports easy to understand

Salam B, Kravchenko D, Nowak S, et al. Generative Pre-trained Transformer 4 makes cardiovascular magnetic resonance reports easy to understand. Journal of Cardiovascular Magnetic Resonance. 2024;26(1):101035

2024
[3]

Evaluation of GPT-4o for multilingual translation of radiology reports across imaging modalities

Terzis R, Salam B, Nowak S, et al. Evaluation of GPT-4o for multilingual translation of radiology reports across imaging modalities. European Journal of Radiology. 2025;191:112341

2025
[4]

Privacy-ensuring Open-weights Large Lan- guage Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports

Nowak S, Wulff B, Layer YC, et al. Privacy-ensuring Open-weights Large Lan- guage Models Are Competitive with Closed-weights GPT-4o in Extracting Chest Radiography Findings from Free-Text Reports. Radiology. 2025;314(1):e240895

2025
[5]

Comparison of the Quality of Discharge Letters Written by Large Language Models and Junior Clinicians: Single-Blinded Study

Tung JYM, Gill SR, Sng GGR, et al. Comparison of the Quality of Discharge Letters Written by Large Language Models and Junior Clinicians: Single-Blinded Study. J Med Internet Res. 2024;26:e57721

2024
[6]

The growing issue of burnout in radiology — a survey-based evaluation of driving factors and potential impacts in pediatric radiologists

Ayyala RS, Baird GL, Sze RW, Brown BP, Taylor GA. The growing issue of burnout in radiology — a survey-based evaluation of driving factors and potential impacts in pediatric radiologists. Pediatr Radiol. 2020;50(8):1071-1077

2020
[7]

Google-Health/medasr

Google Research. Google-Health/medasr. https://github.com/Google-Health/medasr. Published December 19, 2025. Accessed March 23, 2026

2025
[8]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Lin J, Tang J, Tang H, et al. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon W, Li Z, Zhuang S, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks

Ohm M, Plate H, Sykosch A, Meier M. Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks. In: Maurice C, Bilge L, Stringhini G, Neves N, eds. Detection of Intrusions and Malware, and Vulnerability Assessment. Springer International Publishing; 2020:23-43

2020
[11]

ukbonn/ukb-gpt

University Hospital Bonn. ukbonn/ukb-gpt. GitHub. https://github.com/ukbonn/ukb-gpt. Published March 17, 2026. Accessed March 23, 2026

2026
[12]

docker/compose

Docker. docker/compose. GitHub. https://github.com/docker/compose. Published December 9, 2013. Accessed March 23, 2026

2013
[13]

nginx. nginx. GitHub. https://github.com/nginx/nginx. Published June 23, 2015. Accessed March 23, 2026

2015
[14]

Open WebUI: An Open, Extensible, and Usable Interface for AI Interaction

Baek J, Hussain A, Liu D, Vincent N, Kim LH. Open WebUI: An Open, Extensible, and Usable Interface for AI Interaction. arXiv:2510.02546. 2025. 16

work page arXiv 2025
[15]

vllm-project/vllm

vLLM. vllm-project/vllm. GitHub. https://github.com/vllm-project/vllm. Published February 9, 2023. Accessed March 23, 2026

2023
[16]

prometheus

Prometheus. prometheus. GitHub. https://github.com/prometheus/prometheus. Pub- lished November 24, 2012. Accessed March 23, 2026

work page 2012
[17]

Grafana Labs. grafana. GitHub. https://github.com/grafana/grafana. Published December 11, 2013. Accessed March 13, 2026

work page 2013
[18]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Guo D, Yang D, Zhang H, Song J, Wang P, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature; 2025;645(8081):633–8

work page 2025
[19]

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down har- monised rules on artificial intelligence (Artificial Intelligence Act)

European Parliament and Council of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down har- monised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union. 2024

work page 2024
[20]

Pickle Scanning

Hugging Face. Pickle Scanning. https://huggingface.co/docs/hub/security-pickle. Accessed March 23, 2026

work page 2026
[21]

In: 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC 2021)

CheahCS,SelvarajahV.AReviewofCommonWebApplicationBreachingTechniques (SQLi, XSS, CSRF). In: 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC 2021). Atlantis Press; 2021:540-547

work page 2021
[22]

A Container Security Survey: Exploits, Attacks, and Defenses

Jarkas O, Ko R, Dong N, Mahmud R. A Container Security Survey: Exploits, Attacks, and Defenses. ACM Comput Surv. 2025;57(7):1-36

work page 2025
[23]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, Agarwal S, Ahmad L, Ai J, Altman S, Applebaum A, et al. gpt-oss-120b & gpt-oss-20b Model Card. arXiv; 2025; arXiv.2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Qwen3 Technical Report

Yang A, Li A, Yang B, Zhang B, Hui B, Zheng B et al. Qwen3 Technical Report. arXiv; 2025; arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

ICD-10: international statistical classification of diseases and related health problems: tenth revision

World Health Organization. ICD-10: international statistical classification of diseases and related health problems: tenth revision. https://iris.who.int/items/ab0c8de2- 762b-463e-b6ca-51af1753dbf3. Accessed March 23, 2026

work page 2026
[26]

Improving Rare and Common ICD Coding via a Multi-Agent LLM-Based Approach

Li R, Wang X, Yu H. Improving Rare and Common ICD Coding via a Multi-Agent LLM-Based Approach. In: Proceedings of the 34th ACM International Conference on Information and Knowledge; 2025:4945–9

work page 2025
[27]

Assessing GPT-3.5 and GPT-4 in Generating International Classification of Diseases Billing Codes

Soroush A, Glicksberg BS, Zimlichman E, Barash Y, Freeman R, Charney AW, et al. Assessing GPT-3.5 and GPT-4 in Generating International Classification of Diseases Billing Codes. medRxiv; 2023. https://doi.org/10.1101/2023.07.07.23292391

work page doi:10.1101/2023.07.07.23292391 2023
[28]

Automated clinical coding using off-the-shelf large language models

Boyle JS, Kascenas A, Lok P, Liakata M, O’Neil AQ. Automated clinical coding using off-the-shelf large language models. arXiv; 2023; arXiv.2310.06552

work page arXiv 2023
[29]

OpenClaw

Peter Steinberger. OpenClaw. https://github.com/openclaw/openclaw. Published November 24, 2025. Accessed March 23, 2026

work page 2025
[30]

Liver unremarkable, size approx. 15 cm in the right hepatic lobe

NVIDIA Corporation. NemoClaw. https://github.com/NVIDIA/NemoClaw. Pub- lished March 15, 2026. Accessed March 23, 2026. 17 A Full Prompt Texts The following sections contain the exact prompt texts used in the study. A.1 Prompt 1: Report: Correction & Improvement ## Radiological Report - Review & Commenting Act as an experienced radiologist. Your task is to...

work page 2026
[31]

Synthetically summarize the **most important pathological and clinically relevant findings** of the report

work page
[32]

Primarily answer the questions implied by the **indication**

work page
[33]

Provide the referring medical colleague with a **clear basis for decision-making** regarding further management
[34]

Be based **exclusively** on the information provided in the report text
[35]

The assessment is:

Size measurements that are already mentioned in the report should not be repeated. **Instructions for creating the assessment**: * **Focus & Relevance:** * Focus exclusively on the **essential and clinically relevant** results. * Establish a clear connection to the **indication**. Answer the clinical question. * Only mention normal findings if they are ex...

work page
[36]

The complete report of a previous examination (reference for stable findings and baseline values). 19

work page
[37]

Status idem regarding

A memory protocol/notes on the changes, relevant observations, or new aspects in the current examination compared to the previous one. **Processing steps:** **0. Internal Analysis and Planning Strategy:** * **Before you create the actual report**, outline your thoughts and strategy here. * Analyze the two inputs (previous report, notes) carefully. * Expli...

work page
[38]

**Medical Correctness:** The explanation must remain absolutely medically correct and must not distort the original meaning of the report or omit important information
[39]

Avoid complex sentence structures

**Simplicity and Clarity:** Use short sentences and everyday language. Avoid complex sentence structures

work page
[40]

liver" instead of

**No Jargon:** * Consistently replace medical terminology with understandable descriptions or everyday explanations. * If a Latin/medical name for an organ or structure is given, use the common English name instead (e.g., "liver" instead of "hepar"). * Explain abbreviations if they are not absolutely commonplace

work page
[41]

You had an X-ray of your lungs to see if everything is okay there

**Structure and Logic:** Organize the explanation logically and clearly. A good structure could be: * **Introduction:** A brief, simple mention of what kind of examination was done and what was generally examined (e.g., "You had an X-ray of your lungs to see if everything is okay there."). * **Main Results:** What are the central findings? Describe these ...

work page
[42]

**Enhancing Understanding:** Where appropriate, use simple analogies or everyday comparisons to make complex medical contexts clearer, but only if they are truly fitting and neither trivializing nor misleading

work page
[43]

Avoid overly trivializing or unnecessarily alarming language

**Tone:** Choose an empathetic, calm, and objectively informative tone. Avoid overly trivializing or unnecessarily alarming language. The explanation should be informative and supportive

work page
[44]

e.g," to

**Important Note (Disclaimer):** Always add a short, standardized note at the end of the explanation, roughly in this form: "Please note: This explanation is intended to help you better understand your report. However, it does not replace a personal conversation with your attending physician, who knows your entire medical situation and can answer all your...

work page 2019
[45]

* High risk: Optional FU with CT at 12 months (esp

**Single solid SN:** * **<6 mm (<100 mm 3):** * Low risk: No routine FU. * High risk: Optional FU with CT at 12 months (esp. for suspicious morphology, upper lobe location). * **6-8 mm (100-250 mm 3):** * Low risk: CT at 6-12 months, then consider CT at 18-24 months. * High risk: CT at 6-12 months, then CT at 18-24 months. * **>8 mm (>250 mm 3):** * CT at...
[46]

Single solid SN

**Multiple solid SN:** * **All SN <6 mm:** * Low risk: No routine FU. * High risk: Optional FU with CT at 12 months. * **At least 1 SN >=6 mm:** * Management is based on the most suspicious SN (analogous to "Single solid SN"). * Initial CT at 3-6 months, then optional CT at 18-24 months (risk-adapted). **B. SUBSOLID NODULES (SSN)** 26

work page
[47]

(Exception: In high-risk patients or suspicious morphology, an FU at 2 and 4 years can be considered)

**Single pure Ground Glass Nodule (GGN):** * **<6 mm (<100 mm 3):** No routine FU. (Exception: In high-risk patients or suspicious morphology, an FU at 2 and 4 years can be considered). * **>=6 mm (>100 mm 3):** CT at 6-12 months to confirm persistence, then every 2 years for a total of 5 years

work page
[48]

* **>=6 mm (>100 mm 3):** * **Solid component remains <6 mm:** CT at 3-6 months to confirm persistence

**Single Part-Solid Nodule (PSN):** * **<6 mm (<100 mm 3):** No routine FU. * **>=6 mm (>100 mm 3):** * **Solid component remains <6 mm:** CT at 3-6 months to confirm persistence. If persistent, annual CT for 5 years. * **Solid component >=6 mm (initially or during follow-up):** CT at 3-6 months. If persistent and solid component is suspicious (e.g., grow...

work page
[49]

If persistent, consider FU at 2 and 4 years (risk-adapted)

**Multiple SSN (<6 mm pure GGNs):** * CT at 3-6 months. If persistent, consider FU at 2 and 4 years (risk-adapted)
[50]

absent" or

**Multiple SSN (with at least 1 nodule >=6 mm or PSN):** * Management is based on the most suspicious nodule. Initial CT at 3-6 months. **Important Notes:** * Follow-up intervals are often given as ranges to account for individual factors and patient preferences. * Perifissural nodules (typical intrapulmonary lymph nodes) <10mm: Usually no FU needed if ty...

work page 2017
[51]

No specific risk factors mentioned in the report, therefore low risk assumed unless nodule morphology/location is suspicious

**Relevant Report Details and Risk Assessment:** * Nodule(s): (Type, size, number, location of the relevant nodule(s)) * Patient Risk Factors (per report): (List of risk factors mentioned in the report or statement "No specific risk factors mentioned in the report, therefore low risk assumed unless nodule morphology/location is suspicious.") * Resulting R...

work page
[52]

**Justification for Management Recommendation:** (Brief explanation of which findings and risk assessment led to the Fleischner classification and which specific path of the guidelines is being applied.)

work page
[53]

**Management Recommendations according to FLEISCHNER SOCIETY GUIDELINES:** (Specific recommendation based on the classification and risk category.)
[54]

**Additional Relevant Findings (if any):** (e.g., non-pulmonary incidental findings)
[55]

atypical nodule

**Brief Summary for the Radiologist:** (A concise, **bolded** summary of the core message and the main recommendation(s) for a quick overview. Maximum 2-3 sentences.) ## Report for Analysis: ###### Guideline Recommendation (PI-RADS v2.1 2019): for Prostate MRI reports ###### ## Recommendation according to PI-RADS v2.1 **Role**: You are an AI assistant for...

work page 2019
[57]

**PI-RADS Category per Lesion with Justification (based on T2W, DWI, DCE scores and the matrices above):**

work page
[58]

**Identification of the Index Lesion:**

work page
[60]

**Other Relevant Observations in the Context of the PI-RADS Guideline:** **INSTRUCTIONS FOR THE RESPONSE:** * If you obviously lack other information in the report below to perform an adequate classification, clearly state this missing information (**in bold**) and request it before creating your classification! If the context is clear and no information ...

work page
[61]

**Identified Lesions (Localization by Sector Map, Size):**

work page
[62]

**Justification for the Classification:** (Brief explanation of which findings led to the classification.) 31

work page
[63]

**PI-RADS Category per Lesion with Justification:** (based on T2W, DWI, DCE scores and the matrices above)

work page
[64]

**Management Recommendations according to PI-RADS v2.1:** (Based on the classification AND the clinical context)

work page
[65]

**Staging Notes (EPE, SVI, Lymph Nodes, Bones):**

work page
[66]

Napkin-Ring Sign

**Additional Relevant Findings (if any):** (e.g., non-coronary cardiac or extra-cardiac findings) ## Report for Analysis: ###### Guideline Recommendation (CAD-RADS v2.0 2019): for Coronary CT Angiography reports ###### ## Recommendation according to CAD-RADS v2.0 **Role**: You are an AI assistant for radiologists, specialized in the reporting of Coronary ...

work page 2019
[68]

**CAD-RADS Classification:** [e.g., CAD-RADS 3/P2/HRP/I+]

work page
[69]

**Management Recommendations according to CAD-RADS 2.0:** (Based on the classification AND the clinical context)

work page
[70]

cystic mass

**Additional Relevant Findings (if any):** (e.g., non-coronary cardiac or extra-cardiac findings) ## Report for Analysis: ###### Guideline Recommendation (Bosniak v2019): Classification of Cystic Renal Masses ###### ## Recommendation according to Bosniak v2019 **Role**: You are an AI assistant for radiologists, specialized in the reporting and classificat...

work page 2019
[71]

Cystic masses with thin (<=2mm) and few (1-3) septa; septa/wall *may* enhance; calcifications *of any type* permitted

work page
[72]

Homogeneous hyperattenuating masses (>=70 HU) on unenhanced CT

work page
[73]

Homogeneous, non-enhancing masses >20 HU on renal protocol CT; calcifications *of any type* permitted

work page
[74]

Homogeneous masses -9 to 20 HU on unenhanced CT

work page
[75]

Homogeneous masses 21 to 30 HU on portal venous phase CT

work page
[76]

* **MRI Types:**

Homogeneous, low-attenuating masses too small to characterize. * **MRI Types:**

work page
[77]

Cystic masses with thin (<=2mm) and few (1-3) *enhancing* septa; any non-enhancing septa; calcifications *of any type* permitted (if visible)

work page
[78]

Homogeneous masses, markedly hyperintense on T2w (CSF-like) on unenhanced MRI

work page
[79]

Too small to characterize

Homogeneous masses, markedly hyperintense on T1w (approx. 2.5x parenchymal signal) on unenhanced MRI. * **Implication:** Benign or highly likely benign, usually no follow-up. **Bosniak IIF (F for Follow-up):** * **CT & MRI (Type 1):** * Smooth, minimally thickened (3mm) *enhancing* wall OR * Smooth, minimally thickened (3mm) one or more *enhancing* septa ...

work page
[80]

**Justification for the Classification:** (Brief explanation of which findings led to the classification.)

work page
[81]

**Bosniak Classification:**

work page
[82]

**Management Recommendations Bosniak:** (Based on the classification AND the clinical context)

work page
[83]

Custom Use Case

**Additional Relevant Findings (if any):** ## Report for Analysis: 37 B Online Survey Questions The following questionnaire was translated into an online survey (Google Forms) to evaluate the LLM use cases in daily radiological practice.(Note: The questions below have been translated from German to English for the purpose of this publication.) Part 1: Gen...

work page

Showing first 80 references.