pith. sign in

arxiv: 2605.08439 · v2 · pith:7MYBZAZXnew · submitted 2026-05-08 · 💻 cs.CL

Can Language Models Identify Side Effects of Breast Cancer Radiation Treatments?

Pith reviewed 2026-05-20 22:27 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsside effectsbreast cancerradiation therapyoncologysurvivorship careclinician-curated listsprecision and recall
0
0 comments X

The pith

Large language models under-recall rare and long-term side effects when listing breast cancer radiation toxicities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests seven instruction-tuned language models on 21 breast cancer patient profiles to see how well they generate lists of side effects for different radiotherapy regimens. Outputs are compared against a reference list of toxicities mapped to dose, fields, and onset, built by a team of more than seven breast radiation oncologists from informed consent documents at two academic centers. Models prove sensitive to small changes in patient documentation, show clear under-recall of infrequent and delayed effects, and lose precision when forced to limit the number of side effects they name. Grounding the models in the clinician-curated list instead raises reliability and consistency across the tested scenarios.

Core claim

When prompted to list radiation side effects for breast cancer, large language models systematically under-recall rare and long-term toxicities relative to a clinician-curated reference derived from informed consent documents; they are also sensitive to minor input variations, and number constraints on outputs reduce precision, while direct grounding in the clinician-curated side effect lists measurably improves reliability and robustness.

What carries the argument

The deployment-oriented stress-testing framework that constructs paired patient scenarios differing only in radiotherapy regimens and scores LLM outputs against the clinician-curated reference broken down by frequency and temporal onset.

If this is right

  • Grounding model outputs in clinician-curated lists offers a concrete way to raise reliability for survivorship tasks.
  • Models should not be deployed alone for comprehensive side-effect communication because of consistent under-recall of rare events.
  • Prompt designs must avoid hard limits on output size to preserve precision.
  • Small changes in documentation can alter model behavior, so input standardization matters for consistent use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evaluation approach could be applied to other cancer types or treatment modalities to check whether the under-recall pattern holds.
  • Hybrid workflows that combine model generation with clinician review might address the gaps without requiring perfect standalone performance.
  • Testing the framework on real electronic health record excerpts rather than constructed profiles would reveal how well it transfers outside controlled settings.

Load-bearing premise

The clinician-curated reference list derived from informed consent documents is treated as a complete and accurate gold standard for all toxicities linked to the tested regimens.

What would settle it

A controlled review of actual patient records or expert consensus that finds a documented toxicity absent from the reference list, or shows models correctly surfacing a side effect the reference omitted, would undermine the evaluation results.

Figures

Figures reproduced from arXiv: 2605.08439 by Danielle S. Bitterman, Daphna Spiegel, Natalie Seah, Thomas Hartvigsen.

Figure 1
Figure 1. Figure 1: Deployment-oriented evaluation framework. Breast cancer patient profiles are constructed in paired base and specified forms that differ only in radiation documentation specificity. Profiles are converted into prompts and passed to large language models, which generate side-effect lists. Outputs are evaluated along two axes: robustness to documentation perturbations and accuracy relative to a clinician-cura… view at source ↗
read the original abstract

Accurately communicating the side effects of cancer treatments to cancer survivors is critical, particularly in settings such as informed consent, where clinicians must clearly and comprehensively convey potential treatment toxicities. However, this task remains challenging due to clinical knowledge deficits about adverse treatment effects and fragmentation across electronic health record (EHR) systems. Large language models (LLMs) have the potential to assist in this task, though their reliability in oncology survivorship contexts remains poorly understood. We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care. Using 21 breast cancer patient profiles, we construct paired patient clinical scenarios that differ only in radiotherapy regimens to evaluate seven instruction-tuned LLMs under multiple prompting regimes. We then compare LLM outputs to a clinician-curated reference derived from informed consent documents at two major academic medical centers and developed by a team including more than seven breast radiation oncologists. The reference maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset. Across models, we reveal sensitivity to minor documentation changes, trade-offs between precision and recall, and systematic under-recall of rare and long-term side effects. When used alone, constraints on the number of side effects generated reduce precision, and grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness. These findings highlight important limitations of LLM use in oncology and suggest practical design choices for safer and more informative survivorship-focused applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates seven instruction-tuned LLMs on generating side-effect lists for breast cancer radiotherapy using 21 patient profiles that form paired scenarios differing only in dose-fractionation, fields, and locations. Outputs are compared against a clinician-curated reference list derived from informed consent documents at two academic medical centers and developed with input from more than seven breast radiation oncologists; the reference maps regimens to toxicities broken down by frequency and temporal onset. The study reports model sensitivity to minor documentation changes, precision-recall trade-offs, systematic under-recall of rare and long-term effects, and substantial reliability gains when outputs are grounded in the clinician-curated lists.

Significance. If the central findings hold, the work supplies a practical, deployment-oriented stress-testing framework for LLM use in oncology survivorship and informed-consent settings. The explicit demonstration that grounding in an independently constructed clinician reference improves robustness, together with the identification of under-recall patterns for rare/long-term toxicities, offers actionable design guidance for safer clinical applications. The use of paired scenarios that isolate regimen differences and the multi-center clinician reference are notable strengths that enhance the evaluation's relevance.

major comments (2)
  1. [Methods (reference construction) and Results (under-recall analysis)] The claim of systematic under-recall of rare and long-term side effects (abstract and results) rests on treating the clinician-curated reference as a complete gold standard. Because the reference is constructed from informed consent documents, which are required to emphasize common, actionable risks rather than exhaustively enumerate every literature-reported toxicity (e.g., very rare cardiac, pulmonary, or secondary-malignancy effects with long latency), any observed under-recall may be partly an artifact of reference incompleteness rather than a pure model limitation. A concrete validation step against broader oncology literature or additional expert review is needed to separate these effects.
  2. [Evaluation setup and Results sections] The reported improvements from grounding and the precision-recall trade-offs under different prompting regimes lack accompanying details on inter-rater reliability for the reference list, the exact prompting templates, and statistical tests for the observed differences. Without these, the strength of evidence for the central claims about model behavior and the benefits of grounding cannot be fully assessed.
minor comments (2)
  1. [Abstract] The abstract would benefit from explicitly stating the number of models and the precise prompting regimes tested.
  2. [Methods] Notation for frequency and temporal-onset categories in the reference mapping could be clarified with a small example table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below, indicating where we agree and the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods (reference construction) and Results (under-recall analysis)] The claim of systematic under-recall of rare and long-term side effects (abstract and results) rests on treating the clinician-curated reference as a complete gold standard. Because the reference is constructed from informed consent documents, which are required to emphasize common, actionable risks rather than exhaustively enumerate every literature-reported toxicity (e.g., very rare cardiac, pulmonary, or secondary-malignancy effects with long latency), any observed under-recall may be partly an artifact of reference incompleteness rather than a pure model limitation. A concrete validation step against broader oncology literature or additional expert review is needed to separate these effects.

    Authors: We agree that informed consent documents prioritize common, actionable risks and may not exhaustively list every rare or long-latency toxicity reported in the broader literature. Our reference was deliberately constructed from these documents to mirror the information actually conveyed in clinical informed-consent settings, which is the deployment context we target. To address the concern, we will revise the manuscript to explicitly acknowledge this scope limitation and add a supplementary analysis that cross-references the clinician-curated list against toxicities extracted from recent comprehensive oncology reviews. This will help readers distinguish reference scope from model behavior. revision: partial

  2. Referee: [Evaluation setup and Results sections] The reported improvements from grounding and the precision-recall trade-offs under different prompting regimes lack accompanying details on inter-rater reliability for the reference list, the exact prompting templates, and statistical tests for the observed differences. Without these, the strength of evidence for the central claims about model behavior and the benefits of grounding cannot be fully assessed.

    Authors: We concur that these details are necessary to fully evaluate the evidence. In the revised version we will report inter-rater reliability metrics (e.g., percentage agreement and Cohen’s kappa) from the multi-oncologist reference construction process, include the complete prompting templates as an appendix, and add statistical tests (paired t-tests or bootstrap confidence intervals) for the reported differences in precision, recall, and grounding improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation against independent external reference

full rationale

The paper conducts an empirical stress-test of LLMs on side-effect list generation for breast cancer radiotherapy, comparing outputs to a clinician-curated reference list built from informed consent documents at two academic centers by a team of more than seven breast radiation oncologists. This reference is presented as an external gold standard mapping dose-fractionation, fields, and locations to toxicities by frequency and onset. No equations, derivations, fitted parameters, or self-citations are invoked to define the reference, the metrics (precision/recall), or the reported findings in a way that reduces them to the authors' own inputs by construction. The evaluation setup is self-contained against this independently created benchmark, with no self-definitional loops, renamed known results, or load-bearing self-citations. Claims of under-recall and grounding benefits rest on direct comparison to the external list rather than any internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that the clinician reference is exhaustive and that the 21 constructed profiles adequately sample relevant clinical variation; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The clinician-curated reference accurately and comprehensively maps radiation dose-fractionation, fields, and locations to associated toxicities broken down by frequency and temporal onset.
    This reference is used as the sole benchmark for precision, recall, and under-recall claims.

pith-pipeline@v0.9.0 · 5802 in / 1406 out tokens · 46552 ms · 2026-05-20T22:27:52.711283+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care... compare LLM outputs to a clinician-curated reference derived from informed consent documents... maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    JNCI: Journal of the National Cancer Institute , volume=

    Prevalence of cancer survivors in the United States , author=. JNCI: Journal of the National Cancer Institute , volume=. 2024 , publisher=

  2. [2]

    Cancer , volume=

    Defining concepts in cancer survivorship , author=. Cancer , volume=

  3. [3]

    CA: A cancer journal for clinicians , volume=

    Cancer treatment and survivorship statistics, 2025 , author=. CA: A cancer journal for clinicians , volume=. 2025 , publisher=

  4. [4]

    Ca , volume=

    Cancer statistics, 2026 , author=. Ca , volume=

  5. [5]

    Journal of the National Cancer Institute Monographs , volume=

    The interface between primary and oncology specialty care: treatment through survivorship , author=. Journal of the National Cancer Institute Monographs , volume=. 2010 , publisher=

  6. [6]

    Journal of Cancer Survivorship , volume=

    Decision aids for cancer survivors’ engagement with survivorship care services after primary treatment: a systematic review , author=. Journal of Cancer Survivorship , volume=. 2024 , publisher=

  7. [7]

    , author=

    Engaging TEAM medicine in patient care: redefining cancer survivorship from diagnosis. , author=. American Society of Clinical Oncology Educational book. American Society of Clinical Oncology. Annual Meeting , volume=

  8. [8]

    The lancet oncology , volume=

    Integrating primary care providers in the care of cancer survivors: gaps in evidence and future opportunities , author=. The lancet oncology , volume=. 2017 , publisher=

  9. [9]

    Journal of Cancer Survivorship , volume=

    Family physician preferences and knowledge gaps regarding the care of adolescent and young adult survivors of childhood cancer , author=. Journal of Cancer Survivorship , volume=. 2013 , publisher=

  10. [10]

    Journal of Clinical Oncology , volume=

    Promise and perils of large language models for cancer survivorship and supportive care , author=. Journal of Clinical Oncology , volume=

  11. [11]

    The Oncologist , volume=

    Medical accuracy of artificial intelligence chatbots in oncology: a scoping review , author=. The Oncologist , volume=. 2025 , publisher=

  12. [12]

    , author=

    Navigating artificial intelligence (AI) accuracy: A meta-analysis of hallucination incidence in large language model (LLM) responses to oncology questions. , author=. 2025 , publisher=

  13. [13]

    Nature , volume=

    Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

  14. [14]

    CA: a cancer journal for clinicians , volume=

    Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , author=. CA: a cancer journal for clinicians , volume=. 2021 , publisher=

  15. [15]

    Gland Surgery , volume=

    Evolution of radiotherapy techniques in breast conservation treatment , author=. Gland Surgery , volume=

  16. [16]

    NEJM AI , volume=

    Exploring large language models for specialist-level oncology care , author=. NEJM AI , volume=. 2025 , publisher=

  17. [17]

    arXiv preprint arXiv:2310.17703 , year=

    The impact of using an AI chatbot to respond to patient messages , author=. arXiv preprint arXiv:2310.17703 , year=

  18. [18]

    Cancers , volume=

    Leveraging large language models for precision monitoring of chemotherapy-induced toxicities: a pilot study with expert comparisons and future directions , author=. Cancers , volume=. 2024 , publisher=

  19. [19]

    BMJ oncology , volume=

    Large language models in oncology: a review , author=. BMJ oncology , volume=

  20. [20]

    PLOS Digital Health , volume=

    Development and evaluation of large-language models (LLMs) for oncology: A scoping review , author=. PLOS Digital Health , volume=. 2025 , publisher=

  21. [21]

    JAMA Network Open , volume=

    Performance of large language models on medical oncology examination questions , author=. JAMA Network Open , volume=

  22. [22]

    NPJ Precision Oncology , volume=

    Large language model use in clinical oncology , author=. NPJ Precision Oncology , volume=. 2024 , publisher=

  23. [23]

    NPJ Digital Medicine , volume=

    Large language model integrations in cancer decision-making: a systematic review and meta-analysis , author=. NPJ Digital Medicine , volume=. 2025 , publisher=

  24. [24]

    Nejm Ai , volume=

    A cross-sectional study of GPT-4--based plain language translation of clinical notes to improve patient comprehension of disease course and management , author=. Nejm Ai , volume=. 2025 , publisher=

  25. [25]

    JMIR Medical Informatics , volume=

    Transforming informed consent generation using large language models: mixed methods study , author=. JMIR Medical Informatics , volume=. 2025 , publisher=

  26. [26]

    Nature cancer , volume=

    Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology , author=. Nature cancer , volume=. 2025 , publisher=

  27. [27]

    Medical Clinics , volume=

    Long-term and latent side effects of specific cancer types , author=. Medical Clinics , volume=. 2017 , publisher=

  28. [28]

    Clinical Medicine , volume=

    The late medical effects of cancer treatments: a growing challenge for all medical professionals , author=. Clinical Medicine , volume=. 2017 , publisher=

  29. [29]

    Current Oncology , volume=

    Toxicities and quality of life during cancer treatment in advanced solid tumors , author=. Current Oncology , volume=. 2023 , publisher=

  30. [30]

    Journal of Cancer Survivorship , volume=

    Primary care physicians’ knowledge and confidence in providing cancer survivorship care: a systematic review , author=. Journal of Cancer Survivorship , volume=. 2024 , publisher=

  31. [31]

    International journal of environmental research and public health , volume=

    Still lost in transition? Perspectives of ongoing cancer survivorship care needs from comprehensive cancer control programs, survivors, and health care providers , author=. International journal of environmental research and public health , volume=. 2022 , publisher=

  32. [32]

    CA: a cancer journal for clinicians , volume=

    Radiation therapy-associated toxicity: Etiology, management, and prevention , author=. CA: a cancer journal for clinicians , volume=. 2021 , publisher=

  33. [33]

    Journal of Cancer Education , volume=

    Non-oncologist physician knowledge of radiation therapy at an urban community hospital , author=. Journal of Cancer Education , volume=. 2021 , publisher=

  34. [34]

    Technical Innovations & Patient Support in Radiation Oncology , volume=

    Perceptions, educational expectations and knowledge gaps of patients with non-metastatic breast cancer regarding radiotherapy: Integrative review , author=. Technical Innovations & Patient Support in Radiation Oncology , volume=. 2025 , publisher=

  35. [35]

    New England Journal of Medicine , volume=

    Effects of radiotherapy in normal tissue , author=. New England Journal of Medicine , volume=. 2026 , publisher=

  36. [36]

    JAMA oncology , volume=

    Differences in the acute toxic effects of breast radiotherapy by fractionation schedule: comparative analysis of physician-assessed and patient-reported outcomes in a large multicenter cohort , author=. JAMA oncology , volume=

  37. [37]

    JAMA oncology , volume=

    Acute and short-term toxic effects of conventionally fractionated vs hypofractionated whole-breast irradiation: a randomized clinical trial , author=. JAMA oncology , volume=