Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

Aman Pathak; Aokun Chen; Cheng Peng; Hiren Mehta; Mengxian Lyu; Reema Solan; Sankalp Talankar; Yasir Khan; Yi Guo; Yonghui Wu

arxiv: 2606.19852 · v2 · pith:CVW24JX5new · submitted 2026-06-18 · 💻 cs.CL · cs.LG

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

Aman Pathak , Cheng Peng , Mengxian Lyu , Ziyi Chen , Reema Solan , Sankalp Talankar , Yasir Khan , Hiren Mehta

show 3 more authors

Aokun Chen Yi Guo Yonghui Wu

This is my paper

Pith reviewed 2026-06-26 17:47 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords zero-shot learninglarge language modelsinformation extractionpathology reportslung canceragentic workflowsclinical narrativescancer registry

0 comments

The pith

Zero-shot agentic LLMs populate 13 CAP fields from lung reports at 0.893 Micro-F1 without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a zero-shot agentic workflow that applies open-source generative LLMs to extract structured data for 13 College of American Pathologists synoptic fields from lung resection pathology reports. It evaluates five such models against a supervised GatorTron NER-RE baseline using a registry-aligned framework, with the top model reaching Micro-F1 of 0.893 and recall of 0.949 while handling relations like pathologic stage. A sympathetic reader would care because manual extraction for cancer registries is labor-intensive and error-prone, and supervised pipelines demand expensive annotations plus risk cascading failures. The results position these LLMs as a potential low-cost alternative for information extraction from clinical narratives.

Core claim

By applying a zero-shot agentic workflow to five open-source LLMs, the study populates the 13 College of American Pathologists synoptic fields from lung resection pathology reports, with the best model (GPT-OSS-20B) achieving a Micro-F1 of 0.893 and recall of 0.949, closely approaching the supervised GatorTron baseline's 0.960 while accurately handling complex relations such as Pathologic Stage.

What carries the argument

The zero-shot agentic workflow that decomposes extraction into prompt, plan, and extract steps using generative LLMs to fill the 13 CAP fields directly from narrative text.

If this is right

The workflow avoids the need for expensive manual annotation required by supervised NER-RE pipelines.
Cascading failures from missed upstream entities in traditional methods are reduced.
Open-source models can serve as a low-cost solution for extracting lung pathology information.
Complex relations such as pathologic stage can be extracted accurately without task-specific training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same workflow structure could be tested on pathology reports from other cancer sites to check transferability.
Integration with existing registry systems might reduce the time and cost of manual abstraction.
Performance on rarer or more complex fields could be measured separately to identify remaining gaps.

Load-bearing premise

The novel registry-aligned evaluation framework produces an unbiased, apples-to-apples comparison between the zero-shot LLM outputs and the supervised GatorTron baseline on the 13 CAP fields.

What would settle it

A manual gold-standard review on a new collection of lung resection reports where the best zero-shot model's Micro-F1 on the 13 fields falls substantially below 0.893.

Figures

Figures reproduced from arXiv: 2606.19852 by Aman Pathak, Aokun Chen, Cheng Peng, Hiren Mehta, Mengxian Lyu, Reema Solan, Sankalp Talankar, Yasir Khan, Yi Guo, Yonghui Wu, Ziyi Chen.

**Figure 1.** Figure 1: Mind-map of CAP-aligned clinical information extraction schema for lung resection pathology reports. The extraction schema was hierarchically modeled after the College of American Pathologists (CAP) synoptic reporting guidelines for lung carcinoma, encompassing core domains such as Specimen Details, Tumor Characteristics, Tumor Extent and Invasion, Margin Status, Lymph Nodes, and Pathologic Staging. Initia… view at source ↗

**Figure 2.** Figure 2: Architecture of the zero-shot LangGraph agentic workflow for CAP-aligned lung pathology extraction. 1. Mapper: The Mapper node receives the raw pathology report as input and segments it into semantically coherent sections such as SPECIMEN, TUMOR, MARGINS, LYMPH NODES, and PATHOLOGIC STAGING. This step anchors the extraction to the natural rhetorical structure of pathology reports and reduces the burden on … view at source ↗

read the original abstract

Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.893 (recall: 0.949), accurately extracting complex relations like Pathologic Stage without task-specific training. These results suggest that open-source, zero-shot agentic LLMs show great potential as a low-cost solution for extracting lung pathology information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Zero-shot agentic LLMs reach 0.893 Micro-F1 on 13 CAP lung fields versus 0.960 for the GatorTron baseline, but the comparison hinges on whether the registry-aligned eval treats both systems the same way.

read the letter

The paper shows a concrete zero-shot workflow using prompt-plan-extract steps on five open-source LLMs to fill 13 College of American Pathologists synoptic fields from lung resection reports. The best model lands at 0.893 Micro-F1 with 0.949 recall, which is close enough to the supervised baseline to matter for registry work where annotation is expensive.

The new piece is the direct head-to-head on these exact fields with open-source models and the registry-aligned scoring. It does a clean job of showing that agentic prompting can handle relations like pathologic stage without any task-specific training data.

The main soft spot is the evaluation setup. The abstract and stress-test note both flag that we need to see exactly how the 13 fields are scored for the generative outputs versus the NER-RE pipeline. If string matching or normalization rules differ, the 0.067 gap cannot be read as a pure zero-shot versus supervised result. Dataset size, report count, and error breakdown are also missing from the summary, which makes it hard to judge how stable the numbers are.

This is for clinical NLP groups and cancer registry teams who want low-annotation options. A reader already working on pathology extraction will get usable workflow ideas and a sense of current open-source performance.

It deserves peer review. The application is timely and the numbers are worth checking in detail, even if the methods section will need expansion on the evaluation rules.

Referee Report

3 major / 1 minor

Summary. The paper claims to develop a zero-shot agentic LLM workflow that populates 13 College of American Pathologists (CAP) synoptic fields from lung resection pathology reports. Five open-source generative LLMs are evaluated against a supervised GatorTron NER-RE baseline using a novel registry-aligned evaluation framework, reporting Micro-F1 scores of 0.960 (baseline) versus 0.893 (best LLM, GPT-OSS-20B with recall 0.949), and concludes that such models show potential as a low-cost alternative for lung pathology information extraction.

Significance. If the evaluation framework produces a fair, apples-to-apples comparison, the work would demonstrate that zero-shot agentic LLMs can achieve competitive performance on structured clinical extraction tasks without task-specific training or annotation, offering a practical low-cost alternative to supervised pipelines for cancer registry population and staging.

major comments (3)

[Abstract] Abstract: the central performance comparison (Micro-F1 0.893 vs 0.960) is presented without any dataset size, number of reports, definitions of the 13 CAP fields, or description of the agentic workflow steps (planning/extraction), rendering the claim unverifiable from the provided text and blocking assessment of reproducibility.
[Evaluation Framework] Evaluation Framework (implied in methods/results): the registry-aligned framework is load-bearing for attributing the performance gap to the zero-shot approach, yet no explicit description is given of how LLM-generated outputs are parsed, string-matched, or semantically aligned versus the entity/relation-level errors of the NER-RE baseline; if matching rules differ (e.g., allowing multi-step correction or looser normalization for generative outputs), the scores are not directly comparable.
[Results] Results: no error analysis or breakdown by field (especially for complex relations like Pathologic Stage) is supplied, which is required to substantiate the claim that the LLM 'accurately extracting complex relations' without task-specific training.

minor comments (1)

[Abstract] Abstract: 'Micro-F1of' is missing a space before 0.960.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance comparison (Micro-F1 0.893 vs 0.960) is presented without any dataset size, number of reports, definitions of the 13 CAP fields, or description of the agentic workflow steps (planning/extraction), rendering the claim unverifiable from the provided text and blocking assessment of reproducibility.

Authors: We agree that the abstract should be more self-contained to support verifiability. In the revised manuscript we will expand the abstract to report the number of lung resection pathology reports in the dataset, provide a brief enumeration or definition of the 13 CAP synoptic fields, and include a concise description of the agentic workflow steps (prompt, plan, extract). revision: yes
Referee: [Evaluation Framework] Evaluation Framework (implied in methods/results): the registry-aligned framework is load-bearing for attributing the performance gap to the zero-shot approach, yet no explicit description is given of how LLM-generated outputs are parsed, string-matched, or semantically aligned versus the entity/relation-level errors of the NER-RE baseline; if matching rules differ (e.g., allowing multi-step correction or looser normalization for generative outputs), the scores are not directly comparable.

Authors: The Methods section introduces the registry-aligned evaluation framework, but we acknowledge that the parsing, string-matching, and semantic alignment steps for LLM outputs require more explicit detail to demonstrate comparability with the NER-RE baseline. We will revise the Methods to add a dedicated subsection describing the exact procedures for parsing generative outputs, normalization rules, and matching criteria, and we will explicitly state that identical alignment rules are applied to both the LLM and baseline outputs. revision: yes
Referee: [Results] Results: no error analysis or breakdown by field (especially for complex relations like Pathologic Stage) is supplied, which is required to substantiate the claim that the LLM 'accurately extracting complex relations' without task-specific training.

Authors: We agree that an error analysis and field-level breakdown would strengthen the results section and better support the claim regarding complex relations. In the revised manuscript we will add an error analysis subsection that includes per-field Micro-F1 scores and qualitative discussion of errors, with focused attention on Pathologic Stage and other complex relations. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study reporting Micro-F1 scores from zero-shot LLM agentic workflows on 13 CAP fields, compared directly to an external supervised GatorTron NER-RE baseline via a described registry-aligned evaluation. No equations, derivations, fitted parameters, or self-referential definitions appear in the chain; the performance delta is measured against an independent model rather than constructed from the LLM outputs themselves. The novel evaluation framework is presented as a methodological choice without reducing to prior self-citations or ansatzes that would force the result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pathology reports contain the 13 CAP fields in extractable narrative form and that the registry-aligned evaluation is unbiased. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Lung resection pathology reports contain the 13 College of American Pathologists synoptic fields in narrative text that can be extracted by LLMs.
Required for the extraction task to be well-defined.

pith-pipeline@v0.9.1-grok · 5768 in / 1364 out tokens · 48694 ms · 2026-06-26T17:47:18.067166+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references

[1]

specialized medical report expert

In the Mapper node, the model is instructed to behave as a “specialized medical report expert” and to segment each lung tumor pathology report into canonical clinical sections (SPECIMEN, TUMOR, MARGINS, LYMPH NODES, PATHOLOGIC STAGE), returning the full, unaltered section text without summarization or omission. 2. The Planner node operated as an “oncology...

2009
[2]

Clinical Relation Extraction Using Transformer-based Models

Yang X, Yu Z, Guo Y, Bian J, Wu Y. Clinical Relation Extraction Using Transformer-based Models. arXiv [csCL]. Published online July 19, 2021. http://arxiv.org/abs/2107.08957 6. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv [csCL]. Published online October 10, 2018. http://ar...

arXiv 2021

[1] [1]

specialized medical report expert

In the Mapper node, the model is instructed to behave as a “specialized medical report expert” and to segment each lung tumor pathology report into canonical clinical sections (SPECIMEN, TUMOR, MARGINS, LYMPH NODES, PATHOLOGIC STAGE), returning the full, unaltered section text without summarization or omission. 2. The Planner node operated as an “oncology...

2009

[2] [2]

Clinical Relation Extraction Using Transformer-based Models

Yang X, Yu Z, Guo Y, Bian J, Wu Y. Clinical Relation Extraction Using Transformer-based Models. arXiv [csCL]. Published online July 19, 2021. http://arxiv.org/abs/2107.08957 6. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv [csCL]. Published online October 10, 2018. http://ar...

arXiv 2021