Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Daniel C. Castro; Edward Choi; Eun Woo Doe; Geon Choi; Hangyul Yoon; Harshita Sharma; Hyuk Gi Hong; Javier Alvarez-Valle; Jiyoun Kim; Jong Hak Moon

arxiv: 2505.21190 · v2 · submitted 2025-05-27 · 💻 cs.CL · cs.AI

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Jong Hak Moon , Geon Choi , Paloma Rabaey , Min Gwan Kim , Jung-Oh Lee , Hyuk Gi Hong , Eun Woo Doe , Hangyul Yoon

show 5 more authors

Jiyoun Kim Harshita Sharma Daniel C. Castro Javier Alvarez-Valle Edward Choi

This is my paper

Pith reviewed 2026-05-19 13:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords radiology reportschest X-raybenchmark datasetstructured reportingsequential interpretationevaluation metrictemporal consistencydisease progression

0 comments

The pith

LUNGUAGE supplies the first dataset and metric to evaluate AI-generated chest X-ray reports for both single-study detail and changes across a patient's timeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark of 1,473 expert-reviewed chest X-ray reports, 186 of which include annotations across multiple studies to track disease progression and time intervals. It builds a two-stage process that converts free-text reports into structured records matching a fixed schema. It also defines LUNGUAGESCORE, which scores outputs by comparing entities, relations, and attributes while checking consistency over time. Existing tools only handle one report at a time with broad measures that overlook precise clinical meaning and sequence. If the benchmark works, developers can test and improve models that produce reports doctors can trust for ongoing patient care.

Core claim

The authors present LUNGUAGE as the first benchmark dataset for structured radiology report generation that handles both single-report assessment and longitudinal patient-level evaluation across multiple studies. It contains 1,473 expert-annotated chest X-ray reports plus 186 longitudinal cases that capture disease progression and inter-study intervals. A two-stage structuring framework converts generated text into fine-grained schema-aligned outputs, and LUNGUAGESCORE provides an interpretable measure that compares structured reports at the entity, relation, and attribute levels while modeling temporal consistency. Empirical results show the metric supports structured report evaluation.

What carries the argument

The two-stage structuring framework that turns generated reports into fine-grained, schema-aligned structured reports, paired with the LUNGUAGESCORE metric that compares outputs at entity, relation, and attribute levels while modeling temporal consistency across patient timelines.

If this is right

AI report generators can be tested for their ability to maintain consistency across a patient's sequence of studies.
Errors can be isolated to specific entities, relations, or attributes rather than judged only at the whole-report level.
Development of models for sequential radiology interpretation gains a shared reference point with explicit temporal checks.
Longitudinal assessment becomes feasible, allowing evaluation of how well reports reflect disease changes between visits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clinics could adopt the structuring framework to standardize how AI drafts are reviewed before they reach physicians.
The benchmark might extend naturally to other imaging types once similar expert-annotated longitudinal sets exist.
Training data for medical language models could incorporate LUNGUAGESCORE signals to reward temporal accuracy during fine-tuning.
Hospitals might track quality metrics over time by running the same structured comparison on both human and machine reports.

Load-bearing premise

Expert annotations in the dataset accurately and consistently capture fine-grained clinical semantics and temporal dependencies without substantial inter-annotator disagreement or selection bias in the 1,473 reports.

What would settle it

A measurement showing low inter-expert agreement on the annotations or weak correlation between LUNGUAGESCORE values and independent radiologist ratings of report clinical usefulness.

read the original abstract

Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 186 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage structuring framework that transforms generated reports into fine-grained, schema-aligned structured reports, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LUNGUAGE gives a concrete new dataset and metric for longitudinal structured chest X-ray reports, but the missing inter-annotator agreement numbers leave the ground truth shaky.

read the letter

The main takeaway is that this paper ships a benchmark of 1,473 expert-reviewed chest X-ray reports, with 186 longitudinal cases, a two-stage structuring pipeline, and LUNGUAGESCORE for entity-relation-attribute matching across time. That moves the field past single-report coarse metrics toward patient-level sequential evaluation, which is a real gap in medical report generation work. The code release is also straightforward and helpful for anyone who wants to test it directly. The structuring framework looks practical on paper for turning free text into schema-aligned output, and the metric is defined independently rather than fitted to existing models. On the soft side, the stress-test point holds: there are no reported inter-annotator agreement stats, no kappa numbers on entities or temporal intervals, and no details on annotation guidelines or disagreement handling. For a benchmark that rests on those expert annotations capturing fine-grained semantics and progression, that absence makes it harder to judge how stable the ground truth actually is. The abstract claims empirical results that the metric supports structured evaluation, but without seeing the numbers or error analysis in the full text, the strength of that demonstration stays unclear. This is for researchers working on clinical NLP and radiology report generation who need better evaluation tools beyond BLEU or single-study checks. A reader building or comparing models in that area would get direct value from trying the dataset and metric. It deserves a serious referee because the core contribution targets a documented limitation with a usable artifact, even though the annotation validation section needs tightening. I would send it out for review and ask the referees to focus on the reliability of the expert labels.

Referee Report

1 major / 2 minor

Summary. The paper introduces LUNGUAGE, a benchmark of 1,473 expert-reviewed chest X-ray reports (including 186 longitudinal cases with temporal annotations), a two-stage structuring framework that converts free-text reports into schema-aligned structured outputs, and LUNGUAGESCORE, an interpretable metric that compares structured reports at the entity, relation, and attribute levels while modeling temporal consistency across patient timelines. It positions these as the first resources for structured and sequential radiology report evaluation and claims empirical results demonstrate the metric's effectiveness.

Significance. If the annotation quality and empirical results hold, the work would be significant for the field by providing the first dedicated benchmark and metric for fine-grained, longitudinal evaluation of radiology reports, moving beyond coarse single-report metrics to better capture clinical semantics and disease progression.

major comments (1)

[Dataset construction] Dataset construction (or equivalent section describing the 1,473 reports and 186 longitudinal cases): The manuscript states that all reports were reviewed by experts but reports no inter-annotator agreement statistics (e.g., Cohen’s kappa on entities, relations, attributes, or temporal intervals), no annotation guidelines, and no analysis of disagreement resolution or selection bias. This directly undermines the central claim that the benchmark accurately captures fine-grained clinical semantics and temporal dependencies, as LUNGUAGESCORE’s comparisons rest on the stability of this ground truth.

minor comments (2)

[Abstract] The abstract references empirical results demonstrating LUNGUAGESCORE’s effectiveness but supplies no quantitative numbers, baseline comparisons, or error analysis; these should be summarized even at a high level.
[Structuring framework] Clarify the precise schema (entities, relations, attributes, and temporal interval representation) used in the two-stage structuring framework and whether it was derived from existing radiology ontologies.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the transparency of our benchmark. We address the major comment on dataset construction point by point below.

read point-by-point responses

Referee: [Dataset construction] Dataset construction (or equivalent section describing the 1,473 reports and 186 longitudinal cases): The manuscript states that all reports were reviewed by experts but reports no inter-annotator agreement statistics (e.g., Cohen’s kappa on entities, relations, attributes, or temporal intervals), no annotation guidelines, and no analysis of disagreement resolution or selection bias. This directly undermines the central claim that the benchmark accurately captures fine-grained clinical semantics and temporal dependencies, as LUNGUAGESCORE’s comparisons rest on the stability of this ground truth.

Authors: We agree that greater detail on the annotation process is necessary to support claims about ground-truth stability. In the revised manuscript, we will expand the Dataset Construction section to include the full annotation guidelines provided to experts, a description of the multi-expert review workflow (including how disagreements on entities, relations, attributes, and temporal intervals were discussed and resolved by consensus), and an analysis of potential selection biases in report sampling from the source corpus. Regarding inter-annotator agreement, the annotations were led by a primary board-certified radiologist with secondary review by additional experts; however, formal IAA metrics such as Cohen’s kappa were not computed during the original process. We will explicitly acknowledge this as a limitation in the revised text and note that future releases will incorporate such statistics. These changes directly address the concern without overstating the current evidence. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces a newly annotated benchmark dataset (1,473 expert-reviewed reports, including 186 longitudinal cases), a two-stage structuring framework, and the LUNGUAGESCORE metric as independent contributions. No equations, predictions, or first-principles results are presented that reduce to fitted parameters or prior inputs by construction. The metric is defined directly on structured entity/relation/attribute comparisons with temporal modeling, and the dataset is created via expert annotation rather than derived from any self-referential fit. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is self-contained as an empirical benchmark introduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Main additions are the new dataset and metric; relies on domain assumption of reliable expert annotations and standard NLP structuring techniques without new free parameters or invented physical entities.

axioms (1)

domain assumption Expert-reviewed annotations provide reliable ground truth for clinical entities, relations, attributes, and temporal consistency
Benchmark construction and LunguageScore evaluation depend on the quality and consistency of the 1,473 expert annotations described in the abstract.

invented entities (2)

LUNGUAGE benchmark dataset no independent evidence
purpose: Provide structured and longitudinal annotations for chest X-ray reports
Newly created collection of 1,473 reports with expert review; no independent evidence outside this work.
LUNGUAGESCORE metric no independent evidence
purpose: Enable entity-, relation-, and attribute-level comparison with temporal consistency modeling
Newly proposed interpretable metric; no independent evidence outside this work.

pith-pipeline@v0.9.0 · 5780 in / 1401 out tokens · 58446 ms · 2026-05-19T13:07:44.081230+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation... LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ENTITYGROUPS identify observations that refer to the same underlying clinical finding... TEMPORALGROUPS divide each ENTITYGROUP into distinct diagnostic episodes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.