Building a Benchmark Dataset and Classifiers for Sentence-Level Findings in AP Chest X-rays

Alexandros Karargyris; Anup Pillai; Hassan M. Ahmad; Joy T. Wu; Karthik Sheshadri; Ken C. L. Wong; Mehdi Moradi; Nadeem Ansari; Satyananda Kashyap; Tanveer Syeda-Mahmood

arxiv: 1906.09336 · v1 · pith:P3KY4WELnew · submitted 2019-06-21 · 💻 cs.CV

Building a Benchmark Dataset and Classifiers for Sentence-Level Findings in AP Chest X-rays

Tanveer Syeda-Mahmood , Hassan M. Ahmad , Nadeem Ansari , Yaniv Gur , Satyananda Kashyap , Alexandros Karargyris , Mehdi Moradi , Anup Pillai

show 4 more authors

Karthik Sheshadri Weiting Wang Ken C. L. Wong Joy T. Wu

This is my paper

Pith reviewed 2026-05-25 18:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords chest X-raybenchmark datasetsentence-level findingsAP viewdeep learning classifierscrowdsourcingmedical imagingdevice placement

0 comments

The pith

A new benchmark dataset supplies 73 sentence-level descriptors for findings in AP chest X-rays, generated via crowdsourced clinician annotations and learnable by deep learning classifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark database of 73 rich sentence-level descriptors that capture findings visible in anterior-posterior chest X-rays, including device placements such as lines and tubes. It obtains these labels through a semi-automated process that aggregates clinician annotations collected via crowdsourcing. Deep learning classifiers are then trained on the resulting labels to demonstrate that higher-granularity descriptions are learnable. This approach targets the gap left by prior datasets whose labels are too coarse for precise interpretation of the most common diagnostic exam. The work therefore supplies both data and models that support more detailed automatic reading of chest radiographs.

Core claim

We present a new chest X-ray benchmark database of 73 rich sentence-level descriptors of findings seen in AP chest X-rays. We describe our method of obtaining these findings through a semi-automated ground truth generation process from crowdsourcing of clinician annotations. We also present results of building classifiers for these findings that show that such higher granularity labels can also be learned through the framework of deep learning classifiers.

What carries the argument

The benchmark database of 73 sentence-level descriptors produced by semi-automated crowdsourcing of clinician annotations, which serves as training targets for deep learning classifiers.

If this is right

Classifiers trained on the 73 descriptors can detect device placements and other AP-specific findings not covered by prior coarse label sets.
The same deep learning framework used for the new labels can be applied to any future expansion of the sentence-level descriptor set.
Hospitals and emergency rooms gain access to training data that supports more granular automatic interpretation of the most common diagnostic exam.
The semi-automated annotation pipeline can be reused to generate labels for additional chest X-ray views or related imaging modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Combining this dataset with existing NIH chest X-ray collections could produce models that handle both PA and AP views within a single system.
The crowdsourcing method may lower the cost of creating detailed labels for other medical imaging tasks where expert time is limited.
Sentence-level descriptors could serve as an intermediate representation that improves downstream report generation from image classifiers.

Load-bearing premise

The semi-automated ground truth generation process from crowdsourcing of clinician annotations produces sufficiently accurate and unbiased labels that can serve as reliable training targets for the classifiers.

What would settle it

A held-out test set of AP chest X-rays where labels produced by the crowdsourcing pipeline disagree with independent expert radiologist review at a rate high enough to degrade classifier performance below usable levels.

read the original abstract

Chest X-rays are the most common diagnostic exams in emergency rooms and hospitals. There has been a surge of work on automatic interpretation of chest X-rays using deep learning approaches after the availability of large open source chest X-ray dataset from NIH. However, the labels are not sufficiently rich and descriptive for training classification tools. Further, it does not adequately address the findings seen in Chest X-rays taken in anterior-posterior (AP) view which also depict the placement of devices such as central vascular lines and tubes. In this paper, we present a new chest X-ray benchmark database of 73 rich sentence-level descriptors of findings seen in AP chest X-rays. We describe our method of obtaining these findings through a semi-automated ground truth generation process from crowdsourcing of clinician annotations. We also present results of building classifiers for these findings that show that such higher granularity labels can also be learned through the framework of deep learning classifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper creates a new 73-label sentence-level dataset for AP chest X-rays via clinician crowdsourcing and trains DL classifiers on it, but the abstract supplies no performance numbers or label validation metrics.

read the letter

The core contribution is a benchmark of 73 sentence-level descriptors for findings in AP chest X-rays, including device placements that the NIH dataset largely skips. The authors built it through a semi-automated crowdsourcing pipeline with clinicians and then trained deep learning classifiers to show these finer labels are learnable. That addresses a real limitation in existing public chest X-ray resources for emergency and hospital use cases where AP views dominate.

Referee Report

1 major / 1 minor

Summary. The manuscript claims to introduce a new benchmark dataset of 73 sentence-level descriptors for findings in AP chest X-rays, constructed via a semi-automated ground truth process based on crowdsourced clinician annotations, and reports that deep learning classifiers can be trained to learn these higher-granularity labels.

Significance. If the labels are shown to be reliable, the dataset would address gaps in existing resources such as the NIH chest X-ray collection by supplying richer, sentence-level annotations that capture device placements and other AP-specific findings. Demonstrating learnability of these labels via standard deep learning frameworks would support expanded use of detailed annotations in medical imaging.

major comments (1)

[ground truth generation section] The section describing the semi-automated ground truth generation process from crowdsourcing of clinician annotations supplies no inter-annotator agreement statistics, expert adjudication rates, or error analysis on the resulting 73 findings. This information is required to substantiate that the labels meet the reliability threshold needed to serve as training targets.

minor comments (1)

[Abstract] The abstract states that classifier results are presented yet supplies no performance numbers, validation splits, or error analysis, making it difficult to evaluate the claim that the 73 labels are learnable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We address the single major comment below regarding the ground truth generation process.

read point-by-point responses

Referee: [ground truth generation section] The section describing the semi-automated ground truth generation process from crowdsourcing of clinician annotations supplies no inter-annotator agreement statistics, expert adjudication rates, or error analysis on the resulting 73 findings. This information is required to substantiate that the labels meet the reliability threshold needed to serve as training targets.

Authors: We agree that inter-annotator agreement statistics, adjudication rates, and error analysis would help substantiate label reliability. The crowdsourcing was performed via a platform that aggregated clinician annotations without retaining per-annotator identifiers, precluding standard IAA calculations such as Fleiss' kappa. The semi-automated process included automated filtering followed by limited manual review, but detailed per-finding adjudication logs were not retained. In revision we will expand the relevant section to describe all quality-control steps that were applied, report any available adjudication information, and add an explicit limitations paragraph discussing the absence of IAA metrics. This addresses the concern to the extent the original data permit. revision: partial

Circularity Check

0 steps flagged

No circularity: new dataset constructed externally; classifiers are trained outcomes

full rationale

The paper constructs a new benchmark of 73 sentence-level findings via a semi-automated crowdsourcing process from clinician annotations and then trains deep learning classifiers on those labels. No equations, fitted parameters, or self-citations are present that reduce any claim to prior outputs by construction. The dataset creation step is described as an external data-generation procedure rather than a self-referential definition, and the classifier results are empirical performance numbers on the newly created labels. This matches the default case of a self-contained empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the quality of crowdsourced labels and the assumption that semi-automated aggregation yields usable ground truth; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Crowdsourced annotations from multiple clinicians can be aggregated into reliable ground truth labels via semi-automated processing
Invoked to justify the creation of the 73-descriptor dataset from clinician input.

pith-pipeline@v0.9.0 · 5746 in / 1163 out tokens · 26658 ms · 2026-05-25T18:33:08.206716+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

INTRODUCTION Chest X-rays are the most common imaging exams being conducted in emergency rooms. Recently, a number of researchers have begun automated interpretation of chest X-rays, focusing on posterior-anterior (PA) views and lim- ited number of labels of high granularity such as opacity or consolidation.[1, 2, 3]. If machines are to assist radiologist...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

LABEL GENERA TION FOR AP CHEST X-RA YS The NIH dataset consists of 112,121, images with 44,812 im- ages in AP view from 9061 patients. To generate the labeled dataset for AP chest X-rays, we sampled the dataset so that at least one AP chest X-ray image was selected from all patients to obtain a total of 16910 unique images for re-annotation. To rapidly an...

work page
[3]

In addition, tubes and lines have a small footprint in the overall image due to their thin tubu- lar structures

CLASSIFICA TION OF CHEST X-RA Y FINDINGS From the names of the labels available from AP Chest X- ray reports, we can observe that the labels such as ”left picc line with tip at the superior vena cava” and ”left picc with tip at the cavoatrial junction,” depict very similar appearance of these lines as shown in Figure 2 with the main difference being the p...

work page
[4]

A total of 7942 images were retained corresponding to the 73 labels that had support of at least 50 images in the collection

RESULTS The experiments were performed with the newly labeled NIH dataset of 73 ﬁndings. A total of 7942 images were retained corresponding to the 73 labels that had support of at least 50 images in the collection. A total of 6209 images were used for training, and 1733 were retained for validation and testing. First, we generated a baseline result using ...

work page
[5]

We describe our method of obtaining these ﬁndings through a semi-automated ground truth generation process from crowdsourcing of clinical annotations

CONCLUSION In this paper, we present a new chest X-ray benchmark database of 73 sentence-level ﬁndings seen in AP chest X- rays. We describe our method of obtaining these ﬁndings through a semi-automated ground truth generation process from crowdsourcing of clinical annotations

work page
[6]

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Bran- don Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al., “Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning,” arXiv preprint arXiv:1711.05225, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and lo- calization of common thorax diseases,

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and lo- calization of common thorax diseases,” in Computer Vi- sion and Pattern Recognition (CVPR), 2017 IEEE Con- ference on. IEEE, 2017, pp. 3462–3471

work page 2017
[8]

TextRay: Mining Clinical Reports to Gain a Broad Understanding of Chest X-rays

Jonathan Laserson, Christine Dan Lantsman, Michal Cohen-Sfady, Itamar Tamir, Eli Goz, Chen Brestel, Shir Bar, Maya Atar, and Eldad Elnekave, “Textray: Mining clinical reports to gain a broad understanding of chest x-rays,” arXiv preprint arXiv:1806.02121, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Rsna pneumonia detection machine learning challenge now open,

No Author, “Rsna pneumonia detection machine learning challenge now open,” http://www.rsna.org/News.aspx?id=24992, 2018

work page 2018
[10]

Learning the correlation between images and disease labels using ambiguous learning,

Tanveer Syeda-Mahmood, Ritwik Kumar, and Colin Compas, “Learning the correlation between images and disease labels using ambiguous learning,” in In- ternational Conference on Medical Image Computing and Computer-Assisted Intervention . Springer, 2015, pp. 185–193

work page 2015
[11]

Densely connected convolutional networks.,

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks.,” in CVPR, 2017, vol. 1, p. 3

work page 2017
[12]

Imagenet: A large-scale hierarchi- cal image database,

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchi- cal image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on . Ieee, 2009, pp. 248–255

work page 2009
[13]

Rusboost: A hybrid approach to alleviating class imbalance,

Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano, “Rusboost: A hybrid approach to alleviating class imbalance,” IEEE Transactions on Sys- tems, Man, and Cybernetics-Part A: Systems and Hu- mans, vol. 40, no. 1, pp. 185–197, 2010

work page 2010
[14]

Logitboost with errors-in-variables,

Joseph Sexton and Petter Laake, “Logitboost with errors-in-variables,” Computational Statistics & Data Analysis, vol. 52, no. 5, pp. 2549–2559, 2008

work page 2008
[15]

Model- shared subspace boosting for multi-label classiﬁcation,

Rong Yan, Jelena Tesic, and John R Smith, “Model- shared subspace boosting for multi-label classiﬁcation,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 2007, pp. 834–843

work page 2007

[1] [1]

INTRODUCTION Chest X-rays are the most common imaging exams being conducted in emergency rooms. Recently, a number of researchers have begun automated interpretation of chest X-rays, focusing on posterior-anterior (PA) views and lim- ited number of labels of high granularity such as opacity or consolidation.[1, 2, 3]. If machines are to assist radiologist...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

LABEL GENERA TION FOR AP CHEST X-RA YS The NIH dataset consists of 112,121, images with 44,812 im- ages in AP view from 9061 patients. To generate the labeled dataset for AP chest X-rays, we sampled the dataset so that at least one AP chest X-ray image was selected from all patients to obtain a total of 16910 unique images for re-annotation. To rapidly an...

work page

[3] [3]

In addition, tubes and lines have a small footprint in the overall image due to their thin tubu- lar structures

CLASSIFICA TION OF CHEST X-RA Y FINDINGS From the names of the labels available from AP Chest X- ray reports, we can observe that the labels such as ”left picc line with tip at the superior vena cava” and ”left picc with tip at the cavoatrial junction,” depict very similar appearance of these lines as shown in Figure 2 with the main difference being the p...

work page

[4] [4]

A total of 7942 images were retained corresponding to the 73 labels that had support of at least 50 images in the collection

RESULTS The experiments were performed with the newly labeled NIH dataset of 73 ﬁndings. A total of 7942 images were retained corresponding to the 73 labels that had support of at least 50 images in the collection. A total of 6209 images were used for training, and 1733 were retained for validation and testing. First, we generated a baseline result using ...

work page

[5] [5]

We describe our method of obtaining these ﬁndings through a semi-automated ground truth generation process from crowdsourcing of clinical annotations

CONCLUSION In this paper, we present a new chest X-ray benchmark database of 73 sentence-level ﬁndings seen in AP chest X- rays. We describe our method of obtaining these ﬁndings through a semi-automated ground truth generation process from crowdsourcing of clinical annotations

work page

[6] [6]

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Bran- don Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al., “Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning,” arXiv preprint arXiv:1711.05225, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and lo- calization of common thorax diseases,

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and lo- calization of common thorax diseases,” in Computer Vi- sion and Pattern Recognition (CVPR), 2017 IEEE Con- ference on. IEEE, 2017, pp. 3462–3471

work page 2017

[8] [8]

TextRay: Mining Clinical Reports to Gain a Broad Understanding of Chest X-rays

Jonathan Laserson, Christine Dan Lantsman, Michal Cohen-Sfady, Itamar Tamir, Eli Goz, Chen Brestel, Shir Bar, Maya Atar, and Eldad Elnekave, “Textray: Mining clinical reports to gain a broad understanding of chest x-rays,” arXiv preprint arXiv:1806.02121, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Rsna pneumonia detection machine learning challenge now open,

No Author, “Rsna pneumonia detection machine learning challenge now open,” http://www.rsna.org/News.aspx?id=24992, 2018

work page 2018

[10] [10]

Learning the correlation between images and disease labels using ambiguous learning,

Tanveer Syeda-Mahmood, Ritwik Kumar, and Colin Compas, “Learning the correlation between images and disease labels using ambiguous learning,” in In- ternational Conference on Medical Image Computing and Computer-Assisted Intervention . Springer, 2015, pp. 185–193

work page 2015

[11] [11]

Densely connected convolutional networks.,

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger, “Densely connected convolutional networks.,” in CVPR, 2017, vol. 1, p. 3

work page 2017

[12] [12]

Imagenet: A large-scale hierarchi- cal image database,

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchi- cal image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on . Ieee, 2009, pp. 248–255

work page 2009

[13] [13]

Rusboost: A hybrid approach to alleviating class imbalance,

Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano, “Rusboost: A hybrid approach to alleviating class imbalance,” IEEE Transactions on Sys- tems, Man, and Cybernetics-Part A: Systems and Hu- mans, vol. 40, no. 1, pp. 185–197, 2010

work page 2010

[14] [14]

Logitboost with errors-in-variables,

Joseph Sexton and Petter Laake, “Logitboost with errors-in-variables,” Computational Statistics & Data Analysis, vol. 52, no. 5, pp. 2549–2559, 2008

work page 2008

[15] [15]

Model- shared subspace boosting for multi-label classiﬁcation,

Rong Yan, Jelena Tesic, and John R Smith, “Model- shared subspace boosting for multi-label classiﬁcation,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 2007, pp. 834–843

work page 2007