pith. sign in

arxiv: 2606.22144 · v1 · pith:JSSJK5H7new · submitted 2026-06-20 · 💻 cs.CV · cs.AI

SAGE: An Expert-Annotated South Asian GI Endoscopy Dataset for Multimodal Learning and Hallucination Analysis

Pith reviewed 2026-06-26 12:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords South Asian GI endoscopypopulation biasmultimodal AIimage captioningvisual question answeringhallucination detectionmedical imaging datasetAI fairness
0
0 comments X

The pith

South Asian GI endoscopy images cause existing AI models to drop 58 percent in performance compared to European data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAGE, a dataset of 1300 expert-annotated South Asian GI endoscopy images with captions, labels, and question-answer pairs. It benchmarks multi-class classifiers and large multimodal models on this data against European datasets. Task-specific models show an average 58% performance drop, while LMMs have reduced GREEN scores of 0.308 for landmarks and 0.410 for abnormalities. This demonstrates the presence of population bias in current GI imaging AI. The dataset enables development of more inclusive models for regions with high GI cancer burden.

Core claim

We introduce the SAGE dataset consisting of 1,300 South Asian GI endoscopy images, their captions with hallucination tags, 18 labels, and 14,726 question-answer pairs. Benchmarking reveals that multi-class classification models suffer an average 58% performance drop on the South Asian dataset, and contemporary LMMs show substantial drops in GREEN scores for anatomical landmark detection (0.308) and abnormality detection (0.410).

What carries the argument

The SAGE dataset, which provides the first open-source representation of South Asian GI endoscopy images for tasks including classification, captioning, and VQA, along with hallucination annotations.

If this is right

  • Task-specific models experience greater degradation from population shift than large multimodal models in GI imaging tasks.
  • Current European-trained models require adaptation for use in South Asian clinical settings to maintain diagnostic accuracy.
  • The inclusion of hallucination tags enables targeted analysis of model errors in generated reports.
  • Open availability of SAGE facilitates development of geographically inclusive AI tools for automated GI diagnosis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fine-tuning models on SAGE data could help address the scarcity of GI experts in South Asia by improving AI-assisted diagnosis.
  • Similar performance gaps may exist for other underrepresented populations, suggesting a need for multiple regional datasets.
  • Population-specific features in endoscopy images, such as variations in anatomy or disease presentation, may require explicit modeling beyond current approaches.

Load-bearing premise

The observed performance drops are caused by population shift rather than uncontrolled differences in imaging equipment, patient preparation, lighting conditions, or annotation criteria between the South Asian collection and the European datasets used for comparison.

What would settle it

A controlled experiment training models on European data and testing on South Asian images matched for equipment type, lighting, and annotation protocol would isolate whether population shift alone explains the 58% drop.

Figures

Figures reproduced from arXiv: 2606.22144 by Binod Bhattarai, Nikesh Mani Shrestha, Niyoj Oli, Prashnna K Gyawali, Ram Bahadur Gurung, Ramesh Rana, Sachin Acharya, Sandesh Pokhrel, Sanjay Bhandari, Yash Raj Shrestha.

Figure 1
Figure 1. Figure 1: Overview of the SAGE data annotation pipeline. Top: endoscopy images and associated metadata are collected [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: 5.2 Methods Used for the Data Creation The construction of the SAGE dataset followed a rigorous, multi-stage pipeline designed to ensure clinical relevance, patient privacy, and highly accurate multimodal annotations. This process spans from initial clinical curation within a hos￾pital infrastructure to anonymization and a hybrid human-AI annotation workflow [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example images from the SAGE dataset illustrating anatomical landmarks, gastrointestinal segments, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of annotated frames across 18 classes in the multi-label gastrointestinal endoscopy dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GREEN score of contemporary LMMs across six [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Gastrointestinal cancers represent a growing health burden in the South Asian region, driven largely by rapid changes in socio-economic conditions & lifestyle habits. However, early diagnosis of such malignancies remains a significant challenge, largely due to a lack of modern equipment, lack of financial support, and a scarcity of GI experts. AI-assisted diagnosis & report generation, show great promise in alleviating this problem by providing low-skill manpower the technical expertise to perform diagnosis. However, almost all open-source, publicly available datasets are predominantly collected from the European region, with no representation from the South Asian region. The lack of open-source GI datasets from diverse geographic regions has made it difficult to assess whether population bias is present in existing models, and to develop geographically inclusive AI tools for automated GI diagnosis. To address this gap, we introduce SAGE: An Expert-Annotated South Asian GI Endoscopy dataset for image captioning, multi-label classification, and visual question answering (VQA) tasks. It consists of 1,300 images, their captions along with hallucination tag, 18 labels and 14,726 question-answer pairs making it well-suited for diverse range of tasks including classification, benchmarking, and fine-tuning large multimodal models (LMMs). We further conducted benchmarking of multi-class classifiers on the effect of population shift in GI imaging AI tasks, and contemporary LMMs on their performance. Our study reveals that task-specific models, such as multi-class classification models, suffer the most, with an average performance drop of 58% when evaluated on the South Asian dataset. For contemporary LMMs, benchmarking reveals a substantial drop in the average GREEN score for anatomical landmark detection (0.308) and abnormality detection (0.410).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces the SAGE dataset consisting of 1,300 expert-annotated South Asian GI endoscopy images with captions, hallucination tags, 18 labels, and 14,726 QA pairs for tasks including classification, captioning, and VQA. It benchmarks task-specific multi-class classifiers and contemporary LMMs, reporting an average 58% performance drop for classifiers and reduced GREEN scores (0.308 for anatomical landmarks, 0.410 for abnormalities) when evaluated on the South Asian data versus European datasets, attributing the gaps to population shift.

Significance. If the observed performance gaps can be robustly isolated to geographic/population factors, the dataset release would provide a valuable resource for studying and mitigating geographic bias in GI endoscopy AI, supporting development of more inclusive models for regions with high disease burden and limited expert access. The multimodal annotations and hallucination tags add utility for captioning and VQA research.

major comments (2)
  1. [Abstract] Abstract: The headline claim of an 'average performance drop of 58%' for multi-class classifiers (and the specific GREEN scores 0.308/0.410 for LMMs) is presented without any mention of statistical tests, confidence intervals, dataset split details, or the size/composition of the European reference sets used for comparison.
  2. [Abstract] Abstract and benchmarking description: The attribution of performance drops to 'population shift' is not supported by any reported controls or matching for confounding variables such as endoscope vendor/model, image resolution/color calibration, bowel-cleansing scores, sedation protocols, or alignment of annotation rubrics with the European datasets; without these, the causal interpretation cannot be isolated from dataset construction artifacts.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on the presentation of results and interpretation of geographic factors in the SAGE dataset manuscript. We address each major comment point-by-point below, indicating where revisions will be incorporated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim of an 'average performance drop of 58%' for multi-class classifiers (and the specific GREEN scores 0.308/0.410 for LMMs) is presented without any mention of statistical tests, confidence intervals, dataset split details, or the size/composition of the European reference sets used for comparison.

    Authors: We agree the abstract would be strengthened by including this context. The full manuscript specifies the European reference sets (Kvasir and HyperKvasir, totaling over 10,000 images) and uses consistent 80/20 train/test splits for the multi-class classifiers. We will revise the abstract to note that the reported drops are statistically significant (p < 0.05 via appropriate tests) with 95% confidence intervals and dataset details available in the results section and supplement. This change will be made in the revision. revision: yes

  2. Referee: [Abstract] Abstract and benchmarking description: The attribution of performance drops to 'population shift' is not supported by any reported controls or matching for confounding variables such as endoscope vendor/model, image resolution/color calibration, bowel-cleansing scores, sedation protocols, or alignment of annotation rubrics with the European datasets; without these, the causal interpretation cannot be isolated from dataset construction artifacts.

    Authors: We concur that the current wording overstates the ability to isolate population shift as the sole cause without controls for these factors. The European datasets lack sufficient public metadata for direct matching on equipment, preparation scores, or protocols. In revision we will update the abstract and discussion to describe the gaps as 'observed in the context of geographic and population differences' rather than direct attribution, and add an explicit limitations paragraph on potential confounders including annotation rubric alignment. The dataset remains useful for studying such shifts. revision: partial

standing simulated objections not resolved
  • We cannot obtain or perform matching using detailed metadata (endoscope vendor, bowel-cleansing scores, sedation protocols) from the public European reference datasets, as this information is not available.

Circularity Check

0 steps flagged

No circularity; empirical dataset release and direct benchmarking only.

full rationale

The paper introduces a new South Asian GI endoscopy dataset (SAGE) with images, captions, labels, and VQA pairs, then reports measured performance of classifiers and LMMs on it versus prior European datasets. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content. The 58% drop and GREEN scores (0.308/0.410) are direct empirical outcomes from benchmarking, not quantities constructed from the inputs by definition. Attribution to population shift is a causal interpretation that may be under-supported by controls, but that is a correctness issue, not circularity. The contribution is self-contained as data release plus measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Value of the dataset rests on the assumption that expert annotations are accurate and that the 1,300 images are representative of the broader South Asian population; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Expert annotations constitute reliable ground truth for captions, labels, and hallucination tags.
    The paper positions the dataset as expert-annotated but provides no inter-rater reliability statistics or validation procedure in the abstract.

pith-pipeline@v0.9.1-grok · 5900 in / 1200 out tokens · 28552 ms · 2026-06-26T12:15:42.698950+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 11 canonical work pages

  1. [1]

    Proceedings of the 8th ACM on Multimedia Systems Conference , pages =

    KVASIR: A Multi-Class Image Dataset for Computer Aided Gastrointestinal Disease Detection , author =. Proceedings of the 8th ACM on Multimedia Systems Conference , series =. 2017 , isbn =. doi:10.1145/3083187.3083212 , acmid =

  2. [2]

    Scientific Data , number =

    Borgli, Hanna and others , doi =. Scientific Data , number =

  3. [3]

    2023 , isbn =

    Jha, Debesh and others , title =. 2023 , isbn =. doi:10.1007/978-3-031-47679-2_10 , booktitle =

  4. [4]

    MultiMedia Modeling , year =

    Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy , author =. MultiMedia Modeling , year =

  5. [5]

    Proceedings of the First International Workshop on Vision-Language Models for Biomedical Applications (VLM4Bio '24) , year=

    Kvasir-VQA: A Text-Image Pair GI Tract Dataset , author=. Proceedings of the First International Workshop on Vision-Language Models for Biomedical Applications (VLM4Bio '24) , year=

  6. [6]

    2025 , eprint=

    PolypDB: A Curated Multi-Center Dataset for Development of AI Algorithms in Colonoscopy , author=. 2025 , eprint=

  7. [7]

    2024 , publisher =

    Biffi, Carlo and others , journal =. 2024 , publisher =. doi:10.1038/s41597-024-03359-0 , url =

  8. [8]

    Abhishek et al

    \ Bernal Del Nozal\ , Jorge and Sanchez, \ F. Javier\ and Fernando Vilari \ n o. Towards Automatic Polyp Detection with a Polyp Appearance Model. Pattern Recognition. 2012. doi:10.1016/j.patcog.2012.03.002

  9. [9]

    saliency maps from physicians

    Jorge Bernal and S \'a nchez, F. Javier and Gloria Fern \'a ndez-Esparrach and Debora Gil and Cristina Rodr \'i guez and Fernando Vilari \ n o. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. 2015. doi:10.1016/j.compmedimag.2015.02.007

  10. [10]

    and Boers, Tim G.W

    GastroNet-5M: A Multicenter Dataset for Developing Foundation Models in Gastrointestinal Endoscopy , journal =. 2026 , issn =. doi:https://doi.org/10.1053/j.gastro.2025.07.030 , url =

  11. [11]

    2024 , publisher=

    Global Cancer Observatory: Cancer Today , author=. 2024 , publisher=

  12. [12]

    proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025 , year =

    Khanal, Bidur AND others , title =. proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025 , year =

  13. [13]

    Data Engineering in Medical Imaging , year=

    Gautam, Sushant and Riegler, Michael and others , title=. Data Engineering in Medical Imaging , year=

  14. [14]

    Gastroenterology , volume =

    Endoscopic Screening in Asian Countries Is Associated With Reduced Gastric Cancer Mortality: A Meta-analysis and Systematic Review , author =. Gastroenterology , volume =. 2018 , month = aug, doi =

  15. [15]

    The New England Journal of Medicine , volume =

    Colorectal-Cancer Incidence and Mortality with Screening Flexible Sigmoidoscopy , author =. The New England Journal of Medicine , volume =. 2012 , month = jun, doi =

  16. [16]

    , journal =

    Anirvan, Prajna and Meher, Dinesh and Singh, Shivaram P. , journal =. 2020 , volume =. doi:10.5005/jp-journals-10018-1322 , pmid =

  17. [17]

    Scientific Data3(1), 1–9 (2016), https://doi.org/10.1038/sdata.2016.18

    Wilkinson, Mark D. and others , journal =. The. 2016 , volume =. doi:10.1038/sdata.2016.18 , url =

  18. [18]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  19. [19]

    Giannantonio , title =

    Cristina M. Giannantonio , title =. Organizational Research Methods , volume =. 2010 , doi =

  20. [20]

    GitHub repository , howpublished =

    Santiago Castro , title =. GitHub repository , howpublished =. 2017 , publisher =

  21. [21]

    and others , title=

    van Doorn, Sascha C. and others , title=. Official journal of the American College of Gastroenterology | ACG , year=

  22. [22]

    Intrapapillary capillary loop classification in magnification endoscopy: open dataset and baseline methodology , journal=

    Garc. Intrapapillary capillary loop classification in magnification endoscopy: open dataset and baseline methodology , journal=. 2020 , month=. doi:10.1007/s11548-020-02127-w , url=

  23. [23]

    Chandrasinghe, P. C. and Ediriweera, D. S. and Hewavisenthi, J. and Kumarage, S. K. and Fernando, F. R. and Deen, K. I. , title =. BMC Research Notes , year =. doi:10.1186/s13104-017-2869-1 , pmid =

  24. [24]

    Chaudhari, and Jean-Benoit Delbrouck

    Ostmeier, Sophie and Xu, Justin and Chen, Zhihong and Varma, Maya and Blankemeier, Louis and Bluethgen, Christian and Michalson, Arne Edward and Moseley, Michael and Langlotz, Curtis and Chaudhari, Akshay S and Delbrouck, Jean-Benoit. GREEN : Generative Radiology Report Evaluation and Error Notation. Findings of the Association for Computational Linguisti...

  25. [25]

    Oncology Reviews , volume =

    Changing Colorectal Cancer Trends in Asians: Epidemiology and Risk Factors , author =. Oncology Reviews , volume =. 2023 , doi =