SAGE: An Expert-Annotated South Asian GI Endoscopy Dataset for Multimodal Learning and Hallucination Analysis
Pith reviewed 2026-06-26 12:15 UTC · model grok-4.3
The pith
South Asian GI endoscopy images cause existing AI models to drop 58 percent in performance compared to European data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the SAGE dataset consisting of 1,300 South Asian GI endoscopy images, their captions with hallucination tags, 18 labels, and 14,726 question-answer pairs. Benchmarking reveals that multi-class classification models suffer an average 58% performance drop on the South Asian dataset, and contemporary LMMs show substantial drops in GREEN scores for anatomical landmark detection (0.308) and abnormality detection (0.410).
What carries the argument
The SAGE dataset, which provides the first open-source representation of South Asian GI endoscopy images for tasks including classification, captioning, and VQA, along with hallucination annotations.
If this is right
- Task-specific models experience greater degradation from population shift than large multimodal models in GI imaging tasks.
- Current European-trained models require adaptation for use in South Asian clinical settings to maintain diagnostic accuracy.
- The inclusion of hallucination tags enables targeted analysis of model errors in generated reports.
- Open availability of SAGE facilitates development of geographically inclusive AI tools for automated GI diagnosis.
Where Pith is reading between the lines
- Fine-tuning models on SAGE data could help address the scarcity of GI experts in South Asia by improving AI-assisted diagnosis.
- Similar performance gaps may exist for other underrepresented populations, suggesting a need for multiple regional datasets.
- Population-specific features in endoscopy images, such as variations in anatomy or disease presentation, may require explicit modeling beyond current approaches.
Load-bearing premise
The observed performance drops are caused by population shift rather than uncontrolled differences in imaging equipment, patient preparation, lighting conditions, or annotation criteria between the South Asian collection and the European datasets used for comparison.
What would settle it
A controlled experiment training models on European data and testing on South Asian images matched for equipment type, lighting, and annotation protocol would isolate whether population shift alone explains the 58% drop.
Figures
read the original abstract
Gastrointestinal cancers represent a growing health burden in the South Asian region, driven largely by rapid changes in socio-economic conditions & lifestyle habits. However, early diagnosis of such malignancies remains a significant challenge, largely due to a lack of modern equipment, lack of financial support, and a scarcity of GI experts. AI-assisted diagnosis & report generation, show great promise in alleviating this problem by providing low-skill manpower the technical expertise to perform diagnosis. However, almost all open-source, publicly available datasets are predominantly collected from the European region, with no representation from the South Asian region. The lack of open-source GI datasets from diverse geographic regions has made it difficult to assess whether population bias is present in existing models, and to develop geographically inclusive AI tools for automated GI diagnosis. To address this gap, we introduce SAGE: An Expert-Annotated South Asian GI Endoscopy dataset for image captioning, multi-label classification, and visual question answering (VQA) tasks. It consists of 1,300 images, their captions along with hallucination tag, 18 labels and 14,726 question-answer pairs making it well-suited for diverse range of tasks including classification, benchmarking, and fine-tuning large multimodal models (LMMs). We further conducted benchmarking of multi-class classifiers on the effect of population shift in GI imaging AI tasks, and contemporary LMMs on their performance. Our study reveals that task-specific models, such as multi-class classification models, suffer the most, with an average performance drop of 58% when evaluated on the South Asian dataset. For contemporary LMMs, benchmarking reveals a substantial drop in the average GREEN score for anatomical landmark detection (0.308) and abnormality detection (0.410).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the SAGE dataset consisting of 1,300 expert-annotated South Asian GI endoscopy images with captions, hallucination tags, 18 labels, and 14,726 QA pairs for tasks including classification, captioning, and VQA. It benchmarks task-specific multi-class classifiers and contemporary LMMs, reporting an average 58% performance drop for classifiers and reduced GREEN scores (0.308 for anatomical landmarks, 0.410 for abnormalities) when evaluated on the South Asian data versus European datasets, attributing the gaps to population shift.
Significance. If the observed performance gaps can be robustly isolated to geographic/population factors, the dataset release would provide a valuable resource for studying and mitigating geographic bias in GI endoscopy AI, supporting development of more inclusive models for regions with high disease burden and limited expert access. The multimodal annotations and hallucination tags add utility for captioning and VQA research.
major comments (2)
- [Abstract] Abstract: The headline claim of an 'average performance drop of 58%' for multi-class classifiers (and the specific GREEN scores 0.308/0.410 for LMMs) is presented without any mention of statistical tests, confidence intervals, dataset split details, or the size/composition of the European reference sets used for comparison.
- [Abstract] Abstract and benchmarking description: The attribution of performance drops to 'population shift' is not supported by any reported controls or matching for confounding variables such as endoscope vendor/model, image resolution/color calibration, bowel-cleansing scores, sedation protocols, or alignment of annotation rubrics with the European datasets; without these, the causal interpretation cannot be isolated from dataset construction artifacts.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the presentation of results and interpretation of geographic factors in the SAGE dataset manuscript. We address each major comment point-by-point below, indicating where revisions will be incorporated.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim of an 'average performance drop of 58%' for multi-class classifiers (and the specific GREEN scores 0.308/0.410 for LMMs) is presented without any mention of statistical tests, confidence intervals, dataset split details, or the size/composition of the European reference sets used for comparison.
Authors: We agree the abstract would be strengthened by including this context. The full manuscript specifies the European reference sets (Kvasir and HyperKvasir, totaling over 10,000 images) and uses consistent 80/20 train/test splits for the multi-class classifiers. We will revise the abstract to note that the reported drops are statistically significant (p < 0.05 via appropriate tests) with 95% confidence intervals and dataset details available in the results section and supplement. This change will be made in the revision. revision: yes
-
Referee: [Abstract] Abstract and benchmarking description: The attribution of performance drops to 'population shift' is not supported by any reported controls or matching for confounding variables such as endoscope vendor/model, image resolution/color calibration, bowel-cleansing scores, sedation protocols, or alignment of annotation rubrics with the European datasets; without these, the causal interpretation cannot be isolated from dataset construction artifacts.
Authors: We concur that the current wording overstates the ability to isolate population shift as the sole cause without controls for these factors. The European datasets lack sufficient public metadata for direct matching on equipment, preparation scores, or protocols. In revision we will update the abstract and discussion to describe the gaps as 'observed in the context of geographic and population differences' rather than direct attribution, and add an explicit limitations paragraph on potential confounders including annotation rubric alignment. The dataset remains useful for studying such shifts. revision: partial
- We cannot obtain or perform matching using detailed metadata (endoscope vendor, bowel-cleansing scores, sedation protocols) from the public European reference datasets, as this information is not available.
Circularity Check
No circularity; empirical dataset release and direct benchmarking only.
full rationale
The paper introduces a new South Asian GI endoscopy dataset (SAGE) with images, captions, labels, and VQA pairs, then reports measured performance of classifiers and LMMs on it versus prior European datasets. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content. The 58% drop and GREEN scores (0.308/0.410) are direct empirical outcomes from benchmarking, not quantities constructed from the inputs by definition. Attribution to population shift is a causal interpretation that may be under-supported by controls, but that is a correctness issue, not circularity. The contribution is self-contained as data release plus measurement.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotations constitute reliable ground truth for captions, labels, and hallucination tags.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 8th ACM on Multimedia Systems Conference , pages =
KVASIR: A Multi-Class Image Dataset for Computer Aided Gastrointestinal Disease Detection , author =. Proceedings of the 8th ACM on Multimedia Systems Conference , series =. 2017 , isbn =. doi:10.1145/3083187.3083212 , acmid =
-
[2]
Scientific Data , number =
Borgli, Hanna and others , doi =. Scientific Data , number =
-
[3]
Jha, Debesh and others , title =. 2023 , isbn =. doi:10.1007/978-3-031-47679-2_10 , booktitle =
-
[4]
MultiMedia Modeling , year =
Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy , author =. MultiMedia Modeling , year =
-
[5]
Proceedings of the First International Workshop on Vision-Language Models for Biomedical Applications (VLM4Bio '24) , year=
Kvasir-VQA: A Text-Image Pair GI Tract Dataset , author=. Proceedings of the First International Workshop on Vision-Language Models for Biomedical Applications (VLM4Bio '24) , year=
-
[6]
2025 , eprint=
PolypDB: A Curated Multi-Center Dataset for Development of AI Algorithms in Colonoscopy , author=. 2025 , eprint=
2025
-
[7]
Biffi, Carlo and others , journal =. 2024 , publisher =. doi:10.1038/s41597-024-03359-0 , url =
-
[8]
\ Bernal Del Nozal\ , Jorge and Sanchez, \ F. Javier\ and Fernando Vilari \ n o. Towards Automatic Polyp Detection with a Polyp Appearance Model. Pattern Recognition. 2012. doi:10.1016/j.patcog.2012.03.002
-
[9]
Jorge Bernal and S \'a nchez, F. Javier and Gloria Fern \'a ndez-Esparrach and Debora Gil and Cristina Rodr \'i guez and Fernando Vilari \ n o. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. 2015. doi:10.1016/j.compmedimag.2015.02.007
-
[10]
GastroNet-5M: A Multicenter Dataset for Developing Foundation Models in Gastrointestinal Endoscopy , journal =. 2026 , issn =. doi:https://doi.org/10.1053/j.gastro.2025.07.030 , url =
-
[11]
2024 , publisher=
Global Cancer Observatory: Cancer Today , author=. 2024 , publisher=
2024
-
[12]
proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025 , year =
Khanal, Bidur AND others , title =. proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025 , year =
2025
-
[13]
Data Engineering in Medical Imaging , year=
Gautam, Sushant and Riegler, Michael and others , title=. Data Engineering in Medical Imaging , year=
-
[14]
Gastroenterology , volume =
Endoscopic Screening in Asian Countries Is Associated With Reduced Gastric Cancer Mortality: A Meta-analysis and Systematic Review , author =. Gastroenterology , volume =. 2018 , month = aug, doi =
2018
-
[15]
The New England Journal of Medicine , volume =
Colorectal-Cancer Incidence and Mortality with Screening Flexible Sigmoidoscopy , author =. The New England Journal of Medicine , volume =. 2012 , month = jun, doi =
2012
-
[16]
Anirvan, Prajna and Meher, Dinesh and Singh, Shivaram P. , journal =. 2020 , volume =. doi:10.5005/jp-journals-10018-1322 , pmid =
-
[17]
Scientific data3(1), 1–9 (2016) https://doi.org/10.1038/sdata.2016.18
Wilkinson, Mark D. and others , journal =. The. 2016 , volume =. doi:10.1038/sdata.2016.18 , url =
-
[18]
2024 , eprint=
GPT-4 Technical Report , author=. 2024 , eprint=
2024
-
[19]
Giannantonio , title =
Cristina M. Giannantonio , title =. Organizational Research Methods , volume =. 2010 , doi =
2010
-
[20]
GitHub repository , howpublished =
Santiago Castro , title =. GitHub repository , howpublished =. 2017 , publisher =
2017
-
[21]
and others , title=
van Doorn, Sascha C. and others , title=. Official journal of the American College of Gastroenterology | ACG , year=
-
[22]
Garc. Intrapapillary capillary loop classification in magnification endoscopy: open dataset and baseline methodology , journal=. 2020 , month=. doi:10.1007/s11548-020-02127-w , url=
-
[23]
Chandrasinghe, P. C. and Ediriweera, D. S. and Hewavisenthi, J. and Kumarage, S. K. and Fernando, F. R. and Deen, K. I. , title =. BMC Research Notes , year =. doi:10.1186/s13104-017-2869-1 , pmid =
-
[24]
Chaudhari, and Jean-Benoit Delbrouck
Ostmeier, Sophie and Xu, Justin and Chen, Zhihong and Varma, Maya and Blankemeier, Louis and Bluethgen, Christian and Michalson, Arne Edward and Moseley, Michael and Langlotz, Curtis and Chaudhari, Akshay S and Delbrouck, Jean-Benoit. GREEN : Generative Radiology Report Evaluation and Error Notation. Findings of the Association for Computational Linguisti...
-
[25]
Oncology Reviews , volume =
Changing Colorectal Cancer Trends in Asians: Epidemiology and Risk Factors , author =. Oncology Reviews , volume =. 2023 , doi =
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.