Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework

Akitoshi Katsumata; Chisako Muramatsu; Hiroshi Fujita; Nanaka Hosokawa; Ryo Takahashi; Takeshi Hara; Tatsuro Hayashi; Tomoya Kitano; Xiangrong Zhou; Yukihiro Iida

arxiv: 2510.02001 · v6 · submitted 2025-10-02 · 💻 cs.CV · cs.AI

Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework

Nanaka Hosokawa , Ryo Takahashi , Tomoya Kitano , Yukihiro Iida , Chisako Muramatsu , Tatsuro Hayashi , Yuta Seino , Xiangrong Zhou

show 3 more authors

Takeshi Hara Akitoshi Katsumata Hiroshi Fujita

This is my paper

Pith reviewed 2026-05-18 10:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords jaw cystsdental panoramic radiographsvision-language modelsGPTself-correction loopstructured outputradiological findingshallucination reduction

0 comments

The pith

A self-correction loop with structured output improves GPT-VLM accuracy on jaw cyst findings in dental X-rays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Self-correction Loop with Structured Output (SLSO) framework to make GPT-based vision-language models produce more reliable radiological reports for jaw cysts seen on panoramic radiographs. The method runs a ten-step pipeline that generates structured descriptions, extracts tooth numbers, checks consistency, and regenerates outputs up to five times when inconsistencies appear. It is evaluated against the standard Chain-of-Thought prompting approach on seven items that include lesion transparency, borders, root resorption, tooth movement, and tooth numbering. The loop produces higher accuracy on several items and forces explicit statements of negative findings while reducing hallucinations. The study demonstrates that external validation steps can tighten the outputs of current VLMs for this clinical task.

Core claim

The SLSO framework acts as an external validation mechanism for GPT outputs through a 10-step integrated processing methodology that incorporates image analysis, structured data generation, tooth number extraction, consistency checking, and iterative regeneration. When tested on dental panoramic radiographs containing jaw cysts, the framework improved output accuracy over the conventional Chain-of-Thought method for multiple evaluation items, with the largest gains in tooth number identification, tooth movement detection, and root resorption assessment. In successful cases the loop produced consistently structured outputs after at most five regenerations, enforced explicit negative finding記述

What carries the argument

The Self-correction Loop with Structured Output (SLSO) framework, which performs consistency checking on generated fields and triggers up to five iterative regenerations to enforce accurate structured outputs from the VLM.

If this is right

The framework achieves consistently structured outputs after at most five regenerations in successful cases.
It enforces explicit negative finding descriptions in the generated reports.
It suppresses hallucinations compared with standard Chain-of-Thought prompting.
Accurate identification of lesions that span multiple teeth remains limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency-checking loop could be applied to other dental or medical imaging tasks where VLMs produce variable outputs.
Clinical deployment would still require validation on diverse patient populations and scanner types beyond the preliminary dataset.
Combining the loop with larger or fine-tuned VLMs might further reduce the number of regeneration steps needed.

Load-bearing premise

That consistency checking combined with up to five iterative regenerations can reliably enforce accurate structured outputs and suppress hallucinations for jaw cyst findings without introducing new errors.

What would settle it

A test on a larger set of panoramic radiographs containing extensive jaw cysts that measures whether SLSO accuracy gains hold or whether new errors appear after the five-regeneration limit.

read the original abstract

Vision-language models (VLMs) such as GPT (Generative Pre-Trained Transformer) have shown potential for medical image interpretation; however, challenges remain in generating reliable radiological findings in clinical practice, as exemplified by dental pathologies. This study proposes a Self-correction Loop with Structured Output (SLSO) framework as an integrated processing methodology to enhance the accuracy and reliability of AI-generated findings for jaw cysts in dental panoramic radiographs. Dental panoramic radiographs with jaw cysts were used to implement a 10-step integrated processing framework incorporating image analysis, structured data generation, tooth number extraction, consistency checking, and iterative regeneration. The framework functioned as an external validation mechanism for GPT outputs. Performance was compared against the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The SLSO framework improved output accuracy for multiple items compared to the CoT method, with the most notable improvements observed in tooth number identification, tooth movement detection, and root resorption assessment. In successful cases, consistently structured outputs were achieved after up to five regenerations. The framework enforced explicit negative finding descriptions and suppressed hallucinations, although accurate identification of extensive lesions spanning multiple teeth remained limited. This investigation established the feasibility of the proposed integrated processing methodology and provided a foundation for future validation studies with larger, more diverse datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SLSO adds a structured self-correction loop to GPT for jaw cyst findings on panoramic radiographs and claims gains over CoT, but supplies no numbers or protocol details to support the accuracy improvements.

read the letter

The one or two things to know about this paper are that it presents a 10-step SLSO framework that layers structured output generation, tooth number extraction, consistency checking, and up to five iterative regenerations on top of GPT for dental panoramic radiographs, and it claims this produces better results than plain Chain-of-Thought prompting on seven evaluation items, especially tooth number, movement, and root resorption. The improvements are described only in qualitative terms with no supporting data.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a preliminary study proposing a Self-correction Loop with Structured Output (SLSO) framework as a 10-step integrated processing methodology to improve the reliability of GPT-based vision-language model outputs for generating radiological findings on jaw cysts in dental panoramic radiographs. The approach incorporates image analysis, structured data generation, tooth number extraction, consistency checking, and iterative regeneration (up to five times). Performance is compared against the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The authors claim that SLSO improved output accuracy on multiple items, with the largest gains in tooth number identification, tooth movement detection, and root resorption assessment, while enforcing explicit negative findings and suppressing hallucinations in successful cases, though limitations persist for extensive lesions spanning multiple teeth.

Significance. If substantiated, the SLSO framework offers a practical external validation mechanism for reducing hallucinations and enforcing structured outputs in VLM-based medical image interpretation, which is a recognized challenge in dental radiology. The explicit comparison to CoT and focus on clinically relevant items (e.g., root resorption and tooth movement) provide a foundation for future work on self-correcting AI systems. However, the preliminary status and absence of quantitative metrics limit its current significance; larger-scale validation with objective expert agreement rates would be needed to establish broader impact.

major comments (2)

[Abstract] Abstract: The central claim that 'The SLSO framework improved output accuracy for multiple items compared to the CoT method, with the most notable improvements observed in tooth number identification, tooth movement detection, and root resorption assessment' is unsupported by any quantitative metrics, case counts, per-item accuracy percentages, deltas, statistical tests, or success/failure rates for the consistency loop. This is load-bearing for the paper's main contribution.
[Abstract] Abstract and framework description: No details are provided on the evaluation protocol, including how accuracy was measured (e.g., by expert radiologist agreement), the total number of radiographs, or the precise implementation of consistency checking and hallucination suppression. Without these, it is not possible to verify that observed improvements result from the SLSO mechanism rather than prompt engineering or selective reporting.

minor comments (2)

[Abstract] The title refers to a 'Two-Stage Self-Correction Loop' while the abstract describes a '10-step integrated processing framework'; clarifying the relationship between these descriptions would improve precision.
[Abstract] The abstract states that 'accurate identification of extensive lesions spanning multiple teeth remained limited' but does not quantify this limitation or discuss implications for the framework's scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our preliminary study. We agree that the current version lacks sufficient quantitative support and methodological transparency to substantiate the claims, and we will make revisions to address these points directly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'The SLSO framework improved output accuracy for multiple items compared to the CoT method, with the most notable improvements observed in tooth number identification, tooth movement detection, and root resorption assessment' is unsupported by any quantitative metrics, case counts, per-item accuracy percentages, deltas, statistical tests, or success/failure rates for the consistency loop. This is load-bearing for the paper's main contribution.

Authors: We agree that the abstract claim is not supported by quantitative data. This is a limitation of the preliminary study, where improvements were observed qualitatively through case-by-case review of outputs rather than formal accuracy percentages or statistical analysis. In the revision we will tone down the abstract language to state that SLSO 'produced more consistent and structured outputs with fewer apparent errors' in the evaluated cases, remove the specific claim of 'improved output accuracy,' and add a results subsection that reports the exact number of radiographs processed along with qualitative per-item observations. We will also explicitly note that no statistical tests were applied due to the small exploratory sample. revision: yes
Referee: [Abstract] Abstract and framework description: No details are provided on the evaluation protocol, including how accuracy was measured (e.g., by expert radiologist agreement), the total number of radiographs, or the precise implementation of consistency checking and hallucination suppression. Without these, it is not possible to verify that observed improvements result from the SLSO mechanism rather than prompt engineering or selective reporting.

Authors: We acknowledge the absence of these details. The evaluation consisted of internal review by the authors comparing generated findings against visible radiographic features, without independent blinded expert radiologist scoring. We will expand the Methods and Results sections to specify: the total number of panoramic radiographs used, the step-by-step criteria applied during consistency checking and regeneration (up to five iterations), and concrete examples of how negative findings were enforced and hallucinations were identified and corrected. This added transparency will help distinguish the SLSO loop from standard prompting effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of SLSO vs CoT on fixed image set

full rationale

The paper describes a 10-step SLSO processing pipeline applied to dental panoramic radiographs and reports qualitative improvements over Chain-of-Thought prompting across seven evaluation items. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described framework. The central claim rests on observed output differences after iterative regeneration rather than any reduction of the result to its own inputs by construction. This is a standard empirical feasibility study whose validity depends on external validation data and metrics, not internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on the general capabilities of GPT-based VLMs for image interpretation and introduces the SLSO framework as a new external validation mechanism without additional fitted parameters or external benchmarks in the abstract.

axioms (1)

domain assumption GPT-based VLMs can produce usable structured radiological findings when guided by multi-step consistency checks and regeneration.
Core premise enabling the self-correction loop to function as an external validator.

invented entities (1)

SLSO framework no independent evidence
purpose: To enforce structured outputs, consistency, and iterative correction for VLM-generated jaw cyst findings.
Newly proposed 10-step integrated processing methodology.

pith-pipeline@v0.9.0 · 5850 in / 1321 out tokens · 41737 ms · 2026-05-18T10:30:10.461101+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[1]

& Tang, H

Ye, J. & Tang, H. Multimodal large language models for medicine: A comprehensive survey. Preprint at https://arxiv.org/abs/2504.21051 (2025)

work page arXiv 2025
[2]

Zhou, S. et al. Large language models for disease diagnosis: A scoping review. NPJ Artif. Intell. 1, 11 (2025)

work page 2025
[3]

Morishita, M. et al. An exploratory assessment of GPT -4o and GPT-4 performance on the Japanese National Dental Examination. Saudi Dent J 36, 1577-1581 (2024)

work page 2024
[4]

A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam

Ramlogan, S., Raman, V ., Ramlogan & S. A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam. BMC Med. Educ. 25, 727 (2025)

work page 2025
[5]

Jaworski, A. et al. GPT-4o vs. Human candidates: Performance analysis in the Polish Final Dentistry Examination. Cureus 16, e68813 (2024)

work page 2024
[6]

B., Gunasena, C

Dasanayaka, C., Dandeniya, K., Dissanayake, M. B., Gunasena, C. & Jayasinghe, R. Multimodal AI and large language models for orthopantomography radiology report generation and Q&A. Appl. Syst. Innov. 8, 39 (2025)

work page 2025
[7]

A deep learning vision -language model for diagnosing pediatric dental diseases

Pham, T.D. A deep learning vision -language model for diagnosing pediatric dental diseases. Preprint at https://www.medrxiv.org/content/10.1101/2025.05.21.25328098v1 (2025)

work page doi:10.1101/2025.05.21.25328098v1 2025
[8]

& Bi ̇ lge, K

Aşar, E.M., İpek, İ. & Bi ̇ lge, K. Customized GPT -4V(ision) for radiographic diagnosis: Can large language model detect supernumerary teeth?. BMC Oral Health 25, 756 (2025)

work page 2025
[9]

Silva, T. P. et al. Performance of a commercially available Generative Pre -trained Transformer (GPT) in describing radiolucent lesions in panoramic radiographs and establishing differential diagnoses. Clin. Oral Investig. 28, 204 (2024)

work page 2024
[10]

Wang, Y . et al. Factuality of large language models: A survey. Preprint at https://arxiv.org/abs/2402.02420 (2024)

work page arXiv 2024
[11]

& McFarlane, S

Alkaissi, H. & McFarlane, S. I. Artificial hallucinations in ChatGPT: Implications in scientific writing. Cureus 15, e35179 (2023)

work page 2023
[12]

& Xiao, C

Chang, A., Huang, L., Bhatia, P., Kass -Hout, T., Ma, F. & Xiao, C. MedHEval: Benchmarking hallucinations and mitigation strategies in medical large vision - language models. Preprint at https://arxiv.org/abs/2503.02157 (2025)

work page arXiv 2025
[13]

Self-Refine: Iterative Refinement with Self-Feedback

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L. & Wiegreffe, S., et al. Self- refine: Iterative refinement with self -feedback. Preprint at https://arxiv.org/abs/2303.17651 (2023). 21 Acknowledgements We thank the clinical staff at Asahi University Medical and Dental Center and the technical staff at Gifu University for their support. This work w...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Processing load issues due to large image size

work page
[15]

Due to the diverse anatomical structures contained throughout the image, GPT-4o was distracted by findings other than cysts (crown restorations, implants, etc.)

work page
[16]

left mandibular molar area

Tooth number could not be identified (ambiguous expressions such as "left mandibular molar area")

work page
[17]

somewhat,

Frequent use of vague expressions that are difficult to evaluate, such as "somewhat," "seems to be," and "possibly". 24 Figure S1 Extraction of cyst regions from dental panoramic radiographs. The preprocessing of panoramic radiographs is shown. (a) Original image (1976×976 pixels). (b) Simply cropped ROI image around the cyst (200-400 pixels, preserving t...

work page 1976
[18]

vicinity

Although the accuracy of tooth number identification improved, ambiguous location expressions such as “vicinity” and “nearby” still remained

work page
[19]

Clearly demarcated, unilocular translucencies are observed around teeth #46 and #47

It is difficult to grasp the exact positional relationship with adjacent teeth. Phase 3: Incorporation of Tooth Segmentation and Tooth Number Annotation To address the issue of positional relationship recognition in Phase 2, we considered an input format that combined the ROI images and affected tooth number information. Ideally, tooth segmentation and au...

work page
[20]

teeth #46 and #47

The accuracy of tooth number identification improved (correctly identifying "teeth #46 and #47")

work page
[21]

relatively

Frequent use of ambiguous terms such as "relatively" and "mild to moderate" in descriptions of radiological features

work page
[22]

A clearly demarcated unilocular radiolucency is observed around teeth numbers 46 and 47

Occurrence of hallucination (unspecified content). Phase 4: Introducing CoT Prompts To mitigate the hallucination problem identified in Phase 3, we introduced CoT prompts that clearly explained the thought process step-by-step, and explored methods to encourage stepwise reasoning in GPT-4o. Specifically, each item in the schema defined in Table 1 (radiogr...

work page
[23]

The transparency of the thought process was improved by stepwise reasoning

work page
[24]

seems,"

Frequent use of uncertain expressions such as "seems," "appears," and "possibly."

work page
[25]

It was difficult to guarantee the factuality of the final findings

work page
[26]

You are a professional dental radiologist. Analyze this dental panoramic radiograph and output the results as structured data according to the specified items

The issue of the verifiability of the written content was not fundamentally resolved. Supplementary references s1. Kwon, O. et al. Automatic diagnosis for cysts and tumors of both jaws on panoramic radiographs using a deep convolution neural network. Dentomaxillofac. Radiol. 49, 20200185 (2020). s2. Yang, H. et al. Deep learning for automated detection of...

work page 2020

[1] [1]

& Tang, H

Ye, J. & Tang, H. Multimodal large language models for medicine: A comprehensive survey. Preprint at https://arxiv.org/abs/2504.21051 (2025)

work page arXiv 2025

[2] [2]

Zhou, S. et al. Large language models for disease diagnosis: A scoping review. NPJ Artif. Intell. 1, 11 (2025)

work page 2025

[3] [3]

Morishita, M. et al. An exploratory assessment of GPT -4o and GPT-4 performance on the Japanese National Dental Examination. Saudi Dent J 36, 1577-1581 (2024)

work page 2024

[4] [4]

A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam

Ramlogan, S., Raman, V ., Ramlogan & S. A pilot study of the performance of Chat GPT and other large language models on a written final year periodontology exam. BMC Med. Educ. 25, 727 (2025)

work page 2025

[5] [5]

Jaworski, A. et al. GPT-4o vs. Human candidates: Performance analysis in the Polish Final Dentistry Examination. Cureus 16, e68813 (2024)

work page 2024

[6] [6]

B., Gunasena, C

Dasanayaka, C., Dandeniya, K., Dissanayake, M. B., Gunasena, C. & Jayasinghe, R. Multimodal AI and large language models for orthopantomography radiology report generation and Q&A. Appl. Syst. Innov. 8, 39 (2025)

work page 2025

[7] [7]

A deep learning vision -language model for diagnosing pediatric dental diseases

Pham, T.D. A deep learning vision -language model for diagnosing pediatric dental diseases. Preprint at https://www.medrxiv.org/content/10.1101/2025.05.21.25328098v1 (2025)

work page doi:10.1101/2025.05.21.25328098v1 2025

[8] [8]

& Bi ̇ lge, K

Aşar, E.M., İpek, İ. & Bi ̇ lge, K. Customized GPT -4V(ision) for radiographic diagnosis: Can large language model detect supernumerary teeth?. BMC Oral Health 25, 756 (2025)

work page 2025

[9] [9]

Silva, T. P. et al. Performance of a commercially available Generative Pre -trained Transformer (GPT) in describing radiolucent lesions in panoramic radiographs and establishing differential diagnoses. Clin. Oral Investig. 28, 204 (2024)

work page 2024

[10] [10]

Wang, Y . et al. Factuality of large language models: A survey. Preprint at https://arxiv.org/abs/2402.02420 (2024)

work page arXiv 2024

[11] [11]

& McFarlane, S

Alkaissi, H. & McFarlane, S. I. Artificial hallucinations in ChatGPT: Implications in scientific writing. Cureus 15, e35179 (2023)

work page 2023

[12] [12]

& Xiao, C

Chang, A., Huang, L., Bhatia, P., Kass -Hout, T., Ma, F. & Xiao, C. MedHEval: Benchmarking hallucinations and mitigation strategies in medical large vision - language models. Preprint at https://arxiv.org/abs/2503.02157 (2025)

work page arXiv 2025

[13] [13]

Self-Refine: Iterative Refinement with Self-Feedback

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L. & Wiegreffe, S., et al. Self- refine: Iterative refinement with self -feedback. Preprint at https://arxiv.org/abs/2303.17651 (2023). 21 Acknowledgements We thank the clinical staff at Asahi University Medical and Dental Center and the technical staff at Gifu University for their support. This work w...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Processing load issues due to large image size

work page

[15] [15]

Due to the diverse anatomical structures contained throughout the image, GPT-4o was distracted by findings other than cysts (crown restorations, implants, etc.)

work page

[16] [16]

left mandibular molar area

Tooth number could not be identified (ambiguous expressions such as "left mandibular molar area")

work page

[17] [17]

somewhat,

Frequent use of vague expressions that are difficult to evaluate, such as "somewhat," "seems to be," and "possibly". 24 Figure S1 Extraction of cyst regions from dental panoramic radiographs. The preprocessing of panoramic radiographs is shown. (a) Original image (1976×976 pixels). (b) Simply cropped ROI image around the cyst (200-400 pixels, preserving t...

work page 1976

[18] [18]

vicinity

Although the accuracy of tooth number identification improved, ambiguous location expressions such as “vicinity” and “nearby” still remained

work page

[19] [19]

Clearly demarcated, unilocular translucencies are observed around teeth #46 and #47

It is difficult to grasp the exact positional relationship with adjacent teeth. Phase 3: Incorporation of Tooth Segmentation and Tooth Number Annotation To address the issue of positional relationship recognition in Phase 2, we considered an input format that combined the ROI images and affected tooth number information. Ideally, tooth segmentation and au...

work page

[20] [20]

teeth #46 and #47

The accuracy of tooth number identification improved (correctly identifying "teeth #46 and #47")

work page

[21] [21]

relatively

Frequent use of ambiguous terms such as "relatively" and "mild to moderate" in descriptions of radiological features

work page

[22] [22]

A clearly demarcated unilocular radiolucency is observed around teeth numbers 46 and 47

Occurrence of hallucination (unspecified content). Phase 4: Introducing CoT Prompts To mitigate the hallucination problem identified in Phase 3, we introduced CoT prompts that clearly explained the thought process step-by-step, and explored methods to encourage stepwise reasoning in GPT-4o. Specifically, each item in the schema defined in Table 1 (radiogr...

work page

[23] [23]

The transparency of the thought process was improved by stepwise reasoning

work page

[24] [24]

seems,"

Frequent use of uncertain expressions such as "seems," "appears," and "possibly."

work page

[25] [25]

It was difficult to guarantee the factuality of the final findings

work page

[26] [26]

You are a professional dental radiologist. Analyze this dental panoramic radiograph and output the results as structured data according to the specified items

The issue of the verifiability of the written content was not fundamentally resolved. Supplementary references s1. Kwon, O. et al. Automatic diagnosis for cysts and tumors of both jaws on panoramic radiographs using a deep convolution neural network. Dentomaxillofac. Radiol. 49, 20200185 (2020). s2. Yang, H. et al. Deep learning for automated detection of...

work page 2020