MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Chun-Xiao Jin; Cui Huang; Gui-Song Xia; James Kit Hon Tsoi; Jia-Min Wu; Meng-Xun Li; Wen-Hui Deng; Yue Han; Zhi-Xing Wu

arxiv: 2604.14866 · v1 · submitted 2026-04-16 · 💻 cs.CV · cs.AI

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Meng-Xun Li , Wen-Hui Deng , Zhi-Xing Wu , Chun-Xiao Jin , Jia-Min Wu , Yue Han , James Kit Hon Tsoi , Gui-Song Xia

show 1 more author

Cui Huang

This is my paper

Pith reviewed 2026-05-10 11:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords dentistryvision-language modelsclinical image annotationintraoral photographyvisual question answeringimage captioningmulti-label classification

0 comments

The pith

A new annotated dataset of dental images reveals that state-of-the-art vision-language models struggle with fine-grained understanding of intraoral scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a large-scale dataset of dental images with a semi-structured annotation scheme to support vision-language models in clinical dentistry. It shows that even advanced models only reach moderate accuracy on tasks like visual question answering and produce inconsistent descriptions in image captioning. By curating 60,669 images and annotating 2,588 of them, the work creates benchmarks that highlight the need for better fine-grained analysis in oral health imaging. This resource is released publicly to enable further development of such models.

Core claim

MetaDent supplies a dentistry image dataset, a semi-structured annotation framework capturing hierarchical clinical details through high-level summaries and point-by-point abnormality descriptions, and LLM-derived benchmarks including 15K VQA pairs and an 18-class multi-label classification set. Evaluations confirm that state-of-the-art VLMs achieve only moderate accuracy and generate incomplete captions for intraoral photographs.

What carries the argument

The semi-structured annotation framework that combines high-level image summaries with free-text point-by-point descriptions of abnormalities, enabling scalable and task-agnostic representations for clinical evaluation.

Load-bearing premise

The semi-structured annotation framework and subsequent LLM-driven benchmark creation reliably preserve clinically nuanced dental details without semantic drift or selection bias.

What would settle it

Human experts re-labeling a subset of the images and finding substantial differences from the provided annotations, or new VLMs achieving high accuracy and complete captions on the derived tasks.

Figures

Figures reproduced from arXiv: 2604.14866 by Chun-Xiao Jin, Cui Huang, Gui-Song Xia, James Kit Hon Tsoi, Jia-Min Wu, Meng-Xun Li, Wen-Hui Deng, Yue Han, Zhi-Xing Wu.

**Figure 1.** Figure 1: Data processing pipeline. (A) Composition and distribution of the dataset across 3 sources, with the Internet-scraped subset (Data Source 3) being the largest. Image features were extracted using DINOv3 (Siméoni et al 2025) and projected into 2D using principal component analysis; darker, larger dots represent labeled samples, while lighter and smaller dots indicate the remaining images in the collection. … view at source ↗

**Figure 2.** Figure 2: Dataset statistics. (A) Overall image area distribution: Image size (in log10 pixels²) shows that most images fall within the 5.0 to 6.0 range. (B) Aspect ratio by data source: DS3 exhibits the widest distribution of aspect ratios, followed by DS1, whereas DS2 displays a more uniform aspect ratio profile. (C) Image content composition: Approximately 80% of images contain human subjects. Of the entire datas… view at source ↗

**Figure 3.** Figure 3: Performance of 5 vision-language models on MetaDent across 3 tasks. (A) Visual question answering (VQA) accuracy on multiple-choice questions (MCQ). (B) VQA accuracy on true/false questions (TFQ). (C) F1 for the 18-class multilabel classification task. (D) Exact Match accuracy for multilabel predictions. (E) Image captioning semantic similarity reported as BERTScore-F1 against reference captions. (F) Conte… view at source ↗

**Figure 4.** Figure 4: Category-level performance for the multilabel classification task. (A) Exact Match accuracy across 18 categories for 5 models. (B) Distribution of samples per category for the classification task. (C) Names of the 18 categories (see Appendix [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaDent adds a useful dental image dataset with meta-annotations and VLM benchmarks, but the LLM label conversion step needs quantitative validation numbers to be fully convincing.

read the letter

The paper's main point is a new collection of dental images paired with a semi-structured labeling method that gets turned into VQA and classification benchmarks. They pulled 60,669 images from clinics, public sources, and the web, then annotated 2,588 of them with high-level summaries plus free-text notes on specific abnormalities. LLMs convert those into roughly 15K VQA pairs and an 18-class multi-label set, which they say humans checked for fidelity before running evaluations on current VLMs. The results show even strong models only reach moderate accuracy and give incomplete captions on intraoral scenes. This is a straightforward resource paper that targets a clear gap in dentistry-specific vision-language data. Releasing the full dataset, annotations, and tools is the practical part that matters most for the subfield. It lets other groups test and improve models without starting from scratch on curation. The evaluation also makes a reasonable case that fine-grained dental details remain hard for off-the-shelf VLMs. The soft spot is the validation of the LLM-generated benchmarks. The abstract notes human review and error analysis were done to confirm the labels stay faithful, yet it gives no numbers on agreement rates, exact-match percentages, or per-task error breakdowns. Without those, it's hard to separate model shortcomings from possible label noise or selection effects in the annotated subset. Details on how the 2,588 images were chosen from the larger pool and on inter-annotator consistency are also thin. This paper is for researchers working on medical VLMs who need domain-specific resources, especially in dentistry or similar narrow clinical areas. A reader building or testing models on intraoral photos would get direct value from the data release and the baseline numbers. It deserves a serious referee because the dataset itself is new and the core claim about model limitations is testable once the validation metrics are added. I would send it for review with a request for quantitative checks on the label conversion and clearer annotation protocols.

Referee Report

2 major / 2 minor

Summary. The paper introduces MetaDent, a dataset of 60,669 dental images (with 2,588 meta-annotated via high-level summaries plus free-text abnormality points) from clinical, public, and web sources. LLMs are used to derive ~15K VQA pairs and an 18-class multi-label classification benchmark from these annotations, with human review and error analysis claimed to confirm fidelity. State-of-the-art VLMs are then benchmarked on VQA, classification, and captioning tasks, with the central finding that even advanced models achieve only moderate accuracy and generate inconsistent or incomplete descriptions of intraoral scenes.

Significance. If the LLM-converted benchmarks are shown to be faithful clinical proxies, MetaDent would provide a valuable, publicly released resource for advancing VLMs in the underexplored domain of intraoral photography. The semi-structured annotation approach and empirical demonstration of current model limitations on fine-grained dental understanding offer a useful baseline and could accelerate development of clinically relevant vision-language systems. The release of the dataset, annotations, and tools supports reproducibility.

major comments (2)

[Abstract] Abstract: The assertion that 'human review and error analysis' justifies that the LLM-driven conversion 'reliably preserves fidelity and semantic accuracy' is unsupported by any quantitative metrics (e.g., exact-match rate, Cohen's kappa, or per-task error breakdown on a held-out validation subset). This is load-bearing for the headline claim that moderate VLM accuracy reflects model limitations rather than label noise or selection bias in the 2,588-image annotated subset.
[Annotation and benchmark generation] Section describing the annotation and benchmark generation process: No details are provided on the exact annotation protocols for the 2,588 images, inter-rater agreement statistics, data splits, or the specific human validation procedure and error analysis results. Without these, it is impossible to quantify potential semantic drift or selection effects in the derived 15K VQA pairs and 18-class labels.

minor comments (2)

The abstract and methods would benefit from explicit listing of the VQA question templates and classification label definitions to aid reproducibility.
Consider including additional baselines (e.g., non-VLM classifiers or simpler vision-only models) in the classification and captioning evaluations to better contextualize the reported moderate accuracies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review, which highlights important areas for improving the clarity and rigor of our presentation of the MetaDent dataset and benchmarks. We agree that the current manuscript would benefit from expanded details on the annotation and validation processes to better support our claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'human review and error analysis' justifies that the LLM-driven conversion 'reliably preserves fidelity and semantic accuracy' is unsupported by any quantitative metrics (e.g., exact-match rate, Cohen's kappa, or per-task error breakdown on a held-out validation subset). This is load-bearing for the headline claim that moderate VLM accuracy reflects model limitations rather than label noise or selection bias in the 2,588-image annotated subset.

Authors: We agree that the abstract's reference to human review and error analysis would be more convincing with explicit quantitative support. In the revised manuscript, we will update the abstract to reference specific validation metrics and add a new subsection detailing the human review process. This will include exact-match rates between LLM-generated VQA pairs/labels and human corrections, a per-task error breakdown, and any applicable agreement statistics from the 2,588-image subset. These additions will strengthen the argument that observed VLM limitations reflect genuine model shortcomings rather than annotation issues. revision: yes
Referee: [Annotation and benchmark generation] Section describing the annotation and benchmark generation process: No details are provided on the exact annotation protocols for the 2,588 images, inter-rater agreement statistics, data splits, or the specific human validation procedure and error analysis results. Without these, it is impossible to quantify potential semantic drift or selection effects in the derived 15K VQA pairs and 18-class labels.

Authors: We acknowledge that the current manuscript omits these methodological specifics, which are necessary for full reproducibility and assessment of potential biases. We will substantially expand the annotation and benchmark generation section to include: (1) the exact protocols and guidelines used for meta-annotation of the 2,588 images (high-level summaries plus free-text abnormality points); (2) data split details for the annotated subset; (3) the full human validation procedure, including how reviewers interacted with LLM outputs; and (4) results from the error analysis, with quantitative metrics where available (e.g., agreement rates and error categorizations). If inter-rater statistics are limited due to the single-reviewer design for efficiency, we will explicitly note this and justify the approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical dataset construction and benchmarking

full rationale

The paper presents an empirical workflow of image collection (60,669 dental images), semi-structured meta-annotation on a 2,588-image subset, LLM-assisted conversion to ~15K VQA pairs and 18-class labels, human validation, and VLM evaluation on captioning/VQA/classification tasks. No equations, derivations, fitted parameters, or predictions appear anywhere in the described chain. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim (VLMs achieve only moderate accuracy on intraoral scenes) is supported by direct benchmarking results rather than reducing to any input by construction. This matches the default expectation for non-circular empirical dataset papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical modeling; the annotation scheme and LLM use are methodological choices rather than axioms or free parameters. No invented entities.

pith-pipeline@v0.9.0 · 5624 in / 1039 out tokens · 35733 ms · 2026-05-10T11:51:50.247154+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

[1]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Data Collection and Filtering Candidate Mining from COYO-700M To retrieve intraoral clinical photographs from the large-scale web dataset COYO-700M, we fine- tuned a binary image classifier starting from ViT -L/16 (Dosovitskiy et al. 2020) from pre-trained weights. Positive samples were initially composed of DS1, DS2, MOD (Rashid et al. 2024) , and Oral D...

work page 2020
[2]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Evaluation Metric We evaluate VQA performance using the classification accuracy metric, computed as the proportion of questions answered correctly. Because the VQA benchmarks include different question formats, we calculate accuracy separately for multiple-choice questions and for binary true/false questions. In each case, accuracy is defined as: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦...

work page
[3]

The reference captions are generated by GPT-OSS to serve as high-quality ground truth descriptions

Semantic Similarity (BERTScore): The semantic similarity between each model-generated and reference captions was measured using BERTScore. The reference captions are generated by GPT-OSS to serve as high-quality ground truth descriptions. BERTScore computes token-level semantic similarity using pre-trained contextual embeddings, such as RoBERTa-large (Liu...

work page 2019
[4]

I'm sorry, but I can't perform image analysis

Clinical Content Accuracy (Abnormality Extraction): To evaluate whether the generated captions correctly describe the abnormal findings in the images, GPT -OSS was used to automatically extract the list of abnormalities mentioned in each VLM-generated caption. This yields a set of predicted findings for the image, extracted from the caption text. We then ...

work page
[5]

Appendix Figures 1 –3 present the Precision, Recall, and Macro F1 metrics for each class across datasets and models

Definitions for Multi-Label Classification and Analysis of Results The detailed definition for 18-class multi-label classification is presented in Appendix Table 1. Appendix Figures 1 –3 present the Precision, Recall, and Macro F1 metrics for each class across datasets and models. Several consistent patterns emerge across all datasets and VLMs:

work page
[6]

In the F1 heatmaps (Appendix Figure 3), all models exhibit notably lower F1 scores for C2, C3, C13, and C14. Examination of the Precision–Recall plots (Appendix Figures 1–2) reveals a common pattern: precision remains relatively high while recall is low, indicating frequent missed detections

work page
[7]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

In contrast, C4 shows high recall but low precision in most scenarios, a pattern characteristic of false positives. These two complementary error modes — under-calling (C2/C3/C13/C14) and over-calling (C4)— are consistently observed across MetaDent and all three subsets, reflecting class -specific visual ambiguities that current general-purpose VLM priors...

work page
[8]

C4 was predicted in approximately 73.3% of images, whereas only 11% actually contained C4

work page
[9]

C5 was predicted in about 37% of images, while only 4% were truly labeled C5

work page
[10]

A manual inspection of DS2 revealed two plausible contributing factors:

For C14, the most prevalent class in DS2-44.5% of images were truly positive, yet fewer than 9% were predicted positive. A manual inspection of DS2 revealed two plausible contributing factors:

work page
[11]

DS2 images exhibit a noticeable reddish hue, which amplifies soft -tissue redness and mucosal highlights

Image tone shift. DS2 images exhibit a noticeable reddish hue, which amplifies soft -tissue redness and mucosal highlights. VLMs may erroneously associate these features with C4/C5— like patterns, leading to false positives

work page
[12]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Demographic differences. Variations in patient populations and imaging conditions across datasets may have introduced additional distributional shifts that hinder model generalization. Specifically, DS2 is mostly teenagers with mixed dentition , which differentiates it from the other two datasets. Finally, the widespread under -detection of C14 across dat...

work page
[13]

Is there visible calculus in the image?

Errors for Data Conversion We systematically analyzed the source of errors during secondary dataset generation (see Appendix Figure 4), and detailed definitions for each type of error are presented below. Appendix Table 2. Definition of errors for secondary dataset generation. Name Definition Annotation Error Errors arising from incorrect or ambiguous lab...

work page
[14]

Under-reasoning: The LLM fails to draw a valid conclusion that is supported by the evidence in the meta-label

work page
[15]

Examples:

Over-reasoning: The LLM generate s a proposition that the evidence provided is not sufficient to support, and happens to be wrong for the specific image. Examples:

work page
[16]

An image of periodontal surgery

The meta label clearly mentioned “An image of periodontal surgery” and “bleeding of the gingiva ”, but the LLM failed to categorize this image into “Oral wound”

work page
[17]

The small chalky white spots observed on the enamel surfaces are most consistent with which condition?

On an image with a white spot lesion, t he model raises a question: “The small chalky white spots observed on the enamel surfaces are most consistent with which condition?” , and gives the answer as “Incipient dental caries (white ‑spot lesion)”. However white spot on the enamel surface alone does not support this proposition. Translation Error The origin...

work page
[18]

uncertain

Prompt For each specific task in this study, a fixed prompt template was used consistently across all images to ensure procedural consistency and to minimize prompt-induced variability. For the generation of VQA pairs and multi -label classification labels, the te mperature was set to 0 in order to enforce deterministic outputs and maximize reproducibilit...

work page
[19]

All questions must have exactly one correct answer, and the correct answer should be randomly distributed among the options across questions

work page
[22]

Do not include any question regarding the uncertain content (marked with low_confidence is True in the input text)

work page
[24]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

All questions should be generated strictly within the region described in the given text. Any findings not mentioned in the description should be considered normal within that region and may be used to construct questions and answers. For non-existent abnormality questions, you may choose from conditions such as dental caries, non -carious tooth defects, ...

work page
[25]

If there is limited abnormal information, you may design questions based on point 6

Single-choice questions and true/false questions must test different knowledge points (they cannot be the same or similar). If there is limited abnormal information, you may design questions based on point 6

work page
[26]

11" refers to

The output must be in valid JSON array, without any extra content. ``` Note: When describing tooth positions, the FDI notation is used by default. Sometimes the # symbol is omitted in the description. For example, "11" refers to "Upper Right Central Incisor," and "18" refers to "Upper Right Third Molar.". For ranges, e.g., #12-#23 means "Upper Right Later...

work page
[27]

The question must focus on the region described in the text or using the region inferred from the description. For example, an image showing only the upper jaw should not have questions about the lower jaw (or the answer should be "Unknown" if the question is about the lower jaw, or the question should be about the visibility of the lower jaw)

work page
[28]

Any abnormalities/disease not mentioned in the text (but within the described region) are assumed absent, constructing questions based on such deduced non - existent abnormalities is acceptable

work page
[29]

Any VQA content relying on such uncertain observations must be revised or removed

If any item in the diagnostic text is marked with low_confidence: true, it must not be used to generate or support any question or answer. Any VQA content relying on such uncertain observations must be revised or removed

work page
[30]

If you are unsure, it is better to mark the VQA as valid

Be conservative in your judgement and revisions. If you are unsure, it is better to mark the VQA as valid. If the VQA content is largely correct with only minor issues, make minimal necessary changes to ensure accuracy and alignment with the text

work page
[31]

11" refers to

When describing tooth positions, the FDI notation is used by default. Sometimes the # symbol is omitted in the description. For example, "11" refers to "Upper Right Central Incisor," and "18" refers to "Upper Right Third Molar.". Interpret tooth ranges as inclusive sequences: ranges spanning opposite quadrants (e.g., 12–22 or 22–12) include all teeth cros...

work page
[32]

Do not include any treatment suggestions, or basic dental knowledge

Generate questions only for the content of the image. Do not include any treatment suggestions, or basic dental knowledge

work page
[33]

Be professional about the wording, carefully review before answering to avoid incorrect dental terminology or descriptions

work page
[34]

based on the description

The questions must be concise and precise, avoiding vague or ambiguous wording. No need to mention "based on the description" or "The description mentions"

work page
[35]

invalid": true/false,

All questions should be generated strictly within the region described in the given text. Any findings not mentioned in the description should be considered normal within that region and may be used to construct questions and answers. For non-existent abnormality questions, you may choose from conditions such as dental caries, non -carious tooth defects, ...

work page
[36]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Examples of common failure modes To complement the quantitative evaluation, we conducted a qualitative analysis to illustrate Appendix for “MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry” 20 common failure modes of current VLMs in intraoral image understanding. While some AI - generated descriptions demonstrate correct recognit...

work page
[37]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Advantages of the Labeling Scheme Our annotation protocol was designed to reflect the nuanced reasoning and descriptive richness inherent in clinical dental practice. Rather than constraining annotators to a fixed set of categorical labels, we instructed annotators to describe intraoral findings using natural, unstructured language as they would in a clin...

work page
[38]

The potential impacts of the VLMs can be outlined across three key dimensions —clinical care, public health, and the evolution of AI technologies, as detailed below:

Application of Dental-specific VLMs The clinical image vision -language data developed in this project can serve as a foundational infrastructure for advancing AI models in dentistry . The potential impacts of the VLMs can be outlined across three key dimensions —clinical care, public health, and the evolution of AI technologies, as detailed below:

work page
[39]

Enhance clinical workflow: VLMs can be integrated into digital clinical platforms in dental clinics or hospitals. Such integration may enable automatic semantic interpretation and structured documentation of intraoral images, reducing clinicians’ manual burden in composing electron ic health records (EHRs). This improves both the efficiency and standardiz...

work page
[40]

Promote public health: In primary care settings or resource-constrained regions, the dental- specific VLMs can function as an intelligent preliminary screening tool. It can assist non-dental healthcare providers —such as general practitioners, community health workers, or public health personnel—in the initial identification and risk assessment of common ...

work page
[41]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Cross-modal representation learning : The fine -grained vision –language data pairs will significantly accelerate research in cross -modal representation learning within dentistry. On one hand, they enable semantic-level intelligent retrieval from dental image databases, thereby enhancing research and educational efficiency. On the other hand, these struc...

work page
[42]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Cross-LLM Validation for Potential Benchmark Bias The detailed Precision and Recall results for the original multi-label Classification (CLS) and Image Captioning (CAP) tasks are summarized in Appendix Table 10. For the CLS task, all models exhibit moderate precision and recall, with clear performance variation across data sources. Overall, GPT -4o achiev...

work page
[43]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Statistical Analysis Annotation Consistency Assessment To assess the consistency between the two annotators, we evaluated inter -rater reliability using Cohen’s Kappa. Each image –label combination was treated as an independent sample. We first Appendix for “MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry” 25 randomly selected 1...

work page
[44]

Null hypothesis (𝐻0): All models have equal probabilities of correct prediction

work page
[45]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Alternative hypothesis ( 𝐻1 ): At least one model has a different probability of correct prediction. For the VQA task, Cochran’s Q test revealed highly significant differences among the five models (Q = 225.99, p = 9.66e-48), indicating that the VLMs do not share the same probability of generating correct answers. To further investigate the pairwise diffe...

work page 2009
[46]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv: 1907.11692. http://arxiv.org/abs/1907.11692. Rashid J, Qaisar BS, Faheem M, Akram A, Amin R ul, Hamid M. 2024. Mouth and oral disease classification using InceptionResNetV2 method. Multimedia Tools and Applications. 83(11):33903–33921. Sajid S. 2023. Oral Diseases. Kaggle. [acc...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[1] [1]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Data Collection and Filtering Candidate Mining from COYO-700M To retrieve intraoral clinical photographs from the large-scale web dataset COYO-700M, we fine- tuned a binary image classifier starting from ViT -L/16 (Dosovitskiy et al. 2020) from pre-trained weights. Positive samples were initially composed of DS1, DS2, MOD (Rashid et al. 2024) , and Oral D...

work page 2020

[2] [2]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Evaluation Metric We evaluate VQA performance using the classification accuracy metric, computed as the proportion of questions answered correctly. Because the VQA benchmarks include different question formats, we calculate accuracy separately for multiple-choice questions and for binary true/false questions. In each case, accuracy is defined as: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦...

work page

[3] [3]

The reference captions are generated by GPT-OSS to serve as high-quality ground truth descriptions

Semantic Similarity (BERTScore): The semantic similarity between each model-generated and reference captions was measured using BERTScore. The reference captions are generated by GPT-OSS to serve as high-quality ground truth descriptions. BERTScore computes token-level semantic similarity using pre-trained contextual embeddings, such as RoBERTa-large (Liu...

work page 2019

[4] [4]

I'm sorry, but I can't perform image analysis

Clinical Content Accuracy (Abnormality Extraction): To evaluate whether the generated captions correctly describe the abnormal findings in the images, GPT -OSS was used to automatically extract the list of abnormalities mentioned in each VLM-generated caption. This yields a set of predicted findings for the image, extracted from the caption text. We then ...

work page

[5] [5]

Appendix Figures 1 –3 present the Precision, Recall, and Macro F1 metrics for each class across datasets and models

Definitions for Multi-Label Classification and Analysis of Results The detailed definition for 18-class multi-label classification is presented in Appendix Table 1. Appendix Figures 1 –3 present the Precision, Recall, and Macro F1 metrics for each class across datasets and models. Several consistent patterns emerge across all datasets and VLMs:

work page

[6] [6]

In the F1 heatmaps (Appendix Figure 3), all models exhibit notably lower F1 scores for C2, C3, C13, and C14. Examination of the Precision–Recall plots (Appendix Figures 1–2) reveals a common pattern: precision remains relatively high while recall is low, indicating frequent missed detections

work page

[7] [7]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

In contrast, C4 shows high recall but low precision in most scenarios, a pattern characteristic of false positives. These two complementary error modes — under-calling (C2/C3/C13/C14) and over-calling (C4)— are consistently observed across MetaDent and all three subsets, reflecting class -specific visual ambiguities that current general-purpose VLM priors...

work page

[8] [8]

C4 was predicted in approximately 73.3% of images, whereas only 11% actually contained C4

work page

[9] [9]

C5 was predicted in about 37% of images, while only 4% were truly labeled C5

work page

[10] [10]

A manual inspection of DS2 revealed two plausible contributing factors:

For C14, the most prevalent class in DS2-44.5% of images were truly positive, yet fewer than 9% were predicted positive. A manual inspection of DS2 revealed two plausible contributing factors:

work page

[11] [11]

DS2 images exhibit a noticeable reddish hue, which amplifies soft -tissue redness and mucosal highlights

Image tone shift. DS2 images exhibit a noticeable reddish hue, which amplifies soft -tissue redness and mucosal highlights. VLMs may erroneously associate these features with C4/C5— like patterns, leading to false positives

work page

[12] [12]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Demographic differences. Variations in patient populations and imaging conditions across datasets may have introduced additional distributional shifts that hinder model generalization. Specifically, DS2 is mostly teenagers with mixed dentition , which differentiates it from the other two datasets. Finally, the widespread under -detection of C14 across dat...

work page

[13] [13]

Is there visible calculus in the image?

Errors for Data Conversion We systematically analyzed the source of errors during secondary dataset generation (see Appendix Figure 4), and detailed definitions for each type of error are presented below. Appendix Table 2. Definition of errors for secondary dataset generation. Name Definition Annotation Error Errors arising from incorrect or ambiguous lab...

work page

[14] [14]

Under-reasoning: The LLM fails to draw a valid conclusion that is supported by the evidence in the meta-label

work page

[15] [15]

Examples:

Over-reasoning: The LLM generate s a proposition that the evidence provided is not sufficient to support, and happens to be wrong for the specific image. Examples:

work page

[16] [16]

An image of periodontal surgery

The meta label clearly mentioned “An image of periodontal surgery” and “bleeding of the gingiva ”, but the LLM failed to categorize this image into “Oral wound”

work page

[17] [17]

The small chalky white spots observed on the enamel surfaces are most consistent with which condition?

On an image with a white spot lesion, t he model raises a question: “The small chalky white spots observed on the enamel surfaces are most consistent with which condition?” , and gives the answer as “Incipient dental caries (white ‑spot lesion)”. However white spot on the enamel surface alone does not support this proposition. Translation Error The origin...

work page

[18] [18]

uncertain

Prompt For each specific task in this study, a fixed prompt template was used consistently across all images to ensure procedural consistency and to minimize prompt-induced variability. For the generation of VQA pairs and multi -label classification labels, the te mperature was set to 0 in order to enforce deterministic outputs and maximize reproducibilit...

work page

[19] [19]

All questions must have exactly one correct answer, and the correct answer should be randomly distributed among the options across questions

work page

[20] [22]

Do not include any question regarding the uncertain content (marked with low_confidence is True in the input text)

work page

[21] [24]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

All questions should be generated strictly within the region described in the given text. Any findings not mentioned in the description should be considered normal within that region and may be used to construct questions and answers. For non-existent abnormality questions, you may choose from conditions such as dental caries, non -carious tooth defects, ...

work page

[22] [25]

If there is limited abnormal information, you may design questions based on point 6

Single-choice questions and true/false questions must test different knowledge points (they cannot be the same or similar). If there is limited abnormal information, you may design questions based on point 6

work page

[23] [26]

11" refers to

The output must be in valid JSON array, without any extra content. ``` Note: When describing tooth positions, the FDI notation is used by default. Sometimes the # symbol is omitted in the description. For example, "11" refers to "Upper Right Central Incisor," and "18" refers to "Upper Right Third Molar.". For ranges, e.g., #12-#23 means "Upper Right Later...

work page

[24] [27]

The question must focus on the region described in the text or using the region inferred from the description. For example, an image showing only the upper jaw should not have questions about the lower jaw (or the answer should be "Unknown" if the question is about the lower jaw, or the question should be about the visibility of the lower jaw)

work page

[25] [28]

Any abnormalities/disease not mentioned in the text (but within the described region) are assumed absent, constructing questions based on such deduced non - existent abnormalities is acceptable

work page

[26] [29]

Any VQA content relying on such uncertain observations must be revised or removed

If any item in the diagnostic text is marked with low_confidence: true, it must not be used to generate or support any question or answer. Any VQA content relying on such uncertain observations must be revised or removed

work page

[27] [30]

If you are unsure, it is better to mark the VQA as valid

Be conservative in your judgement and revisions. If you are unsure, it is better to mark the VQA as valid. If the VQA content is largely correct with only minor issues, make minimal necessary changes to ensure accuracy and alignment with the text

work page

[28] [31]

11" refers to

When describing tooth positions, the FDI notation is used by default. Sometimes the # symbol is omitted in the description. For example, "11" refers to "Upper Right Central Incisor," and "18" refers to "Upper Right Third Molar.". Interpret tooth ranges as inclusive sequences: ranges spanning opposite quadrants (e.g., 12–22 or 22–12) include all teeth cros...

work page

[29] [32]

Do not include any treatment suggestions, or basic dental knowledge

Generate questions only for the content of the image. Do not include any treatment suggestions, or basic dental knowledge

work page

[30] [33]

Be professional about the wording, carefully review before answering to avoid incorrect dental terminology or descriptions

work page

[31] [34]

based on the description

The questions must be concise and precise, avoiding vague or ambiguous wording. No need to mention "based on the description" or "The description mentions"

work page

[32] [35]

invalid": true/false,

All questions should be generated strictly within the region described in the given text. Any findings not mentioned in the description should be considered normal within that region and may be used to construct questions and answers. For non-existent abnormality questions, you may choose from conditions such as dental caries, non -carious tooth defects, ...

work page

[33] [36]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Examples of common failure modes To complement the quantitative evaluation, we conducted a qualitative analysis to illustrate Appendix for “MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry” 20 common failure modes of current VLMs in intraoral image understanding. While some AI - generated descriptions demonstrate correct recognit...

work page

[34] [37]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Advantages of the Labeling Scheme Our annotation protocol was designed to reflect the nuanced reasoning and descriptive richness inherent in clinical dental practice. Rather than constraining annotators to a fixed set of categorical labels, we instructed annotators to describe intraoral findings using natural, unstructured language as they would in a clin...

work page

[35] [38]

The potential impacts of the VLMs can be outlined across three key dimensions —clinical care, public health, and the evolution of AI technologies, as detailed below:

Application of Dental-specific VLMs The clinical image vision -language data developed in this project can serve as a foundational infrastructure for advancing AI models in dentistry . The potential impacts of the VLMs can be outlined across three key dimensions —clinical care, public health, and the evolution of AI technologies, as detailed below:

work page

[36] [39]

Enhance clinical workflow: VLMs can be integrated into digital clinical platforms in dental clinics or hospitals. Such integration may enable automatic semantic interpretation and structured documentation of intraoral images, reducing clinicians’ manual burden in composing electron ic health records (EHRs). This improves both the efficiency and standardiz...

work page

[37] [40]

Promote public health: In primary care settings or resource-constrained regions, the dental- specific VLMs can function as an intelligent preliminary screening tool. It can assist non-dental healthcare providers —such as general practitioners, community health workers, or public health personnel—in the initial identification and risk assessment of common ...

work page

[38] [41]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Cross-modal representation learning : The fine -grained vision –language data pairs will significantly accelerate research in cross -modal representation learning within dentistry. On one hand, they enable semantic-level intelligent retrieval from dental image databases, thereby enhancing research and educational efficiency. On the other hand, these struc...

work page

[39] [42]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Cross-LLM Validation for Potential Benchmark Bias The detailed Precision and Recall results for the original multi-label Classification (CLS) and Image Captioning (CAP) tasks are summarized in Appendix Table 10. For the CLS task, all models exhibit moderate precision and recall, with clear performance variation across data sources. Overall, GPT -4o achiev...

work page

[40] [43]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Statistical Analysis Annotation Consistency Assessment To assess the consistency between the two annotators, we evaluated inter -rater reliability using Cohen’s Kappa. Each image –label combination was treated as an independent sample. We first Appendix for “MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry” 25 randomly selected 1...

work page

[41] [44]

Null hypothesis (𝐻0): All models have equal probabilities of correct prediction

work page

[42] [45]

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Alternative hypothesis ( 𝐻1 ): At least one model has a different probability of correct prediction. For the VQA task, Cochran’s Q test revealed highly significant differences among the five models (Q = 225.99, p = 9.66e-48), indicating that the VLMs do not share the same probability of generating correct answers. To further investigate the pairwise diffe...

work page 2009

[43] [46]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv: 1907.11692. http://arxiv.org/abs/1907.11692. Rashid J, Qaisar BS, Faheem M, Akram A, Amin R ul, Hamid M. 2024. Mouth and oral disease classification using InceptionResNetV2 method. Multimedia Tools and Applications. 83(11):33903–33921. Sajid S. 2023. Oral Diseases. Kaggle. [acc...

work page internal anchor Pith review Pith/arXiv arXiv 1907