MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Pith reviewed 2026-05-10 11:51 UTC · model grok-4.3
The pith
A new annotated dataset of dental images reveals that state-of-the-art vision-language models struggle with fine-grained understanding of intraoral scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MetaDent supplies a dentistry image dataset, a semi-structured annotation framework capturing hierarchical clinical details through high-level summaries and point-by-point abnormality descriptions, and LLM-derived benchmarks including 15K VQA pairs and an 18-class multi-label classification set. Evaluations confirm that state-of-the-art VLMs achieve only moderate accuracy and generate incomplete captions for intraoral photographs.
What carries the argument
The semi-structured annotation framework that combines high-level image summaries with free-text point-by-point descriptions of abnormalities, enabling scalable and task-agnostic representations for clinical evaluation.
Load-bearing premise
The semi-structured annotation framework and subsequent LLM-driven benchmark creation reliably preserve clinically nuanced dental details without semantic drift or selection bias.
What would settle it
Human experts re-labeling a subset of the images and finding substantial differences from the provided annotations, or new VLMs achieving high accuracy and complete captions on the derived tasks.
Figures
read the original abstract
Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MetaDent, a dataset of 60,669 dental images (with 2,588 meta-annotated via high-level summaries plus free-text abnormality points) from clinical, public, and web sources. LLMs are used to derive ~15K VQA pairs and an 18-class multi-label classification benchmark from these annotations, with human review and error analysis claimed to confirm fidelity. State-of-the-art VLMs are then benchmarked on VQA, classification, and captioning tasks, with the central finding that even advanced models achieve only moderate accuracy and generate inconsistent or incomplete descriptions of intraoral scenes.
Significance. If the LLM-converted benchmarks are shown to be faithful clinical proxies, MetaDent would provide a valuable, publicly released resource for advancing VLMs in the underexplored domain of intraoral photography. The semi-structured annotation approach and empirical demonstration of current model limitations on fine-grained dental understanding offer a useful baseline and could accelerate development of clinically relevant vision-language systems. The release of the dataset, annotations, and tools supports reproducibility.
major comments (2)
- [Abstract] Abstract: The assertion that 'human review and error analysis' justifies that the LLM-driven conversion 'reliably preserves fidelity and semantic accuracy' is unsupported by any quantitative metrics (e.g., exact-match rate, Cohen's kappa, or per-task error breakdown on a held-out validation subset). This is load-bearing for the headline claim that moderate VLM accuracy reflects model limitations rather than label noise or selection bias in the 2,588-image annotated subset.
- [Annotation and benchmark generation] Section describing the annotation and benchmark generation process: No details are provided on the exact annotation protocols for the 2,588 images, inter-rater agreement statistics, data splits, or the specific human validation procedure and error analysis results. Without these, it is impossible to quantify potential semantic drift or selection effects in the derived 15K VQA pairs and 18-class labels.
minor comments (2)
- The abstract and methods would benefit from explicit listing of the VQA question templates and classification label definitions to aid reproducibility.
- Consider including additional baselines (e.g., non-VLM classifiers or simpler vision-only models) in the classification and captioning evaluations to better contextualize the reported moderate accuracies.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review, which highlights important areas for improving the clarity and rigor of our presentation of the MetaDent dataset and benchmarks. We agree that the current manuscript would benefit from expanded details on the annotation and validation processes to better support our claims. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that 'human review and error analysis' justifies that the LLM-driven conversion 'reliably preserves fidelity and semantic accuracy' is unsupported by any quantitative metrics (e.g., exact-match rate, Cohen's kappa, or per-task error breakdown on a held-out validation subset). This is load-bearing for the headline claim that moderate VLM accuracy reflects model limitations rather than label noise or selection bias in the 2,588-image annotated subset.
Authors: We agree that the abstract's reference to human review and error analysis would be more convincing with explicit quantitative support. In the revised manuscript, we will update the abstract to reference specific validation metrics and add a new subsection detailing the human review process. This will include exact-match rates between LLM-generated VQA pairs/labels and human corrections, a per-task error breakdown, and any applicable agreement statistics from the 2,588-image subset. These additions will strengthen the argument that observed VLM limitations reflect genuine model shortcomings rather than annotation issues. revision: yes
-
Referee: [Annotation and benchmark generation] Section describing the annotation and benchmark generation process: No details are provided on the exact annotation protocols for the 2,588 images, inter-rater agreement statistics, data splits, or the specific human validation procedure and error analysis results. Without these, it is impossible to quantify potential semantic drift or selection effects in the derived 15K VQA pairs and 18-class labels.
Authors: We acknowledge that the current manuscript omits these methodological specifics, which are necessary for full reproducibility and assessment of potential biases. We will substantially expand the annotation and benchmark generation section to include: (1) the exact protocols and guidelines used for meta-annotation of the 2,588 images (high-level summaries plus free-text abnormality points); (2) data split details for the annotated subset; (3) the full human validation procedure, including how reviewers interacted with LLM outputs; and (4) results from the error analysis, with quantitative metrics where available (e.g., agreement rates and error categorizations). If inter-rater statistics are limited due to the single-reviewer design for efficiency, we will explicitly note this and justify the approach. revision: yes
Circularity Check
No significant circularity: purely empirical dataset construction and benchmarking
full rationale
The paper presents an empirical workflow of image collection (60,669 dental images), semi-structured meta-annotation on a 2,588-image subset, LLM-assisted conversion to ~15K VQA pairs and 18-class labels, human validation, and VLM evaluation on captioning/VQA/classification tasks. No equations, derivations, fitted parameters, or predictions appear anywhere in the described chain. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim (VLMs achieve only moderate accuracy on intraoral scenes) is supported by direct benchmarking results rather than reducing to any input by construction. This matches the default expectation for non-circular empirical dataset papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Data Collection and Filtering Candidate Mining from COYO-700M To retrieve intraoral clinical photographs from the large-scale web dataset COYO-700M, we fine- tuned a binary image classifier starting from ViT -L/16 (Dosovitskiy et al. 2020) from pre-trained weights. Positive samples were initially composed of DS1, DS2, MOD (Rashid et al. 2024) , and Oral D...
work page 2020
-
[2]
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Evaluation Metric We evaluate VQA performance using the classification accuracy metric, computed as the proportion of questions answered correctly. Because the VQA benchmarks include different question formats, we calculate accuracy separately for multiple-choice questions and for binary true/false questions. In each case, accuracy is defined as: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦...
-
[3]
The reference captions are generated by GPT-OSS to serve as high-quality ground truth descriptions
Semantic Similarity (BERTScore): The semantic similarity between each model-generated and reference captions was measured using BERTScore. The reference captions are generated by GPT-OSS to serve as high-quality ground truth descriptions. BERTScore computes token-level semantic similarity using pre-trained contextual embeddings, such as RoBERTa-large (Liu...
work page 2019
-
[4]
I'm sorry, but I can't perform image analysis
Clinical Content Accuracy (Abnormality Extraction): To evaluate whether the generated captions correctly describe the abnormal findings in the images, GPT -OSS was used to automatically extract the list of abnormalities mentioned in each VLM-generated caption. This yields a set of predicted findings for the image, extracted from the caption text. We then ...
-
[5]
Definitions for Multi-Label Classification and Analysis of Results The detailed definition for 18-class multi-label classification is presented in Appendix Table 1. Appendix Figures 1 –3 present the Precision, Recall, and Macro F1 metrics for each class across datasets and models. Several consistent patterns emerge across all datasets and VLMs:
-
[6]
In the F1 heatmaps (Appendix Figure 3), all models exhibit notably lower F1 scores for C2, C3, C13, and C14. Examination of the Precision–Recall plots (Appendix Figures 1–2) reveals a common pattern: precision remains relatively high while recall is low, indicating frequent missed detections
-
[7]
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
In contrast, C4 shows high recall but low precision in most scenarios, a pattern characteristic of false positives. These two complementary error modes — under-calling (C2/C3/C13/C14) and over-calling (C4)— are consistently observed across MetaDent and all three subsets, reflecting class -specific visual ambiguities that current general-purpose VLM priors...
-
[8]
C4 was predicted in approximately 73.3% of images, whereas only 11% actually contained C4
-
[9]
C5 was predicted in about 37% of images, while only 4% were truly labeled C5
-
[10]
A manual inspection of DS2 revealed two plausible contributing factors:
For C14, the most prevalent class in DS2-44.5% of images were truly positive, yet fewer than 9% were predicted positive. A manual inspection of DS2 revealed two plausible contributing factors:
-
[11]
Image tone shift. DS2 images exhibit a noticeable reddish hue, which amplifies soft -tissue redness and mucosal highlights. VLMs may erroneously associate these features with C4/C5— like patterns, leading to false positives
-
[12]
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Demographic differences. Variations in patient populations and imaging conditions across datasets may have introduced additional distributional shifts that hinder model generalization. Specifically, DS2 is mostly teenagers with mixed dentition , which differentiates it from the other two datasets. Finally, the widespread under -detection of C14 across dat...
-
[13]
Is there visible calculus in the image?
Errors for Data Conversion We systematically analyzed the source of errors during secondary dataset generation (see Appendix Figure 4), and detailed definitions for each type of error are presented below. Appendix Table 2. Definition of errors for secondary dataset generation. Name Definition Annotation Error Errors arising from incorrect or ambiguous lab...
-
[14]
Under-reasoning: The LLM fails to draw a valid conclusion that is supported by the evidence in the meta-label
- [15]
-
[16]
An image of periodontal surgery
The meta label clearly mentioned “An image of periodontal surgery” and “bleeding of the gingiva ”, but the LLM failed to categorize this image into “Oral wound”
-
[17]
On an image with a white spot lesion, t he model raises a question: “The small chalky white spots observed on the enamel surfaces are most consistent with which condition?” , and gives the answer as “Incipient dental caries (white ‑spot lesion)”. However white spot on the enamel surface alone does not support this proposition. Translation Error The origin...
-
[18]
Prompt For each specific task in this study, a fixed prompt template was used consistently across all images to ensure procedural consistency and to minimize prompt-induced variability. For the generation of VQA pairs and multi -label classification labels, the te mperature was set to 0 in order to enforce deterministic outputs and maximize reproducibilit...
-
[19]
All questions must have exactly one correct answer, and the correct answer should be randomly distributed among the options across questions
-
[22]
Do not include any question regarding the uncertain content (marked with low_confidence is True in the input text)
-
[24]
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
All questions should be generated strictly within the region described in the given text. Any findings not mentioned in the description should be considered normal within that region and may be used to construct questions and answers. For non-existent abnormality questions, you may choose from conditions such as dental caries, non -carious tooth defects, ...
-
[25]
If there is limited abnormal information, you may design questions based on point 6
Single-choice questions and true/false questions must test different knowledge points (they cannot be the same or similar). If there is limited abnormal information, you may design questions based on point 6
-
[26]
The output must be in valid JSON array, without any extra content. ``` Note: When describing tooth positions, the FDI notation is used by default. Sometimes the # symbol is omitted in the description. For example, "11" refers to "Upper Right Central Incisor," and "18" refers to "Upper Right Third Molar.". For ranges, e.g., #12-#23 means "Upper Right Later...
-
[27]
The question must focus on the region described in the text or using the region inferred from the description. For example, an image showing only the upper jaw should not have questions about the lower jaw (or the answer should be "Unknown" if the question is about the lower jaw, or the question should be about the visibility of the lower jaw)
-
[28]
Any abnormalities/disease not mentioned in the text (but within the described region) are assumed absent, constructing questions based on such deduced non - existent abnormalities is acceptable
-
[29]
Any VQA content relying on such uncertain observations must be revised or removed
If any item in the diagnostic text is marked with low_confidence: true, it must not be used to generate or support any question or answer. Any VQA content relying on such uncertain observations must be revised or removed
-
[30]
If you are unsure, it is better to mark the VQA as valid
Be conservative in your judgement and revisions. If you are unsure, it is better to mark the VQA as valid. If the VQA content is largely correct with only minor issues, make minimal necessary changes to ensure accuracy and alignment with the text
-
[31]
When describing tooth positions, the FDI notation is used by default. Sometimes the # symbol is omitted in the description. For example, "11" refers to "Upper Right Central Incisor," and "18" refers to "Upper Right Third Molar.". Interpret tooth ranges as inclusive sequences: ranges spanning opposite quadrants (e.g., 12–22 or 22–12) include all teeth cros...
-
[32]
Do not include any treatment suggestions, or basic dental knowledge
Generate questions only for the content of the image. Do not include any treatment suggestions, or basic dental knowledge
-
[33]
Be professional about the wording, carefully review before answering to avoid incorrect dental terminology or descriptions
-
[34]
The questions must be concise and precise, avoiding vague or ambiguous wording. No need to mention "based on the description" or "The description mentions"
-
[35]
All questions should be generated strictly within the region described in the given text. Any findings not mentioned in the description should be considered normal within that region and may be used to construct questions and answers. For non-existent abnormality questions, you may choose from conditions such as dental caries, non -carious tooth defects, ...
-
[36]
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Examples of common failure modes To complement the quantitative evaluation, we conducted a qualitative analysis to illustrate Appendix for “MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry” 20 common failure modes of current VLMs in intraoral image understanding. While some AI - generated descriptions demonstrate correct recognit...
-
[37]
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Advantages of the Labeling Scheme Our annotation protocol was designed to reflect the nuanced reasoning and descriptive richness inherent in clinical dental practice. Rather than constraining annotators to a fixed set of categorical labels, we instructed annotators to describe intraoral findings using natural, unstructured language as they would in a clin...
-
[38]
Application of Dental-specific VLMs The clinical image vision -language data developed in this project can serve as a foundational infrastructure for advancing AI models in dentistry . The potential impacts of the VLMs can be outlined across three key dimensions —clinical care, public health, and the evolution of AI technologies, as detailed below:
-
[39]
Enhance clinical workflow: VLMs can be integrated into digital clinical platforms in dental clinics or hospitals. Such integration may enable automatic semantic interpretation and structured documentation of intraoral images, reducing clinicians’ manual burden in composing electron ic health records (EHRs). This improves both the efficiency and standardiz...
-
[40]
Promote public health: In primary care settings or resource-constrained regions, the dental- specific VLMs can function as an intelligent preliminary screening tool. It can assist non-dental healthcare providers —such as general practitioners, community health workers, or public health personnel—in the initial identification and risk assessment of common ...
-
[41]
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Cross-modal representation learning : The fine -grained vision –language data pairs will significantly accelerate research in cross -modal representation learning within dentistry. On one hand, they enable semantic-level intelligent retrieval from dental image databases, thereby enhancing research and educational efficiency. On the other hand, these struc...
-
[42]
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Cross-LLM Validation for Potential Benchmark Bias The detailed Precision and Recall results for the original multi-label Classification (CLS) and Image Captioning (CAP) tasks are summarized in Appendix Table 10. For the CLS task, all models exhibit moderate precision and recall, with clear performance variation across data sources. Overall, GPT -4o achiev...
-
[43]
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Statistical Analysis Annotation Consistency Assessment To assess the consistency between the two annotators, we evaluated inter -rater reliability using Cohen’s Kappa. Each image –label combination was treated as an independent sample. We first Appendix for “MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry” 25 randomly selected 1...
-
[44]
Null hypothesis (𝐻0): All models have equal probabilities of correct prediction
-
[45]
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Alternative hypothesis ( 𝐻1 ): At least one model has a different probability of correct prediction. For the VQA task, Cochran’s Q test revealed highly significant differences among the five models (Q = 225.99, p = 9.66e-48), indicating that the VLMs do not share the same probability of generating correct answers. To further investigate the pairwise diffe...
work page 2009
-
[46]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv: 1907.11692. http://arxiv.org/abs/1907.11692. Rashid J, Qaisar BS, Faheem M, Akram A, Amin R ul, Hamid M. 2024. Mouth and oral disease classification using InceptionResNetV2 method. Multimedia Tools and Applications. 83(11):33903–33921. Sajid S. 2023. Oral Diseases. Kaggle. [acc...
work page internal anchor Pith review Pith/arXiv arXiv 1907
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.