pith. sign in

arxiv: 2604.14866 · v1 · submitted 2026-04-16 · 💻 cs.CV · cs.AI

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Pith reviewed 2026-05-10 11:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords dentistryvision-language modelsclinical image annotationintraoral photographyvisual question answeringimage captioningmulti-label classification
0
0 comments X

The pith

A new annotated dataset of dental images reveals that state-of-the-art vision-language models struggle with fine-grained understanding of intraoral scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a large-scale dataset of dental images with a semi-structured annotation scheme to support vision-language models in clinical dentistry. It shows that even advanced models only reach moderate accuracy on tasks like visual question answering and produce inconsistent descriptions in image captioning. By curating 60,669 images and annotating 2,588 of them, the work creates benchmarks that highlight the need for better fine-grained analysis in oral health imaging. This resource is released publicly to enable further development of such models.

Core claim

MetaDent supplies a dentistry image dataset, a semi-structured annotation framework capturing hierarchical clinical details through high-level summaries and point-by-point abnormality descriptions, and LLM-derived benchmarks including 15K VQA pairs and an 18-class multi-label classification set. Evaluations confirm that state-of-the-art VLMs achieve only moderate accuracy and generate incomplete captions for intraoral photographs.

What carries the argument

The semi-structured annotation framework that combines high-level image summaries with free-text point-by-point descriptions of abnormalities, enabling scalable and task-agnostic representations for clinical evaluation.

Load-bearing premise

The semi-structured annotation framework and subsequent LLM-driven benchmark creation reliably preserve clinically nuanced dental details without semantic drift or selection bias.

What would settle it

Human experts re-labeling a subset of the images and finding substantial differences from the provided annotations, or new VLMs achieving high accuracy and complete captions on the derived tasks.

Figures

Figures reproduced from arXiv: 2604.14866 by Chun-Xiao Jin, Cui Huang, Gui-Song Xia, James Kit Hon Tsoi, Jia-Min Wu, Meng-Xun Li, Wen-Hui Deng, Yue Han, Zhi-Xing Wu.

Figure 1
Figure 1. Figure 1: Data processing pipeline. (A) Composition and distribution of the dataset across 3 sources, with the Internet-scraped subset (Data Source 3) being the largest. Image features were extracted using DINOv3 (Siméoni et al 2025) and projected into 2D using principal component analysis; darker, larger dots represent labeled samples, while lighter and smaller dots indicate the remaining images in the collection. … view at source ↗
Figure 2
Figure 2. Figure 2: Dataset statistics. (A) Overall image area distribution: Image size (in log10 pixels²) shows that most images fall within the 5.0 to 6.0 range. (B) Aspect ratio by data source: DS3 exhibits the widest distribution of aspect ratios, followed by DS1, whereas DS2 displays a more uniform aspect ratio profile. (C) Image content composition: Approximately 80% of images contain human subjects. Of the entire datas… view at source ↗
Figure 3
Figure 3. Figure 3: Performance of 5 vision-language models on MetaDent across 3 tasks. (A) Visual question answering (VQA) accuracy on multiple-choice questions (MCQ). (B) VQA accuracy on true/false questions (TFQ). (C) F1 for the 18-class multilabel classification task. (D) Exact Match accuracy for multilabel predictions. (E) Image captioning semantic similarity reported as BERTScore-F1 against reference captions. (F) Conte… view at source ↗
Figure 4
Figure 4. Figure 4: Category-level performance for the multilabel classification task. (A) Exact Match accuracy across 18 categories for 5 models. (B) Distribution of samples per category for the classification task. (C) Names of the 18 categories (see Appendix [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MetaDent, a dataset of 60,669 dental images (with 2,588 meta-annotated via high-level summaries plus free-text abnormality points) from clinical, public, and web sources. LLMs are used to derive ~15K VQA pairs and an 18-class multi-label classification benchmark from these annotations, with human review and error analysis claimed to confirm fidelity. State-of-the-art VLMs are then benchmarked on VQA, classification, and captioning tasks, with the central finding that even advanced models achieve only moderate accuracy and generate inconsistent or incomplete descriptions of intraoral scenes.

Significance. If the LLM-converted benchmarks are shown to be faithful clinical proxies, MetaDent would provide a valuable, publicly released resource for advancing VLMs in the underexplored domain of intraoral photography. The semi-structured annotation approach and empirical demonstration of current model limitations on fine-grained dental understanding offer a useful baseline and could accelerate development of clinically relevant vision-language systems. The release of the dataset, annotations, and tools supports reproducibility.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'human review and error analysis' justifies that the LLM-driven conversion 'reliably preserves fidelity and semantic accuracy' is unsupported by any quantitative metrics (e.g., exact-match rate, Cohen's kappa, or per-task error breakdown on a held-out validation subset). This is load-bearing for the headline claim that moderate VLM accuracy reflects model limitations rather than label noise or selection bias in the 2,588-image annotated subset.
  2. [Annotation and benchmark generation] Section describing the annotation and benchmark generation process: No details are provided on the exact annotation protocols for the 2,588 images, inter-rater agreement statistics, data splits, or the specific human validation procedure and error analysis results. Without these, it is impossible to quantify potential semantic drift or selection effects in the derived 15K VQA pairs and 18-class labels.
minor comments (2)
  1. The abstract and methods would benefit from explicit listing of the VQA question templates and classification label definitions to aid reproducibility.
  2. Consider including additional baselines (e.g., non-VLM classifiers or simpler vision-only models) in the classification and captioning evaluations to better contextualize the reported moderate accuracies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review, which highlights important areas for improving the clarity and rigor of our presentation of the MetaDent dataset and benchmarks. We agree that the current manuscript would benefit from expanded details on the annotation and validation processes to better support our claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'human review and error analysis' justifies that the LLM-driven conversion 'reliably preserves fidelity and semantic accuracy' is unsupported by any quantitative metrics (e.g., exact-match rate, Cohen's kappa, or per-task error breakdown on a held-out validation subset). This is load-bearing for the headline claim that moderate VLM accuracy reflects model limitations rather than label noise or selection bias in the 2,588-image annotated subset.

    Authors: We agree that the abstract's reference to human review and error analysis would be more convincing with explicit quantitative support. In the revised manuscript, we will update the abstract to reference specific validation metrics and add a new subsection detailing the human review process. This will include exact-match rates between LLM-generated VQA pairs/labels and human corrections, a per-task error breakdown, and any applicable agreement statistics from the 2,588-image subset. These additions will strengthen the argument that observed VLM limitations reflect genuine model shortcomings rather than annotation issues. revision: yes

  2. Referee: [Annotation and benchmark generation] Section describing the annotation and benchmark generation process: No details are provided on the exact annotation protocols for the 2,588 images, inter-rater agreement statistics, data splits, or the specific human validation procedure and error analysis results. Without these, it is impossible to quantify potential semantic drift or selection effects in the derived 15K VQA pairs and 18-class labels.

    Authors: We acknowledge that the current manuscript omits these methodological specifics, which are necessary for full reproducibility and assessment of potential biases. We will substantially expand the annotation and benchmark generation section to include: (1) the exact protocols and guidelines used for meta-annotation of the 2,588 images (high-level summaries plus free-text abnormality points); (2) data split details for the annotated subset; (3) the full human validation procedure, including how reviewers interacted with LLM outputs; and (4) results from the error analysis, with quantitative metrics where available (e.g., agreement rates and error categorizations). If inter-rater statistics are limited due to the single-reviewer design for efficiency, we will explicitly note this and justify the approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical dataset construction and benchmarking

full rationale

The paper presents an empirical workflow of image collection (60,669 dental images), semi-structured meta-annotation on a 2,588-image subset, LLM-assisted conversion to ~15K VQA pairs and 18-class labels, human validation, and VLM evaluation on captioning/VQA/classification tasks. No equations, derivations, fitted parameters, or predictions appear anywhere in the described chain. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim (VLMs achieve only moderate accuracy on intraoral scenes) is supported by direct benchmarking results rather than reducing to any input by construction. This matches the default expectation for non-circular empirical dataset papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical modeling; the annotation scheme and LLM use are methodological choices rather than axioms or free parameters. No invented entities.

pith-pipeline@v0.9.0 · 5624 in / 1039 out tokens · 35733 ms · 2026-05-10T11:51:50.247154+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

    Data Collection and Filtering Candidate Mining from COYO-700M To retrieve intraoral clinical photographs from the large-scale web dataset COYO-700M, we fine- tuned a binary image classifier starting from ViT -L/16 (Dosovitskiy et al. 2020) from pre-trained weights. Positive samples were initially composed of DS1, DS2, MOD (Rashid et al. 2024) , and Oral D...

  2. [2]

    MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

    Evaluation Metric We evaluate VQA performance using the classification accuracy metric, computed as the proportion of questions answered correctly. Because the VQA benchmarks include different question formats, we calculate accuracy separately for multiple-choice questions and for binary true/false questions. In each case, accuracy is defined as: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦...

  3. [3]

    The reference captions are generated by GPT-OSS to serve as high-quality ground truth descriptions

    Semantic Similarity (BERTScore): The semantic similarity between each model-generated and reference captions was measured using BERTScore. The reference captions are generated by GPT-OSS to serve as high-quality ground truth descriptions. BERTScore computes token-level semantic similarity using pre-trained contextual embeddings, such as RoBERTa-large (Liu...

  4. [4]

    I'm sorry, but I can't perform image analysis

    Clinical Content Accuracy (Abnormality Extraction): To evaluate whether the generated captions correctly describe the abnormal findings in the images, GPT -OSS was used to automatically extract the list of abnormalities mentioned in each VLM-generated caption. This yields a set of predicted findings for the image, extracted from the caption text. We then ...

  5. [5]

    Appendix Figures 1 –3 present the Precision, Recall, and Macro F1 metrics for each class across datasets and models

    Definitions for Multi-Label Classification and Analysis of Results The detailed definition for 18-class multi-label classification is presented in Appendix Table 1. Appendix Figures 1 –3 present the Precision, Recall, and Macro F1 metrics for each class across datasets and models. Several consistent patterns emerge across all datasets and VLMs:

  6. [6]

    In the F1 heatmaps (Appendix Figure 3), all models exhibit notably lower F1 scores for C2, C3, C13, and C14. Examination of the Precision–Recall plots (Appendix Figures 1–2) reveals a common pattern: precision remains relatively high while recall is low, indicating frequent missed detections

  7. [7]

    MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

    In contrast, C4 shows high recall but low precision in most scenarios, a pattern characteristic of false positives. These two complementary error modes — under-calling (C2/C3/C13/C14) and over-calling (C4)— are consistently observed across MetaDent and all three subsets, reflecting class -specific visual ambiguities that current general-purpose VLM priors...

  8. [8]

    C4 was predicted in approximately 73.3% of images, whereas only 11% actually contained C4

  9. [9]

    C5 was predicted in about 37% of images, while only 4% were truly labeled C5

  10. [10]

    A manual inspection of DS2 revealed two plausible contributing factors:

    For C14, the most prevalent class in DS2-44.5% of images were truly positive, yet fewer than 9% were predicted positive. A manual inspection of DS2 revealed two plausible contributing factors:

  11. [11]

    DS2 images exhibit a noticeable reddish hue, which amplifies soft -tissue redness and mucosal highlights

    Image tone shift. DS2 images exhibit a noticeable reddish hue, which amplifies soft -tissue redness and mucosal highlights. VLMs may erroneously associate these features with C4/C5— like patterns, leading to false positives

  12. [12]

    MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

    Demographic differences. Variations in patient populations and imaging conditions across datasets may have introduced additional distributional shifts that hinder model generalization. Specifically, DS2 is mostly teenagers with mixed dentition , which differentiates it from the other two datasets. Finally, the widespread under -detection of C14 across dat...

  13. [13]

    Is there visible calculus in the image?

    Errors for Data Conversion We systematically analyzed the source of errors during secondary dataset generation (see Appendix Figure 4), and detailed definitions for each type of error are presented below. Appendix Table 2. Definition of errors for secondary dataset generation. Name Definition Annotation Error Errors arising from incorrect or ambiguous lab...

  14. [14]

    Under-reasoning: The LLM fails to draw a valid conclusion that is supported by the evidence in the meta-label

  15. [15]

    Examples:

    Over-reasoning: The LLM generate s a proposition that the evidence provided is not sufficient to support, and happens to be wrong for the specific image. Examples:

  16. [16]

    An image of periodontal surgery

    The meta label clearly mentioned “An image of periodontal surgery” and “bleeding of the gingiva ”, but the LLM failed to categorize this image into “Oral wound”

  17. [17]

    The small chalky white spots observed on the enamel surfaces are most consistent with which condition?

    On an image with a white spot lesion, t he model raises a question: “The small chalky white spots observed on the enamel surfaces are most consistent with which condition?” , and gives the answer as “Incipient dental caries (white ‑spot lesion)”. However white spot on the enamel surface alone does not support this proposition. Translation Error The origin...

  18. [18]

    uncertain

    Prompt For each specific task in this study, a fixed prompt template was used consistently across all images to ensure procedural consistency and to minimize prompt-induced variability. For the generation of VQA pairs and multi -label classification labels, the te mperature was set to 0 in order to enforce deterministic outputs and maximize reproducibilit...

  19. [19]

    All questions must have exactly one correct answer, and the correct answer should be randomly distributed among the options across questions

  20. [22]

    Do not include any question regarding the uncertain content (marked with low_confidence is True in the input text)

  21. [24]

    MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

    All questions should be generated strictly within the region described in the given text. Any findings not mentioned in the description should be considered normal within that region and may be used to construct questions and answers. For non-existent abnormality questions, you may choose from conditions such as dental caries, non -carious tooth defects, ...

  22. [25]

    If there is limited abnormal information, you may design questions based on point 6

    Single-choice questions and true/false questions must test different knowledge points (they cannot be the same or similar). If there is limited abnormal information, you may design questions based on point 6

  23. [26]

    11" refers to

    The output must be in valid JSON array, without any extra content. ``` Note: When describing tooth positions, the FDI notation is used by default. Sometimes the # symbol is omitted in the description. For example, "11" refers to "Upper Right Central Incisor," and "18" refers to "Upper Right Third Molar.". For ranges, e.g., #12-#23 means "Upper Right Later...

  24. [27]

    The question must focus on the region described in the text or using the region inferred from the description. For example, an image showing only the upper jaw should not have questions about the lower jaw (or the answer should be "Unknown" if the question is about the lower jaw, or the question should be about the visibility of the lower jaw)

  25. [28]

    Any abnormalities/disease not mentioned in the text (but within the described region) are assumed absent, constructing questions based on such deduced non - existent abnormalities is acceptable

  26. [29]

    Any VQA content relying on such uncertain observations must be revised or removed

    If any item in the diagnostic text is marked with low_confidence: true, it must not be used to generate or support any question or answer. Any VQA content relying on such uncertain observations must be revised or removed

  27. [30]

    If you are unsure, it is better to mark the VQA as valid

    Be conservative in your judgement and revisions. If you are unsure, it is better to mark the VQA as valid. If the VQA content is largely correct with only minor issues, make minimal necessary changes to ensure accuracy and alignment with the text

  28. [31]

    11" refers to

    When describing tooth positions, the FDI notation is used by default. Sometimes the # symbol is omitted in the description. For example, "11" refers to "Upper Right Central Incisor," and "18" refers to "Upper Right Third Molar.". Interpret tooth ranges as inclusive sequences: ranges spanning opposite quadrants (e.g., 12–22 or 22–12) include all teeth cros...

  29. [32]

    Do not include any treatment suggestions, or basic dental knowledge

    Generate questions only for the content of the image. Do not include any treatment suggestions, or basic dental knowledge

  30. [33]

    Be professional about the wording, carefully review before answering to avoid incorrect dental terminology or descriptions

  31. [34]

    based on the description

    The questions must be concise and precise, avoiding vague or ambiguous wording. No need to mention "based on the description" or "The description mentions"

  32. [35]

    invalid": true/false,

    All questions should be generated strictly within the region described in the given text. Any findings not mentioned in the description should be considered normal within that region and may be used to construct questions and answers. For non-existent abnormality questions, you may choose from conditions such as dental caries, non -carious tooth defects, ...

  33. [36]

    MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

    Examples of common failure modes To complement the quantitative evaluation, we conducted a qualitative analysis to illustrate Appendix for “MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry” 20 common failure modes of current VLMs in intraoral image understanding. While some AI - generated descriptions demonstrate correct recognit...

  34. [37]

    MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

    Advantages of the Labeling Scheme Our annotation protocol was designed to reflect the nuanced reasoning and descriptive richness inherent in clinical dental practice. Rather than constraining annotators to a fixed set of categorical labels, we instructed annotators to describe intraoral findings using natural, unstructured language as they would in a clin...

  35. [38]

    The potential impacts of the VLMs can be outlined across three key dimensions —clinical care, public health, and the evolution of AI technologies, as detailed below:

    Application of Dental-specific VLMs The clinical image vision -language data developed in this project can serve as a foundational infrastructure for advancing AI models in dentistry . The potential impacts of the VLMs can be outlined across three key dimensions —clinical care, public health, and the evolution of AI technologies, as detailed below:

  36. [39]

    Enhance clinical workflow: VLMs can be integrated into digital clinical platforms in dental clinics or hospitals. Such integration may enable automatic semantic interpretation and structured documentation of intraoral images, reducing clinicians’ manual burden in composing electron ic health records (EHRs). This improves both the efficiency and standardiz...

  37. [40]

    Promote public health: In primary care settings or resource-constrained regions, the dental- specific VLMs can function as an intelligent preliminary screening tool. It can assist non-dental healthcare providers —such as general practitioners, community health workers, or public health personnel—in the initial identification and risk assessment of common ...

  38. [41]

    MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

    Cross-modal representation learning : The fine -grained vision –language data pairs will significantly accelerate research in cross -modal representation learning within dentistry. On one hand, they enable semantic-level intelligent retrieval from dental image databases, thereby enhancing research and educational efficiency. On the other hand, these struc...

  39. [42]

    MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

    Cross-LLM Validation for Potential Benchmark Bias The detailed Precision and Recall results for the original multi-label Classification (CLS) and Image Captioning (CAP) tasks are summarized in Appendix Table 10. For the CLS task, all models exhibit moderate precision and recall, with clear performance variation across data sources. Overall, GPT -4o achiev...

  40. [43]

    MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

    Statistical Analysis Annotation Consistency Assessment To assess the consistency between the two annotators, we evaluated inter -rater reliability using Cohen’s Kappa. Each image –label combination was treated as an independent sample. We first Appendix for “MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry” 25 randomly selected 1...

  41. [44]

    Null hypothesis (𝐻0): All models have equal probabilities of correct prediction

  42. [45]

    MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

    Alternative hypothesis ( 𝐻1 ): At least one model has a different probability of correct prediction. For the VQA task, Cochran’s Q test revealed highly significant differences among the five models (Q = 225.99, p = 9.66e-48), indicating that the VLMs do not share the same probability of generating correct answers. To further investigate the pairwise diffe...

  43. [46]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv: 1907.11692. http://arxiv.org/abs/1907.11692. Rashid J, Qaisar BS, Faheem M, Akram A, Amin R ul, Hamid M. 2024. Mouth and oral disease classification using InceptionResNetV2 method. Multimedia Tools and Applications. 83(11):33903–33921. Sajid S. 2023. Oral Diseases. Kaggle. [acc...