arxiv: 2604.19323 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.CV

Recognition: unknown

Concept Inconsistency in Dermoscopic Concept Bottleneck Models: A Rough-Set Analysis of the Derm7pt Dataset

Gonzalo N\'apoles , Isel Grau , Yamisleydi Salgueiro

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:17 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords concept bottleneck modelsdermoscopymelanomarough set theorydataset inconsistency7-point checklistinterpretabilityaccuracy ceiling

0 comments

The pith

Inconsistent concept profiles in Derm7pt impose a 92.1% accuracy ceiling on hard concept bottleneck models for melanoma detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies rough set theory to the Derm7pt dataset to measure how often identical combinations of seven dermoscopic criteria receive conflicting benign or malignant labels. It identifies 50 inconsistent profiles out of 305 unique ones, affecting 306 images or 30.3% of the total. This inconsistency creates an unavoidable limit for any concept bottleneck model that routes predictions through hard concept assignments, capping accuracy at 92.1% regardless of the neural network used. The authors also remove the conflicting images to form a fully consistent subset of 705 images and report baseline CBM performance on it. This work shows that concept-level data quality directly constrains the reliability of interpretable medical AI systems.

Core claim

Among the 305 unique concept profiles formed by the 7 dermoscopic criteria of the 7-point melanoma checklist, 50 (16.4%) are inconsistent, spanning 306 images (30.3% of the dataset). This yields a theoretical accuracy ceiling of 92.1%, independent of backbone architecture or training strategy for CBMs that exclusively operate with hard concepts. Symmetric removal of all boundary-region images yields Derm7pt+, a fully consistent benchmark subset of 705 images with perfect quality of classification and no hard accuracy ceiling. Building on this filtered dataset, EfficientNet-B5 achieves the best label F1 score (0.85) and label accuracy (0.90) on the held-out test set under symmetric filtering,

What carries the argument

Rough-set analysis of the 305 concept profiles to detect inconsistencies where the same seven-criteria combination maps to both benign and malignant diagnoses.

If this is right

CBMs using hard concepts on the full Derm7pt cannot exceed 92.1% accuracy due to the conflicting profiles.
Symmetric removal of boundary images produces a 705-image consistent subset with no accuracy ceiling.
On the filtered set, EfficientNet-B5 reaches 0.90 label accuracy and 0.70 concept accuracy under symmetric filtering.
Asymmetric filtering allows EfficientNet-B7 to reach 0.82 label F1 and 0.70 concept accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other medical imaging datasets that rely on multi-criteria concept labels may contain similar hidden inconsistencies that limit interpretable models.
Annotation protocols could be improved by focusing extra review on the dermoscopic features that most frequently generate boundary conflicts.
Replacing hard concept assignments with soft or probabilistic ones might allow models to surpass the 92.1% ceiling on the original data.
The cleaned Derm7pt+ subset provides a reproducible testbed for comparing concept-based methods without data-quality confounds.

Load-bearing premise

The concept annotations supplied with Derm7pt are treated as accurate ground truth, making observed inconsistencies an intrinsic property of the clinical data.

What would settle it

Re-annotating the 306 images that belong to inconsistent profiles and finding zero remaining conflicts would show the inconsistencies were annotation artifacts rather than fixed properties of the dataset.

Figures

Figures reproduced from arXiv: 2604.19323 by Gonzalo N\'apoles, Isel Grau, Yamisleydi Salgueiro.

**Figure 1.** Figure 1: Rough set partition of the Derm7pt dataset. (a) Stacked bar chart showing the consistent (blue) and inconsistent (red) split at the concept-profile level and the image level. (b) Histogram of conflict ratio γk across all 50 inconsistent profiles, with kernel density estimate. The dashed vertical line marks maximum ambiguity γk = 0.5. where irregular dots and globules appear in 79% of inconsistent profile i… view at source ↗

**Figure 2.** Figure 2: Melanoma prevalence per concept value with 95% Wilson confidence intervals, where pˆ is the estimated melanoma proportion for each concept value. Color encodes risk tier: high (pˆ > 0.40), moderate (0.20 < pˆ ≤ 0.40), low (pˆ ≤ 0.20). The dotted vertical line marks the dataset-level melanoma prevalence (24.9%). Right-axis labels group concept values by their parent attribute in the set of dermoscopic conce… view at source ↗

**Figure 3.** Figure 3: Joint melanoma rate heatmaps. Each cell shows the estimated probability Pˆ(melanoma) and the sample count n for images that carry both concept values. Cells with n < 3 are omitted. (a) Blue-whitish veil × dots and globules. (b) Pigment network × streaks. The color scale is shared across both panels. (a) seborrheic keratosis (b) melanoma [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Inconsistent pair from profile P1 (γ1 = 0.50). Despite identical concept descriptions, the left image is labeled as seborrheic keratosis (d = 0), whereas the right image is labeled as melanoma (d = 1). the difference to feature correlations that bypass the bottleneck, producing explanations that capture artifacts rather than clinical insight and silently violating the core CBM interpretability guarantee. 3… view at source ↗

**Figure 5.** Figure 5: Filtering strategy comparison. (a) Theoretical accuracy ceiling for each configuration. The dashed line marks the majority-class baseline (75.1%). (b) Post-filter dataset composition by class. 4.1 Data Preparation All experiments use the symmetric filtering strategy defined in Equation (9), which restricts the dataset to the positive region POSC(d) by removing every image whose concept profile belongs to B… view at source ↗

**Figure 6.** Figure 6: Architecture of our hard CBM with stop-gradient. An input image x is passed through a backbone φθ to produce a feature vector z. Separate concept heads {gc} predict logits for each dermoscopic concept. These logits are converted into a single hard, discrete concept vector vˆ via argmax and one-hot encoding. To prevent label leakage, the gradient flow is detached at the bottleneck, ensuring the label classi… view at source ↗

read the original abstract

Concept Bottleneck Models (CBMs) route predictions exclusively through a clinically grounded concept layer, binding interpretability to concept-label consistency. When a dataset contains concept-level inconsistencies, identical concept profiles mapped to conflicting diagnosis labels create an unresolvable bottleneck that imposes a hard ceiling on achievable accuracy. In this paper, we apply rough set theory to the Derm7pt dermoscopy benchmark and characterize the full extent and clinical structure of this inconsistency. Among 305 unique concept profiles formed by the 7 dermoscopic criteria of the 7-point melanoma checklist, 50 (16.4%) are inconsistent, spanning 306 images (30.3% of the dataset). This yields a theoretical accuracy ceiling of 92.1%, independent of backbone architecture or training strategy for CBMs that exclusively operate with hard concepts. In addition, we characterize the conflict-severity distribution, identify the clinical features most responsible for boundary ambiguity, and evaluate two filtering strategies with quantified effects on dataset composition and CBM interpretability. Symmetric removal of all boundary-region images yields Derm7pt+, a fully consistent benchmark subset of 705 images with perfect quality of classification and no hard accuracy ceiling. Building on this filtered dataset, we present a hard CBM evaluated across 19 backbone architectures from the EfficientNet, DenseNet, ResNet, and Wide ResNet families. Under symmetric filtering, explored for completeness, EfficientNet-B5 achieves the best label F1 score (0.85) and label accuracy (0.90) on the held-out test set, with a concept accuracy of 0.70. Under asymmetric filtering, EfficientNet-B7 leads across all four metrics, reaching a label F1 score of 0.82 and concept accuracy of 0.70. These results establish reproducible baselines for concept-consistent CBM evaluation on dermoscopic data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that 16% of concept profiles in Derm7pt create label conflicts that cap hard CBM accuracy at 92.1%, and it supplies a filtered consistent subset plus baselines across 19 backbones.

read the letter

The core finding is straightforward: of the 305 unique profiles built from the seven dermoscopic criteria, 50 turn out inconsistent and cover 306 images. That directly imposes a 92.1% upper bound on accuracy for any CBM that uses hard concepts only. They also release Derm7pt+, the 705-image subset after symmetric removal of the conflicting cases, which removes the ceiling entirely. The baselines then run EfficientNet, DenseNet, ResNet, and Wide ResNet variants on both the original and filtered data, with EfficientNet-B5 hitting 0.90 label accuracy and 0.70 concept accuracy under symmetric filtering. These numbers come from exhaustive enumeration rather than sampling or modeling assumptions, so they are easy to verify from the table alone. The rough-set framing simply makes the indiscernibility relation explicit and lets them count conflicts and characterize which features drive the boundary cases. That part is new for this dataset and this model family. The work is useful for anyone who needs a reproducible, concept-consistent dermoscopy benchmark or who wants to quantify how much label noise sits inside the concept layer before training starts. The main limitation is that the entire analysis treats the supplied annotations as fixed ground truth. If some of the 50 inconsistent profiles reflect labeling mistakes rather than genuine clinical ambiguity, the reported ceiling would shrink once those errors are fixed. Concept accuracy staying at 0.70 also means the bottleneck is still leaky even on the cleaned data, which limits how much interpretability you actually get. The paper does not test soft concepts or post-hoc correction methods that might recover some of the lost performance. Readers working on concept bottleneck models in medical imaging will get the most out of it, especially if they already use Derm7pt or similar checklists. The measurements are concrete and the code for the profile enumeration appears reproducible, so the paper is worth sending to referees. A serious review would mainly check the exact profile construction and whether the filtering choices are justified clinically.

Referee Report

0 major / 2 minor

Summary. The paper applies rough-set theory to the Derm7pt dataset to quantify concept-level inconsistencies arising from the 7 dermoscopic criteria of the 7-point checklist. It reports that 50 of 305 unique concept profiles (16.4%) are inconsistent, spanning 306 images (30.3% of the data) and imposing a 92.1% theoretical accuracy ceiling on any hard-concept CBM, independent of backbone or training. The authors introduce symmetric and asymmetric filtering to produce the consistent Derm7pt+ subset (705 images), then evaluate hard CBMs across 19 architectures from EfficientNet, DenseNet, ResNet, and Wide ResNet families, with EfficientNet-B5 and B7 achieving the strongest label and concept metrics on the filtered data.

Significance. If the inconsistency counts and derived ceiling hold, the work identifies a concrete, data-intrinsic limit on CBM performance for this clinically relevant benchmark and supplies a cleaned, reproducible subset together with baselines across 19 architectures. The exhaustive enumeration of profiles, direct application of indiscernibility, and provision of parameter-free accuracy bounds are strengths that support falsifiable claims about interpretability bottlenecks in medical imaging.

minor comments (2)

The abstract and results sections refer to 'symmetric removal of all boundary-region images' and 'symmetric filtering' without an explicit definition or pseudocode for how boundary regions are identified from the rough-set lower/upper approximations; this notation should be clarified with a short example or equation in the methods.
Table or supplementary material listing the 50 inconsistent profiles (or at least the distribution of conflict severity) would strengthen verifiability of the 306-image count and 92.1% ceiling; currently the numbers are stated but not broken down by profile.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, as well as the recommendation for minor revision. The assessment correctly identifies the core contributions: the rough-set enumeration of inconsistent concept profiles in Derm7pt, the resulting 92.1% accuracy ceiling for hard CBMs, and the provision of the consistent Derm7pt+ subset together with baselines across 19 architectures.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper computes inconsistency counts and the accuracy ceiling by direct application of standard rough-set indiscernibility to the raw Derm7pt annotation table: identical 7-concept vectors are grouped, label conflicts within each group are tallied, and the ceiling is obtained from the per-group majority-vote bound. These steps use only the supplied data and the definition of inconsistency; no parameters are fitted, no equations are self-referential, and the claimed independence from backbone or training strategy follows logically from the hard-concept CBM premise. No load-bearing premise rests on self-citation chains or imported uniqueness theorems. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the standard rough-set framework for detecting inconsistent decision rules and on the assumption that the seven dermoscopic criteria fully capture the clinically relevant concepts.

axioms (1)

standard math Rough set theory identifies inconsistent objects by comparing lower and upper approximations of decision classes.
Invoked to locate the 50 conflicting profiles from the 305 unique combinations.

pith-pipeline@v0.9.0 · 5655 in / 1404 out tokens · 58536 ms · 2026-05-10T02:17:04.626020+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 25 canonical work pages · 1 internal anchor

[1]

& Doshi-Velez, F

Havasi, M., Parbhoo, S. & Doshi-Velez, F. Addressing leakage in concept bottleneck models. InAdvances in Neural Information Processing Systems, vol. 35, 23386–23397 (2022)

2022
[2]

Promises and pitfalls of black-box concept learning models.arXiv preprint arXiv:2106.13314, 2021

Mahinpei, A., Clark, J., Lage, I., Doshi-Velez, F. & Pan, W. Promises and pitfalls of black-box concept learning models. InICML 2021 Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI, DOI: 10.48550/arXiv.2106.13314 (2021)

work page doi:10.48550/arxiv.2106.13314 2021
[3]

& Lee, N

Shin, S., Jo, Y ., Ahn, S. & Lee, N. A closer look at the intervention procedure of concept bottleneck models. InProceedings of the 40th International Conference on Machine Learning, vol. 202 ofProceedings of Machine Learning Research, 31504–31520, DOI: 10.48550/arXiv.2302.14260 (PMLR, 2023)

work page doi:10.48550/arxiv.2302.14260 2023
[4]

InAdvances in Neural Information Processing Systems, vol

Espinosa Zarlenga, M.et al.Concept embedding models: Beyond the accuracy-explainability trade-off. InAdvances in Neural Information Processing Systems, vol. 35, 21400–21413, DOI: 10.48550/arXiv.2209.09056 (2022)

work page doi:10.48550/arxiv.2209.09056 2022
[5]

Post-hoc concept bottleneck models

Yuksekgonul, M., Wang, M. & Zou, J. Post-hoc concept bottleneck models. InThe Eleventh International Conference on Learning Representations, DOI: 10.48550/arXiv.2205.15480 (2023)

work page doi:10.48550/arxiv.2205.15480 2023
[6]

Oikarinen, T., Das, S., Nguyen, L. M. & Weng, T.-W. Label-free concept bottleneck models. InThe Eleventh International Conference on Learning Representations, DOI: 10.48550/arXiv.2304.06129 (2023)

work page doi:10.48550/arxiv.2304.06129 2023
[7]

& Brox, T

Schrodi, S., Schur, J., Argus, M. & Brox, T. Concept bottleneck models without predefined concepts.Transactions on Mach. Learn. Res.(2025). 9.Ciravegna, G.et al.Logic explained networks.Artif. Intell.314, 103822, DOI: 10.1016/j.artint.2022.103822 (2023)

work page doi:10.1016/j.artint.2022.103822 2025
[8]

Esteva, A.et al.Dermatologist-level classification of skin cancer with deep neural networks.Nature542, 115–118, DOI: 10.1038/nature21056 (2017)

work page doi:10.1038/nature21056 2017
[9]

Hauser, K.et al.Explainable artificial intelligence in skin cancer recognition: A systematic review.Eur. J. Cancer167, 54–69, DOI: 10.1016/j.ejca.2022.02.025 (2022)

work page doi:10.1016/j.ejca.2022.02.025 2022
[10]

O’Connor, and Kevin McGuinness

Lucieri, A.et al.On interpretability of deep learning based skin lesion classifiers using concept activation vectors. In2020 International Joint Conference on Neural Networks (IJCNN), 1–10, DOI: 10.1109/IJCNN48605.2020.9206946 (2020)

work page doi:10.1109/ijcnn48605.2020.9206946 2020
[11]

Patrício, C., Neves, J. C. & Teixeira, L. F. Coherent concept-based explanations in medical image and its application to skin lesion diagnosis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 3798–3807, DOI: 10.1109/CVPRW59228.2023.10208381 (2023)

work page doi:10.1109/cvprw59228.2023.10208381 2023
[12]

Patrício, C., Teixeira, L. F. & Neves, J. C. A two-step concept-based approach for enhanced interpretability and trust in skin lesion diagnosis.Comput. Struct. Biotechnol. J.28, 71–79, DOI: 10.1016/j.csbj.2025.02.013 (2025)

work page doi:10.1016/j.csbj.2025.02.013 2025
[13]

Commun.16, 4739, DOI: 10.1038/s41467-025-59532-5 (2025)

Chanda, T.et al.Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: Eye-tracking study.Nat. Commun.16, 4739, DOI: 10.1038/s41467-025-59532-5 (2025)

work page doi:10.1038/s41467-025-59532-5 2025
[14]

, author Ferreira, P.M

Mendonça, T., Ferreira, P. M., Marques, J. S., Marçal, A. R. S. & Rozeira, J. PH2 – a dermoscopic image database for research and benchmarking. In2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 5437–5440, DOI: 10.1109/EMBC.2013.6610779 (2013). 13/14

work page doi:10.1109/embc.2013.6610779 2013
[15]

Scientific Data5(1), 180161 (Aug 2018)

Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Sci. Data5, 180161, DOI: 10.1038/sdata.2018.161 (2018)

work page doi:10.1038/sdata.2018.161 2018
[16]

Data11, 641, DOI: 10.1038/ s41597-024-03387-w (2024)

Hernández-Pérez, C.et al.BCN20000: Dermoscopic lesions in the wild.Sci. Data11, 641, DOI: 10.1038/ s41597-024-03387-w (2024). 19.Combalia, M.et al.BCN20000: Dermoscopic lesions in the wild, DOI: 10.48550/arXiv.1908.02288 (2019)

work page doi:10.48550/arxiv.1908.02288 2024
[17]

Rotemberg, V .et al.A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Sci. Data8, 34, DOI: 10.1038/s41597-021-00815-z (2021)

work page doi:10.1038/s41597-021-00815-z 2021
[18]

P., Gencoglan, G

Yilmaz, A., Yasar, S. P., Gencoglan, G. & Temelkuran, B. DERM12345: A large, multisource dermatoscopic skin lesion dataset with 40 subclasses.Sci. Data11, 1302, DOI: 10.1038/s41597-024-04104-3 (2024)

work page doi:10.1038/s41597-024-04104-3 2024
[19]

IEEE Journal of Biomedical and Health Informatics23(2), 538–546 (Mar 2019)

Kawahara, J., Daneshvar, S., Argenziano, G. & Hamarneh, G. Seven-point checklist and skin lesion classification using multitask multimodal neural nets.IEEE J. Biomed. Heal. Informatics23, 538–546, DOI: 10.1109/JBHI.2018.2824327 (2019)

work page doi:10.1109/jbhi.2018.2824327 2018
[20]

A., Afify, Y

Saeed, M. A., Afify, Y . M., Badr, N. L. & Helal, N. A. Multimodal deep learning ensemble framework for skin cancer detection.Sci. Reports15, 45660, DOI: 10.1038/s41598-025-30534-z (2025). 24.Pawlak, Z. Rough sets.Int. J. Comput. Inf. Sci.11, 341–356, DOI: 10.1007/BF01001956 (1982)

work page doi:10.1038/s41598-025-30534-z 2025
[21]

Theory and Decision Library D (Springer, Dordrecht, 1991)

Pawlak, Z.Rough Sets: Theoretical Aspects of Reasoning about Data. Theory and Decision Library D (Springer, Dordrecht, 1991)

1991
[22]

& Sola, S

Massone, C., Hofman-Wellenhof, R., Chiodi, S. & Sola, S. Dermoscopic criteria, histopathological correlates and genetic findings of thin melanoma on non-volar skin.Genes12, 1288 (2021)

2021
[23]

Rodríguez-Lomba, E.et al.Concordance analysis of dermoscopic features between five observers in a sample of 200 dermoscopic images.Anais Brasileiros de Dermatol.97, 382–384 (2022)

2022
[24]

Wilson, E. B. Probable inference, the law of succession, and statistical inference.J. Am. Stat. Assoc.22, 209–212, DOI: 10.1080/01621459.1927.10502953 (1927)

work page doi:10.1080/01621459.1927.10502953 1927
[25]

& Salgueiro, Y

Nápoles, G., Grau, I., Jastrzebska, A. & Salgueiro, Y . Presumably correct decision sets.Pattern Recognit.141, 1–35, DOI: 10.1016/j.patcog.2023.109640 (2023)

work page doi:10.1016/j.patcog.2023.109640 2023
[26]

& Chen, H

Hou, J., Xu, J. & Chen, H. Concept-attention whitening for interpretable skin lesion diagnosis. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2024, vol. 15010 ofLecture Notes in Computer Science, 109–119, DOI: 10.1007/978-3-031-72117-5_11 (Springer, 2024)

work page doi:10.1007/978-3-031-72117-5_11 2024
[27]

Methods Programs Biomed.215, 106620, DOI: 10.1016/j.cmpb.2022.106620 (2022)

Lucieri, A.et al.ExAID: A multimodal explanation framework for computer-aided diagnosis of skin lesions.Comput. Methods Programs Biomed.215, 106620, DOI: 10.1016/j.cmpb.2022.106620 (2022)

work page doi:10.1016/j.cmpb.2022.106620 2022
[28]

& Chen, H

Bie, Y ., Luo, L. & Chen, H. MICA: Towards explainable skin lesion diagnosis via multi-level image-concept alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, 837–845, DOI: 10.1609/aaai.v38i2.27842 (2024)

work page doi:10.1609/aaai.v38i2.27842 2024
[29]

Tan, M. & Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Chaudhuri, K. & Salakhutdinov, R. (eds.)Proceedings of the 36th International Conference on Machine Learning, vol. 97 ofProceedings of Machine Learning Research, 6105–6114 (PMLR, 2019)

2019
[30]

& Weinberger, K

Huang, G., Liu, Z., van der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2017)

2017
[31]

Wide Residual Networks

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016). 36.Zagoruyko, S. & Komodakis, N. Wide residual networks.CoRRabs/1605.07146(2016). 14/14

work page internal anchor Pith review arXiv 2016