Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps

Katarzyna Bozek; Kim Ouan; No\'emie Moreau

arxiv: 2603.15269 · v2 · pith:CTLIRTANnew · submitted 2026-03-16 · 💻 cs.CV

Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps

Kim Ouan , No\'emie Moreau , Katarzyna Bozek This is my paper

Pith reviewed 2026-05-21 10:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised learningDINOcorneal nervestortuosity gradingconfocal microscopymedical image analysistransfer learningsegmentation-free classification

0 comments

The pith

Self-supervised features from ImageNet improve corneal nerve tortuosity grading to 84.25% accuracy without segmentation maps

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that self-supervised representations learned by DINO on ImageNet photographs can be transferred and fine-tuned for grading the tortuosity of corneal nerve fibers in in vivo confocal microscopy images. This approach outperforms existing methods that depend on segmentation maps, reaching 84.25 percent accuracy and 77.97 percent sensitivity. A sympathetic reader would care because creating segmentation maps is costly and time-intensive, so skipping them could make disease indication analysis more practical in clinical settings. The model achieves this by focusing on morphological elements like nerve shapes directly from the images. This suggests a path for applying general self-supervised models to domain-specific medical tasks with limited labeled data.

Core claim

Self-supervised pretrained features from ImageNet are transferable to the domain of in vivo confocal microscopy. After careful fine-tuning, DINO improves upon the state-of-the-art in terms of accuracy (84.25%) and sensitivity (77.97%). The fine-tuned model focuses on the key morphological elements in grading without the use of segmentation maps.

What carries the argument

The fine-tuned DINO self-supervised model that classifies tortuosity grades directly from raw in vivo confocal microscopy images by leveraging transferred ImageNet features.

If this is right

Tortuosity grading can proceed without the need for expensive segmentation maps of nerve fibers.
The method achieves higher accuracy and sensitivity than prior segmentation-dependent approaches.
General self-supervised features from natural images can be adapted to specialized medical imaging domains through fine-tuning.
The classifier attends to important morphological features relevant to disease indication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This transfer learning strategy might apply to other medical imaging tasks where annotation for segmentation is burdensome.
Testing on diverse patient populations could reveal how well the features generalize beyond the training dataset.
Combining this with other self-supervised advancements could further boost performance on small medical datasets.

Load-bearing premise

Self-supervised features learned on natural ImageNet photographs transfer meaningfully to in vivo confocal microscopy images of corneal nerves after fine-tuning on the specific dataset.

What would settle it

A drop in accuracy or sensitivity when the model is evaluated on an external, unseen set of confocal microscopy images collected under different conditions or from different patients.

read the original abstract

The tortuosity of corneal nerve fibers are used as indication for different diseases. Current state-of-the-art methods for grading the tortuosity heavily rely on expensive segmentation maps of these nerve fibers. In this paper, we demonstrate that self-supervised pretrained features from ImageNet are transferable to the domain of in vivo confocal microscopy. We show that DINO should not be disregarded as a deep learning model for medical imaging, although it was superseded by two later versions. After careful fine-tuning, DINO improves upon the state-of-the-art in terms of accuracy (84,25%) and sensitivity (77,97%). Our fine-tuned model focuses on the key morphological elements in grading without the use of segmentation maps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DINO fine-tuning works for corneal nerve tortuosity grading without segmentation, but the evaluation lacks key details on data handling and generalizability.

read the letter

The main thing to know is that this paper shows DINO pretrained on ImageNet can be fine-tuned for grading tortuosity in corneal nerves from confocal microscopy, hitting 84.25% accuracy and 77.97% sensitivity without needing segmentation maps, which beats the current methods that depend on them. They do a good job demonstrating the transferability here. By fine-tuning the self-supervised features, the model seems to latch onto the morphological details that matter for the grading task. That's helpful because it skips the costly step of creating those segmentation maps, which is a real bottleneck in this kind of medical analysis. What works well is the straightforward application—no fancy new tricks, just careful adaptation of an existing model. It makes the case that older self-supervised approaches like DINO still have value in specialized domains even after newer ones came along. The weaker part is the experimental reporting. The performance claims are there, but details on how many images were used, whether splits were done at the patient level to prevent leakage, or how it compares to other non-segmentation baselines are missing or thin. With the big shift from everyday photos to these detailed nerve images, and typical small sizes in such datasets, it's easy to get inflated numbers from overfitting. More on the validation would make the generalizability clearer. This is really for folks in medical image analysis focused on eye imaging or similar low-data scenarios. Someone looking for ways to cut annotation costs might pick up some ideas from it. I'd say send it to peer review. The results are interesting enough for experts to check the setup and see if the gains are robust.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that self-supervised DINO features pretrained on ImageNet transfer to in vivo confocal microscopy images of corneal nerves for tortuosity grading. After fine-tuning, the approach achieves 84.25% accuracy and 77.97% sensitivity, outperforming state-of-the-art segmentation-based methods by focusing on key morphological elements without requiring segmentation maps.

Significance. If the reported performance gains are validated with appropriate controls for generalizability, the work would demonstrate the value of reusing earlier self-supervised models for domain-shifted medical imaging tasks and could reduce reliance on expensive manual segmentations in corneal nerve analysis pipelines.

major comments (2)

[Abstract] Abstract: the central performance claims (84.25% accuracy, 77.97% sensitivity) are stated without any disclosure of dataset size, number of patients/subjects, train/test split protocol (image-level vs. patient-level), or statistical significance testing, rendering it impossible to assess whether the numbers support the transfer and improvement assertions.
[Results] The manuscript provides no baseline comparisons, ablation studies, or external validation cohort to substantiate the claim of improvement over prior segmentation-dependent SOTA methods; without these, the reported gains cannot be distinguished from potential overfitting on a small domain-shifted dataset.

minor comments (1)

[Abstract] The decimal notation '84,25%' and '77,97%' may confuse readers; standardize to period notation or clarify regional convention in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has identified key areas where additional transparency and supporting analyses will strengthen the manuscript. We address each major comment below and outline the revisions planned for the next version.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (84.25% accuracy, 77.97% sensitivity) are stated without any disclosure of dataset size, number of patients/subjects, train/test split protocol (image-level vs. patient-level), or statistical significance testing, rendering it impossible to assess whether the numbers support the transfer and improvement assertions.

Authors: We agree that the abstract would benefit from these contextual details to allow proper evaluation of the claims. In the revised manuscript we will expand the abstract to state the dataset size, number of patients, the patient-level train/test split protocol used to avoid leakage, and the statistical testing performed (bootstrap confidence intervals). These additions directly address the concern and will be incorporated. revision: yes
Referee: [Results] The manuscript provides no baseline comparisons, ablation studies, or external validation cohort to substantiate the claim of improvement over prior segmentation-dependent SOTA methods; without these, the reported gains cannot be distinguished from potential overfitting on a small domain-shifted dataset.

Authors: The manuscript already reports direct numerical comparisons against prior segmentation-based SOTA methods in the results, with the proposed approach showing higher accuracy and sensitivity. To further substantiate the gains and mitigate overfitting concerns we will add ablation studies on feature extraction and fine-tuning choices. An external multi-center cohort is not available in the current study; we will therefore add a limitations paragraph discussing the patient-level cross-validation protocol employed and the need for future external validation. These revisions will be made. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning results contain no derivation chain or self-referential reduction

full rationale

The paper reports an empirical machine-learning experiment: a DINO model pretrained on ImageNet is fine-tuned on in-vivo confocal microscopy images to grade corneal-nerve tortuosity, achieving 84.25% accuracy and 77.97% sensitivity without segmentation maps. No equations, first-principles derivations, or predictions appear in the provided text. The performance numbers are direct experimental outputs of the fine-tuning procedure rather than quantities that reduce by construction to fitted parameters, self-citations, or ansatzes. The transfer assumption from natural images to the medical domain is an empirical claim subject to external validation, not a definitional or self-referential step. Consequently the derivation chain is empty and the result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the untested transferability of ImageNet self-supervised features to this medical domain and on the assumption that the fine-tuning procedure generalizes beyond the authors' dataset.

axioms (1)

domain assumption Self-supervised features pretrained on ImageNet photographs are transferable to in vivo confocal microscopy images of corneal nerves
Invoked implicitly as the foundation for using DINO without segmentation maps and for claiming improvement over prior methods.

pith-pipeline@v0.9.0 · 5654 in / 1233 out tokens · 45499 ms · 2026-05-21T10:50:07.326882+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

After careful fine-tuning, DINO improves upon the state-of-the-art in terms of accuracy (84,25%) and sensitivity (77,97%).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.