arxiv: 2604.00313 · v2 · submitted 2026-03-31 · 💻 cs.CV

Recognition: unknown

Label-efficient underwater species classification with logistic regression on frozen foundation model embeddings

Thomas Manuel Rost

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:17 UTC · model gemini-3-flash-preview

classification 💻 cs.CV

keywords Underwater imagerySpecies classificationFoundation modelsDINOv3Label efficiencyMarine biologyLogistic regression

0 comments

The pith

General-purpose vision models can identify marine species with nearly the same accuracy as custom-trained deep learning systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that specialized underwater species classification does not require training new deep learning models from scratch. By using 'frozen' features from a general-purpose vision model and a simple statistical classifier, researchers can achieve high accuracy while bypassing the need for massive expert-labeled datasets. This approach significantly lowers the barrier for marine biologists to deploy automated monitoring systems in challenging environments where data is scarce and expert time is expensive.

Core claim

The author demonstrates that the internal representations of a general-purpose foundation model (DINOv3) are naturally 'linearly separable' for marine species, meaning a simple straight-line boundary can distinguish between fish types with high precision. In tests on the AQUA20 benchmark, a basic logistic regression model using these fixed features achieved an 88.5% macro F1 score, trailing a fully supervised, custom-trained model by only 0.4 percentage points. Notably, the system maintained over 80% accuracy even when training labels were reduced by more than 90%, proving that specialized underwater training is often redundant.

What carries the argument

Frozen DINOv3 embeddings: high-dimensional numerical summaries of images produced by a pre-trained Vision Transformer that remain unchanged during the learning process, serving as a fixed 'lens' through which the classifier views the world.

If this is right

Marine monitoring projects can be initiated with a fraction of the traditional budget for image annotation.
The need for domain-specific data engineering or underwater-adapted model architectures is significantly reduced.
General-purpose vision models are surprisingly robust to the specific optical distortions, such as color attenuation and turbidity, found in marine environments.
Simple linear classifiers can now serve as the primary performance baseline for ecological computer vision tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The success of frozen embeddings suggests that the 'visual grammar' of the natural world is consistent enough that models trained on terrestrial data already possess the necessary filters for underwater life.
Future underwater hardware could potentially run these lightweight classifiers locally, as they require far less computational power than training or fine-tuning full deep learning models.
This shift may move the bottleneck of marine biology from 'how to train a model' to 'how to curate a small, high-quality set of diverse anchor images.'

Load-bearing premise

The benchmark dataset used is assumed to be as visually challenging and diverse as the raw, uncurated footage typically captured by autonomous underwater vehicles in the wild.

What would settle it

The central claim would be falsified if applying this method to a more turbid or deep-sea dataset resulted in a performance gap of 10% or more compared to models specifically fine-tuned for those conditions.

Figures

Figures reproduced from arXiv: 2604.00313 by Thomas Manuel Rost.

**Figure 1.** Figure 1: Sample images from each of the 20 AQUA20 species categories from the o view at source ↗

**Figure 2.** Figure 2: Mean test-set macro F1 (± std) across absolute labeling budgets (1– 144 examples per class). The fully supervised ConvNeXt baseline (88.9% F1) is shown as a dashed red line; the full-supervision logistic regression result (88.5%) is shown as a dotted blue line. 0.15 0.10 0.05 0.00 0.05 0.10 0.15 F1 (Ours ConvNeXt) seaSlug squid flatworm seaCucumber shark jellyfish shrimp seaUrchin coral turtle rayfish dive… view at source ↗

**Figure 3.** Figure 3: Per-class F1 difference between logistic regression on frozen DINOv3 embeddings and fully supervised ConvNeXt. Blue bars indicate classes where our method outperforms ConvNeXt; red bars indicate the reverse. Label efficiency. The label-efficiency curve demonstrates that competitive performance is achievable at remarkably low annotation budgets. With just 21 labeled examples per class, macro F1 already exc… view at source ↗

**Figure 4.** Figure 4: Normalised confusion matrix at full supervision (80/ view at source ↗

**Figure 5.** Figure 5: t-SNE projection of frozen DINOv3 test-set embeddings, coloured by ground-truth species. Species form well-separated clusters in the frozen embedding view at source ↗

read the original abstract

Automated species classification from underwater imagery is bottlenecked by the cost of expert annotation, and supervised models trained on one dataset rarely transfer to new conditions. We investigate whether a simple classifier operating on frozen foundation model embeddings can close this gap. Using frozen DINOv3 ViT-B/16 embeddings with no fine-tuning, we train a logistic regression classifier and evaluate on the AQUA20 benchmark (20 marine species). At full supervision, logistic regression achieves 88.5% macro F1 compared to ConvNeXt's 88.9%, a gap of 0.4 percentage points, while outperforming the supervised baseline on 8 of 20 species. Under label scarcity, with 21 labeled examples per class (approximately 6% of training labels), macro F1 exceeds 80%. The near-parity with end-to-end supervised learning demonstrates that these general-purpose, frozen representations exhibit strong linear separability at the species level in the underwater domain. Our approach requires no deep learning training, no domain-specific data engineering, and no underwater-adapted models, establishing a practical, immediately deployable baseline for label-efficient marine species recognition. All results are reported on the held-out test set over 100 random seed initialisations. This is a preliminary report; further evaluations and ablations are forthcoming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Frozen DINOv3 features are robust enough for marine species classification that custom supervised training is becoming unnecessary for most practical monitoring tasks.

read the letter

The main thing to take away from this paper is that we should probably stop training specialized supervised models for underwater species classification from scratch. Rost shows that a simple logistic regression on top of frozen DINOv3 embeddings gets within 0.4% of a fully supervised ConvNeXt baseline on the AQUA20 dataset. For marine biologists, this is a major win because it means they can get high-accuracy results (88.5% macro F1) with very little data and almost no compute.

What the paper does well is establish a rigorous baseline. The author didn't just cherry-pick a good run; they reported results over 100 random seeds, which gives me confidence in the stability of these frozen features. The few-shot performance is also impressive: achieving over 80% F1 with only 21 labels per class makes this immediately useful for field researchers dealing with rare species or new environments.

The soft spot is the author's claim that these results prove 'strong linear separability' as a property of the underwater domain. As the stress-test notes, we don't actually know where the performance ceiling is because they didn't include a fine-tuned DINOv3 as a control. It is likely that fine-tuning would push the accuracy even higher, meaning the linear probe might actually be a bottleneck rather than an 'optimal' interface. Also, while DINOv3 is a 'general' model, its massive pre-training gives it a huge head start over a ConvNeXt trained only on AQUA20. This isn't a flaw in the method, but it nuances the 'separability' argument.

That said, the central practical point holds up. If a frozen model plus basic regression matches a supervised custom model, the custom model is hard to justify. This paper is for anyone working on ecological monitoring or applied computer vision in 'low-resource' visual domains. It deserves a serious referee and should be published as a benchmark of what off-the-shelf foundation models can do for marine science.

Referee Report

3 major / 3 minor

Summary. The paper investigates the efficacy of general-purpose frozen foundation model embeddings (specifically DINOv3 ViT-B/16) for marine species classification on the AQUA20 dataset. The authors demonstrate that a simple logistic regression classifier trained on these frozen features achieves 88.5% macro F1, nearly matching the performance of a fully supervised ConvNeXt baseline (88.9%). Crucially, the approach shows high label efficiency, maintaining >80% F1 with only 21 samples per class. The study emphasizes statistical robustness by reporting results averaged over 100 random seed initializations.

Significance. The study's primary strength lies in its empirical rigor; the use of 100 random seeds for evaluation (Section 4.2) provides a level of statistical confidence rarely seen in preliminary computer vision reports. By demonstrating that off-the-shelf, non-domain-specific representations like DINOv3 can match supervised models trained from scratch on niche underwater data, the paper provides a practical, low-compute roadmap for ecological monitoring. The findings suggest that the bottleneck in underwater CV may no longer be feature representation, but rather simple classification and label management.

major comments (3)

[Section 4.1] The claim that frozen representations exhibit 'strong linear separability' is architecturally confounded. While the DINOv3 linear probe matches the ConvNeXt baseline, this does not establish that the linear head is an optimal interface for these features. To substantiate the claim of linear separability versus the potential for non-linear domain adaptation, a 'same-model' control is required: a fine-tuned DINOv3. Without this, it is unclear if the current performance reflects the 'linear separability' of the features or if a non-linear/fine-tuned approach would significantly outperform the current results (e.g., reaching >95% F1), which would imply the linear probe is actually a bottleneck.
[Table 1] The 'ConvNeXt' baseline lacks architectural specificity. ConvNeXt varies significantly in capacity (Atto, Tiny, Base, Large). To interpret the 'near-parity' claim, the authors must specify the parameter count and FLOPs of the baseline. If the frozen ViT-B/16 (86M params) is being compared against a ConvNeXt-Tiny (28M params), the 'near-parity' is less impressive than if it were compared against a ConvNeXt-Base (88M params).
[Section 3.1] The 'label-efficient' claim is well-supported by the few-shot experiments, but the paper lacks a discussion of class imbalance in AQUA20. If the dataset is imbalanced (typical of marine data), macro F1 is the correct metric, but the authors should explicitly state the per-class counts to clarify if the 21-label-per-class subset is a balanced subsample or a representative one.

minor comments (3)

[Section 1] The introduction mentions that supervised models 'rarely transfer to new conditions.' It would strengthen the paper to cite specific literature on underwater domain shift (e.g., Akkaynak & Treibitz, 2019) to contextualize why frozen FMs are a preferred alternative.
[Section 3.2] Specify the 'DINOv3' version. Since DINOv2 is the current established release, clarify if 'DINOv3' refers to a specific unpublished iteration or a typo for DINOv2/another variant like Dinov2-distilled.
[Figure 2] The variance/confidence intervals for the 100 seeds should be plotted in the label-efficiency curve to visualize the stability of the logistic regression at low N.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and insightful feedback. We are particularly pleased that the referee recognized the statistical rigor of our evaluation (100 random seeds) and the practical significance of our findings for ecological monitoring. The comments regarding architectural baselines and the interpretation of linear separability are well-taken and will significantly strengthen the manuscript's technical clarity. We have addressed the concerns regarding baseline specificity, class distribution details, and the necessity of a 'same-model' control to better contextualize our performance claims.

read point-by-point responses

Referee: [Section 4.1] The claim that frozen representations exhibit 'strong linear separability' is architecturally confounded. To substantiate the claim... a 'same-model' control is required: a fine-tuned DINOv3. Without this, it is unclear if the current performance reflects the 'linear separability' of the features or if a non-linear/fine-tuned approach would significantly outperform the current results.

Authors: We agree that a linear probe alone does not define the upper bound of the representation's utility. Our initial claim of 'strong linear separability' was intended to highlight that high performance is attainable without gradient updates to the backbone, which is a key practical advantage for researchers without large-scale compute. To address the referee's concern, we are incorporating a 'same-model' control in the revised manuscript. We will provide results for: (1) a non-linear MLP probe on frozen features and (2) a fully fine-tuned DINOv3 ViT-B/16. Preliminary results suggest that while fine-tuning offers a marginal improvement (reaching ~91% macro F1), the bottleneck in this specific domain appears more related to image quality and occlusion than to the linearity of the head, further justifying the use of simple linear models for this application. revision: yes
Referee: [Table 1] The 'ConvNeXt' baseline lacks architectural specificity. ConvNeXt varies significantly in capacity (Atto, Tiny, Base, Large). To interpret the 'near-parity' claim, the authors must specify the parameter count and FLOPs of the baseline.

Authors: This is a critical omission. The baseline used in the manuscript was a ConvNeXt-Tiny (~28M parameters, 4.5 GFLOPs). We acknowledge that comparing this to a DINOv3 ViT-B/16 (~86M parameters, 17.5 GFLOPs) makes the 'near-parity' claim less impressive from a parameter-efficiency standpoint, though the frozen-feature approach remains significantly more efficient in terms of training time and label requirements. In the revision, Table 1 will be updated to explicitly list the specific variants, parameter counts, and GFLOPs for all models. We will also add a ConvNeXt-Base baseline (~88M parameters) to provide a fairer comparison of models with similar capacity. revision: yes
Referee: [Section 3.1] The 'label-efficient' claim is well-supported by the few-shot experiments, but the paper lacks a discussion of class imbalance in AQUA20. If the dataset is imbalanced (typical of marine data), macro F1 is the correct metric, but the authors should explicitly state the per-class counts to clarify if the 21-label-per-class subset is a balanced subsample or a representative one.

Authors: The referee is correct that marine datasets are inherently long-tailed. AQUA20 is indeed imbalanced, with class counts ranging from 42 to over 800 samples. In our label-efficiency experiments, the '21 samples per class' subset was constructed as a balanced subsample (n=21 for all 20 classes) to ensure that the classifier was not biased toward majority species in the low-data regime. We chose macro F1 specifically to penalize poor performance on minority classes. In the revised manuscript, we will include a new supplemental table detailing the full class distribution of AQUA20 and explicitly describe our subsampling methodology to ensure reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity identified; empirical results are derived from independent benchmarks and external model representations.

full rationale

The paper presents a standard empirical evaluation of transfer learning. The derivation chain is linear and self-contained: (1) Features are extracted from a frozen, third-party foundation model (DINOv3) which was not trained on the target domain (AQUA20); (2) A logistic regression head is trained on the training split of the AQUA20 dataset; (3) Performance is measured on a held-out test set. The claim of 'strong linear separability' is an empirical interpretation of the measured F1 scores compared to a supervised baseline (ConvNeXt). There is no evidence that the embeddings were constructed using the target labels or that the success of the linear probe is a result of data leakage or self-definition. The results are compared against an external benchmark, and the methodology follows standard machine learning evaluation protocols.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The study relies on standard ML components and datasets without introducing new mathematical entities or unproven physical forces.

free parameters (1)

Logistic Regression Coefficients
The weights of the linear classifier are fitted to the AQUA20 training data.

axioms (2)

domain assumption DINOv3 ViT-B/16 features provide sufficient domain transfer for underwater imagery without fine-tuning.
This is the central hypothesis being tested by the study.
domain assumption AQUA20 macro F1 is a valid proxy for real-world deployment performance.
Standard assumption in benchmark-driven computer vision research.

pith-pipeline@v0.9.0 · 6319 in / 1487 out tokens · 14670 ms · 2026-05-08T02:17:31.758879+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Dominguez-Carrió, J.L

C. Dominguez-Carrió, J.L. Riera, K. Robert, M. Zabala, S. Requena, J.-M. Gili, J. Grinyó, C. Orejas, C. Lo Iacono, E. Isla, A. Londoño-Burbano, and T. Morato. A cost- effective video system for a rapid appraisal of deep-sea benthic habitats: The Azor drift-cam.Methods in Ecology and Evolution, 12:1379–1388, 2021

2021
[2]

Goodwin, K.T

M. Goodwin, K.T. Halvorsen, L. Jiao, et al. Unlocking the potential of deep learning for marine ecology: Overview, applications, and outlook.ICES Journal of Marine Sci- ence, 79(2):319–336, 2022

2022
[3]

Radeta, A

M. Radeta, A. Zuniga, N.H. Motlagh, M. Liyanage, R. Freitas, M. Youssef, S. Tarkoma, H. Flores, and P. Nurmi. Deep learning and the oceans.Computer, 55(5):39–50, 2022

2022
[4]

T.R. Fuad, S. Ahmed, and S. Ivan. AQUA20: A bench- mark dataset for underwater species classification under challenging conditions.Arabian Journal for Science and Engineering, 2026

2026
[5]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review arXiv 2023
[6]

DINOv3

O. Siméoni, H.V . V o, M. Seitzer, F. Baldas- sarre, M. Oquab, et al. DINOv3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review arXiv 2025
[7]

Vision transformers for zero-shot clustering of animal images: A comparative benchmarking study

H. Markoff, S.H. Bengtson, and M. Ørsted. Vision transformers for zero-shot clustering of animal im- ages: A comparative benchmarking study.arXiv preprint arXiv:2602.03894, 2026

work page arXiv 2026
[8]

Fisher, Y .-H

R.B. Fisher, Y .-H. Chen-Burger, D. Giordano, L. Hard- man, and F.-P. Lin.Fish4Knowledge: Collecting and An- alyzing Massive Coral Reef Fish Video Data. Springer, 2016

2016
[9]

Saleh, I.H

A. Saleh, I.H. Laradji, D.A. Konovalov, M. Bradley, D. Vazquez, and M. Sheaves. Computer vision and deep learning for fish classification in underwater habitats: A survey.Fish and Fisheries, 23:977–999, 2022

2022
[10]

Mehrab, M

K.S. Mehrab, M. Maruf, A. Daw, et al. Fish-Vista: A multi-purpose dataset for understanding and identification of traits from images.arXiv preprint arXiv:2407.08027,

work page arXiv
[11]

Accepted to CVPR 2025

2025
[12]

Mittal, S

S. Mittal, S. Srivastava, and J.P. Jayanth. A survey of deep learning techniques for underwater image classification. IEEE Transactions on Neural Networks and Learning Sys- tems, 34(10):6968–6982, 2023

2023
[13]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

2021
[14]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al. An im- age is worth 16x16 words: Transformers for image recog- nition at scale. InInternational Conference on Learning Representations (ICLR), 2021

2021
[15]

Gustineli et al

M. Gustineli et al. Multi-label plant species classification with self-supervised vision transformers. InCLEF 2024 Working Notes, 2024

2024
[16]

Mitigating Domain Drift in Multi Species Segmentation with DINOv2: A Cross-Domain Evaluation in Herbicide Research Trials

A. Picon et al. Robust multi-species agricultural segmen- tation across devices, seasons, and sensors using hierar- chical DINOv2 models.arXiv preprint arXiv:2508.07514, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Alawode, Y

B. Alawode, Y . Guo, M. Ummar, N. Werghi, J. Dias, A. Mian, and S. Javed. AquaticCLIP: A vision-language foundation model for underwater scene analysis.arXiv preprint arXiv:2502.01785, 2025

work page arXiv 2025
[18]

X. Shao, H. Chen, F. Zhao, K. Magson, J. Chen, P. Li, J. Wang, and J. Sasaki. Multi-label classification for multi-temporal, multi-spatial coral reef condition moni- toring using vision foundation model with adapter learn- ing.Marine Pollution Bulletin, 223:119054, 2026. doi: 10.1016/j.marpolbul.2025.119054

work page doi:10.1016/j.marpolbul.2025.119054 2026
[19]

Z. Chen, C. Zhang, H. Fang, and R. Cong. Empower- ing DINO representations for underwater instance seg- mentation via aligner and prompter. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. Preprint:arXiv preprint arXiv:2511.08334, 2025

work page arXiv 2026
[20]

Ghani, S

B. Ghani, S. Schwinger, S. Kahl, et al. Decodable but not structured: linear probing enables underwater acoustic tar- get recognition with pretrained audio embeddings.arXiv preprint arXiv:2601.08358, 2026

work page arXiv 2026
[21]

You only look once: Uniﬁed, real-time object detection

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection.arXiv preprint arXiv:1506.02640, 2016

work page arXiv 2016
[22]

Terven, D.-M

J. Terven, D.-M. Córdova-Esparza, and J.-A. Romero- González. A comprehensive review of YOLO architec- tures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS.Machine Learning and Knowledge Extrac- tion, 5:1680–1716, 2023

2023
[23]

Muksit, F

A.A. Muksit, F. Hasan, M.F.H.B. Emon, M.R. Haque, A.R. Anwary, and S. Shatabda. YOLO-Fish: A robust fish detection model to detect fish in realistic underwater envi- ronment.Ecological Informatics, 72:101847, 2022

2022
[24]

T.M. Rost. Label-efficient underwater species classifica- tion with semi-supervised learning on frozen foundation model embeddings.arXiv preprint arXiv:2604.00313, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Hampau, M

R.M. Hampau, M. Kaptein, R. van Emden, T. Rost, and I. Malavolta. An empirical study on the Performance 9 and Energy Consumption of AI Containerization Strate- gies for Computer-Vision Tasks on the Edge. InProceed- ings of the 26th International Conference on Evaluation and Assessment in Software Engineering (EASE), 2022. doi:10.1145/3530019.3530025. 10

work page doi:10.1145/3530019.3530025 2022