Recognition: unknown
Label-efficient underwater species classification with logistic regression on frozen foundation model embeddings
Pith reviewed 2026-05-08 02:17 UTC · model gemini-3-flash-preview
The pith
General-purpose vision models can identify marine species with nearly the same accuracy as custom-trained deep learning systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The author demonstrates that the internal representations of a general-purpose foundation model (DINOv3) are naturally 'linearly separable' for marine species, meaning a simple straight-line boundary can distinguish between fish types with high precision. In tests on the AQUA20 benchmark, a basic logistic regression model using these fixed features achieved an 88.5% macro F1 score, trailing a fully supervised, custom-trained model by only 0.4 percentage points. Notably, the system maintained over 80% accuracy even when training labels were reduced by more than 90%, proving that specialized underwater training is often redundant.
What carries the argument
Frozen DINOv3 embeddings: high-dimensional numerical summaries of images produced by a pre-trained Vision Transformer that remain unchanged during the learning process, serving as a fixed 'lens' through which the classifier views the world.
If this is right
- Marine monitoring projects can be initiated with a fraction of the traditional budget for image annotation.
- The need for domain-specific data engineering or underwater-adapted model architectures is significantly reduced.
- General-purpose vision models are surprisingly robust to the specific optical distortions, such as color attenuation and turbidity, found in marine environments.
- Simple linear classifiers can now serve as the primary performance baseline for ecological computer vision tasks.
Where Pith is reading between the lines
- The success of frozen embeddings suggests that the 'visual grammar' of the natural world is consistent enough that models trained on terrestrial data already possess the necessary filters for underwater life.
- Future underwater hardware could potentially run these lightweight classifiers locally, as they require far less computational power than training or fine-tuning full deep learning models.
- This shift may move the bottleneck of marine biology from 'how to train a model' to 'how to curate a small, high-quality set of diverse anchor images.'
Load-bearing premise
The benchmark dataset used is assumed to be as visually challenging and diverse as the raw, uncurated footage typically captured by autonomous underwater vehicles in the wild.
What would settle it
The central claim would be falsified if applying this method to a more turbid or deep-sea dataset resulted in a performance gap of 10% or more compared to models specifically fine-tuned for those conditions.
Figures
read the original abstract
Automated species classification from underwater imagery is bottlenecked by the cost of expert annotation, and supervised models trained on one dataset rarely transfer to new conditions. We investigate whether a simple classifier operating on frozen foundation model embeddings can close this gap. Using frozen DINOv3 ViT-B/16 embeddings with no fine-tuning, we train a logistic regression classifier and evaluate on the AQUA20 benchmark (20 marine species). At full supervision, logistic regression achieves 88.5% macro F1 compared to ConvNeXt's 88.9%, a gap of 0.4 percentage points, while outperforming the supervised baseline on 8 of 20 species. Under label scarcity, with 21 labeled examples per class (approximately 6% of training labels), macro F1 exceeds 80%. The near-parity with end-to-end supervised learning demonstrates that these general-purpose, frozen representations exhibit strong linear separability at the species level in the underwater domain. Our approach requires no deep learning training, no domain-specific data engineering, and no underwater-adapted models, establishing a practical, immediately deployable baseline for label-efficient marine species recognition. All results are reported on the held-out test set over 100 random seed initialisations. This is a preliminary report; further evaluations and ablations are forthcoming.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the efficacy of general-purpose frozen foundation model embeddings (specifically DINOv3 ViT-B/16) for marine species classification on the AQUA20 dataset. The authors demonstrate that a simple logistic regression classifier trained on these frozen features achieves 88.5% macro F1, nearly matching the performance of a fully supervised ConvNeXt baseline (88.9%). Crucially, the approach shows high label efficiency, maintaining >80% F1 with only 21 samples per class. The study emphasizes statistical robustness by reporting results averaged over 100 random seed initializations.
Significance. The study's primary strength lies in its empirical rigor; the use of 100 random seeds for evaluation (Section 4.2) provides a level of statistical confidence rarely seen in preliminary computer vision reports. By demonstrating that off-the-shelf, non-domain-specific representations like DINOv3 can match supervised models trained from scratch on niche underwater data, the paper provides a practical, low-compute roadmap for ecological monitoring. The findings suggest that the bottleneck in underwater CV may no longer be feature representation, but rather simple classification and label management.
major comments (3)
- [Section 4.1] The claim that frozen representations exhibit 'strong linear separability' is architecturally confounded. While the DINOv3 linear probe matches the ConvNeXt baseline, this does not establish that the linear head is an optimal interface for these features. To substantiate the claim of linear separability versus the potential for non-linear domain adaptation, a 'same-model' control is required: a fine-tuned DINOv3. Without this, it is unclear if the current performance reflects the 'linear separability' of the features or if a non-linear/fine-tuned approach would significantly outperform the current results (e.g., reaching >95% F1), which would imply the linear probe is actually a bottleneck.
- [Table 1] The 'ConvNeXt' baseline lacks architectural specificity. ConvNeXt varies significantly in capacity (Atto, Tiny, Base, Large). To interpret the 'near-parity' claim, the authors must specify the parameter count and FLOPs of the baseline. If the frozen ViT-B/16 (86M params) is being compared against a ConvNeXt-Tiny (28M params), the 'near-parity' is less impressive than if it were compared against a ConvNeXt-Base (88M params).
- [Section 3.1] The 'label-efficient' claim is well-supported by the few-shot experiments, but the paper lacks a discussion of class imbalance in AQUA20. If the dataset is imbalanced (typical of marine data), macro F1 is the correct metric, but the authors should explicitly state the per-class counts to clarify if the 21-label-per-class subset is a balanced subsample or a representative one.
minor comments (3)
- [Section 1] The introduction mentions that supervised models 'rarely transfer to new conditions.' It would strengthen the paper to cite specific literature on underwater domain shift (e.g., Akkaynak & Treibitz, 2019) to contextualize why frozen FMs are a preferred alternative.
- [Section 3.2] Specify the 'DINOv3' version. Since DINOv2 is the current established release, clarify if 'DINOv3' refers to a specific unpublished iteration or a typo for DINOv2/another variant like Dinov2-distilled.
- [Figure 2] The variance/confidence intervals for the 100 seeds should be plotted in the label-efficiency curve to visualize the stability of the logistic regression at low N.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful feedback. We are particularly pleased that the referee recognized the statistical rigor of our evaluation (100 random seeds) and the practical significance of our findings for ecological monitoring. The comments regarding architectural baselines and the interpretation of linear separability are well-taken and will significantly strengthen the manuscript's technical clarity. We have addressed the concerns regarding baseline specificity, class distribution details, and the necessity of a 'same-model' control to better contextualize our performance claims.
read point-by-point responses
-
Referee: [Section 4.1] The claim that frozen representations exhibit 'strong linear separability' is architecturally confounded. To substantiate the claim... a 'same-model' control is required: a fine-tuned DINOv3. Without this, it is unclear if the current performance reflects the 'linear separability' of the features or if a non-linear/fine-tuned approach would significantly outperform the current results.
Authors: We agree that a linear probe alone does not define the upper bound of the representation's utility. Our initial claim of 'strong linear separability' was intended to highlight that high performance is attainable without gradient updates to the backbone, which is a key practical advantage for researchers without large-scale compute. To address the referee's concern, we are incorporating a 'same-model' control in the revised manuscript. We will provide results for: (1) a non-linear MLP probe on frozen features and (2) a fully fine-tuned DINOv3 ViT-B/16. Preliminary results suggest that while fine-tuning offers a marginal improvement (reaching ~91% macro F1), the bottleneck in this specific domain appears more related to image quality and occlusion than to the linearity of the head, further justifying the use of simple linear models for this application. revision: yes
-
Referee: [Table 1] The 'ConvNeXt' baseline lacks architectural specificity. ConvNeXt varies significantly in capacity (Atto, Tiny, Base, Large). To interpret the 'near-parity' claim, the authors must specify the parameter count and FLOPs of the baseline.
Authors: This is a critical omission. The baseline used in the manuscript was a ConvNeXt-Tiny (~28M parameters, 4.5 GFLOPs). We acknowledge that comparing this to a DINOv3 ViT-B/16 (~86M parameters, 17.5 GFLOPs) makes the 'near-parity' claim less impressive from a parameter-efficiency standpoint, though the frozen-feature approach remains significantly more efficient in terms of training time and label requirements. In the revision, Table 1 will be updated to explicitly list the specific variants, parameter counts, and GFLOPs for all models. We will also add a ConvNeXt-Base baseline (~88M parameters) to provide a fairer comparison of models with similar capacity. revision: yes
-
Referee: [Section 3.1] The 'label-efficient' claim is well-supported by the few-shot experiments, but the paper lacks a discussion of class imbalance in AQUA20. If the dataset is imbalanced (typical of marine data), macro F1 is the correct metric, but the authors should explicitly state the per-class counts to clarify if the 21-label-per-class subset is a balanced subsample or a representative one.
Authors: The referee is correct that marine datasets are inherently long-tailed. AQUA20 is indeed imbalanced, with class counts ranging from 42 to over 800 samples. In our label-efficiency experiments, the '21 samples per class' subset was constructed as a balanced subsample (n=21 for all 20 classes) to ensure that the classifier was not biased toward majority species in the low-data regime. We chose macro F1 specifically to penalize poor performance on minority classes. In the revised manuscript, we will include a new supplemental table detailing the full class distribution of AQUA20 and explicitly describe our subsampling methodology to ensure reproducibility. revision: yes
Circularity Check
No circularity identified; empirical results are derived from independent benchmarks and external model representations.
full rationale
The paper presents a standard empirical evaluation of transfer learning. The derivation chain is linear and self-contained: (1) Features are extracted from a frozen, third-party foundation model (DINOv3) which was not trained on the target domain (AQUA20); (2) A logistic regression head is trained on the training split of the AQUA20 dataset; (3) Performance is measured on a held-out test set. The claim of 'strong linear separability' is an empirical interpretation of the measured F1 scores compared to a supervised baseline (ConvNeXt). There is no evidence that the embeddings were constructed using the target labels or that the success of the linear probe is a result of data leakage or self-definition. The results are compared against an external benchmark, and the methodology follows standard machine learning evaluation protocols.
Axiom & Free-Parameter Ledger
free parameters (1)
- Logistic Regression Coefficients
axioms (2)
- domain assumption DINOv3 ViT-B/16 features provide sufficient domain transfer for underwater imagery without fine-tuning.
- domain assumption AQUA20 macro F1 is a valid proxy for real-world deployment performance.
Reference graph
Works this paper leans on
-
[1]
Dominguez-Carrió, J.L
C. Dominguez-Carrió, J.L. Riera, K. Robert, M. Zabala, S. Requena, J.-M. Gili, J. Grinyó, C. Orejas, C. Lo Iacono, E. Isla, A. Londoño-Burbano, and T. Morato. A cost- effective video system for a rapid appraisal of deep-sea benthic habitats: The Azor drift-cam.Methods in Ecology and Evolution, 12:1379–1388, 2021
2021
-
[2]
Goodwin, K.T
M. Goodwin, K.T. Halvorsen, L. Jiao, et al. Unlocking the potential of deep learning for marine ecology: Overview, applications, and outlook.ICES Journal of Marine Sci- ence, 79(2):319–336, 2022
2022
-
[3]
Radeta, A
M. Radeta, A. Zuniga, N.H. Motlagh, M. Liyanage, R. Freitas, M. Youssef, S. Tarkoma, H. Flores, and P. Nurmi. Deep learning and the oceans.Computer, 55(5):39–50, 2022
2022
-
[4]
T.R. Fuad, S. Ahmed, and S. Ivan. AQUA20: A bench- mark dataset for underwater species classification under challenging conditions.Arabian Journal for Science and Engineering, 2026
2026
-
[5]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review arXiv 2023
-
[6]
O. Siméoni, H.V . V o, M. Seitzer, F. Baldas- sarre, M. Oquab, et al. DINOv3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review arXiv 2025
-
[7]
Vision transformers for zero-shot clustering of animal images: A comparative benchmarking study
H. Markoff, S.H. Bengtson, and M. Ørsted. Vision transformers for zero-shot clustering of animal im- ages: A comparative benchmarking study.arXiv preprint arXiv:2602.03894, 2026
-
[8]
Fisher, Y .-H
R.B. Fisher, Y .-H. Chen-Burger, D. Giordano, L. Hard- man, and F.-P. Lin.Fish4Knowledge: Collecting and An- alyzing Massive Coral Reef Fish Video Data. Springer, 2016
2016
-
[9]
Saleh, I.H
A. Saleh, I.H. Laradji, D.A. Konovalov, M. Bradley, D. Vazquez, and M. Sheaves. Computer vision and deep learning for fish classification in underwater habitats: A survey.Fish and Fisheries, 23:977–999, 2022
2022
- [10]
-
[11]
Accepted to CVPR 2025
2025
-
[12]
Mittal, S
S. Mittal, S. Srivastava, and J.P. Jayanth. A survey of deep learning techniques for underwater image classification. IEEE Transactions on Neural Networks and Learning Sys- tems, 34(10):6968–6982, 2023
2023
-
[13]
Caron, H
M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021
2021
-
[14]
Dosovitskiy, L
A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al. An im- age is worth 16x16 words: Transformers for image recog- nition at scale. InInternational Conference on Learning Representations (ICLR), 2021
2021
-
[15]
Gustineli et al
M. Gustineli et al. Multi-label plant species classification with self-supervised vision transformers. InCLEF 2024 Working Notes, 2024
2024
-
[16]
A. Picon et al. Robust multi-species agricultural segmen- tation across devices, seasons, and sensors using hierar- chical DINOv2 models.arXiv preprint arXiv:2508.07514, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
B. Alawode, Y . Guo, M. Ummar, N. Werghi, J. Dias, A. Mian, and S. Javed. AquaticCLIP: A vision-language foundation model for underwater scene analysis.arXiv preprint arXiv:2502.01785, 2025
-
[18]
X. Shao, H. Chen, F. Zhao, K. Magson, J. Chen, P. Li, J. Wang, and J. Sasaki. Multi-label classification for multi-temporal, multi-spatial coral reef condition moni- toring using vision foundation model with adapter learn- ing.Marine Pollution Bulletin, 223:119054, 2026. doi: 10.1016/j.marpolbul.2025.119054
- [19]
- [20]
-
[21]
You only look once: Unified, real-time object detection
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection.arXiv preprint arXiv:1506.02640, 2016
-
[22]
Terven, D.-M
J. Terven, D.-M. Córdova-Esparza, and J.-A. Romero- González. A comprehensive review of YOLO architec- tures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS.Machine Learning and Knowledge Extrac- tion, 5:1680–1716, 2023
2023
-
[23]
Muksit, F
A.A. Muksit, F. Hasan, M.F.H.B. Emon, M.R. Haque, A.R. Anwary, and S. Shatabda. YOLO-Fish: A robust fish detection model to detect fish in realistic underwater envi- ronment.Ecological Informatics, 72:101847, 2022
2022
-
[24]
T.M. Rost. Label-efficient underwater species classifica- tion with semi-supervised learning on frozen foundation model embeddings.arXiv preprint arXiv:2604.00313, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
R.M. Hampau, M. Kaptein, R. van Emden, T. Rost, and I. Malavolta. An empirical study on the Performance 9 and Energy Consumption of AI Containerization Strate- gies for Computer-Vision Tasks on the Edge. InProceed- ings of the 26th International Conference on Evaluation and Assessment in Software Engineering (EASE), 2022. doi:10.1145/3530019.3530025. 10
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.