Cross-modal learning for plankton recognition

Heikki K\"alvi\"ainen; Joona Kareinen; Kaisa Kraft; Lasse Lensu; Lumi Haraguchi; Sanna Suikkanen; Tuomas Eerola; Veikka Immonen

arxiv: 2603.16427 · v2 · submitted 2026-03-17 · 💻 cs.CV

Cross-modal learning for plankton recognition

Joona Kareinen , Veikka Immonen , Tuomas Eerola , Lumi Haraguchi , Lasse Lensu , Kaisa Kraft , Sanna Suikkanen , Heikki K\"alvi\"ainen This is my paper

Pith reviewed 2026-05-15 10:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords plankton recognitioncross-modal learningself-supervised learningmultimodal representationsk-NN classificationoptical profilesspecies identification

0 comments

The pith

Binary same-particle supervision from image-profile pairs produces representations for accurate plankton recognition with a small labeled gallery and k-NN.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training separate encoders for plankton images and their optical measurement profiles using only binary labels that indicate whether each image and profile come from the same particle. This cross-modal coordination produces representations that support species-level classification via k-NN on a minimal set of labeled examples. The resulting model is inherently multimodal and requires far less manual labeling than standard supervised methods. It also outperforms a comparable self-supervised baseline that uses images alone. The approach addresses the bottleneck of labeling large volumes of plankton data collected by automated imaging instruments that also record scatter and fluorescence profiles.

Core claim

Training image and profile encoders with a contrastive-style loss on binary same-particle versus different-particle supervision yields transferable representations; these representations, paired with a small labeled species gallery and a k-NN classifier, deliver high recognition accuracy while incorporating both modalities by design and surpassing image-only self-supervised baselines.

What carries the argument

Cross-modal coordination via binary same/different particle supervision to align image and profile encoders, followed by k-NN classification on a small labeled gallery.

If this is right

Large volumes of unlabeled multimodal plankton data can be used for representation learning without manual labels.
The final recognition model can exploit both image and profile information at inference time.
Labeling effort reduces to curating a small gallery of known species rather than exhaustive training sets.
Performance gains over image-only self-supervision demonstrate the value of the additional optical modality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same binary coordination approach could transfer to other scientific imaging settings that pair images with sensor profiles, such as cell imaging or material analysis.
Replacing k-NN with a lightweight supervised head trained on the gallery might further reduce the required number of labeled examples.
Success on plankton data suggests particle identity serves as a natural, cheap supervisory signal for learning domain-specific features in biological imaging.

Load-bearing premise

Binary same-particle versus different-particle supervision is sufficient to produce representations that transfer well to species-level recognition when combined with a small labeled gallery and k-NN.

What would settle it

On a held-out plankton test set, if the cross-modal model with the same small gallery achieves no higher accuracy than the image-only self-supervised baseline, the benefit of profile coordination is refuted.

Figures

Figures reproduced from arXiv: 2603.16427 by Heikki K\"alvi\"ainen, Joona Kareinen, Kaisa Kraft, Lasse Lensu, Lumi Haraguchi, Sanna Suikkanen, Tuomas Eerola, Veikka Immonen.

**Figure 2.** Figure 2: Overview of the proposed multimodal recognition framework. (a) During [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Example image and profile data [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Pre-processing and data augmentations: (a) original profile and image, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Average accuracy and standard deviation as a function of gallery set size, [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

This paper considers self-supervised cross-modal coordination as a strategy enabling utilization of multiple modalities and large volumes of unlabeled plankton data to build models for plankton recognition. Automated imaging instruments facilitate the continuous collection of plankton image data on a large scale. Current methods for automatic plankton image recognition rely primarily on supervised approaches, which require labeled training sets that are labor-intensive to collect. On the other hand, some modern plankton imaging instruments complement image information with optical measurement data, such as scatter and fluorescence profiles, which currently are not widely utilized in plankton recognition. In this work, we explore the possibility of using such measurement data to guide the learning process without requiring manual labeling. Inspired by the concepts behind Contrastive Language-Image Pre-training, we train encoders for both modalities using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles. For plankton recognition, we employ a small labeled gallery of known plankton species combined with a $k$-NN classifier. This approach yields a recognition model that is inherently multimodal, i.e., capable of utilizing information extracted from both image and profile data. We demonstrate that the proposed method achieves high recognition accuracy while requiring only a minimal number of labeled images. Furthermore, we show that the approach outperforms an image-only self-supervised baseline. Code available at https://github.com/Jookare/cross-modal-plankton.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They adapt contrastive pretraining to align plankton images with optical profiles using only same-particle binary labels, which cuts labeling needs for species recognition but leaves open whether the profiles add real taxonomic structure beyond instance alignment.

read the letter

This paper adapts the CLIP-style contrastive setup to plankton, pairing images with scatter and fluorescence profiles and training encoders with a binary same-particle versus different-particle signal. Downstream they use a small labeled gallery plus k-NN to classify species, and the abstract reports that the resulting model beats a plain image self-supervised baseline while needing few labels. Code is released, which helps anyone who wants to reproduce or extend it on their own instrument data. The practical angle is clear: many plankton imagers already collect both modalities, so this turns existing unlabeled streams into a pretraining resource without new manual work. That part is straightforward and useful for scaling monitoring programs. The soft spot is the supervision itself. The loss only enforces cross-modal consistency at the level of individual particles; it supplies no species or class signal. If optical profiles from different taxa overlap, the image embeddings can satisfy the objective without forming the tight species clusters that k-NN needs. The reported improvement over the image-only baseline therefore rests on an untested assumption that the profiles inject additional species-relevant structure rather than just regularizing the representation. Without numbers, ablations, or error breakdowns in the abstract it is hard to judge how large or consistent that gain actually is. This is aimed at applied researchers who run or analyze high-throughput plankton imaging systems and already have or can collect paired profile data. A methods person in marine ecology or automated classification would get the most out of it. I would send it to peer review. The idea is clean, the data constraints are real, and even a modest, well-documented gain in this domain is worth referee time to check the empirical details.

Referee Report

2 major / 2 minor

Summary. The paper proposes a self-supervised cross-modal contrastive learning framework for plankton recognition. Image and optical profile encoders are trained using only binary labels indicating whether an image-profile pair originates from the same physical particle. Recognition is performed by k-NN classification against a small labeled gallery of known species. The central claims are that this yields high species recognition accuracy with minimal labeled data and outperforms an image-only self-supervised baseline; the resulting model is inherently multimodal.

Significance. If the empirical claims hold with rigorous validation, the work would be significant for automated plankton monitoring by reducing reliance on expensive labeled datasets through exploitation of abundant unlabeled multimodal sensor data. It adapts CLIP-style contrastive pretraining to scientific instrumentation and enables joint use of image and profile modalities at inference time. Code release supports reproducibility.

major comments (2)

[§4] §4 (Experiments): the abstract and method description claim outperformance over an image-only self-supervised baseline and high accuracy with few labels, yet no quantitative accuracy values, standard deviations, ablation tables, or error analysis are referenced in the provided summary. Specific numbers, dataset splits, and statistical significance tests must be supplied to substantiate the central empirical claim.
[§3.2] §3.2 (Pretraining objective): the contrastive loss is defined solely on instance-level binary same/different-particle supervision. No analysis, embedding visualizations, or quantitative measure is given showing that this produces species-discriminative clusters rather than generic modality-invariant features. This assumption is load-bearing for the k-NN transfer claim and requires explicit validation (e.g., t-SNE or nearest-neighbor species purity metrics).

minor comments (2)

[Abstract] The abstract states 'high recognition accuracy' without numbers; replace with concrete figures once §4 is expanded.
[§3] Clarify the exact form of the contrastive loss (InfoNCE temperature, batch construction) and any differences from standard CLIP-style implementations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to provide the requested empirical details and validation.

read point-by-point responses

Referee: [§4] §4 (Experiments): the abstract and method description claim outperformance over an image-only self-supervised baseline and high accuracy with few labels, yet no quantitative accuracy values, standard deviations, ablation tables, or error analysis are referenced in the provided summary. Specific numbers, dataset splits, and statistical significance tests must be supplied to substantiate the central empirical claim.

Authors: We agree that the experimental section requires expansion for rigor. In the revised manuscript we will add to §4: (i) concrete top-1 and top-5 accuracy figures with standard deviations over multiple random seeds for varying numbers of labeled gallery images, (ii) a full ablation table directly comparing the cross-modal model against the image-only self-supervised baseline, (iii) explicit dataset split sizes and sampling protocol, and (iv) statistical significance tests (paired t-test or Wilcoxon signed-rank) together with a short error analysis of misclassified species. These results already exist in our experimental logs and will be reported verbatim. revision: yes
Referee: [§3.2] §3.2 (Pretraining objective): the contrastive loss is defined solely on instance-level binary same/different-particle supervision. No analysis, embedding visualizations, or quantitative measure is given showing that this produces species-discriminative clusters rather than generic modality-invariant features. This assumption is load-bearing for the k-NN transfer claim and requires explicit validation (e.g., t-SNE or nearest-neighbor species purity metrics).

Authors: We acknowledge the need for explicit validation of species-level discrimination. While the particle-level supervision implicitly aligns same-species embeddings, we will add in the revision: (1) t-SNE plots of the joint embedding space colored by species labels from the labeled gallery, and (2) quantitative nearest-neighbor species purity scores (fraction of same-species neighbors within the k-NN radius) computed on held-out particles. These additions will appear in §3.2 or a new subsection and directly support the k-NN transfer claim. revision: yes

Circularity Check

0 steps flagged

No circularity; standard contrastive pretraining with external binary supervision

full rationale

The paper trains image and profile encoders via contrastive loss on binary same-particle versus different-particle labels that are directly obtained from the physical data collection process, then performs species recognition via k-NN on a small external labeled gallery. No equation or claim reduces the reported recognition accuracy to a parameter fitted inside the paper itself, nor does any step invoke a self-citation chain, uniqueness theorem, or ansatz that is defined only by the present work. The supervision signal is independent of the learned embeddings, and performance is measured against an image-only baseline on held-out data, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard contrastive-learning assumption that same-particle pairs provide a useful positive signal and on the transferability of the learned embeddings to a k-NN classifier; no free parameters or invented entities are introduced beyond those in the base CLIP-style loss.

axioms (1)

domain assumption Binary same/different particle labels supply sufficient supervisory signal for representation learning
Invoked when the contrastive objective is defined on particle identity rather than species labels.

pith-pipeline@v0.9.0 · 5570 in / 1164 out tokens · 39504 ms · 2026-05-15T10:00:48.051351+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

train encoders ... using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles ... InfoNCE loss ... Sigmoid loss
IndisputableMonolith/Foundation/ArithmeticFromLogic LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

nearest neighbor search in the embedding space ... k-NN classifier

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

[1]

In: ISVC (2022)

Badreldeen Bdawy Mohamed, O., Eerola, T., Kraft, K., et al.: Open-set plankton recognition using similarity learning. In: ISVC (2022)

work page 2022
[2]

In: ICIP (2024) 14 J

Batrakhanov, D., Eerola, T., Kraft, K., et al.: Daplankton: Benchmark dataset for multi-instrument plankton recognition via fine-grained domain adaptation. In: ICIP (2024) 14 J. Kareinen et al. Table 6: Average accuracy and standard deviation (%) comparing DINO and contrastive multimodal pre-training Method Model LAB→LAB LAB→SEA UTO→LAB UTO→SEA DINO Effic...

work page 2024
[3]

In: ICPR (2021)

Bureš, J., Eerola, T., Lensu, L., Kälviäinen, H., Zemčík, P.: Plankton recognition in images with varying size. In: ICPR (2021)

work page 2021
[4]

In: CVPR (2021)

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: CVPR (2021)

work page 2021
[5]

Limnology and Oceanography: Methods23, 39–66 (2025)

Chen, C., Kyathanahally, S.P., Reyes, M., et al.: Producing plankton classifiers that are robust to dataset shift. Limnology and Oceanography: Methods23, 39–66 (2025)

work page 2025
[6]

In: ICLR (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

work page 2021
[7]

Cytometry37, 247–254 (1999)

Dubelaar, G.B., Gerritzen, P.L., Beeker, A.E., Jonker, R.R., Tangen, K.: Design and first results of cytobuoy: A wireless flow cytometer for in situ analysis of marine and fresh waters. Cytometry37, 247–254 (1999)

work page 1999
[8]

Arti- ficial Intelligence Review57(2024)

Eerola, T., Batrakhanov, D., Barazandeh, N.V., et al.: Survey of automatic plank- ton image recognition: challenges, existing solutions and future perspectives. Arti- ficial Intelligence Review57(2024)

work page 2024
[9]

In: ICASSP (2023)

Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning audio concepts from natural language supervision. In: ICASSP (2023)

work page 2023
[10]

Limnology and Oceanography: Methods17, 439–461 (2019)

Ellen, J.S., Graff, C.A., Ohman, M.D.: Improving plankton image classification us- ing context metadata. Limnology and Oceanography: Methods17, 439–461 (2019)

work page 2019
[11]

Photosynthesis research39, 235–258 (1994)

Falkowski,P.G.:Theroleofphytoplanktonphotosynthesisinglobalbiogeochemical cycles. Photosynthesis research39, 235–258 (1994)

work page 1994
[12]

Science281, 237– 240 (1998)

Field, C.B., Behrenfeld, M.J., Randerson, J.T., Falkowski, P.: Primary production of the biosphere: integrating terrestrial and oceanic components. Science281, 237– 240 (1998)

work page 1998
[13]

Limnology and Oceanography: Methods20(7), 387–399 (2022)

Fuchs, R., Thyssen, M., Creach, V., et al.: Automatic recognition of flow cytometric phytoplankton functional groups using convolutional neural networks. Limnology and Oceanography: Methods20(7), 387–399 (2022)

work page 2022
[14]

Cytometry Part A (2025)

Gallot, C., Hubert, Z., Haraguchi, L., et al.: Best practices for optimization of phy- toplankton analysis in natural waters using cytosense flow cytometers. Cytometry Part A (2025)

work page 2025
[15]

Gu, J., Stevens, S., Campolongo, E.G., et al.: Bioclip 2: Emergent properties from scaling hierarchical contrastive learning (2025),https://arxiv.org/abs/2505. 23883

work page 2025
[16]

Neural Computation9, 1735–1780 (1997) Cross-modal learning for plankton recognition 15

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation9, 1735–1780 (1997) Cross-modal learning for plankton recognition 15

work page 1997
[17]

In: CVPR Workshops (2025)

Kareinen, J., Eerola, T., Kraft, K., Lensu, L., Suikkanen, S., Kälviäinen, H.: Self- supervised pretraining for fine-grained plankton recognition. In: CVPR Workshops (2025)

work page 2025
[18]

In: ECCV Workshops (2024)

Kareinen, J., Skyttä, A., Eerola, T., et al.: Open-set plankton recognition. In: ECCV Workshops (2024)

work page 2024
[19]

23729/fd-470acabc-afb8-39cb-a86e-0f81872e7443(2025)

Kareinen, J., Veikka, I., Eerola, T., Haraguchi, L., Lensu, L., Kraft, K., Suikka- nen, S., Kälviäinen, H.: SYKE-plankton_CytoSense_2025.https://doi.org/10. 23729/fd-470acabc-afb8-39cb-a86e-0f81872e7443(2025)

work page 2025
[20]

Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014),https://arxiv.org/abs/1411. 2539

work page 2014
[21]

Harmful Algae p

Kraft, K., Haraguchi, L., Hällfors, H., et al.: Monitoring cyanobacteria blooms with complementary measurements–a similar story told using high-throughput imaging, optical sensors, light microscopy, and satellite-based methods. Harmful Algae p. 102865 (2025)

work page 2025
[22]

https://doi.org/10.23728/b2share.abf913e5a6ad47e6baa273ae0ed6617a

Kraft, K., Velhonoja, O., Seppälä, J., et al.: SYKE-plankton_IFCB_2022 (2022). https://doi.org/10.23728/b2share.abf913e5a6ad47e6baa273ae0ed6617a

work page doi:10.23728/b2share.abf913e5a6ad47e6baa273ae0ed6617a 2022
[23]

Scientific Reports12, 18590 (2022)

Kyathanahally, S., Hardeman, T., Reyes, M., Merz, E., Bulas, T., Brun, P., Po- mati, F., Baity-Jesi, M.: Ensembles of data-efficient vision transformers as a new paradigm automated classification in ecology. Scientific Reports12, 18590 (2022)

work page 2022
[24]

ACM Computing Sur- veys56, 1–42 (2024)

Liang, P.P., Zadeh, A., Morency, L.P.: Foundations & trends in multimodal ma- chine learning: Principles, challenges, and open questions. ACM Computing Sur- veys56, 1–42 (2024)

work page 2024
[25]

In: CVPR (2022)

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR (2022)

work page 2022
[26]

Scientific Reports 13, 10443 (2023)

Maracani, A., Pastore, V.P., Natale, L., Rosasco, L., Odone, F.: In-domain versus out-of-domain transfer learning in plankton image classification. Scientific Reports 13, 10443 (2023)

work page 2023
[27]

van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018),https://arxiv.org/abs/1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Orenstein, E.C., Beijbom, O., Peacock, E.E., Sosik, H.M.: WHOI-Plankton- A Large Scale Fine Grained Visual Recognition Benchmark Dataset for Plankton Classification (2015),https://arxiv.org/abs/1510.00745

work page internal anchor Pith review Pith/arXiv arXiv 2015
[29]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021
[30]

Limnology and Oceanography: Methods 5, 204–216 (2007)

Sosik, H.M., Olson, R.J.: Automated taxonomic classification of phytoplankton sampled with imaging-in-flow cytometry. Limnology and Oceanography: Methods 5, 204–216 (2007)

work page 2007
[31]

In: CVPR (2024)

Stevens, S., Wu, J., Thompson, M.J., et al.: Bioclip: A vision foundation model for the tree of life. In: CVPR (2024)

work page 2024
[32]

In: ICML (2013)

Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: ICML (2013)

work page 2013
[33]

In: ICML (2019)

Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: ICML (2019)

work page 2019
[34]

Tschannen, M., Gritsenko, A., Wang, X., et al.: Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features (2025),https://arxiv.org/abs/2502.14786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

In: NeurIPS (2017) 16 J

Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NeurIPS (2017) 16 J. Kareinen et al

work page 2017
[36]

ICES Journal of Marine Science pp

Yang, Z., Li, J., Chen, T., Pu, Y., Feng, Z.: Contrastive learning-based image retrieval for automatic recognition of in situ marine plankton images. ICES Journal of Marine Science pp. 2643–2655 (2022)

work page 2022
[37]

In: CVPR (2023) Cross-modal learning for plankton recognition 17 Appendix A.1

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: CVPR (2023) Cross-modal learning for plankton recognition 17 Appendix A.1. Profile encoder architectures Thedesignchoicesforeachencoderweremadetobalancerepresentationalcapac- ity and computational efficiency, given the relatively small dataset size. Detaile...

work page 2023

[1] [1]

In: ISVC (2022)

Badreldeen Bdawy Mohamed, O., Eerola, T., Kraft, K., et al.: Open-set plankton recognition using similarity learning. In: ISVC (2022)

work page 2022

[2] [2]

In: ICIP (2024) 14 J

Batrakhanov, D., Eerola, T., Kraft, K., et al.: Daplankton: Benchmark dataset for multi-instrument plankton recognition via fine-grained domain adaptation. In: ICIP (2024) 14 J. Kareinen et al. Table 6: Average accuracy and standard deviation (%) comparing DINO and contrastive multimodal pre-training Method Model LAB→LAB LAB→SEA UTO→LAB UTO→SEA DINO Effic...

work page 2024

[3] [3]

In: ICPR (2021)

Bureš, J., Eerola, T., Lensu, L., Kälviäinen, H., Zemčík, P.: Plankton recognition in images with varying size. In: ICPR (2021)

work page 2021

[4] [4]

In: CVPR (2021)

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: CVPR (2021)

work page 2021

[5] [5]

Limnology and Oceanography: Methods23, 39–66 (2025)

Chen, C., Kyathanahally, S.P., Reyes, M., et al.: Producing plankton classifiers that are robust to dataset shift. Limnology and Oceanography: Methods23, 39–66 (2025)

work page 2025

[6] [6]

In: ICLR (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

work page 2021

[7] [7]

Cytometry37, 247–254 (1999)

Dubelaar, G.B., Gerritzen, P.L., Beeker, A.E., Jonker, R.R., Tangen, K.: Design and first results of cytobuoy: A wireless flow cytometer for in situ analysis of marine and fresh waters. Cytometry37, 247–254 (1999)

work page 1999

[8] [8]

Arti- ficial Intelligence Review57(2024)

Eerola, T., Batrakhanov, D., Barazandeh, N.V., et al.: Survey of automatic plank- ton image recognition: challenges, existing solutions and future perspectives. Arti- ficial Intelligence Review57(2024)

work page 2024

[9] [9]

In: ICASSP (2023)

Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning audio concepts from natural language supervision. In: ICASSP (2023)

work page 2023

[10] [10]

Limnology and Oceanography: Methods17, 439–461 (2019)

Ellen, J.S., Graff, C.A., Ohman, M.D.: Improving plankton image classification us- ing context metadata. Limnology and Oceanography: Methods17, 439–461 (2019)

work page 2019

[11] [11]

Photosynthesis research39, 235–258 (1994)

Falkowski,P.G.:Theroleofphytoplanktonphotosynthesisinglobalbiogeochemical cycles. Photosynthesis research39, 235–258 (1994)

work page 1994

[12] [12]

Science281, 237– 240 (1998)

Field, C.B., Behrenfeld, M.J., Randerson, J.T., Falkowski, P.: Primary production of the biosphere: integrating terrestrial and oceanic components. Science281, 237– 240 (1998)

work page 1998

[13] [13]

Limnology and Oceanography: Methods20(7), 387–399 (2022)

Fuchs, R., Thyssen, M., Creach, V., et al.: Automatic recognition of flow cytometric phytoplankton functional groups using convolutional neural networks. Limnology and Oceanography: Methods20(7), 387–399 (2022)

work page 2022

[14] [14]

Cytometry Part A (2025)

Gallot, C., Hubert, Z., Haraguchi, L., et al.: Best practices for optimization of phy- toplankton analysis in natural waters using cytosense flow cytometers. Cytometry Part A (2025)

work page 2025

[15] [15]

Gu, J., Stevens, S., Campolongo, E.G., et al.: Bioclip 2: Emergent properties from scaling hierarchical contrastive learning (2025),https://arxiv.org/abs/2505. 23883

work page 2025

[16] [16]

Neural Computation9, 1735–1780 (1997) Cross-modal learning for plankton recognition 15

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation9, 1735–1780 (1997) Cross-modal learning for plankton recognition 15

work page 1997

[17] [17]

In: CVPR Workshops (2025)

Kareinen, J., Eerola, T., Kraft, K., Lensu, L., Suikkanen, S., Kälviäinen, H.: Self- supervised pretraining for fine-grained plankton recognition. In: CVPR Workshops (2025)

work page 2025

[18] [18]

In: ECCV Workshops (2024)

Kareinen, J., Skyttä, A., Eerola, T., et al.: Open-set plankton recognition. In: ECCV Workshops (2024)

work page 2024

[19] [19]

23729/fd-470acabc-afb8-39cb-a86e-0f81872e7443(2025)

Kareinen, J., Veikka, I., Eerola, T., Haraguchi, L., Lensu, L., Kraft, K., Suikka- nen, S., Kälviäinen, H.: SYKE-plankton_CytoSense_2025.https://doi.org/10. 23729/fd-470acabc-afb8-39cb-a86e-0f81872e7443(2025)

work page 2025

[20] [20]

Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014),https://arxiv.org/abs/1411. 2539

work page 2014

[21] [21]

Harmful Algae p

Kraft, K., Haraguchi, L., Hällfors, H., et al.: Monitoring cyanobacteria blooms with complementary measurements–a similar story told using high-throughput imaging, optical sensors, light microscopy, and satellite-based methods. Harmful Algae p. 102865 (2025)

work page 2025

[22] [22]

https://doi.org/10.23728/b2share.abf913e5a6ad47e6baa273ae0ed6617a

Kraft, K., Velhonoja, O., Seppälä, J., et al.: SYKE-plankton_IFCB_2022 (2022). https://doi.org/10.23728/b2share.abf913e5a6ad47e6baa273ae0ed6617a

work page doi:10.23728/b2share.abf913e5a6ad47e6baa273ae0ed6617a 2022

[23] [23]

Scientific Reports12, 18590 (2022)

Kyathanahally, S., Hardeman, T., Reyes, M., Merz, E., Bulas, T., Brun, P., Po- mati, F., Baity-Jesi, M.: Ensembles of data-efficient vision transformers as a new paradigm automated classification in ecology. Scientific Reports12, 18590 (2022)

work page 2022

[24] [24]

ACM Computing Sur- veys56, 1–42 (2024)

Liang, P.P., Zadeh, A., Morency, L.P.: Foundations & trends in multimodal ma- chine learning: Principles, challenges, and open questions. ACM Computing Sur- veys56, 1–42 (2024)

work page 2024

[25] [25]

In: CVPR (2022)

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR (2022)

work page 2022

[26] [26]

Scientific Reports 13, 10443 (2023)

Maracani, A., Pastore, V.P., Natale, L., Rosasco, L., Odone, F.: In-domain versus out-of-domain transfer learning in plankton image classification. Scientific Reports 13, 10443 (2023)

work page 2023

[27] [27]

van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018),https://arxiv.org/abs/1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Orenstein, E.C., Beijbom, O., Peacock, E.E., Sosik, H.M.: WHOI-Plankton- A Large Scale Fine Grained Visual Recognition Benchmark Dataset for Plankton Classification (2015),https://arxiv.org/abs/1510.00745

work page internal anchor Pith review Pith/arXiv arXiv 2015

[29] [29]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021

[30] [30]

Limnology and Oceanography: Methods 5, 204–216 (2007)

Sosik, H.M., Olson, R.J.: Automated taxonomic classification of phytoplankton sampled with imaging-in-flow cytometry. Limnology and Oceanography: Methods 5, 204–216 (2007)

work page 2007

[31] [31]

In: CVPR (2024)

Stevens, S., Wu, J., Thompson, M.J., et al.: Bioclip: A vision foundation model for the tree of life. In: CVPR (2024)

work page 2024

[32] [32]

In: ICML (2013)

Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: ICML (2013)

work page 2013

[33] [33]

In: ICML (2019)

Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: ICML (2019)

work page 2019

[34] [34]

Tschannen, M., Gritsenko, A., Wang, X., et al.: Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features (2025),https://arxiv.org/abs/2502.14786

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

In: NeurIPS (2017) 16 J

Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NeurIPS (2017) 16 J. Kareinen et al

work page 2017

[36] [36]

ICES Journal of Marine Science pp

Yang, Z., Li, J., Chen, T., Pu, Y., Feng, Z.: Contrastive learning-based image retrieval for automatic recognition of in situ marine plankton images. ICES Journal of Marine Science pp. 2643–2655 (2022)

work page 2022

[37] [37]

In: CVPR (2023) Cross-modal learning for plankton recognition 17 Appendix A.1

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: CVPR (2023) Cross-modal learning for plankton recognition 17 Appendix A.1. Profile encoder architectures Thedesignchoicesforeachencoderweremadetobalancerepresentationalcapac- ity and computational efficiency, given the relatively small dataset size. Detaile...

work page 2023