Cross-modal learning for plankton recognition
Pith reviewed 2026-05-15 10:00 UTC · model grok-4.3
The pith
Binary same-particle supervision from image-profile pairs produces representations for accurate plankton recognition with a small labeled gallery and k-NN.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training image and profile encoders with a contrastive-style loss on binary same-particle versus different-particle supervision yields transferable representations; these representations, paired with a small labeled species gallery and a k-NN classifier, deliver high recognition accuracy while incorporating both modalities by design and surpassing image-only self-supervised baselines.
What carries the argument
Cross-modal coordination via binary same/different particle supervision to align image and profile encoders, followed by k-NN classification on a small labeled gallery.
If this is right
- Large volumes of unlabeled multimodal plankton data can be used for representation learning without manual labels.
- The final recognition model can exploit both image and profile information at inference time.
- Labeling effort reduces to curating a small gallery of known species rather than exhaustive training sets.
- Performance gains over image-only self-supervision demonstrate the value of the additional optical modality.
Where Pith is reading between the lines
- The same binary coordination approach could transfer to other scientific imaging settings that pair images with sensor profiles, such as cell imaging or material analysis.
- Replacing k-NN with a lightweight supervised head trained on the gallery might further reduce the required number of labeled examples.
- Success on plankton data suggests particle identity serves as a natural, cheap supervisory signal for learning domain-specific features in biological imaging.
Load-bearing premise
Binary same-particle versus different-particle supervision is sufficient to produce representations that transfer well to species-level recognition when combined with a small labeled gallery and k-NN.
What would settle it
On a held-out plankton test set, if the cross-modal model with the same small gallery achieves no higher accuracy than the image-only self-supervised baseline, the benefit of profile coordination is refuted.
Figures
read the original abstract
This paper considers self-supervised cross-modal coordination as a strategy enabling utilization of multiple modalities and large volumes of unlabeled plankton data to build models for plankton recognition. Automated imaging instruments facilitate the continuous collection of plankton image data on a large scale. Current methods for automatic plankton image recognition rely primarily on supervised approaches, which require labeled training sets that are labor-intensive to collect. On the other hand, some modern plankton imaging instruments complement image information with optical measurement data, such as scatter and fluorescence profiles, which currently are not widely utilized in plankton recognition. In this work, we explore the possibility of using such measurement data to guide the learning process without requiring manual labeling. Inspired by the concepts behind Contrastive Language-Image Pre-training, we train encoders for both modalities using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles. For plankton recognition, we employ a small labeled gallery of known plankton species combined with a $k$-NN classifier. This approach yields a recognition model that is inherently multimodal, i.e., capable of utilizing information extracted from both image and profile data. We demonstrate that the proposed method achieves high recognition accuracy while requiring only a minimal number of labeled images. Furthermore, we show that the approach outperforms an image-only self-supervised baseline. Code available at https://github.com/Jookare/cross-modal-plankton.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a self-supervised cross-modal contrastive learning framework for plankton recognition. Image and optical profile encoders are trained using only binary labels indicating whether an image-profile pair originates from the same physical particle. Recognition is performed by k-NN classification against a small labeled gallery of known species. The central claims are that this yields high species recognition accuracy with minimal labeled data and outperforms an image-only self-supervised baseline; the resulting model is inherently multimodal.
Significance. If the empirical claims hold with rigorous validation, the work would be significant for automated plankton monitoring by reducing reliance on expensive labeled datasets through exploitation of abundant unlabeled multimodal sensor data. It adapts CLIP-style contrastive pretraining to scientific instrumentation and enables joint use of image and profile modalities at inference time. Code release supports reproducibility.
major comments (2)
- [§4] §4 (Experiments): the abstract and method description claim outperformance over an image-only self-supervised baseline and high accuracy with few labels, yet no quantitative accuracy values, standard deviations, ablation tables, or error analysis are referenced in the provided summary. Specific numbers, dataset splits, and statistical significance tests must be supplied to substantiate the central empirical claim.
- [§3.2] §3.2 (Pretraining objective): the contrastive loss is defined solely on instance-level binary same/different-particle supervision. No analysis, embedding visualizations, or quantitative measure is given showing that this produces species-discriminative clusters rather than generic modality-invariant features. This assumption is load-bearing for the k-NN transfer claim and requires explicit validation (e.g., t-SNE or nearest-neighbor species purity metrics).
minor comments (2)
- [Abstract] The abstract states 'high recognition accuracy' without numbers; replace with concrete figures once §4 is expanded.
- [§3] Clarify the exact form of the contrastive loss (InfoNCE temperature, batch construction) and any differences from standard CLIP-style implementations.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to provide the requested empirical details and validation.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): the abstract and method description claim outperformance over an image-only self-supervised baseline and high accuracy with few labels, yet no quantitative accuracy values, standard deviations, ablation tables, or error analysis are referenced in the provided summary. Specific numbers, dataset splits, and statistical significance tests must be supplied to substantiate the central empirical claim.
Authors: We agree that the experimental section requires expansion for rigor. In the revised manuscript we will add to §4: (i) concrete top-1 and top-5 accuracy figures with standard deviations over multiple random seeds for varying numbers of labeled gallery images, (ii) a full ablation table directly comparing the cross-modal model against the image-only self-supervised baseline, (iii) explicit dataset split sizes and sampling protocol, and (iv) statistical significance tests (paired t-test or Wilcoxon signed-rank) together with a short error analysis of misclassified species. These results already exist in our experimental logs and will be reported verbatim. revision: yes
-
Referee: [§3.2] §3.2 (Pretraining objective): the contrastive loss is defined solely on instance-level binary same/different-particle supervision. No analysis, embedding visualizations, or quantitative measure is given showing that this produces species-discriminative clusters rather than generic modality-invariant features. This assumption is load-bearing for the k-NN transfer claim and requires explicit validation (e.g., t-SNE or nearest-neighbor species purity metrics).
Authors: We acknowledge the need for explicit validation of species-level discrimination. While the particle-level supervision implicitly aligns same-species embeddings, we will add in the revision: (1) t-SNE plots of the joint embedding space colored by species labels from the labeled gallery, and (2) quantitative nearest-neighbor species purity scores (fraction of same-species neighbors within the k-NN radius) computed on held-out particles. These additions will appear in §3.2 or a new subsection and directly support the k-NN transfer claim. revision: yes
Circularity Check
No circularity; standard contrastive pretraining with external binary supervision
full rationale
The paper trains image and profile encoders via contrastive loss on binary same-particle versus different-particle labels that are directly obtained from the physical data collection process, then performs species recognition via k-NN on a small external labeled gallery. No equation or claim reduces the reported recognition accuracy to a parameter fitted inside the paper itself, nor does any step invoke a self-citation chain, uniqueness theorem, or ansatz that is defined only by the present work. The supervision signal is independent of the learned embeddings, and performance is measured against an image-only baseline on held-out data, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Binary same/different particle labels supply sufficient supervisory signal for representation learning
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
train encoders ... using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles ... InfoNCE loss ... Sigmoid loss
-
IndisputableMonolith/Foundation/ArithmeticFromLogicLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
nearest neighbor search in the embedding space ... k-NN classifier
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Badreldeen Bdawy Mohamed, O., Eerola, T., Kraft, K., et al.: Open-set plankton recognition using similarity learning. In: ISVC (2022)
work page 2022
-
[2]
Batrakhanov, D., Eerola, T., Kraft, K., et al.: Daplankton: Benchmark dataset for multi-instrument plankton recognition via fine-grained domain adaptation. In: ICIP (2024) 14 J. Kareinen et al. Table 6: Average accuracy and standard deviation (%) comparing DINO and contrastive multimodal pre-training Method Model LAB→LAB LAB→SEA UTO→LAB UTO→SEA DINO Effic...
work page 2024
-
[3]
Bureš, J., Eerola, T., Lensu, L., Kälviäinen, H., Zemčík, P.: Plankton recognition in images with varying size. In: ICPR (2021)
work page 2021
-
[4]
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: CVPR (2021)
work page 2021
-
[5]
Limnology and Oceanography: Methods23, 39–66 (2025)
Chen, C., Kyathanahally, S.P., Reyes, M., et al.: Producing plankton classifiers that are robust to dataset shift. Limnology and Oceanography: Methods23, 39–66 (2025)
work page 2025
-
[6]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
work page 2021
-
[7]
Dubelaar, G.B., Gerritzen, P.L., Beeker, A.E., Jonker, R.R., Tangen, K.: Design and first results of cytobuoy: A wireless flow cytometer for in situ analysis of marine and fresh waters. Cytometry37, 247–254 (1999)
work page 1999
-
[8]
Arti- ficial Intelligence Review57(2024)
Eerola, T., Batrakhanov, D., Barazandeh, N.V., et al.: Survey of automatic plank- ton image recognition: challenges, existing solutions and future perspectives. Arti- ficial Intelligence Review57(2024)
work page 2024
-
[9]
Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning audio concepts from natural language supervision. In: ICASSP (2023)
work page 2023
-
[10]
Limnology and Oceanography: Methods17, 439–461 (2019)
Ellen, J.S., Graff, C.A., Ohman, M.D.: Improving plankton image classification us- ing context metadata. Limnology and Oceanography: Methods17, 439–461 (2019)
work page 2019
-
[11]
Photosynthesis research39, 235–258 (1994)
Falkowski,P.G.:Theroleofphytoplanktonphotosynthesisinglobalbiogeochemical cycles. Photosynthesis research39, 235–258 (1994)
work page 1994
-
[12]
Field, C.B., Behrenfeld, M.J., Randerson, J.T., Falkowski, P.: Primary production of the biosphere: integrating terrestrial and oceanic components. Science281, 237– 240 (1998)
work page 1998
-
[13]
Limnology and Oceanography: Methods20(7), 387–399 (2022)
Fuchs, R., Thyssen, M., Creach, V., et al.: Automatic recognition of flow cytometric phytoplankton functional groups using convolutional neural networks. Limnology and Oceanography: Methods20(7), 387–399 (2022)
work page 2022
-
[14]
Gallot, C., Hubert, Z., Haraguchi, L., et al.: Best practices for optimization of phy- toplankton analysis in natural waters using cytosense flow cytometers. Cytometry Part A (2025)
work page 2025
-
[15]
Gu, J., Stevens, S., Campolongo, E.G., et al.: Bioclip 2: Emergent properties from scaling hierarchical contrastive learning (2025),https://arxiv.org/abs/2505. 23883
work page 2025
-
[16]
Neural Computation9, 1735–1780 (1997) Cross-modal learning for plankton recognition 15
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation9, 1735–1780 (1997) Cross-modal learning for plankton recognition 15
work page 1997
-
[17]
Kareinen, J., Eerola, T., Kraft, K., Lensu, L., Suikkanen, S., Kälviäinen, H.: Self- supervised pretraining for fine-grained plankton recognition. In: CVPR Workshops (2025)
work page 2025
-
[18]
Kareinen, J., Skyttä, A., Eerola, T., et al.: Open-set plankton recognition. In: ECCV Workshops (2024)
work page 2024
-
[19]
23729/fd-470acabc-afb8-39cb-a86e-0f81872e7443(2025)
Kareinen, J., Veikka, I., Eerola, T., Haraguchi, L., Lensu, L., Kraft, K., Suikka- nen, S., Kälviäinen, H.: SYKE-plankton_CytoSense_2025.https://doi.org/10. 23729/fd-470acabc-afb8-39cb-a86e-0f81872e7443(2025)
work page 2025
-
[20]
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014),https://arxiv.org/abs/1411. 2539
work page 2014
-
[21]
Kraft, K., Haraguchi, L., Hällfors, H., et al.: Monitoring cyanobacteria blooms with complementary measurements–a similar story told using high-throughput imaging, optical sensors, light microscopy, and satellite-based methods. Harmful Algae p. 102865 (2025)
work page 2025
-
[22]
https://doi.org/10.23728/b2share.abf913e5a6ad47e6baa273ae0ed6617a
Kraft, K., Velhonoja, O., Seppälä, J., et al.: SYKE-plankton_IFCB_2022 (2022). https://doi.org/10.23728/b2share.abf913e5a6ad47e6baa273ae0ed6617a
work page doi:10.23728/b2share.abf913e5a6ad47e6baa273ae0ed6617a 2022
-
[23]
Scientific Reports12, 18590 (2022)
Kyathanahally, S., Hardeman, T., Reyes, M., Merz, E., Bulas, T., Brun, P., Po- mati, F., Baity-Jesi, M.: Ensembles of data-efficient vision transformers as a new paradigm automated classification in ecology. Scientific Reports12, 18590 (2022)
work page 2022
-
[24]
ACM Computing Sur- veys56, 1–42 (2024)
Liang, P.P., Zadeh, A., Morency, L.P.: Foundations & trends in multimodal ma- chine learning: Principles, challenges, and open questions. ACM Computing Sur- veys56, 1–42 (2024)
work page 2024
-
[25]
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR (2022)
work page 2022
-
[26]
Scientific Reports 13, 10443 (2023)
Maracani, A., Pastore, V.P., Natale, L., Rosasco, L., Odone, F.: In-domain versus out-of-domain transfer learning in plankton image classification. Scientific Reports 13, 10443 (2023)
work page 2023
-
[27]
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018),https://arxiv.org/abs/1807.03748
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Orenstein, E.C., Beijbom, O., Peacock, E.E., Sosik, H.M.: WHOI-Plankton- A Large Scale Fine Grained Visual Recognition Benchmark Dataset for Plankton Classification (2015),https://arxiv.org/abs/1510.00745
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[29]
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
work page 2021
-
[30]
Limnology and Oceanography: Methods 5, 204–216 (2007)
Sosik, H.M., Olson, R.J.: Automated taxonomic classification of phytoplankton sampled with imaging-in-flow cytometry. Limnology and Oceanography: Methods 5, 204–216 (2007)
work page 2007
-
[31]
Stevens, S., Wu, J., Thompson, M.J., et al.: Bioclip: A vision foundation model for the tree of life. In: CVPR (2024)
work page 2024
-
[32]
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: ICML (2013)
work page 2013
-
[33]
Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: ICML (2019)
work page 2019
-
[34]
Tschannen, M., Gritsenko, A., Wang, X., et al.: Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features (2025),https://arxiv.org/abs/2502.14786
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NeurIPS (2017) 16 J. Kareinen et al
work page 2017
-
[36]
ICES Journal of Marine Science pp
Yang, Z., Li, J., Chen, T., Pu, Y., Feng, Z.: Contrastive learning-based image retrieval for automatic recognition of in situ marine plankton images. ICES Journal of Marine Science pp. 2643–2655 (2022)
work page 2022
-
[37]
In: CVPR (2023) Cross-modal learning for plankton recognition 17 Appendix A.1
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: CVPR (2023) Cross-modal learning for plankton recognition 17 Appendix A.1. Profile encoder architectures Thedesignchoicesforeachencoderweremadetobalancerepresentationalcapac- ity and computational efficiency, given the relatively small dataset size. Detaile...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.