pith. sign in

arxiv: 2603.16427 · v2 · submitted 2026-03-17 · 💻 cs.CV

Cross-modal learning for plankton recognition

Pith reviewed 2026-05-15 10:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords plankton recognitioncross-modal learningself-supervised learningmultimodal representationsk-NN classificationoptical profilesspecies identification
0
0 comments X

The pith

Binary same-particle supervision from image-profile pairs produces representations for accurate plankton recognition with a small labeled gallery and k-NN.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training separate encoders for plankton images and their optical measurement profiles using only binary labels that indicate whether each image and profile come from the same particle. This cross-modal coordination produces representations that support species-level classification via k-NN on a minimal set of labeled examples. The resulting model is inherently multimodal and requires far less manual labeling than standard supervised methods. It also outperforms a comparable self-supervised baseline that uses images alone. The approach addresses the bottleneck of labeling large volumes of plankton data collected by automated imaging instruments that also record scatter and fluorescence profiles.

Core claim

Training image and profile encoders with a contrastive-style loss on binary same-particle versus different-particle supervision yields transferable representations; these representations, paired with a small labeled species gallery and a k-NN classifier, deliver high recognition accuracy while incorporating both modalities by design and surpassing image-only self-supervised baselines.

What carries the argument

Cross-modal coordination via binary same/different particle supervision to align image and profile encoders, followed by k-NN classification on a small labeled gallery.

If this is right

  • Large volumes of unlabeled multimodal plankton data can be used for representation learning without manual labels.
  • The final recognition model can exploit both image and profile information at inference time.
  • Labeling effort reduces to curating a small gallery of known species rather than exhaustive training sets.
  • Performance gains over image-only self-supervision demonstrate the value of the additional optical modality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same binary coordination approach could transfer to other scientific imaging settings that pair images with sensor profiles, such as cell imaging or material analysis.
  • Replacing k-NN with a lightweight supervised head trained on the gallery might further reduce the required number of labeled examples.
  • Success on plankton data suggests particle identity serves as a natural, cheap supervisory signal for learning domain-specific features in biological imaging.

Load-bearing premise

Binary same-particle versus different-particle supervision is sufficient to produce representations that transfer well to species-level recognition when combined with a small labeled gallery and k-NN.

What would settle it

On a held-out plankton test set, if the cross-modal model with the same small gallery achieves no higher accuracy than the image-only self-supervised baseline, the benefit of profile coordination is refuted.

Figures

Figures reproduced from arXiv: 2603.16427 by Heikki K\"alvi\"ainen, Joona Kareinen, Kaisa Kraft, Lasse Lensu, Lumi Haraguchi, Sanna Suikkanen, Tuomas Eerola, Veikka Immonen.

Figure 1
Figure 1. Figure 1: Overview of the proposed multimodal plankton recognition model for Cy [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed multimodal recognition framework. (a) During [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example image and profile data [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pre-processing and data augmentations: (a) original profile and image, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average accuracy and standard deviation as a function of gallery set size, [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

This paper considers self-supervised cross-modal coordination as a strategy enabling utilization of multiple modalities and large volumes of unlabeled plankton data to build models for plankton recognition. Automated imaging instruments facilitate the continuous collection of plankton image data on a large scale. Current methods for automatic plankton image recognition rely primarily on supervised approaches, which require labeled training sets that are labor-intensive to collect. On the other hand, some modern plankton imaging instruments complement image information with optical measurement data, such as scatter and fluorescence profiles, which currently are not widely utilized in plankton recognition. In this work, we explore the possibility of using such measurement data to guide the learning process without requiring manual labeling. Inspired by the concepts behind Contrastive Language-Image Pre-training, we train encoders for both modalities using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles. For plankton recognition, we employ a small labeled gallery of known plankton species combined with a $k$-NN classifier. This approach yields a recognition model that is inherently multimodal, i.e., capable of utilizing information extracted from both image and profile data. We demonstrate that the proposed method achieves high recognition accuracy while requiring only a minimal number of labeled images. Furthermore, we show that the approach outperforms an image-only self-supervised baseline. Code available at https://github.com/Jookare/cross-modal-plankton.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a self-supervised cross-modal contrastive learning framework for plankton recognition. Image and optical profile encoders are trained using only binary labels indicating whether an image-profile pair originates from the same physical particle. Recognition is performed by k-NN classification against a small labeled gallery of known species. The central claims are that this yields high species recognition accuracy with minimal labeled data and outperforms an image-only self-supervised baseline; the resulting model is inherently multimodal.

Significance. If the empirical claims hold with rigorous validation, the work would be significant for automated plankton monitoring by reducing reliance on expensive labeled datasets through exploitation of abundant unlabeled multimodal sensor data. It adapts CLIP-style contrastive pretraining to scientific instrumentation and enables joint use of image and profile modalities at inference time. Code release supports reproducibility.

major comments (2)
  1. [§4] §4 (Experiments): the abstract and method description claim outperformance over an image-only self-supervised baseline and high accuracy with few labels, yet no quantitative accuracy values, standard deviations, ablation tables, or error analysis are referenced in the provided summary. Specific numbers, dataset splits, and statistical significance tests must be supplied to substantiate the central empirical claim.
  2. [§3.2] §3.2 (Pretraining objective): the contrastive loss is defined solely on instance-level binary same/different-particle supervision. No analysis, embedding visualizations, or quantitative measure is given showing that this produces species-discriminative clusters rather than generic modality-invariant features. This assumption is load-bearing for the k-NN transfer claim and requires explicit validation (e.g., t-SNE or nearest-neighbor species purity metrics).
minor comments (2)
  1. [Abstract] The abstract states 'high recognition accuracy' without numbers; replace with concrete figures once §4 is expanded.
  2. [§3] Clarify the exact form of the contrastive loss (InfoNCE temperature, batch construction) and any differences from standard CLIP-style implementations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to provide the requested empirical details and validation.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): the abstract and method description claim outperformance over an image-only self-supervised baseline and high accuracy with few labels, yet no quantitative accuracy values, standard deviations, ablation tables, or error analysis are referenced in the provided summary. Specific numbers, dataset splits, and statistical significance tests must be supplied to substantiate the central empirical claim.

    Authors: We agree that the experimental section requires expansion for rigor. In the revised manuscript we will add to §4: (i) concrete top-1 and top-5 accuracy figures with standard deviations over multiple random seeds for varying numbers of labeled gallery images, (ii) a full ablation table directly comparing the cross-modal model against the image-only self-supervised baseline, (iii) explicit dataset split sizes and sampling protocol, and (iv) statistical significance tests (paired t-test or Wilcoxon signed-rank) together with a short error analysis of misclassified species. These results already exist in our experimental logs and will be reported verbatim. revision: yes

  2. Referee: [§3.2] §3.2 (Pretraining objective): the contrastive loss is defined solely on instance-level binary same/different-particle supervision. No analysis, embedding visualizations, or quantitative measure is given showing that this produces species-discriminative clusters rather than generic modality-invariant features. This assumption is load-bearing for the k-NN transfer claim and requires explicit validation (e.g., t-SNE or nearest-neighbor species purity metrics).

    Authors: We acknowledge the need for explicit validation of species-level discrimination. While the particle-level supervision implicitly aligns same-species embeddings, we will add in the revision: (1) t-SNE plots of the joint embedding space colored by species labels from the labeled gallery, and (2) quantitative nearest-neighbor species purity scores (fraction of same-species neighbors within the k-NN radius) computed on held-out particles. These additions will appear in §3.2 or a new subsection and directly support the k-NN transfer claim. revision: yes

Circularity Check

0 steps flagged

No circularity; standard contrastive pretraining with external binary supervision

full rationale

The paper trains image and profile encoders via contrastive loss on binary same-particle versus different-particle labels that are directly obtained from the physical data collection process, then performs species recognition via k-NN on a small external labeled gallery. No equation or claim reduces the reported recognition accuracy to a parameter fitted inside the paper itself, nor does any step invoke a self-citation chain, uniqueness theorem, or ansatz that is defined only by the present work. The supervision signal is independent of the learned embeddings, and performance is measured against an image-only baseline on held-out data, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard contrastive-learning assumption that same-particle pairs provide a useful positive signal and on the transferability of the learned embeddings to a k-NN classifier; no free parameters or invented entities are introduced beyond those in the base CLIP-style loss.

axioms (1)
  • domain assumption Binary same/different particle labels supply sufficient supervisory signal for representation learning
    Invoked when the contrastive objective is defined on particle identity rather than species labels.

pith-pipeline@v0.9.0 · 5570 in / 1164 out tokens · 39504 ms · 2026-05-15T10:00:48.051351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

  1. [1]

    In: ISVC (2022)

    Badreldeen Bdawy Mohamed, O., Eerola, T., Kraft, K., et al.: Open-set plankton recognition using similarity learning. In: ISVC (2022)

  2. [2]

    In: ICIP (2024) 14 J

    Batrakhanov, D., Eerola, T., Kraft, K., et al.: Daplankton: Benchmark dataset for multi-instrument plankton recognition via fine-grained domain adaptation. In: ICIP (2024) 14 J. Kareinen et al. Table 6: Average accuracy and standard deviation (%) comparing DINO and contrastive multimodal pre-training Method Model LAB→LAB LAB→SEA UTO→LAB UTO→SEA DINO Effic...

  3. [3]

    In: ICPR (2021)

    Bureš, J., Eerola, T., Lensu, L., Kälviäinen, H., Zemčík, P.: Plankton recognition in images with varying size. In: ICPR (2021)

  4. [4]

    In: CVPR (2021)

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: CVPR (2021)

  5. [5]

    Limnology and Oceanography: Methods23, 39–66 (2025)

    Chen, C., Kyathanahally, S.P., Reyes, M., et al.: Producing plankton classifiers that are robust to dataset shift. Limnology and Oceanography: Methods23, 39–66 (2025)

  6. [6]

    In: ICLR (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

  7. [7]

    Cytometry37, 247–254 (1999)

    Dubelaar, G.B., Gerritzen, P.L., Beeker, A.E., Jonker, R.R., Tangen, K.: Design and first results of cytobuoy: A wireless flow cytometer for in situ analysis of marine and fresh waters. Cytometry37, 247–254 (1999)

  8. [8]

    Arti- ficial Intelligence Review57(2024)

    Eerola, T., Batrakhanov, D., Barazandeh, N.V., et al.: Survey of automatic plank- ton image recognition: challenges, existing solutions and future perspectives. Arti- ficial Intelligence Review57(2024)

  9. [9]

    In: ICASSP (2023)

    Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H.: Clap learning audio concepts from natural language supervision. In: ICASSP (2023)

  10. [10]

    Limnology and Oceanography: Methods17, 439–461 (2019)

    Ellen, J.S., Graff, C.A., Ohman, M.D.: Improving plankton image classification us- ing context metadata. Limnology and Oceanography: Methods17, 439–461 (2019)

  11. [11]

    Photosynthesis research39, 235–258 (1994)

    Falkowski,P.G.:Theroleofphytoplanktonphotosynthesisinglobalbiogeochemical cycles. Photosynthesis research39, 235–258 (1994)

  12. [12]

    Science281, 237– 240 (1998)

    Field, C.B., Behrenfeld, M.J., Randerson, J.T., Falkowski, P.: Primary production of the biosphere: integrating terrestrial and oceanic components. Science281, 237– 240 (1998)

  13. [13]

    Limnology and Oceanography: Methods20(7), 387–399 (2022)

    Fuchs, R., Thyssen, M., Creach, V., et al.: Automatic recognition of flow cytometric phytoplankton functional groups using convolutional neural networks. Limnology and Oceanography: Methods20(7), 387–399 (2022)

  14. [14]

    Cytometry Part A (2025)

    Gallot, C., Hubert, Z., Haraguchi, L., et al.: Best practices for optimization of phy- toplankton analysis in natural waters using cytosense flow cytometers. Cytometry Part A (2025)

  15. [15]

    Gu, J., Stevens, S., Campolongo, E.G., et al.: Bioclip 2: Emergent properties from scaling hierarchical contrastive learning (2025),https://arxiv.org/abs/2505. 23883

  16. [16]

    Neural Computation9, 1735–1780 (1997) Cross-modal learning for plankton recognition 15

    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation9, 1735–1780 (1997) Cross-modal learning for plankton recognition 15

  17. [17]

    In: CVPR Workshops (2025)

    Kareinen, J., Eerola, T., Kraft, K., Lensu, L., Suikkanen, S., Kälviäinen, H.: Self- supervised pretraining for fine-grained plankton recognition. In: CVPR Workshops (2025)

  18. [18]

    In: ECCV Workshops (2024)

    Kareinen, J., Skyttä, A., Eerola, T., et al.: Open-set plankton recognition. In: ECCV Workshops (2024)

  19. [19]

    23729/fd-470acabc-afb8-39cb-a86e-0f81872e7443(2025)

    Kareinen, J., Veikka, I., Eerola, T., Haraguchi, L., Lensu, L., Kraft, K., Suikka- nen, S., Kälviäinen, H.: SYKE-plankton_CytoSense_2025.https://doi.org/10. 23729/fd-470acabc-afb8-39cb-a86e-0f81872e7443(2025)

  20. [20]

    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014),https://arxiv.org/abs/1411. 2539

  21. [21]

    Harmful Algae p

    Kraft, K., Haraguchi, L., Hällfors, H., et al.: Monitoring cyanobacteria blooms with complementary measurements–a similar story told using high-throughput imaging, optical sensors, light microscopy, and satellite-based methods. Harmful Algae p. 102865 (2025)

  22. [22]

    https://doi.org/10.23728/b2share.abf913e5a6ad47e6baa273ae0ed6617a

    Kraft, K., Velhonoja, O., Seppälä, J., et al.: SYKE-plankton_IFCB_2022 (2022). https://doi.org/10.23728/b2share.abf913e5a6ad47e6baa273ae0ed6617a

  23. [23]

    Scientific Reports12, 18590 (2022)

    Kyathanahally, S., Hardeman, T., Reyes, M., Merz, E., Bulas, T., Brun, P., Po- mati, F., Baity-Jesi, M.: Ensembles of data-efficient vision transformers as a new paradigm automated classification in ecology. Scientific Reports12, 18590 (2022)

  24. [24]

    ACM Computing Sur- veys56, 1–42 (2024)

    Liang, P.P., Zadeh, A., Morency, L.P.: Foundations & trends in multimodal ma- chine learning: Principles, challenges, and open questions. ACM Computing Sur- veys56, 1–42 (2024)

  25. [25]

    In: CVPR (2022)

    Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR (2022)

  26. [26]

    Scientific Reports 13, 10443 (2023)

    Maracani, A., Pastore, V.P., Natale, L., Rosasco, L., Odone, F.: In-domain versus out-of-domain transfer learning in plankton image classification. Scientific Reports 13, 10443 (2023)

  27. [27]

    van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018),https://arxiv.org/abs/1807.03748

  28. [28]

    Orenstein, E.C., Beijbom, O., Peacock, E.E., Sosik, H.M.: WHOI-Plankton- A Large Scale Fine Grained Visual Recognition Benchmark Dataset for Plankton Classification (2015),https://arxiv.org/abs/1510.00745

  29. [29]

    In: ICML (2021)

    Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

  30. [30]

    Limnology and Oceanography: Methods 5, 204–216 (2007)

    Sosik, H.M., Olson, R.J.: Automated taxonomic classification of phytoplankton sampled with imaging-in-flow cytometry. Limnology and Oceanography: Methods 5, 204–216 (2007)

  31. [31]

    In: CVPR (2024)

    Stevens, S., Wu, J., Thompson, M.J., et al.: Bioclip: A vision foundation model for the tree of life. In: CVPR (2024)

  32. [32]

    In: ICML (2013)

    Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: ICML (2013)

  33. [33]

    In: ICML (2019)

    Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: ICML (2019)

  34. [34]

    Tschannen, M., Gritsenko, A., Wang, X., et al.: Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features (2025),https://arxiv.org/abs/2502.14786

  35. [35]

    In: NeurIPS (2017) 16 J

    Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NeurIPS (2017) 16 J. Kareinen et al

  36. [36]

    ICES Journal of Marine Science pp

    Yang, Z., Li, J., Chen, T., Pu, Y., Feng, Z.: Contrastive learning-based image retrieval for automatic recognition of in situ marine plankton images. ICES Journal of Marine Science pp. 2643–2655 (2022)

  37. [37]

    In: CVPR (2023) Cross-modal learning for plankton recognition 17 Appendix A.1

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: CVPR (2023) Cross-modal learning for plankton recognition 17 Appendix A.1. Profile encoder architectures Thedesignchoicesforeachencoderweremadetobalancerepresentationalcapac- ity and computational efficiency, given the relatively small dataset size. Detaile...