pith. sign in

arxiv: 2604.17856 · v1 · submitted 2026-04-20 · 💻 cs.CV

PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation

Pith reviewed 2026-05-10 05:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords plankton instance segmentationpseudo community imagesvision transformersmasked autoencoderssynthetic data generationMask2Formeraquatic ecosystem monitoring
0
0 comments X

The pith

PlankFormer uses synthesized pseudo community images and MAE-pretrained vision transformers to segment plankton instances more accurately than standard CNNs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to automate segmentation of individual plankton in crowded microscopic images, which is essential for aquatic ecosystem monitoring but hindered by scarce pixel-level labels and the limitations of CNNs in handling debris and overlaps. It generates labeled Pseudo Community Images by placing real plankton specimens onto varied backgrounds, including those from generative models, to create training data without extensive manual annotation of real scenes. The model employs a Vision Transformer backbone pre-trained via Masked Autoencoder on unlabeled individual plankton images, decoded with Mask2Former to capture global features for better distinction amid clutter. If correct, this would deliver high-precision results on real data, especially in debris-dense conditions, using far less labeled real imagery than before.

Core claim

PlankFormer generates labeled Pseudo Community Images by synthesizing individual plankton onto diverse backgrounds and pairs this with a Vision Transformer backbone pre-trained by Masked Autoencoder on unlabeled plankton images, using a Mask2Former decoder to perform instance segmentation that significantly outperforms Mask R-CNN on real-world datasets, particularly under high debris density, while requiring fewer manual annotations.

What carries the argument

Pseudo Community Image synthesis combined with an MAE-pretrained Vision Transformer and Mask2Former decoder.

If this is right

  • Training on synthetic Pseudo Community Images enables effective segmentation with substantially reduced pixel-level annotation of real crowded scenes.
  • MAE pre-training on unlabeled individuals equips the model to capture global structures, improving handling of occlusions and debris.
  • Performance gains appear most pronounced in high-debris real-world conditions compared to Mask R-CNN.
  • The overall pipeline supports precise plankton analysis for ecosystem assessment with lower manual labeling effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The synthesis-plus-pretraining strategy could transfer to segmenting other dense microscopic objects, such as cells or bacteria in noisy biological images.
  • Expanding the generative background models might allow rapid adaptation when new plankton species or water conditions appear.
  • Scaled deployment could support automated, continuous tracking of plankton populations to inform water quality and biodiversity studies.

Load-bearing premise

The distribution of the synthesized pseudo community images matches the statistics of real crowded plankton scenes closely enough for performance gains to transfer without major domain shift.

What would settle it

Test the trained PlankFormer on a new collection of real crowded plankton images from a different aquatic site or season with distinct debris types and densities, comparing its accuracy to a Mask R-CNN baseline trained on the same pseudo data.

Figures

Figures reproduced from arXiv: 2604.17856 by Jotaro Urabe, Koichi Ito, Masaharu Miyazaki, Takafumi Aoki, Wataru Makino, Yurie Otake.

Figure 1
Figure 1. Figure 1: Overview of the Pseudo Community Image (PCI) generation process. images and using them as backgrounds, we can train the model to accurately detect plankton even in environments containing non-plankton objects such as debris. Furthermore, to expand background diversity, we also employ image gen￾erative models. Specifically, we generate background images using StyleGAN3 [17] and Denoising Diffusion Probabili… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the plankton image segmentation method PlankFormer. 3.4 PCI Labeling Simultaneously with PCI creation, pixel-level ground truth labels for plank￾ton individuals and background regions are automatically generated. The same geometric transformations (flipping, rotation, scaling, and placement position) applied to the individual images are applied to the corresponding individual mask images. By sy… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation datasets used in the experiments. 5.2 Experimental Conditions In this experiment, to demonstrate the effectiveness of the proposed method for plankton detection, we compare detection accuracy with conventional instance segmentation methods: Mask R-CNN [13] and Mask2Former [3]. To verify the effectiveness of pre-training using MAE [12] in the proposed method, we conduct comparative experiments un… view at source ↗
Figure 4
Figure 4. Figure 4: Example of segmentation results (zoomed-in views). Red dashed circles indicate false positives and false negatives. have been more advantageous than the strong shape priors acquired through pre-training. However, addressing adverse conditions like those in the Biwako dataset is crucial for real-world monitoring. The proposed method demonstrates significant performance gains on the challenging dataset while… view at source ↗
read the original abstract

Plankton monitoring is essential for assessing aquatic ecosystems but is limited by the labor-intensive nature of manual microscopic analysis. Automating the segmentation of plankton from crowded images is crucial, however, it faces two major challenges: (i) the scarcity of pixel-level annotated datasets and (ii) the difficulty of distinguishing plankton from debris and overlapping individuals using conventional CNN-based methods. To address these issues, we propose PlankFormer, a novel framework for plankton instance segmentation. First, to overcome the data shortage, we introduce a method to generate labeled Pseudo Community Images (PCI) by synthesizing individual plankton images onto diverse backgrounds, including those created by generative models. Second, we propose a segmentation model utilizing a Vision Transformer (ViT) backbone with a Mask2Former decoder. To robustly capture the global structural features of plankton against occlusion and debris, we employ a Masked Autoencoder (MAE) for self-supervised pre-training on unlabeled individual images. Experimental results on real-world datasets demonstrate that our method significantly outperforms conventional methods, such as Mask R-CNN, particularly in challenging environments with high debris density. We demonstrate that our synthetic training strategy and MAE-based architecture enable high-precision segmentation with requiring less manual annotations for individual plankton images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PlankFormer, a framework for plankton instance segmentation that generates labeled Pseudo Community Images (PCI) by synthesizing individual plankton instances onto diverse backgrounds (including generative-model outputs) to mitigate annotation scarcity. It employs a Vision Transformer backbone pretrained via Masked Autoencoder (MAE) on unlabeled individual plankton images, paired with a Mask2Former decoder, to capture global features robust to occlusion and debris. The central claim is that this approach significantly outperforms conventional CNN-based methods such as Mask R-CNN on real-world crowded plankton datasets, particularly under high debris density, while requiring fewer manual annotations.

Significance. If the empirical claims hold with rigorous validation, the work could meaningfully advance automated analysis of aquatic ecosystems by reducing reliance on labor-intensive pixel-level labeling and improving robustness in debris-heavy scenes. The synthesis strategy and MAE pretraining on ViTs represent standard, well-motivated responses to data scarcity and occlusion challenges in ecological imaging; credit is due for applying these techniques to a domain with clear practical need. However, the absence of any quantitative metrics, ablation details, or dataset statistics in the abstract makes it impossible to gauge the magnitude or reliability of the reported gains at present.

major comments (2)
  1. Abstract: the claim of 'significantly outperforms conventional methods, such as Mask R-CNN' is presented without any supporting numbers (mAP, AP50, IoU, dataset sizes, number of images, or statistical tests). This is load-bearing for the central empirical contribution and must be substantiated with concrete results, error bars, and comparison tables in the experimental section.
  2. Method section (PCI generation): the transfer assumption that pasting individual plankton onto backgrounds (including generative ones) produces images whose statistics match real crowded scenes sufficiently for performance gains to transfer is stated but not validated. Domain-shift diagnostics, such as feature distribution comparisons or a controlled real-vs-synthetic ablation, are required to support the claim that annotation reduction does not come at the cost of domain mismatch.
minor comments (2)
  1. Abstract, final sentence: 'with requiring less manual annotations' is grammatically incorrect and should be rephrased (e.g., 'while requiring fewer manual annotations').
  2. The paper should explicitly cite the original Mask2Former and MAE papers when describing the decoder and pretraining components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments correctly identify areas where the presentation of results and validation of our approach can be strengthened. We address each major comment point by point below, indicating the specific revisions we will incorporate.

read point-by-point responses
  1. Referee: Abstract: the claim of 'significantly outperforms conventional methods, such as Mask R-CNN' is presented without any supporting numbers (mAP, AP50, IoU, dataset sizes, number of images, or statistical tests). This is load-bearing for the central empirical contribution and must be substantiated with concrete results, error bars, and comparison tables in the experimental section.

    Authors: We agree that the abstract should include quantitative support for the performance claims. The experimental section of the manuscript already reports mAP, AP50, and IoU metrics comparing PlankFormer against Mask R-CNN and other baselines across real-world plankton datasets, along with dataset statistics (number of images and annotated instances), error bars from multiple training runs, and statistical significance tests. In the revised manuscript, we will update the abstract to explicitly cite these key results (e.g., the observed mAP gains and reduction in required manual annotations) while directing readers to the corresponding tables and figures in the experiments section. revision: yes

  2. Referee: Method section (PCI generation): the transfer assumption that pasting individual plankton onto backgrounds (including generative ones) produces images whose statistics match real crowded scenes sufficiently for performance gains to transfer is stated but not validated. Domain-shift diagnostics, such as feature distribution comparisons or a controlled real-vs-synthetic ablation, are required to support the claim that annotation reduction does not come at the cost of domain mismatch.

    Authors: We acknowledge that direct validation of the domain transfer from PCI to real scenes would strengthen the claims. Our experiments demonstrate that models trained primarily on PCI outperform those trained on limited real annotations when evaluated on held-out real crowded scenes, providing indirect evidence that the synthesis strategy transfers effectively. However, we did not include explicit domain-shift diagnostics such as feature distribution comparisons (e.g., t-SNE or MMD) or a dedicated real-versus-synthetic ablation study. We will add a new subsection to the experiments with these analyses, including feature visualizations and an ablation varying the ratio of synthetic to real training data, to more rigorously support the transfer assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experiments

full rationale

The paper proposes PlankFormer using PCI synthesis for data augmentation and MAE-pretrained ViT + Mask2Former for segmentation. The central claim is empirical outperformance on real datasets versus Mask R-CNN, especially in high-debris scenes. No equations, fitted parameters, derivations, or self-citation chains appear in the abstract or described pipeline. The synthesis and pretraining steps are standard techniques whose validity is tested externally via held-out real images rather than reducing to the inputs by construction. This is a typical non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work rests on standard deep-learning assumptions plus two ad-hoc constructions introduced in the paper.

axioms (2)
  • ad hoc to paper Synthetic images generated by pasting individual plankton onto backgrounds preserve the statistical properties needed for segmentation transfer.
    Invoked in the PCI generation step described in the abstract.
  • ad hoc to paper MAE pretraining on unlabeled individual plankton images learns features robust to occlusion and debris.
    Stated as the motivation for the self-supervised pre-training stage.
invented entities (1)
  • Pseudo Community Image (PCI) no independent evidence
    purpose: Augment scarce pixel-level annotated plankton data by synthesizing crowded scenes.
    New data-generation procedure introduced to address scarcity of labeled datasets.

pith-pipeline@v0.9.0 · 5543 in / 1266 out tokens · 31208 ms · 2026-05-10T05:06:34.429464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    OCEANS Con f

    Bergum, S., Saad, A., Stahl, A.: Automatic in-situ instan ce and semantic segmen- tation of planktonic organisms using mask R-CNN. OCEANS Con f. pp. 1–8 (Oct 2020)

  2. [2]

    Chen, X., Xie, S., He, K.: An empirical study of training se lf-supervised vision transformers. Int. Conf. Comput. Vis. pp. 9620–9629 (Oct 20 21)

  3. [3]

    IEEE/C VF Conf

    Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdha r, R.: Masked-attention mask transformer for universal image segmentation. IEEE/C VF Conf. Comput. Vis. Pattern Recog. pp. 1280–1289 (Jun 2022)

  4. [4]

    PLOS ONE 14(7), e0219570–1–17 (Jul 2019)

    Cheng, K., Cheng, X., Wang, Y., Bi, H., Benfield, M.C.: Enha nced convolutional neural network for plankton identification and enumeration . PLOS ONE 14(7), e0219570–1–17 (Jul 2019)

  5. [5]

    Cowen, R.K., Guigand, C.: In situ ichthyoplankton imagin g system (ISIIS): System design and preliminary results. Limnol. Oceanog.: Methods 6(2), 126–132 (Feb 2008)

  6. [6]

    IEEE/CVF Conf

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L .: ImageNet: A large-scale hierarchical image database. IEEE/CVF Conf. Comput. Vis. P attern Recog. pp. 248–255 (Jun 2009)

  7. [7]

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn , D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszk oreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recogn ition at scale. Int. Conf. Learn. Represent. pp. 1–21 (Jan 2021)

  8. [8]

    IE EE/CVF Conf

    Fukui, H., Hirakawa, T., Yamashita, T., Fujiyoshi, H.: At tention branch network: Learning of attention mechanism for visual explanation. IE EE/CVF Conf. Com- put. Vis. Pattern Recog. pp. 10705–10714 (Jun 2019)

  9. [9]

    IEEE/CVF Conf

    Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.Y., Cub uk, E.D., Le, Q.V., Zoph, B.: Simple copy-paste is a strong data augmentation me thod for instance segmentation. IEEE/CVF Conf. Comput. Vis. Pattern Recog. p p. 2917–2927 (Jun 2020)

  10. [10]

    MIT Press (2016)

    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learnin g. MIT Press (2016)

  11. [11]

    Gorsky, G., Ohman, M.D., Picheral, M., Gasparini, S., St emmann, L., Romagnan, J., Cawood, A., Pesant, S., Garcíacomass, G., Prejger, F.: D igital zooplankton image analysis using the ZooScan integrated system. J. Plan kton Research 32(3), 285–303 (Mar 2010) PlankFormer: Robust Plankton Instance Segmentation 15

  12. [12]

    : Masked autoencoders are scalable vision learners

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R. : Masked autoencoders are scalable vision learners. IEEE/CVF Conf. Comput. Vis. P attern Recog. pp. 16000–16009 (Jun 2022)

  13. [13]

    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CN N. Int. Conf. Comput. Vis. pp. 2980–2988 (Oct 2017)

  14. [14]

    IEEE/CVF Conf

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learnin g for image recognition. IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 770–778 (Jun 2016)

  15. [15]

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabi listic models. Adv. Neural Inform. Process. Syst. pp. 6840–6851 (Dec 2020)

  16. [16]

    As ian Conf

    Ito, K., Miura, K., Aoki, T., Otake, Y., Makino, W., Urabe , J.: Zooplankton clas- sification using hierarchical attention branch network. As ian Conf. Pattern Recog. pp. 409–419 (Nov 2023)

  17. [17]

    Karras, T., Aittala, M., Laine, S., Hönen, E., Hellsten, J., Lehtinen, J., Aila, T.: Alias-free generative adversarial networks. Adv. Neural I nform. Process. Syst. 34, 852–863 (Dec 2021)

  18. [18]

    Kyathanahally, S.P., Hardeman, T., Merz, E., Bulas, T., Reyes, M., Isles, P., Po- mati, F., Baity-Jesi, M.: Deep learning classification of la ke zooplankton. Front. Microbiol. 12(746297), 1–13 (Nov 2021)

  19. [19]

    Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ra manan, D., Dollár, P., C.L., Z.: Microsoft COCO: Common objects in context. Eur. Co nf. Comput. Vis. pp. 740–755 (Sep 2014)

  20. [20]

    Loshchilov, L., Hutter, F.: Decoupled weight decay regu larization. Int. Conf. Learn. Represent. pp. 1–10 (May 2019)

  21. [21]

    Ecological Informatics 51, 33–43 (May 2019)

    Lumini, A., Nanni, L.: Deep learning and transfer learni ng features for plankton classification. Ecological Informatics 51, 33–43 (May 2019)

  22. [22]

    Limnology and Oceanography: Methods 16(12), 814–827 (Dec 2018)

    Luo, J.Y., Irisson, J.O., Graham, B., Guigand, C., Saraf raz, A., Mader, C., Cowen, R.K.: Automated plankton image analysis using convolution al neural networks. Limnology and Oceanography: Methods 16(12), 814–827 (Dec 2018)

  23. [23]

    Zoo- plankton

    Otake, Y., Osone, A., Makino, W., Ito, K., Aoki, T., Miura , K., Hayakawa, Y., Yoshida, R., Ichise, S., Tuji, A., Urabe, J.: High-resoluti on microscopic image dataset of freshwater plankton in Japanese lakes and reserv oirs (FREP): I. Zoo- plankton. Bull. Natl. Mus. Nat. Sci., Ser. B 50(4), 159–164 (Nov 2024)

  24. [24]

    Panaïotis, T., Caray-Counil, L., Woodward, B., Schmid, M.S., Daprano, D., Tsai, S.T., Sullivan, C.M., Cowen, R.K., Irisson, J.O.: Content- aware segmentation of objects spanning a large size range: Application to plankto n images. Front. Mar. Sci. 9(870005), 1–16 (Jun 2022)

  25. [25]

    Sosik, H.M., Olson, R.J.: Automated taxonomic classific ation of phytoplankton sampled with imaging-in-flow cytometry. Limnol. Oceanogr. Methods 5(6), 204– 216 (Jun 2007)

  26. [26]

    CSIRO Publishing (2019)

    Suthers, I., Rissik, D., Richardson, A.: Plankton: A gui de to their ecology and monitoring for water quality. CSIRO Publishing (2019)

  27. [27]

    Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., J ones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Adv. N eural Inform. Process. Syst. pp. 5998–6008 (Dec 2017)

  28. [28]

    Int’l Conf

    Yu, K., Zhou, Y., Bai, Y., Soh, Z.D., Xu, X., Goh, R.S.M., C heng, C., Liu, Y.: Ur- Found: Towards universal retinal foundation models via kno wledge-guided masked modeling. Int’l Conf. Medical Image Computing and Computer Assisted Interven- tion pp. 753–762 (Oct 2024)