pith. sign in

arxiv: 2605.22492 · v1 · pith:EDVUHVILnew · submitted 2026-05-21 · 💻 cs.CV

Training-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline

Pith reviewed 2026-05-22 06:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords fine-grained semantic segmentationtraining-free methodslow-data regimesclass-agnostic masksprototype matchingfungi identificationmacro-taxonomic prompts
0
0 comments X

The pith

A training-free two-stage method segments fine-grained fungi by generating class-agnostic masks first then assigning labels via prototype matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that fine-grained semantic segmentation of mushrooms can be achieved without any training by decoupling mask creation from species labeling. Broad taxonomic prompts generate outlines that ignore specific classes, after which a simple transformation of the embedding space supports prototype matching to assign the correct fine-grained label. This approach would matter because it avoids the high cost of per-class prompts or model adaptation in domains with long-tailed distributions and limited examples, where visual similarity between species makes discrimination hard. A sympathetic reader would see value in the reported results spanning one-shot to few-hundred-shot regimes, as they supply an initial scalable baseline that keeps segmentation costs low compared with fully class-specific methods.

Core claim

The central claim is that a training-free two-stage framework decouples segmentation from classification on the FungiTastic dataset. Macro-taxonomic prompts first produce class-agnostic mushroom masks, after which fine-grained labels are assigned through prototype matching in the embedding space following a simple transformation of the feature space. This is shown to be more scalable than class-specific prompting while maintaining low segmentation cost, with performance measured across low-data regimes from single examples to a few hundred shots per class.

What carries the argument

The two-stage decoupling of segmentation and classification, where macro-taxonomic prompts create class-agnostic masks and prototype matching after feature-space transformation assigns fine-grained labels.

If this is right

  • New fine-grained classes can be added without retraining or expanding the set of segmentation prompts.
  • Segmentation cost remains constant as the number of classes grows because masks are produced independently of specific labels.
  • The method supplies a consistent reference point for performance across data regimes from one example to several hundred.
  • The feature-space transformation step raises the accuracy of prototype-based label assignment relative to the untransformed case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of mask generation from label assignment could reduce labeled-data needs in other long-tailed fine-grained domains such as plant or insect identification from field images.
  • If the transformation step generalizes across embedding models, it might allow quicker adaptation of existing vision representations for segmentation tasks without per-task retraining.
  • Measuring how often macro prompts fail on closely related species pairs would indicate where prompt design or mask refinement needs further attention.

Load-bearing premise

Broad macro-taxonomic prompts will produce accurate enough class-agnostic masks to isolate instances even among visually similar fine-grained species, and a basic unspecified transformation of the embedding space will make prototype matching reliable without task-specific tuning.

What would settle it

A side-by-side evaluation on image pairs of visually close mushroom species showing that the generated masks frequently merge or miss boundaries, causing the subsequent prototype matching to assign wrong labels at rates far above the reported baseline.

Figures

Figures reproduced from arXiv: 2605.22492 by Francesco Pelosin, Lapo Faggi, Sebastian Cavada.

Figure 1
Figure 1. Figure 1: Performance across low-data regimes on FungiTastic. Top: mean class accuracy (left) and mean IoU (right) as the number [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Macro-to-fine pipeline: DINOv3 extracts features to match class prototypes for label prediction, while SAM3 provides class [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Fine-grained semantic segmentation requires both precise localization and discrimination between visually similar classes. In FungiTastic, this problem is further complicated by a long-tailed distribution and strong variation in image acquisition conditions. We propose a training-free two-stage framework that decouples segmentation from classification. SAM3 first produces class-agnostic mushroom masks using macro-taxonomic prompts, and DINOv3 then assigns fine-grained labels through prototype matching in the embedding space. To improve this stage, we apply a simple transformation of the DINOv3 feature space that improves prototype-based classification. Compared with class-specific prompting, our approach is more scalable and keeps the segmentation cost low. We report results from one-shot to few-hundred-shot regimes, providing, to the best of our knowledge, the first baseline for fine-grained semantic segmentation in low-data settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a training-free two-stage framework for fine-grained semantic segmentation of fungi images under long-tailed distributions and variable acquisition conditions. SAM3 generates class-agnostic instance masks from macro-taxonomic (genus/family) prompts; DINOv3 then assigns fine-grained species labels via prototype matching after an unspecified transformation of the embedding space. Results are reported across one-shot to few-hundred-shot regimes and positioned as a scalable baseline relative to class-specific prompting.

Significance. If the performance claims can be substantiated with quantitative evidence, the work would supply a practical, training-free baseline for fine-grained segmentation in low-data, high-similarity domains. The explicit decoupling of mask generation from label assignment and the emphasis on macro-level prompting are conceptually attractive for scalability, but the absence of supporting metrics leaves the practical utility unverified.

major comments (3)
  1. [Abstract] Abstract: the central claim that macro-taxonomic prompts to SAM3 yield masks sufficiently accurate for subsequent fine-grained prototype matching is load-bearing yet unsupported by any mask-quality metrics (e.g., boundary F-score, per-genus IoU, or over-/under-segmentation rates). Without these, it is impossible to determine whether reported low-shot gains arise from the embedding transformation or from mask errors that happen to be correctable by DINOv3.
  2. [Abstract] Abstract: the 'simple transformation' of the DINOv3 feature space is described only qualitatively and without an ablation, quantitative improvement figures, or pseudocode. Because this step is presented as the key enabler of reliable prototype matching, its omission prevents assessment of whether the method is genuinely training-free or merely defers adaptation to an ad-hoc preprocessing choice.
  3. [Results] Results (one-shot to few-hundred-shot regimes): no error bars, per-class breakdowns, or comparison tables against class-specific prompting baselines are supplied. The absence of these data makes the scalability claim and the cross-regime performance statements unverifiable.
minor comments (1)
  1. [Abstract] The manuscript would benefit from an explicit statement of the exact prompting template used with SAM3 and the precise form of the DINOv3 transformation (e.g., centering, whitening, or learned affine map).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that macro-taxonomic prompts to SAM3 yield masks sufficiently accurate for subsequent fine-grained prototype matching is load-bearing yet unsupported by any mask-quality metrics (e.g., boundary F-score, per-genus IoU, or over-/under-segmentation rates). Without these, it is impossible to determine whether reported low-shot gains arise from the embedding transformation or from mask errors that happen to be correctable by DINOv3.

    Authors: We agree that mask-quality metrics are necessary to isolate the contributions of each stage. In the revised manuscript we will report boundary F-score, per-genus IoU, and over-/under-segmentation statistics for the SAM3 masks produced with macro-taxonomic prompts. These metrics will be computed on a held-out validation subset and presented alongside the existing segmentation results. revision: yes

  2. Referee: [Abstract] Abstract: the 'simple transformation' of the DINOv3 feature space is described only qualitatively and without an ablation, quantitative improvement figures, or pseudocode. Because this step is presented as the key enabler of reliable prototype matching, its omission prevents assessment of whether the method is genuinely training-free or merely defers adaptation to an ad-hoc preprocessing choice.

    Authors: We acknowledge the description is currently qualitative. The transformation is a fixed, deterministic preprocessing step (no parameters are learned from the target fungi data) that improves prototype separability while preserving the training-free character of the pipeline. In revision we will supply the exact formulation, pseudocode, an ablation table quantifying the accuracy gain, and explicit confirmation that the step requires no training or fine-tuning on the evaluation regimes. revision: yes

  3. Referee: [Results] Results (one-shot to few-hundred-shot regimes): no error bars, per-class breakdowns, or comparison tables against class-specific prompting baselines are supplied. The absence of these data makes the scalability claim and the cross-regime performance statements unverifiable.

    Authors: We agree that the reported results would be more verifiable with additional statistical detail. The revised results section will include error bars (standard deviation over multiple random seeds), per-class performance tables, and a side-by-side comparison against class-specific prompting baselines across the one-shot to few-hundred-shot regimes to support the scalability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural combination of pre-trained models

full rationale

The paper presents a training-free two-stage pipeline that applies existing models (SAM3 with macro-taxonomic prompts for class-agnostic masks, followed by DINOv3 prototype matching on a simple feature-space transformation) without any equations, derivations, parameter fitting, or self-referential reductions. The central claims rest on the empirical behavior of off-the-shelf pre-trained components rather than on any quantity defined in terms of itself or justified solely by prior self-citation. No load-bearing step reduces by construction to its inputs, satisfying the self-contained criterion against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the off-the-shelf performance of SAM3 and DINOv3 on fungi imagery plus the effectiveness of an unspecified feature transformation; no new entities are postulated and no parameters appear to be fitted to the target data.

axioms (2)
  • domain assumption Pre-trained SAM3 responds reliably to macro-taxonomic prompts on the FungiTastic dataset to produce accurate class-agnostic masks.
    Invoked in the description of the first stage; no evidence or validation of prompt robustness is supplied in the abstract.
  • domain assumption DINOv3 embeddings, after a simple transformation, support accurate prototype matching for fine-grained discrimination under long-tailed distributions.
    Central to the second stage; the transformation itself is not formalized or justified beyond the claim that it 'improves' performance.

pith-pipeline@v0.9.0 · 5676 in / 1575 out tokens · 48928 ms · 2026-05-22T06:25:54.153102+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

  1. [1]

    Fine-grained few- shot classification with part matching

    Samuel Black and Richard Souvenir. Fine-grained few- shot classification with part matching. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition Work- shops, CVPR Workshops 2025, Nashville, TN, USA, June 11- 15, 2025, pages 2057–2067. Computer Vision Foundation / IEEE, 2025. 2

  2. [2]

    Overview of birdclef+ 2025: Multi-taxonomic sound identification in the middle magdalena, colombia

    Juan Sebasti ´an Ca ˜nas, Stefan Kahl, Tom Denton, Maria Paula Toro-G ´omez, Susana Rodr ´ıguez-Buritica, Jose Luis Benavides-Lopez, Juan Sebastian Ulloa, Paula Caycedo-Rosales, Holger Klinck, Herv ´e Go ¨eau, Willem- Pier Vellinga, Robert Planqu ´e, and Alexis Joly. Overview of birdclef+ 2025: Multi-taxonomic sound identification in the middle magdalena,...

  3. [3]

    Sam 3: Segment anything with concepts, 2025

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman R¨adle, Triantafyllos Afouras, Effrosyni Mavroudi, Kather- ine Xu, Tsung-Han Wu, Yu Zhou, Lil...

  4. [4]

    Etheredge

    Jack N. Etheredge. Few-shot fungi classification with proto- typical networks using multiple pretrained embedding mod- els. InWorking Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025, Madrid, Spain, 9-12 September 2025, pages 3001–3010. CEUR-WS.org, 2025. 2

  5. [5]

    Prototypi- cal learning guided context-aware segmentation network for few-shot anomaly detection.IEEE Trans

    Yuxin Jiang, Yunkang Cao, and Weiming Shen. Prototypi- cal learning guided context-aware segmentation network for few-shot anomaly detection.IEEE Trans. Neural Networks Learn. Syst., 36(7):12016–12026, 2025. 2

  6. [6]

    Whitening con- sistently improves self-supervised learning

    Andr ´as Kalapos and B ´alint Gyires-T ´oth. Whitening con- sistently improves self-supervised learning. In2024 Inter- national Conference on Machine Learning and Applications (ICMLA), pages 448–453. IEEE, 2024. 2

  7. [7]

    Optimal whitening and decorrelation.The American Statistician, 72 (4):309–314, 2018

    Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation.The American Statistician, 72 (4):309–314, 2018. 2

  8. [8]

    Few-shot classification of fungi species using contrastive representation learning and multimodal fu- sion

    Lianping Lu, Heng Yang, Shuo Li, Fang Liu, Puhua Chen, and Wenping Ma. Few-shot classification of fungi species using contrastive representation learning and multimodal fu- sion. InWorking Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025, Madrid, Spain, 9-12 September 2025, pages 3094–3101. CEUR-WS.org, 2025. 2

  9. [9]

    Overview of plantclef 2025: Multi- species plant identification in vegetation quadrat images

    Giulio Martellucci, Herv ´e Go ¨eau, Pierre Bonnet, Fabrice Vinatier, and Alexis Joly. Overview of plantclef 2025: Multi- species plant identification in vegetation quadrat images. In Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025, Madrid, Spain, 9-12 September 2025, pages 2942–2954. CEUR-WS.org, 2025. 1

  10. [10]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rab- bat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e J´egou, Julien Mairal, P...

  11. [11]

    Jeppe- sen, Jacob Heilmann-Clausen, Thomas Læssøe, and Tobias Frøslev

    Luk ´aˇs Picek, Milan ˇSulc, Ji ˇr´ı Matas, Thomas S. Jeppe- sen, Jacob Heilmann-Clausen, Thomas Læssøe, and Tobias Frøslev. Danish fungi 2020 - not just another image recogni- tion dataset. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision, pages 1525–1535,

  12. [12]

    Fungitastic: A multi-modal dataset and benchmark for image categorization

    Lukas Picek, Klara Janouskova, V ojtech Cermak, and Jiri Matas. Fungitastic: A multi-modal dataset and benchmark for image categorization. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Work- shops, pages 2046–2056, 2025. 2

  13. [13]

    Harley, and Katerina Fragkiadaki

    Mihir Prabhudesai, Shamit Lal, Darshan Patil, Hsiao-Yu Tung, Adam W. Harley, and Katerina Fragkiadaki. Disentan- gling 3d prototypical networks for few-shot concept learn- ing. In9th International Conference on Learning Represen- tations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. 2

  14. [14]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, ...

  15. [15]

    Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

  16. [16]

    Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototyp- ical networks for few-shot learning. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4- 9, 2017, Long Beach, CA, USA, pages 4077–4087, 2017. 2

  17. [17]

    Awesome fine-grained few-shot learn- ing.https : / / github

    Hao Tang. Awesome fine-grained few-shot learn- ing.https : / / github . com / CSer - Tang - hao / Awesome- Fine- Grained- Few- Shot- Learning,

  18. [18]

    GitHub repository. 2

  19. [19]

    Toso, Davit Shadunts, Yunyang Lu, Nihal Sharma, Donglin Zhan, Nam H

    Leonardo F. Toso, Davit Shadunts, Yunyang Lu, Nihal Sharma, Donglin Zhan, Nam H. Nguyen, and James An- derson. Learning invariant visual representations for plan- ning with joint-embedding predictive world models.CoRR, abs/2602.18639, 2026. 2

  20. [20]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey A. Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier J. H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense fea...

  21. [21]

    An investigation into whitening loss for self-supervised learning.Advances in Neural Infor- mation Processing Systems, 35:29748–29760, 2022

    Xi Weng, Lei Huang, Lei Zhao, Rao Anwer, Salman H Khan, and Fahad Shahbaz Khan. An investigation into whitening loss for self-supervised learning.Advances in Neural Infor- mation Processing Systems, 35:29748–29760, 2022. 2

  22. [22]

    Mushroom for improvement: Prototypical few-shot learning with multi- modal fungal features

    Tuan-Anh Yang and Minh-Quang Nguyen. Mushroom for improvement: Prototypical few-shot learning with multi- modal fungal features. InWorking Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025, Madrid, Spain, 9-12 September 2025, pages 3287–3295. CEUR- WS.org, 2025. 2