PASTA: Vision Transformer Patch Aggregation for Weakly Supervised Target and Anomaly Segmentation
Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3
The pith
PASTA segments targets and anomalies using ViT feature distributions and text-prompted zero-shot masks with only image-level labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer feature spaces by comparing an observed scene with a nominal reference. The pipeline then utilizes semantic text-prompts via the Segment Anything Model to guide zero-shot object segmentation. On a custom steel scrap recycling dataset and a plant dataset, the method reduces training time by 75.8 percent relative to domain-specific baselines while delivering superior Target segmentation up to 88.3 percent IoU and Anomaly segmentation up to 63.5 percent IoU.
What carries the argument
Patch Aggregation for Segmentation of Targets and Anomalies (PASTA), which aggregates patches from self-supervised ViT features for distribution comparison against nominal references to flag deviations.
Load-bearing premise
Distribution analysis in self-supervised ViT feature spaces reliably separates targets and anomalies from a nominal reference, and text-prompted SAM produces accurate zero-shot segmentations without domain-specific fine-tuning.
What would settle it
Running PASTA on a new dataset with strong lighting or viewpoint variation where anomaly IoU falls below 50 percent would show the feature-space separation does not hold.
Figures
read the original abstract
Detecting unseen anomalies in unstructured environments presents a critical challenge for industrial and agricultural applications such as material recycling and weeding. Existing perception systems frequently fail to satisfy the strict operational requirements of these domains, specifically real-time processing, pixel-level segmentation precision, and robust accuracy, due to their reliance on exhaustively annotated datasets. To address these limitations, we propose a weakly supervised pipeline for object segmentation and classification using weak image-level supervision called 'Patch Aggregation for Segmentation of Targets and Anomalies' (PASTA). By comparing an observed scene with a nominal reference, PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer (ViT) feature spaces. Our pipeline utilizes semantic text-prompts via the Segment Anything Model 3 to guide zero-shot object segmentation. Evaluations on a custom steel scrap recycling dataset and a plant dataset demonstrate a 75.8% training time reduction of our approach to domain-specific baselines. While being domain-agnostic, our method achieves superior Target (up to 88.3% IoU) and Anomaly (up to 63.5% IoU) segmentation performance in the industrial and agricultural domain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PASTA, a weakly supervised pipeline for target and anomaly segmentation that performs distribution analysis on patches from a self-supervised ViT feature space (compared against a nominal reference) to identify objects of interest, then uses SAM with semantic text prompts for zero-shot mask generation. It evaluates the method on a custom steel-scrap recycling dataset and a plant dataset, claiming up to 88.3% IoU for targets, 63.5% IoU for anomalies, domain-agnostic operation, and a 75.8% reduction in training time relative to domain-specific baselines.
Significance. If the core distribution-analysis step reliably separates nominal, target, and anomaly patches without domain-specific fine-tuning, the approach could meaningfully reduce annotation burden for real-time industrial and agricultural perception tasks. However, the absence of separability diagnostics or ablations leaves open the possibility that reported gains are driven primarily by SAM rather than the proposed ViT aggregation logic, limiting the strength of the domain-agnostic and weakly-supervised claims.
major comments (3)
- [Method] Method section (distribution analysis step): No quantitative separability metrics (e.g., inter- vs. intra-class distances, silhouette scores, or embedding visualizations) are reported for the self-supervised ViT patch features on the steel-scrap or plant data. Without such evidence, it is impossible to verify that the claimed IoU numbers arise from the PASTA aggregation rather than from SAM's zero-shot masks alone.
- [Experiments] Experimental Results section: The abstract and results report 88.3% target IoU, 63.5% anomaly IoU, and 75.8% training-time reduction, yet supply no description of the exact baseline architectures, training protocols, statistical significance tests, or cross-validation procedure. This omission prevents assessment of whether the performance margins are robust or dataset-specific.
- [Experiments] Ablation studies: No ablation is presented that removes or replaces the distribution-analysis component while keeping SAM fixed. Such an experiment is required to substantiate the novelty of the patch-aggregation logic over a pure SAM + text-prompt baseline.
minor comments (2)
- [Method] Notation for the nominal reference embedding and the distance threshold used in distribution analysis is introduced without an explicit equation or pseudocode block, making the pipeline difficult to re-implement precisely.
- [Figures] Figure captions for the qualitative results do not indicate whether the shown masks are produced by PASTA or by the SAM component alone.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects for improving the clarity and rigor of our work. We address each major comment point by point below and commit to the indicated revisions.
read point-by-point responses
-
Referee: [Method] Method section (distribution analysis step): No quantitative separability metrics (e.g., inter- vs. intra-class distances, silhouette scores, or embedding visualizations) are reported for the self-supervised ViT patch features on the steel-scrap or plant data. Without such evidence, it is impossible to verify that the claimed IoU numbers arise from the PASTA aggregation rather than from SAM's zero-shot masks alone.
Authors: We agree that the original submission omitted these diagnostics. In the revised manuscript we will add quantitative separability analysis for the self-supervised ViT patch features on both datasets, including inter- versus intra-class Euclidean distances, silhouette scores, and t-SNE visualizations of nominal, target, and anomaly patch embeddings. These additions will directly demonstrate that the distribution-analysis step produces separable clusters prior to SAM mask generation. revision: yes
-
Referee: [Experiments] Experimental Results section: The abstract and results report 88.3% target IoU, 63.5% anomaly IoU, and 75.8% training-time reduction, yet supply no description of the exact baseline architectures, training protocols, statistical significance tests, or cross-validation procedure. This omission prevents assessment of whether the performance margins are robust or dataset-specific.
Authors: We acknowledge the insufficient experimental detail. The revised Experimental Results section will explicitly describe the baseline architectures (including layer counts, loss functions, and optimization settings for the domain-specific models), the precise training protocols and hyper-parameters, the cross-validation procedure (k-fold splits with seed reporting), and statistical significance tests (paired t-tests and Wilcoxon signed-rank tests with p-values) computed over multiple runs. This will allow readers to evaluate the robustness of the reported margins. revision: yes
-
Referee: [Experiments] Ablation studies: No ablation is presented that removes or replaces the distribution-analysis component while keeping SAM fixed. Such an experiment is required to substantiate the novelty of the patch-aggregation logic over a pure SAM + text-prompt baseline.
Authors: We accept that an explicit ablation isolating the distribution-analysis component is necessary. In the revision we will add a controlled ablation that applies SAM with identical semantic text prompts directly to the full images, bypassing ViT patch selection and distribution comparison. We will report target and anomaly IoU for this SAM-only variant on both datasets and compare it against the full PASTA pipeline, thereby quantifying the contribution of the proposed aggregation logic. revision: yes
Circularity Check
No circularity in derivation chain; claims are empirical
full rationale
The paper describes a proposed pipeline (PASTA) that performs distribution analysis on self-supervised ViT patch embeddings to identify targets/anomalies relative to a nominal reference, followed by SAM zero-shot segmentation guided by text prompts. No equations, parameter-fitting steps, or derivation chain appear in the abstract or described method that reduce by construction to the inputs (e.g., no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations of uniqueness theorems). Performance numbers (IoU, training-time reduction) are presented as results of empirical evaluation on custom datasets rather than algebraic identities, rendering the central claims self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Self-supervised ViT features produce distributions that differ meaningfully between nominal and anomalous scenes
- domain assumption SAM produces accurate object masks from semantic text prompts in the target domains
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By comparing an observed scene with a nominal reference, PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer (ViT) feature spaces.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Anomalies are identified based on the principle of missing features. Clusters prominent in the mixed baseline set P_{T∪A} but significantly suppressed in the clean target set P_T are flagged.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We employ Mini-Batch K-Means clustering on F_{T∪A}.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
You only look once: Unified, real-time object detection,
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2016
work page 2016
-
[2]
K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” inProc. of the IEEE Int. Conf. on Computer Vision, 2017
work page 2017
-
[3]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inInt. Conf. on Learning Represen- tations, 2021
work page 2021
-
[4]
O. Sim ´eoniet al., “DINOv3,” 2025. [Online]. Available: https://arxiv.org/abs/2508.10104
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
SAM 3: Segment anything with concepts,
N. Carionet al., “SAM 3: Segment anything with concepts,” inInt. Conf. on Learning Representations, 2026
work page 2026
-
[6]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInt. Conf. on Machine Learning. PmLR, 2021
work page 2021
-
[7]
Emerging properties in self-supervised vision trans- formers,
M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision trans- formers,” inProc. of the IEEE Int. Conf. on Computer Vision, 2021
work page 2021
-
[8]
A. Kirillovet al., “Segment anything,” inProc. of the IEEE Int. Conf. on Computer Vision, 2023
work page 2023
-
[9]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection,
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean conference on computer vision. Springer, 2024
work page 2024
-
[10]
Towards training-free anomaly detection with vision and language foundation models,
J. Zhang, G. Wang, Y . Jin, and D. Huang, “Towards training-free anomaly detection with vision and language foundation models,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[11]
Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip,
W. Ma and other, “Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip,” inProc. of the IEEE Int. Conf. on Computer Vision, 2025
work page 2025
-
[12]
Clipsam: Clip and sam collaboration for zero-shot anomaly segmentation,
S. Li, J. Cao, P. Ye, Y . Ding, C. Tu, and T. Chen, “Clipsam: Clip and sam collaboration for zero-shot anomaly segmentation,” Neurocomputing, 2025
work page 2025
-
[13]
Towards total recall in industrial anomaly detection,
K. Roth, L. Pemula, J. Zepeda, B. Sch ¨olkopf, T. Brox, and P. Gehler, “Towards total recall in industrial anomaly detection,” inProc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[14]
Padim: a patch dis- tribution modeling framework for anomaly detection and localization,
T. Defard, A. Setkov, A. Loesch, and R. Audigier, “Padim: a patch dis- tribution modeling framework for anomaly detection and localization,” inInt. Conf. on Pattern Recognition. Springer, 2021
work page 2021
-
[15]
Component-aware unsupervised logical anomaly generation for industrial anomaly detection,
X. Tonget al., “Component-aware unsupervised logical anomaly generation for industrial anomaly detection,” in2025 IEEE Int. Conf. on Robotics and Automation (ICRA), 2025
work page 2025
-
[16]
Oodis: Anomaly instance segmentation and detection benchmark,
A. Nekrasovet al., “Oodis: Anomaly instance segmentation and detection benchmark,” in2025 IEEE Int. Conf. on Robotics and Automation (ICRA), 2025
work page 2025
-
[17]
M. Thilakarathna, X. Wang, A. Wijesinghe, D. Hinwood, and D. Herath, “Robotic grasping for automated sorting of complex, highly contaminated industrial food waste: A benchmark study,” in2025 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2025
work page 2025
-
[18]
Color indices for weed identification under various soil, residue, and lighting conditions,
D. M. Woebbecke, G. E. Meyer, K. V on Bargen, and D. A. Mortensen, “Color indices for weed identification under various soil, residue, and lighting conditions,”Transactions of the ASAE, 1995
work page 1995
-
[19]
Zero-shot semantic segmentation for robots in agriculture,
Y . L. Chong, L. Nunes, F. Magistri, X. Zhong, J. Behley, and C. Stach- niss, “Zero-shot semantic segmentation for robots in agriculture,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2025
work page 2025
-
[20]
Sparsifying instance segmentation models for efficient vision-based industrial recy- cling,
M. Neubauer, O. ¨Ozdenizci, J. Piater, and E. Rueckert, “Sparsifying instance segmentation models for efficient vision-based industrial recy- cling,” inMachine Learning and Knowledge Discovery in Databases. Applied Data Science Track and Demo Track, 2026
work page 2026
-
[21]
J. Weyleret al., “Phenobench: A large dataset and benchmarks for semantic image interpretation in the agricultural domain,”IEEE Trans. on Pattern Analysis and Machine Intelligence, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.