pith. sign in

arxiv: 2604.09701 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.LG

PASTA: Vision Transformer Patch Aggregation for Weakly Supervised Target and Anomaly Segmentation

Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords weakly supervised segmentationvision transformeranomaly detectiontarget segmentationsegment anything modelindustrial visionagricultural automationzero-shot segmentation
0
0 comments X

The pith

PASTA segments targets and anomalies using ViT feature distributions and text-prompted zero-shot masks with only image-level labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PASTA as a pipeline that detects and segments unseen objects by comparing new scenes against a nominal reference through analysis of self-supervised Vision Transformer features. It then refines boundaries via the Segment Anything Model driven by semantic text prompts, eliminating the need for pixel-level annotations. This approach matters for applications like steel recycling and plant monitoring because it cuts training time by 75.8 percent while reaching up to 88.3 percent IoU on targets and 63.5 percent on anomalies across tested datasets. A sympathetic reader would see the value in a domain-agnostic system that adapts to unstructured environments without exhaustive data labeling.

Core claim

PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer feature spaces by comparing an observed scene with a nominal reference. The pipeline then utilizes semantic text-prompts via the Segment Anything Model to guide zero-shot object segmentation. On a custom steel scrap recycling dataset and a plant dataset, the method reduces training time by 75.8 percent relative to domain-specific baselines while delivering superior Target segmentation up to 88.3 percent IoU and Anomaly segmentation up to 63.5 percent IoU.

What carries the argument

Patch Aggregation for Segmentation of Targets and Anomalies (PASTA), which aggregates patches from self-supervised ViT features for distribution comparison against nominal references to flag deviations.

Load-bearing premise

Distribution analysis in self-supervised ViT feature spaces reliably separates targets and anomalies from a nominal reference, and text-prompted SAM produces accurate zero-shot segmentations without domain-specific fine-tuning.

What would settle it

Running PASTA on a new dataset with strong lighting or viewpoint variation where anomaly IoU falls below 50 percent would show the feature-space separation does not hold.

Figures

Figures reproduced from arXiv: 2604.09701 by Christian Rauch, Elmar Rueckert, Melanie Neubauer.

Figure 1
Figure 1. Figure 1: Anomaly detection via distribution analysis. Left: Patch-Level Cluster Map on a SteelDS example, highlighting the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dataset comparison for SteelDS and PhenoBench. [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reconstructed train and inference pipeline. Training [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architecture of PASTA. The training phase (top) [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Parameter Sensitivity Heatmaps illustrating Anomaly [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Backbone Robustness: Impact of hypersphere density [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative segmentation results. Top: original image, [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
read the original abstract

Detecting unseen anomalies in unstructured environments presents a critical challenge for industrial and agricultural applications such as material recycling and weeding. Existing perception systems frequently fail to satisfy the strict operational requirements of these domains, specifically real-time processing, pixel-level segmentation precision, and robust accuracy, due to their reliance on exhaustively annotated datasets. To address these limitations, we propose a weakly supervised pipeline for object segmentation and classification using weak image-level supervision called 'Patch Aggregation for Segmentation of Targets and Anomalies' (PASTA). By comparing an observed scene with a nominal reference, PASTA identifies Target and Anomaly objects through distribution analysis in self-supervised Vision Transformer (ViT) feature spaces. Our pipeline utilizes semantic text-prompts via the Segment Anything Model 3 to guide zero-shot object segmentation. Evaluations on a custom steel scrap recycling dataset and a plant dataset demonstrate a 75.8% training time reduction of our approach to domain-specific baselines. While being domain-agnostic, our method achieves superior Target (up to 88.3% IoU) and Anomaly (up to 63.5% IoU) segmentation performance in the industrial and agricultural domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes PASTA, a weakly supervised pipeline for target and anomaly segmentation that performs distribution analysis on patches from a self-supervised ViT feature space (compared against a nominal reference) to identify objects of interest, then uses SAM with semantic text prompts for zero-shot mask generation. It evaluates the method on a custom steel-scrap recycling dataset and a plant dataset, claiming up to 88.3% IoU for targets, 63.5% IoU for anomalies, domain-agnostic operation, and a 75.8% reduction in training time relative to domain-specific baselines.

Significance. If the core distribution-analysis step reliably separates nominal, target, and anomaly patches without domain-specific fine-tuning, the approach could meaningfully reduce annotation burden for real-time industrial and agricultural perception tasks. However, the absence of separability diagnostics or ablations leaves open the possibility that reported gains are driven primarily by SAM rather than the proposed ViT aggregation logic, limiting the strength of the domain-agnostic and weakly-supervised claims.

major comments (3)
  1. [Method] Method section (distribution analysis step): No quantitative separability metrics (e.g., inter- vs. intra-class distances, silhouette scores, or embedding visualizations) are reported for the self-supervised ViT patch features on the steel-scrap or plant data. Without such evidence, it is impossible to verify that the claimed IoU numbers arise from the PASTA aggregation rather than from SAM's zero-shot masks alone.
  2. [Experiments] Experimental Results section: The abstract and results report 88.3% target IoU, 63.5% anomaly IoU, and 75.8% training-time reduction, yet supply no description of the exact baseline architectures, training protocols, statistical significance tests, or cross-validation procedure. This omission prevents assessment of whether the performance margins are robust or dataset-specific.
  3. [Experiments] Ablation studies: No ablation is presented that removes or replaces the distribution-analysis component while keeping SAM fixed. Such an experiment is required to substantiate the novelty of the patch-aggregation logic over a pure SAM + text-prompt baseline.
minor comments (2)
  1. [Method] Notation for the nominal reference embedding and the distance threshold used in distribution analysis is introduced without an explicit equation or pseudocode block, making the pipeline difficult to re-implement precisely.
  2. [Figures] Figure captions for the qualitative results do not indicate whether the shown masks are produced by PASTA or by the SAM component alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects for improving the clarity and rigor of our work. We address each major comment point by point below and commit to the indicated revisions.

read point-by-point responses
  1. Referee: [Method] Method section (distribution analysis step): No quantitative separability metrics (e.g., inter- vs. intra-class distances, silhouette scores, or embedding visualizations) are reported for the self-supervised ViT patch features on the steel-scrap or plant data. Without such evidence, it is impossible to verify that the claimed IoU numbers arise from the PASTA aggregation rather than from SAM's zero-shot masks alone.

    Authors: We agree that the original submission omitted these diagnostics. In the revised manuscript we will add quantitative separability analysis for the self-supervised ViT patch features on both datasets, including inter- versus intra-class Euclidean distances, silhouette scores, and t-SNE visualizations of nominal, target, and anomaly patch embeddings. These additions will directly demonstrate that the distribution-analysis step produces separable clusters prior to SAM mask generation. revision: yes

  2. Referee: [Experiments] Experimental Results section: The abstract and results report 88.3% target IoU, 63.5% anomaly IoU, and 75.8% training-time reduction, yet supply no description of the exact baseline architectures, training protocols, statistical significance tests, or cross-validation procedure. This omission prevents assessment of whether the performance margins are robust or dataset-specific.

    Authors: We acknowledge the insufficient experimental detail. The revised Experimental Results section will explicitly describe the baseline architectures (including layer counts, loss functions, and optimization settings for the domain-specific models), the precise training protocols and hyper-parameters, the cross-validation procedure (k-fold splits with seed reporting), and statistical significance tests (paired t-tests and Wilcoxon signed-rank tests with p-values) computed over multiple runs. This will allow readers to evaluate the robustness of the reported margins. revision: yes

  3. Referee: [Experiments] Ablation studies: No ablation is presented that removes or replaces the distribution-analysis component while keeping SAM fixed. Such an experiment is required to substantiate the novelty of the patch-aggregation logic over a pure SAM + text-prompt baseline.

    Authors: We accept that an explicit ablation isolating the distribution-analysis component is necessary. In the revision we will add a controlled ablation that applies SAM with identical semantic text prompts directly to the full images, bypassing ViT patch selection and distribution comparison. We will report target and anomaly IoU for this SAM-only variant on both datasets and compare it against the full PASTA pipeline, thereby quantifying the contribution of the proposed aggregation logic. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims are empirical

full rationale

The paper describes a proposed pipeline (PASTA) that performs distribution analysis on self-supervised ViT patch embeddings to identify targets/anomalies relative to a nominal reference, followed by SAM zero-shot segmentation guided by text prompts. No equations, parameter-fitting steps, or derivation chain appear in the abstract or described method that reduce by construction to the inputs (e.g., no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations of uniqueness theorems). Performance numbers (IoU, training-time reduction) are presented as results of empirical evaluation on custom datasets rather than algebraic identities, rendering the central claims self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard computer-vision assumptions about feature separability and zero-shot model behavior rather than new invented entities or fitted constants.

axioms (2)
  • domain assumption Self-supervised ViT features produce distributions that differ meaningfully between nominal and anomalous scenes
    Invoked in the core identification step of comparing observed and reference scenes.
  • domain assumption SAM produces accurate object masks from semantic text prompts in the target domains
    Required for the zero-shot segmentation stage.

pith-pipeline@v0.9.0 · 5511 in / 1368 out tokens · 64397 ms · 2026-05-10T19:17:06.951385+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    You only look once: Unified, real-time object detection,

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2016

  2. [2]

    Mask r-cnn,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” inProc. of the IEEE Int. Conf. on Computer Vision, 2017

  3. [3]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inInt. Conf. on Learning Represen- tations, 2021

  4. [4]

    DINOv3

    O. Sim ´eoniet al., “DINOv3,” 2025. [Online]. Available: https://arxiv.org/abs/2508.10104

  5. [5]

    SAM 3: Segment anything with concepts,

    N. Carionet al., “SAM 3: Segment anything with concepts,” inInt. Conf. on Learning Representations, 2026

  6. [6]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInt. Conf. on Machine Learning. PmLR, 2021

  7. [7]

    Emerging properties in self-supervised vision trans- formers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision trans- formers,” inProc. of the IEEE Int. Conf. on Computer Vision, 2021

  8. [8]

    Segment anything,

    A. Kirillovet al., “Segment anything,” inProc. of the IEEE Int. Conf. on Computer Vision, 2023

  9. [9]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean conference on computer vision. Springer, 2024

  10. [10]

    Towards training-free anomaly detection with vision and language foundation models,

    J. Zhang, G. Wang, Y . Jin, and D. Huang, “Towards training-free anomaly detection with vision and language foundation models,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2025

  11. [11]

    Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip,

    W. Ma and other, “Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip,” inProc. of the IEEE Int. Conf. on Computer Vision, 2025

  12. [12]

    Clipsam: Clip and sam collaboration for zero-shot anomaly segmentation,

    S. Li, J. Cao, P. Ye, Y . Ding, C. Tu, and T. Chen, “Clipsam: Clip and sam collaboration for zero-shot anomaly segmentation,” Neurocomputing, 2025

  13. [13]

    Towards total recall in industrial anomaly detection,

    K. Roth, L. Pemula, J. Zepeda, B. Sch ¨olkopf, T. Brox, and P. Gehler, “Towards total recall in industrial anomaly detection,” inProc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 2022

  14. [14]

    Padim: a patch dis- tribution modeling framework for anomaly detection and localization,

    T. Defard, A. Setkov, A. Loesch, and R. Audigier, “Padim: a patch dis- tribution modeling framework for anomaly detection and localization,” inInt. Conf. on Pattern Recognition. Springer, 2021

  15. [15]

    Component-aware unsupervised logical anomaly generation for industrial anomaly detection,

    X. Tonget al., “Component-aware unsupervised logical anomaly generation for industrial anomaly detection,” in2025 IEEE Int. Conf. on Robotics and Automation (ICRA), 2025

  16. [16]

    Oodis: Anomaly instance segmentation and detection benchmark,

    A. Nekrasovet al., “Oodis: Anomaly instance segmentation and detection benchmark,” in2025 IEEE Int. Conf. on Robotics and Automation (ICRA), 2025

  17. [17]

    Robotic grasping for automated sorting of complex, highly contaminated industrial food waste: A benchmark study,

    M. Thilakarathna, X. Wang, A. Wijesinghe, D. Hinwood, and D. Herath, “Robotic grasping for automated sorting of complex, highly contaminated industrial food waste: A benchmark study,” in2025 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2025

  18. [18]

    Color indices for weed identification under various soil, residue, and lighting conditions,

    D. M. Woebbecke, G. E. Meyer, K. V on Bargen, and D. A. Mortensen, “Color indices for weed identification under various soil, residue, and lighting conditions,”Transactions of the ASAE, 1995

  19. [19]

    Zero-shot semantic segmentation for robots in agriculture,

    Y . L. Chong, L. Nunes, F. Magistri, X. Zhong, J. Behley, and C. Stach- niss, “Zero-shot semantic segmentation for robots in agriculture,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2025

  20. [20]

    Sparsifying instance segmentation models for efficient vision-based industrial recy- cling,

    M. Neubauer, O. ¨Ozdenizci, J. Piater, and E. Rueckert, “Sparsifying instance segmentation models for efficient vision-based industrial recy- cling,” inMachine Learning and Knowledge Discovery in Databases. Applied Data Science Track and Demo Track, 2026

  21. [21]

    Phenobench: A large dataset and benchmarks for semantic image interpretation in the agricultural domain,

    J. Weyleret al., “Phenobench: A large dataset and benchmarks for semantic image interpretation in the agricultural domain,”IEEE Trans. on Pattern Analysis and Machine Intelligence, 2024