pith. sign in

arxiv: 2606.10696 · v1 · pith:MKNZ2HCHnew · submitted 2026-06-09 · 💻 cs.CV

Don't waste SAM

Pith reviewed 2026-06-27 13:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords waste segmentationSegment Anything Modelfine-tuningSAMIoUZerowasteTACOTrashCan
0
0 comments X

The pith

Fine-tuning the Segment Anything Model improves waste image segmentation accuracy on real-world datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates how well the Segment Anything Model, originally trained for general image segmentation, can be adapted to the specific task of segmenting waste objects in cluttered scenes. It shows that fine-tuning SAM's largest variant on three public waste datasets produces higher intersection-over-union scores than previous specialized methods on two of them. The authors conclude that instead of ignoring SAM for these applications, researchers should fine-tune it as a starting point to handle challenges like transparent objects and background confusion. This matters because waste segmentation supports recycling and environmental monitoring, where accurate object boundaries are needed but expert annotation is expensive. If the finding holds, general foundation models become practical tools for niche computer vision tasks after modest adaptation.

Core claim

The fine-tuned SAM-ViT-H model outperforms the state-of-the-art on the Zerowaste and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.

What carries the argument

The fine-tuned SAM-ViT-H model, which adapts the pre-trained Vision Transformer Huge backbone of the Segment Anything Model to waste segmentation data.

Load-bearing premise

The reported performance gains assume that the fine-tuning experiments used the same test data splits and evaluation protocols as the prior state-of-the-art methods being compared.

What would settle it

Re-running the fine-tuning and evaluation on the exact same test splits of Zerowaste and TACO using the published code and hyperparameters of the previous SOTA methods to check if the +30 IoU advantage disappears.

Figures

Figures reproduced from arXiv: 2606.10696 by Nermeen Abou Baker, Uwe Handmann.

Figure 1
Figure 1. Figure 1: Segmentation results for the pair of ground truth with original box and [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Meta AI has recently released the Segment Anything Model (SAM), which demonstrates exceptional zero-shot image segmentation performance across various tasks with remarkable accuracy. Despite its inability to provide accurate segmentation across multiple research fields, SAM still serves as a valuable starting point for supporting the segmentation pipeline process, particularly for tasks that require extensive and senior skills annotations. This study aims to evaluate the generalization of SAM and fine-tuning SAM models using three waste segmentation datasets. Although they are captured from real scenes as SAM was pretrained on, these datasets present several challenges, including occlusions, deformable objects, transparency, and objects easily confused with backgrounds. In our findings, the fine-tuned SAM-ViT-H model outperforms the state-ofthe-art Zerowaste, and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates the Segment Anything Model (SAM) on three waste segmentation datasets (Zerowaste, TACO, TrashCan 1.0) that feature real-world challenges such as occlusions, deformable objects, transparency, and background confusion. It reports that fine-tuning the SAM-ViT-H variant produces large gains, outperforming prior SOTA methods on Zerowaste and TACO by +30 IoU while approaching TrashCan 1.0 performance within -1.44 IoU, and concludes that fine-tuning SAM is essential for better generalization on downstream waste tasks rather than discarding the model.

Significance. If the reported IoU deltas are shown to arise from identical test splits and evaluation protocols, the result would indicate that a general-purpose foundation model can be adapted to deliver competitive performance on a domain with substantial visual variability, supporting the broader utility of fine-tuning SAM-style models for specialized segmentation without designing new architectures from scratch.

major comments (2)
  1. [Abstract] Abstract: the headline claim of a +30 IoU improvement on Zerowaste and TACO (and -1.44 on TrashCan 1.0) is presented without any description of the train/test splits, confirmation that the cited SOTA baselines were re-evaluated on the identical held-out images, or the precise IoU computation (per-image vs. dataset-wide, threshold, etc.). This information is load-bearing for the central generalization argument.
  2. [Abstract] Abstract: no training protocol is supplied (learning rate, number of epochs, data augmentations, optimizer, or whether the SAM encoder was frozen), nor are any ablations or error bars across runs. Without these, it is impossible to determine whether the reported gains are attributable to fine-tuning SAM or to other uncontrolled factors.
minor comments (1)
  1. [Abstract] Abstract contains the typo "state-ofthe-art" (missing space).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need for greater transparency in the abstract. We agree that the headline results require supporting details on splits, baselines, metrics, and training protocol to be fully convincing. We will revise the abstract (and cross-reference the methods) to address both points.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of a +30 IoU improvement on Zerowaste and TACO (and -1.44 on TrashCan 1.0) is presented without any description of the train/test splits, confirmation that the cited SOTA baselines were re-evaluated on the identical held-out images, or the precise IoU computation (per-image vs. dataset-wide, threshold, etc.). This information is load-bearing for the central generalization argument.

    Authors: We accept the criticism. The abstract will be rewritten to state that we adopt the official train/test splits released with each dataset, that the reported SOTA numbers are taken from the original papers (or re-implemented on the identical test images where code was available), and that IoU is the standard per-image mean IoU averaged over the test set using the conventional 0.5 overlap threshold. These clarifications will be added while preserving the abstract's brevity. revision: yes

  2. Referee: [Abstract] Abstract: no training protocol is supplied (learning rate, number of epochs, data augmentations, optimizer, or whether the SAM encoder was frozen), nor are any ablations or error bars across runs. Without these, it is impossible to determine whether the reported gains are attributable to fine-tuning SAM or to other uncontrolled factors.

    Authors: We agree that the abstract must at least sketch the fine-tuning protocol. The revised abstract will include a concise statement of the optimizer, learning rate, number of epochs, and that the image encoder was fine-tuned (not frozen). The full hyper-parameter list, augmentations, and implementation details already appear in Section 3; we will add an explicit pointer from the abstract. The manuscript does not currently contain ablations or multi-run error bars; we will note this limitation and, if space permits, add a brief statement that single-run results are reported. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of fine-tuning results

full rationale

The manuscript contains no derivation chain, equations, or predictions. Performance numbers are presented as direct outputs of standard fine-tuning and held-out evaluation on waste datasets. No fitted parameters are renamed as predictions, no self-citations bear load-bearing uniqueness claims, and no ansatz or renaming of known results occurs. The work is self-contained against external benchmarks via reported IoU values.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on the standard assumption that fine-tuning a pre-trained vision transformer on domain-specific labeled images will improve segmentation metrics, plus the usual supervised-learning axioms of i.i.d. train/test splits and that IoU is a faithful proxy for segmentation quality. No new entities or free parameters are introduced beyond the usual optimizer and learning-rate choices implicit in any fine-tuning run.

axioms (2)
  • domain assumption Fine-tuning a foundation model on task-specific data improves downstream metrics without catastrophic forgetting or domain-shift artifacts
    Invoked implicitly when the authors conclude that fine-tuning is 'a crucial step' for generalization.
  • domain assumption The three waste datasets are representative of real-world waste scenes and share the same distribution as SAM's pre-training data
    Stated in the abstract when the authors note that the datasets 'are captured from real scenes as SAM was pretrained on' yet still present challenges.

pith-pipeline@v0.9.1-grok · 5718 in / 1384 out tokens · 20481 ms · 2026-06-27T13:40:47.969470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references

  1. [1]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  2. [2]

    Chatcpt, November 2022

    OpenAI. Chatcpt, November 2022

  3. [3]

    Zero-shot text-to-image generation, 2021

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021

  4. [4]

    Learning transferable visual models from natural language supervi- sion, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervi- sion, 2021

  5. [5]

    Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision, 2021

  6. [6]

    Metasa1b, April 2023

    MetaSA1B. Metasa1b, April 2023

  7. [7]

    Berg, Wan-Yen Lo, Piotr Doll˜A¡r, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll˜A¡r, and Ross Girshick. Segment anything, 2023

  8. [8]

    Inpaint anything: Segment anything meets image inpainting, 2023

    Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting, 2023

  9. [9]

    Anything-3d: Towards single-view anything reconstruction in the wild, 2023

    Qiuhong Shen, Xingyi Yang, and Xinchao Wang. Anything-3d: Towards single-view anything reconstruction in the wild, 2023

  10. [10]

    Any-to-any style transfer: Making picasso and da vinci collaborate, 2023

    Songhua Liu, Jingwen Ye, and Xinchao Wang. Any-to-any style transfer: Making picasso and da vinci collaborate, 2023

  11. [11]

    Track anything: Segment anything meets videos, 2023

    Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos, 2023

  12. [12]

    Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models, 2023

    Jielu Zhang, Zhongliang Zhou, Gengchen Mai, Lan Mu, Mengxuan Hu, and Sheng Li. Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models, 2023

  13. [13]

    Learning to ”segment anything” in thermal infrared images through knowledge distillation with a large scale dataset satir, 2023

    Junzhang Chen and Xiangzhi Bai. Learning to ”segment anything” in thermal infrared images through knowledge distillation with a large scale dataset satir, 2023

  14. [14]

    Can sam segment anything? when sam meets camou- flaged object detection, 2023

    Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camou- flaged object detection, 2023

  15. [15]

    Can sam segment polyps?, 2023

    Tao Zhou, Yizhe Zhang, Yi Zhou, Ye Wu, and Chen Gong. Can sam segment polyps?, 2023

  16. [16]

    Ellen Grant, and Yangming Ou

    Sheng He, Rina Bao, Jingpeng Li, P. Ellen Grant, and Yangming Ou. Accuracy of segment-anything model (sam) in medical image segmentation tasks, 2023

  17. [17]

    Sam struggles in concealed scenes – empirical study on ”segment anything”, 2023

    Ge-Peng Ji, Deng-Ping Fan, Peng Xu, Ming-Ming Cheng, Bowen Zhou, and Luc Van Gool. Sam struggles in concealed scenes – empirical study on ”segment anything”, 2023

  18. [18]

    Sam fails to segment anything? – sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more, 2023

    Tianrun Chen, Lanyun Zhu, Chaotao Ding, Runlong Cao, Yan Wang, Zejian Li, Lingyun Sun, Papa Mao, and Ying Zang. Sam fails to segment anything? – sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more, 2023

  19. [19]

    Zerowaste dataset: Towards deformable object segmentation in cluttered scenes, 2022

    Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. Zerowaste dataset: Towards deformable object segmentation in cluttered scenes, 2022

  20. [20]

    Trashcan: A semantically-segmented dataset towards visual detection of marine debris, 2020

    Jungseok Hong, Michael Fulton, and Junaed Sattar. Trashcan: A semantically-segmented dataset towards visual detection of marine debris, 2020

  21. [21]

    Taco: Trash annotations in context for litter detec- tion, 2020

    Pedro F Proenca and Pedro Simoes. Taco: Trash annotations in context for litter detec- tion, 2020

  22. [22]

    lightning-sam, April 2023

    Luca Medeiros. lightning-sam, April 2023

  23. [23]

    Petrov, and Anton Konushin

    Konstantin Sofiiuk, Ilia A. Petrov, and Anton Konushin. Reviving iterative training with mask guidance for interactive segmentation, 2021