Don't waste SAM

Nermeen Abou Baker; Uwe Handmann

arxiv: 2606.10696 · v1 · pith:MKNZ2HCHnew · submitted 2026-06-09 · 💻 cs.CV

Don't waste SAM

Nermeen Abou Baker , Uwe Handmann This is my paper

Pith reviewed 2026-06-27 13:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords waste segmentationSegment Anything Modelfine-tuningSAMIoUZerowasteTACOTrashCan

0 comments

The pith

Fine-tuning the Segment Anything Model improves waste image segmentation accuracy on real-world datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates how well the Segment Anything Model, originally trained for general image segmentation, can be adapted to the specific task of segmenting waste objects in cluttered scenes. It shows that fine-tuning SAM's largest variant on three public waste datasets produces higher intersection-over-union scores than previous specialized methods on two of them. The authors conclude that instead of ignoring SAM for these applications, researchers should fine-tune it as a starting point to handle challenges like transparent objects and background confusion. This matters because waste segmentation supports recycling and environmental monitoring, where accurate object boundaries are needed but expert annotation is expensive. If the finding holds, general foundation models become practical tools for niche computer vision tasks after modest adaptation.

Core claim

The fine-tuned SAM-ViT-H model outperforms the state-of-the-art on the Zerowaste and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.

What carries the argument

The fine-tuned SAM-ViT-H model, which adapts the pre-trained Vision Transformer Huge backbone of the Segment Anything Model to waste segmentation data.

Load-bearing premise

The reported performance gains assume that the fine-tuning experiments used the same test data splits and evaluation protocols as the prior state-of-the-art methods being compared.

What would settle it

Re-running the fine-tuning and evaluation on the exact same test splits of Zerowaste and TACO using the published code and hyperparameters of the previous SOTA methods to check if the +30 IoU advantage disappears.

Figures

Figures reproduced from arXiv: 2606.10696 by Nermeen Abou Baker, Uwe Handmann.

read the original abstract

Meta AI has recently released the Segment Anything Model (SAM), which demonstrates exceptional zero-shot image segmentation performance across various tasks with remarkable accuracy. Despite its inability to provide accurate segmentation across multiple research fields, SAM still serves as a valuable starting point for supporting the segmentation pipeline process, particularly for tasks that require extensive and senior skills annotations. This study aims to evaluate the generalization of SAM and fine-tuning SAM models using three waste segmentation datasets. Although they are captured from real scenes as SAM was pretrained on, these datasets present several challenges, including occlusions, deformable objects, transparency, and objects easily confused with backgrounds. In our findings, the fine-tuned SAM-ViT-H model outperforms the state-ofthe-art Zerowaste, and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning SAM on waste datasets produces the claimed IoU numbers, but the abstract gives no evidence the comparisons used identical test splits or evaluation protocols.

read the letter

The paper fine-tunes SAM-ViT-H on three existing waste segmentation collections and reports that it beats the prior numbers on Zerowaste and TACO by roughly 30 IoU while landing 1.44 points behind the best result on TrashCan. That is the concrete takeaway.

The work itself is a direct transfer-learning experiment. The authors note that the datasets contain occlusions, transparent items, and background confusion, which are the sort of conditions SAM was pretrained to handle. They then show the adapted model improves on the cited baselines. The numbers are stated plainly, which is better than papers that only claim qualitative gains.

The soft spot is exactly the one flagged in the stress-test note. The abstract supplies no training recipe, no confirmation that the same test images and metric code were used for the baselines, and no ablations. Without those, the 30-point delta could come from different data partitions or from the fine-tuning setup itself rather than from SAM. The paper does not appear to contain new algorithms or derivations; it is an empirical check on an existing model.

This is a narrow application paper aimed at people building vision systems for recycling or waste sorting. A reader already working on those datasets might find the numbers worth checking against their own runs. Outside that subfield there is little to carry over.

I would not bring it to a reading group. I would not cite it. The central performance claim rests on unverified comparison details, so the paper does not reach the threshold where a serious editor should send it for peer review.

Referee Report

2 major / 1 minor

Summary. The paper evaluates the Segment Anything Model (SAM) on three waste segmentation datasets (Zerowaste, TACO, TrashCan 1.0) that feature real-world challenges such as occlusions, deformable objects, transparency, and background confusion. It reports that fine-tuning the SAM-ViT-H variant produces large gains, outperforming prior SOTA methods on Zerowaste and TACO by +30 IoU while approaching TrashCan 1.0 performance within -1.44 IoU, and concludes that fine-tuning SAM is essential for better generalization on downstream waste tasks rather than discarding the model.

Significance. If the reported IoU deltas are shown to arise from identical test splits and evaluation protocols, the result would indicate that a general-purpose foundation model can be adapted to deliver competitive performance on a domain with substantial visual variability, supporting the broader utility of fine-tuning SAM-style models for specialized segmentation without designing new architectures from scratch.

major comments (2)

[Abstract] Abstract: the headline claim of a +30 IoU improvement on Zerowaste and TACO (and -1.44 on TrashCan 1.0) is presented without any description of the train/test splits, confirmation that the cited SOTA baselines were re-evaluated on the identical held-out images, or the precise IoU computation (per-image vs. dataset-wide, threshold, etc.). This information is load-bearing for the central generalization argument.
[Abstract] Abstract: no training protocol is supplied (learning rate, number of epochs, data augmentations, optimizer, or whether the SAM encoder was frozen), nor are any ablations or error bars across runs. Without these, it is impossible to determine whether the reported gains are attributable to fine-tuning SAM or to other uncontrolled factors.

minor comments (1)

[Abstract] Abstract contains the typo "state-ofthe-art" (missing space).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need for greater transparency in the abstract. We agree that the headline results require supporting details on splits, baselines, metrics, and training protocol to be fully convincing. We will revise the abstract (and cross-reference the methods) to address both points.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of a +30 IoU improvement on Zerowaste and TACO (and -1.44 on TrashCan 1.0) is presented without any description of the train/test splits, confirmation that the cited SOTA baselines were re-evaluated on the identical held-out images, or the precise IoU computation (per-image vs. dataset-wide, threshold, etc.). This information is load-bearing for the central generalization argument.

Authors: We accept the criticism. The abstract will be rewritten to state that we adopt the official train/test splits released with each dataset, that the reported SOTA numbers are taken from the original papers (or re-implemented on the identical test images where code was available), and that IoU is the standard per-image mean IoU averaged over the test set using the conventional 0.5 overlap threshold. These clarifications will be added while preserving the abstract's brevity. revision: yes
Referee: [Abstract] Abstract: no training protocol is supplied (learning rate, number of epochs, data augmentations, optimizer, or whether the SAM encoder was frozen), nor are any ablations or error bars across runs. Without these, it is impossible to determine whether the reported gains are attributable to fine-tuning SAM or to other uncontrolled factors.

Authors: We agree that the abstract must at least sketch the fine-tuning protocol. The revised abstract will include a concise statement of the optimizer, learning rate, number of epochs, and that the image encoder was fine-tuned (not frozen). The full hyper-parameter list, augmentations, and implementation details already appear in Section 3; we will add an explicit pointer from the abstract. The manuscript does not currently contain ablations or multi-run error bars; we will note this limitation and, if space permits, add a brief statement that single-run results are reported. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of fine-tuning results

full rationale

The manuscript contains no derivation chain, equations, or predictions. Performance numbers are presented as direct outputs of standard fine-tuning and held-out evaluation on waste datasets. No fitted parameters are renamed as predictions, no self-citations bear load-bearing uniqueness claims, and no ansatz or renaming of known results occurs. The work is self-contained against external benchmarks via reported IoU values.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on the standard assumption that fine-tuning a pre-trained vision transformer on domain-specific labeled images will improve segmentation metrics, plus the usual supervised-learning axioms of i.i.d. train/test splits and that IoU is a faithful proxy for segmentation quality. No new entities or free parameters are introduced beyond the usual optimizer and learning-rate choices implicit in any fine-tuning run.

axioms (2)

domain assumption Fine-tuning a foundation model on task-specific data improves downstream metrics without catastrophic forgetting or domain-shift artifacts
Invoked implicitly when the authors conclude that fine-tuning is 'a crucial step' for generalization.
domain assumption The three waste datasets are representative of real-world waste scenes and share the same distribution as SAM's pre-training data
Stated in the abstract when the authors note that the datasets 'are captured from real scenes as SAM was pretrained on' yet still present challenges.

pith-pipeline@v0.9.1-grok · 5718 in / 1384 out tokens · 20481 ms · 2026-06-27T13:40:47.969470+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references

[1]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

2020
[2]

Chatcpt, November 2022

OpenAI. Chatcpt, November 2022

2022
[3]

Zero-shot text-to-image generation, 2021

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021

2021
[4]

Learning transferable visual models from natural language supervi- sion, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervi- sion, 2021

2021
[5]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision, 2021

2021
[6]

Metasa1b, April 2023

MetaSA1B. Metasa1b, April 2023

2023
[7]

Berg, Wan-Yen Lo, Piotr Doll˜A¡r, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll˜A¡r, and Ross Girshick. Segment anything, 2023

2023
[8]

Inpaint anything: Segment anything meets image inpainting, 2023

Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting, 2023

2023
[9]

Anything-3d: Towards single-view anything reconstruction in the wild, 2023

Qiuhong Shen, Xingyi Yang, and Xinchao Wang. Anything-3d: Towards single-view anything reconstruction in the wild, 2023

2023
[10]

Any-to-any style transfer: Making picasso and da vinci collaborate, 2023

Songhua Liu, Jingwen Ye, and Xinchao Wang. Any-to-any style transfer: Making picasso and da vinci collaborate, 2023

2023
[11]

Track anything: Segment anything meets videos, 2023

Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos, 2023

2023
[12]

Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models, 2023

Jielu Zhang, Zhongliang Zhou, Gengchen Mai, Lan Mu, Mengxuan Hu, and Sheng Li. Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models, 2023

2023
[13]

Learning to ”segment anything” in thermal infrared images through knowledge distillation with a large scale dataset satir, 2023

Junzhang Chen and Xiangzhi Bai. Learning to ”segment anything” in thermal infrared images through knowledge distillation with a large scale dataset satir, 2023

2023
[14]

Can sam segment anything? when sam meets camou- flaged object detection, 2023

Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camou- flaged object detection, 2023

2023
[15]

Can sam segment polyps?, 2023

Tao Zhou, Yizhe Zhang, Yi Zhou, Ye Wu, and Chen Gong. Can sam segment polyps?, 2023

2023
[16]

Ellen Grant, and Yangming Ou

Sheng He, Rina Bao, Jingpeng Li, P. Ellen Grant, and Yangming Ou. Accuracy of segment-anything model (sam) in medical image segmentation tasks, 2023

2023
[17]

Sam struggles in concealed scenes – empirical study on ”segment anything”, 2023

Ge-Peng Ji, Deng-Ping Fan, Peng Xu, Ming-Ming Cheng, Bowen Zhou, and Luc Van Gool. Sam struggles in concealed scenes – empirical study on ”segment anything”, 2023

2023
[18]

Sam fails to segment anything? – sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more, 2023

Tianrun Chen, Lanyun Zhu, Chaotao Ding, Runlong Cao, Yan Wang, Zejian Li, Lingyun Sun, Papa Mao, and Ying Zang. Sam fails to segment anything? – sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more, 2023

2023
[19]

Zerowaste dataset: Towards deformable object segmentation in cluttered scenes, 2022

Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. Zerowaste dataset: Towards deformable object segmentation in cluttered scenes, 2022

2022
[20]

Trashcan: A semantically-segmented dataset towards visual detection of marine debris, 2020

Jungseok Hong, Michael Fulton, and Junaed Sattar. Trashcan: A semantically-segmented dataset towards visual detection of marine debris, 2020

2020
[21]

Taco: Trash annotations in context for litter detec- tion, 2020

Pedro F Proenca and Pedro Simoes. Taco: Trash annotations in context for litter detec- tion, 2020

2020
[22]

lightning-sam, April 2023

Luca Medeiros. lightning-sam, April 2023

2023
[23]

Petrov, and Anton Konushin

Konstantin Sofiiuk, Ilia A. Petrov, and Anton Konushin. Reviving iterative training with mask guidance for interactive segmentation, 2021

2021

[1] [1]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

2020

[2] [2]

Chatcpt, November 2022

OpenAI. Chatcpt, November 2022

2022

[3] [3]

Zero-shot text-to-image generation, 2021

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021

2021

[4] [4]

Learning transferable visual models from natural language supervi- sion, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervi- sion, 2021

2021

[5] [5]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision, 2021

2021

[6] [6]

Metasa1b, April 2023

MetaSA1B. Metasa1b, April 2023

2023

[7] [7]

Berg, Wan-Yen Lo, Piotr Doll˜A¡r, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll˜A¡r, and Ross Girshick. Segment anything, 2023

2023

[8] [8]

Inpaint anything: Segment anything meets image inpainting, 2023

Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting, 2023

2023

[9] [9]

Anything-3d: Towards single-view anything reconstruction in the wild, 2023

Qiuhong Shen, Xingyi Yang, and Xinchao Wang. Anything-3d: Towards single-view anything reconstruction in the wild, 2023

2023

[10] [10]

Any-to-any style transfer: Making picasso and da vinci collaborate, 2023

Songhua Liu, Jingwen Ye, and Xinchao Wang. Any-to-any style transfer: Making picasso and da vinci collaborate, 2023

2023

[11] [11]

Track anything: Segment anything meets videos, 2023

Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos, 2023

2023

[12] [12]

Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models, 2023

Jielu Zhang, Zhongliang Zhou, Gengchen Mai, Lan Mu, Mengxuan Hu, and Sheng Li. Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models, 2023

2023

[13] [13]

Learning to ”segment anything” in thermal infrared images through knowledge distillation with a large scale dataset satir, 2023

Junzhang Chen and Xiangzhi Bai. Learning to ”segment anything” in thermal infrared images through knowledge distillation with a large scale dataset satir, 2023

2023

[14] [14]

Can sam segment anything? when sam meets camou- flaged object detection, 2023

Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camou- flaged object detection, 2023

2023

[15] [15]

Can sam segment polyps?, 2023

Tao Zhou, Yizhe Zhang, Yi Zhou, Ye Wu, and Chen Gong. Can sam segment polyps?, 2023

2023

[16] [16]

Ellen Grant, and Yangming Ou

Sheng He, Rina Bao, Jingpeng Li, P. Ellen Grant, and Yangming Ou. Accuracy of segment-anything model (sam) in medical image segmentation tasks, 2023

2023

[17] [17]

Sam struggles in concealed scenes – empirical study on ”segment anything”, 2023

Ge-Peng Ji, Deng-Ping Fan, Peng Xu, Ming-Ming Cheng, Bowen Zhou, and Luc Van Gool. Sam struggles in concealed scenes – empirical study on ”segment anything”, 2023

2023

[18] [18]

Sam fails to segment anything? – sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more, 2023

Tianrun Chen, Lanyun Zhu, Chaotao Ding, Runlong Cao, Yan Wang, Zejian Li, Lingyun Sun, Papa Mao, and Ying Zang. Sam fails to segment anything? – sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more, 2023

2023

[19] [19]

Zerowaste dataset: Towards deformable object segmentation in cluttered scenes, 2022

Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. Zerowaste dataset: Towards deformable object segmentation in cluttered scenes, 2022

2022

[20] [20]

Trashcan: A semantically-segmented dataset towards visual detection of marine debris, 2020

Jungseok Hong, Michael Fulton, and Junaed Sattar. Trashcan: A semantically-segmented dataset towards visual detection of marine debris, 2020

2020

[21] [21]

Taco: Trash annotations in context for litter detec- tion, 2020

Pedro F Proenca and Pedro Simoes. Taco: Trash annotations in context for litter detec- tion, 2020

2020

[22] [22]

lightning-sam, April 2023

Luca Medeiros. lightning-sam, April 2023

2023

[23] [23]

Petrov, and Anton Konushin

Konstantin Sofiiuk, Ilia A. Petrov, and Anton Konushin. Reviving iterative training with mask guidance for interactive segmentation, 2021

2021