Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes
Pith reviewed 2026-05-23 21:24 UTC · model grok-4.3
The pith
GSAM makes SAM fine-tuning work with any input image size by adding random cropping to training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GSAM is the first method to apply random cropping during training with SAM, allowing variable input sizes and thereby significantly reducing the computational cost of training compared to standard SAM fine-tuning and other methods.
What carries the argument
Random cropping of input images during the training phase, which enables the model to process patches of varying sizes instead of requiring a fixed 1024x1024 resolution.
If this is right
- Training time and memory use decrease because smaller random crops replace full fixed-size images.
- Images of arbitrary aspect ratios can be used directly without resizing that distorts content.
- Accuracy remains comparable or higher than baseline methods on multiple dataset types and resolutions.
- Fine-tuning becomes practical for larger collections of images that vary in pixel count.
Where Pith is reading between the lines
- This approach could be tested on other foundation models that currently enforce fixed input dimensions.
- Deployment scenarios might benefit if the model can adapt to the native resolution of new images without retraining from scratch.
- Further experiments could check performance when crop sizes are chosen based on the target application rather than randomly.
Load-bearing premise
Random cropping integrates directly into SAM fine-tuning without harming the model's pre-trained segmentation performance or requiring changes to the model architecture.
What would settle it
If experiments on additional datasets show that GSAM requires more computation than standard fine-tuning or produces lower segmentation accuracy, the efficiency and accuracy claims would not hold.
Figures
read the original abstract
There has been a lot of recent research on improving the efficiency of fine-tuning foundation models. In this paper, we propose a novel efficient fine-tuning method that allows the input image size of Segment Anything Model (SAM) to be variable. SAM is a powerful foundational model for image segmentation trained on huge datasets, but it requires fine-tuning to recognize arbitrary classes. The input image size of SAM is fixed at 1024 x 1024, resulting in substantial computational demands during training. Furthermore, the fixed input image size may result in the loss of image information, e.g. due to fixed aspect ratios. To address this problem, we propose Generalized SAM (GSAM). Different from the previous methods, GSAM is the first to apply random cropping during training with SAM, thereby significantly reducing the computational cost of training. Experiments on datasets of various types and various pixel counts have shown that GSAM can train more efficiently than SAM and other fine-tuning methods for SAM, achieving comparable or higher accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Generalized SAM (GSAM), a fine-tuning method for the Segment Anything Model that uses random cropping during training to support variable input image sizes. This is claimed to reduce computational cost compared to standard SAM fine-tuning (which fixes inputs at 1024x1024) while achieving comparable or higher accuracy, without additional architectural changes. Experiments are reported on multiple datasets with varying pixel counts and types.
Significance. If the central technical claim holds after clarification, the work would address a practical limitation in applying SAM to images of arbitrary resolutions and aspect ratios, offering efficiency gains for fine-tuning on resource-constrained settings. The use of random cropping as a simple mechanism for variable-size handling could be broadly applicable if it preserves pre-trained weights without hidden modifications.
major comments (2)
- [Abstract and Methods] Abstract and Methods section: The claim that GSAM enables variable input sizes 'without additional architectural changes' conflicts with the fixed positional embeddings in SAM's ViT-B/16 image encoder (learned for exactly 64x64 tokens from 1024x1024 inputs). Random crops of differing spatial dimensions require either resizing back to 1024x1024 (rendering the variable-size claim false) or positional embedding interpolation/padding (an architectural change). This load-bearing inconsistency must be resolved with explicit implementation details.
- [Experiments] Experiments section: The abstract asserts efficiency and accuracy gains over SAM and other fine-tuning methods across datasets of various types and pixel counts, but the provided text lacks quantitative results, specific baselines, training-time measurements, inference details for variable sizes, or error analysis. Without these, the empirical support for the central efficiency claim cannot be assessed.
minor comments (2)
- [Abstract] Abstract: The opening sentence uses informal language ('There has been a lot of recent research'); revise to a more precise statement such as 'Recent work has explored efficient fine-tuning of foundation models.'
- [Abstract] Abstract: The phrase 'various pixel counts' is vague; specify the range of resolutions or dataset statistics used in experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the manuscript to improve clarity and provide additional details.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods section: The claim that GSAM enables variable input sizes 'without additional architectural changes' conflicts with the fixed positional embeddings in SAM's ViT-B/16 image encoder (learned for exactly 64x64 tokens from 1024x1024 inputs). Random crops of differing spatial dimensions require either resizing back to 1024x1024 (rendering the variable-size claim false) or positional embedding interpolation/padding (an architectural change). This load-bearing inconsistency must be resolved with explicit implementation details.
Authors: We agree that the manuscript requires explicit implementation details on this point. GSAM applies random cropping to produce smaller training inputs of varying sizes from original images of arbitrary resolutions, which reduces token count and thus training compute. For the fixed positional embeddings, we use standard bilinear interpolation to match the resulting token grid (a common ViT adaptation with no new parameters or modules introduced). We view this as preserving the core architecture while enabling variable sizes. We will revise the Abstract and Methods sections to describe this process in detail, including training and inference handling. revision: yes
-
Referee: [Experiments] Experiments section: The abstract asserts efficiency and accuracy gains over SAM and other fine-tuning methods across datasets of various types and pixel counts, but the provided text lacks quantitative results, specific baselines, training-time measurements, inference details for variable sizes, or error analysis. Without these, the empirical support for the central efficiency claim cannot be assessed.
Authors: We apologize if the quantitative results were not sufficiently prominent. The full manuscript reports experiments across multiple datasets with accuracy metrics, comparisons to SAM and other fine-tuning baselines, and efficiency measurements. We will revise the Experiments section to add explicit tables with training-time results, inference details for variable input sizes, and error analysis to better substantiate the claims. revision: yes
Circularity Check
No circularity: experimental proposal with no derivation chain
full rationale
The paper proposes GSAM as an empirical fine-tuning method that applies random cropping to SAM for variable input sizes, supported solely by experimental comparisons on multiple datasets. No equations, fitted parameters, or mathematical derivations are described that could reduce to self-referential inputs by construction. Claims of efficiency and accuracy rest on direct empirical results rather than any self-definition, renamed predictions, or load-bearing self-citations. The central contribution is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SAM requires fixed 1024x1024 input images
Reference graph
Works this paper leans on
-
[1]
Alom, M.Z., Hasan, M., Yakopcic, C., Taha, T.M., Asari, V.K.: Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmen- tation. arXiv preprint arXiv:1802.06955 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Pattern recognition letters30(2), 88–97 (2009)
Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high- definition ground truth database. Pattern recognition letters30(2), 88–97 (2009)
work page 2009
-
[3]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Bubeck,S.,Chandrasekaran,V.,Eldan,R.,Gehrke,J.,Horvitz,E.,Kamar,E.,Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al.: Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
In: European conference on computer vision
Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin- unet: Unet-like pure transformer for medical image segmentation. In: European conference on computer vision. pp. 205–218. Springer (2022)
work page 2022
-
[5]
TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.:Transunet:Transformersmakestrongencodersformedicalimagesegmentation. arXiv preprint arXiv:2102.04306 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2017)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2017)
work page 2017
-
[7]
Rethinking Atrous Convolution for Semantic Image Segmentation
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[8]
In: Proceedings of the European conference on computer vision (ECCV)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 801–818 (2018)
work page 2018
-
[9]
Advances in Neural Information Processing Systems35, 16664–16678 (2022)
Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., Luo, P.: Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems35, 16664–16678 (2022)
work page 2022
-
[10]
arXiv preprint arXiv:2102.10882 (2021)
Chu, X., Tian, Z., Zhang, B., Wang, X., Shen, C.: Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)
-
[11]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)
work page 2016
-
[12]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:Bert:Pre-trainingofdeepbidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Generalized SAM 15
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. CoRR abs/2010.11929 (2020), https://arxiv.org/abs/2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[14]
He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
work page 2016
-
[15]
Segmentation of neuronal structures in em stacks challenge.https://imagej.net/ events/isbi-2012-segmentation-challenge (2012)
work page 2012
-
[16]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., De Lange, T., Johansen, D., Johansen, H.D.: Kvasir-seg: A segmented polyp dataset. In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26. pp. 451–462. Springer (2020)
work page 2020
-
[18]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4015–4026 (2023)
work page 2023
-
[19]
Advances in neural information processing systems25 (2012)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. Advances in neural information processing systems25 (2012)
work page 2012
-
[20]
arXiv preprint arXiv:2309.06824 (2023)
Lin, X., Xiang, Y., Zhang, L., Yang, X., Yan, Z., Yu, L.: Samus: Adapting segment anything model for clinically-friendly and generalizable ultrasound image segmen- tation. arXiv preprint arXiv:2309.06824 (2023)
-
[21]
SGDR: Stochastic Gradient Descent with Warm Restarts
Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[22]
Mnih, V.: Machine Learning for Aerial Image Labeling. Ph.D. thesis, University of Toronto (2013)
work page 2013
-
[23]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
work page 2021
-
[24]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
work page 2022
-
[25]
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, Oc- tober 5-9, 2015, proceedings, part III 18. pp. 234–241. Springer (2015)
work page 2015
-
[26]
Multi-atlas labeling beyond the cranial vault - workshop and challenge.https: //www.synapse.org/#!Synapse:syn3193805/wiki/217789 (2015)
work page 2015
-
[27]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Advances in neural information pro- cessing systems30 (2017) 16 S
Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30 (2017) 16 S. Kato et al
work page 2017
-
[29]
Xie, E., Wang, W., Wang, W., Ding, M., Shen, C., Luo, P.: Segmenting transpar- ent objects in the wild. In: Computer Vision–ECCV 2020: 16th European Confer- ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16. pp. 696–711. Springer (2020)
work page 2020
-
[30]
Multi-Scale Context Aggregation by Dilated Convolutions
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[31]
arXiv preprint arXiv:2304.13785 (2023)
Zhang, K., Liu, D.: Customized segment anything model for medical image seg- mentation. arXiv preprint arXiv:2304.13785 (2023)
-
[32]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2881–2890 (2017)
work page 2017
-
[33]
Convolution meets lora: Parameter efficient finetuning for segment anything model
Zhong, Z., Tang, Z., He, T., Fang, H., Yuan, C.: Convolution meets lora: Parameter efficient finetuning for segment anything model. arXiv preprint arXiv:2401.17868 (2024)
-
[34]
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th In- ternational Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Grana...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.