pith. sign in

arxiv: 2408.12406 · v2 · submitted 2024-08-22 · 💻 cs.CV

Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes

Pith reviewed 2026-05-23 21:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords SAMfine-tuningimage segmentationrandom croppingvariable input sizeefficient trainingfoundation model
0
0 comments X

The pith

GSAM makes SAM fine-tuning work with any input image size by adding random cropping to training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the fixed input size limitation of the Segment Anything Model during fine-tuning. SAM normally requires 1024 by 1024 pixel inputs, which increases computation and can discard parts of images with mismatched shapes. GSAM introduces random cropping so that training uses variable sized patches from the original images. This change lowers the resources needed for training while experiments show accuracy stays the same or improves across different kinds of datasets.

Core claim

GSAM is the first method to apply random cropping during training with SAM, allowing variable input sizes and thereby significantly reducing the computational cost of training compared to standard SAM fine-tuning and other methods.

What carries the argument

Random cropping of input images during the training phase, which enables the model to process patches of varying sizes instead of requiring a fixed 1024x1024 resolution.

If this is right

  • Training time and memory use decrease because smaller random crops replace full fixed-size images.
  • Images of arbitrary aspect ratios can be used directly without resizing that distorts content.
  • Accuracy remains comparable or higher than baseline methods on multiple dataset types and resolutions.
  • Fine-tuning becomes practical for larger collections of images that vary in pixel count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could be tested on other foundation models that currently enforce fixed input dimensions.
  • Deployment scenarios might benefit if the model can adapt to the native resolution of new images without retraining from scratch.
  • Further experiments could check performance when crop sizes are chosen based on the target application rather than randomly.

Load-bearing premise

Random cropping integrates directly into SAM fine-tuning without harming the model's pre-trained segmentation performance or requiring changes to the model architecture.

What would settle it

If experiments on additional datasets show that GSAM requires more computation than standard fine-tuning or produces lower segmentation accuracy, the efficiency and accuracy claims would not hold.

Figures

Figures reproduced from arXiv: 2408.12406 by Hinako Mitsuoka, Kazuhiro Hotta, Sota Kato.

Figure 1
Figure 1. Figure 1: When SAM is fine-tuned for semantic segmentation with conventional methods, only fixed-size images can be input. As a result, input images are deformed to fit a specific size, causing information loss. In contrast, GSAM supports various input image sizes while maintaining the superior segmentation performance of SAM. This allows images to be used in their original form and enables random cropping during fi… view at source ↗
Figure 2
Figure 2. Figure 2: The trade-off between MACs and segmentation accuracy (mIoU) for con￾ventional fine-tuning methods for SAM on the ISBI2012 dataset [15]. The Red circles indicate our proposed GSAM and the triangles indicate the conventional methods. Random cropping is only performed for GSAM, cropping to the number of pixels in￾dicated by the "size". Random cropping cannot be used except for GSAM due to its structure. scale… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Generalized SAM. F ROZEN indicates a network in which the weight parameters are fixed, and Learnable indicates a network in which the weight parameters are updated. spatial orientation, which enables it to retain positional information even when the input size of the feature map is variable. However, the original pretrained SAM does not support random inputs, and it is possible that global lear… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of Spatial-Multiscale AdaptFormer. Five convolutional layers with different receptive fields are used to acquire the spatial features necessary for semantic segmentation. spatial information into account. Since spatial features are important informa￾tion in semantic segmentation, the proposed SM-AdaptFormer prepares multiple convolutional layers with kernels in various ranges and acquires multisca… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results. The first row is the results on the ISBI2012, the second is on the M-Building, the third is on the Cityscapes, and the fourth is on the Trans10k dataset. The Cityscapes dataset can be fed into GSAM with its original aspect ratio, but for simplicity of comparison, the same aspect ratios as other methods are shown. (a) Input image, (b) Ground truth, (c) SAM [18], (d) AdaptFormer [9], (e)… view at source ↗
Figure 6
Figure 6. Figure 6: The MACs and segmentation accuracy of each method on the ISBI2012 dataset are illustrated. The bar graphs are the MACs for each method and the line graphs are mIoU. LoRA, ConvLoRA, AdaptFormer, and GSAM are compared when the size of the random cropping is changed. Since the number of input image sizes is fixed in conventional SAM fine-tuning methods, the computational cost becomes huge, and it can be seen … view at source ↗
read the original abstract

There has been a lot of recent research on improving the efficiency of fine-tuning foundation models. In this paper, we propose a novel efficient fine-tuning method that allows the input image size of Segment Anything Model (SAM) to be variable. SAM is a powerful foundational model for image segmentation trained on huge datasets, but it requires fine-tuning to recognize arbitrary classes. The input image size of SAM is fixed at 1024 x 1024, resulting in substantial computational demands during training. Furthermore, the fixed input image size may result in the loss of image information, e.g. due to fixed aspect ratios. To address this problem, we propose Generalized SAM (GSAM). Different from the previous methods, GSAM is the first to apply random cropping during training with SAM, thereby significantly reducing the computational cost of training. Experiments on datasets of various types and various pixel counts have shown that GSAM can train more efficiently than SAM and other fine-tuning methods for SAM, achieving comparable or higher accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Generalized SAM (GSAM), a fine-tuning method for the Segment Anything Model that uses random cropping during training to support variable input image sizes. This is claimed to reduce computational cost compared to standard SAM fine-tuning (which fixes inputs at 1024x1024) while achieving comparable or higher accuracy, without additional architectural changes. Experiments are reported on multiple datasets with varying pixel counts and types.

Significance. If the central technical claim holds after clarification, the work would address a practical limitation in applying SAM to images of arbitrary resolutions and aspect ratios, offering efficiency gains for fine-tuning on resource-constrained settings. The use of random cropping as a simple mechanism for variable-size handling could be broadly applicable if it preserves pre-trained weights without hidden modifications.

major comments (2)
  1. [Abstract and Methods] Abstract and Methods section: The claim that GSAM enables variable input sizes 'without additional architectural changes' conflicts with the fixed positional embeddings in SAM's ViT-B/16 image encoder (learned for exactly 64x64 tokens from 1024x1024 inputs). Random crops of differing spatial dimensions require either resizing back to 1024x1024 (rendering the variable-size claim false) or positional embedding interpolation/padding (an architectural change). This load-bearing inconsistency must be resolved with explicit implementation details.
  2. [Experiments] Experiments section: The abstract asserts efficiency and accuracy gains over SAM and other fine-tuning methods across datasets of various types and pixel counts, but the provided text lacks quantitative results, specific baselines, training-time measurements, inference details for variable sizes, or error analysis. Without these, the empirical support for the central efficiency claim cannot be assessed.
minor comments (2)
  1. [Abstract] Abstract: The opening sentence uses informal language ('There has been a lot of recent research'); revise to a more precise statement such as 'Recent work has explored efficient fine-tuning of foundation models.'
  2. [Abstract] Abstract: The phrase 'various pixel counts' is vague; specify the range of resolutions or dataset statistics used in experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the manuscript to improve clarity and provide additional details.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods section: The claim that GSAM enables variable input sizes 'without additional architectural changes' conflicts with the fixed positional embeddings in SAM's ViT-B/16 image encoder (learned for exactly 64x64 tokens from 1024x1024 inputs). Random crops of differing spatial dimensions require either resizing back to 1024x1024 (rendering the variable-size claim false) or positional embedding interpolation/padding (an architectural change). This load-bearing inconsistency must be resolved with explicit implementation details.

    Authors: We agree that the manuscript requires explicit implementation details on this point. GSAM applies random cropping to produce smaller training inputs of varying sizes from original images of arbitrary resolutions, which reduces token count and thus training compute. For the fixed positional embeddings, we use standard bilinear interpolation to match the resulting token grid (a common ViT adaptation with no new parameters or modules introduced). We view this as preserving the core architecture while enabling variable sizes. We will revise the Abstract and Methods sections to describe this process in detail, including training and inference handling. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract asserts efficiency and accuracy gains over SAM and other fine-tuning methods across datasets of various types and pixel counts, but the provided text lacks quantitative results, specific baselines, training-time measurements, inference details for variable sizes, or error analysis. Without these, the empirical support for the central efficiency claim cannot be assessed.

    Authors: We apologize if the quantitative results were not sufficiently prominent. The full manuscript reports experiments across multiple datasets with accuracy metrics, comparisons to SAM and other fine-tuning baselines, and efficiency measurements. We will revise the Experiments section to add explicit tables with training-time results, inference details for variable input sizes, and error analysis to better substantiate the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental proposal with no derivation chain

full rationale

The paper proposes GSAM as an empirical fine-tuning method that applies random cropping to SAM for variable input sizes, supported solely by experimental comparisons on multiple datasets. No equations, fitted parameters, or mathematical derivations are described that could reduce to self-referential inputs by construction. Claims of efficiency and accuracy rest on direct empirical results rather than any self-definition, renamed predictions, or load-bearing self-citations. The central contribution is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no equations, parameters, or new entities are specified. The approach relies on the known fixed-input property of SAM and the standard practice of random cropping as a data augmentation technique.

axioms (1)
  • domain assumption SAM requires fixed 1024x1024 input images
    Explicitly stated in the abstract as the core limitation being addressed.

pith-pipeline@v0.9.0 · 5707 in / 1146 out tokens · 32756 ms · 2026-05-23T21:24:45.542133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 10 internal anchors

  1. [1]

    Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation

    Alom, M.Z., Hasan, M., Yakopcic, C., Taha, T.M., Asari, V.K.: Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmen- tation. arXiv preprint arXiv:1802.06955 (2018)

  2. [2]

    Pattern recognition letters30(2), 88–97 (2009)

    Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high- definition ground truth database. Pattern recognition letters30(2), 88–97 (2009)

  3. [3]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Bubeck,S.,Chandrasekaran,V.,Eldan,R.,Gehrke,J.,Horvitz,E.,Kamar,E.,Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al.: Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023)

  4. [4]

    In: European conference on computer vision

    Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin- unet: Unet-like pure transformer for medical image segmentation. In: European conference on computer vision. pp. 205–218. Springer (2022)

  5. [5]

    TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

    Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.:Transunet:Transformersmakestrongencodersformedicalimagesegmentation. arXiv preprint arXiv:2102.04306 (2021)

  6. [6]

    IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2017)

    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2017)

  7. [7]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)

  8. [8]

    In: Proceedings of the European conference on computer vision (ECCV)

    Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 801–818 (2018)

  9. [9]

    Advances in Neural Information Processing Systems35, 16664–16678 (2022)

    Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., Luo, P.: Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems35, 16664–16678 (2022)

  10. [10]

    arXiv preprint arXiv:2102.10882 (2021)

    Chu, X., Tian, Z., Zhang, B., Wang, X., Shen, C.: Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)

  11. [11]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

  12. [12]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:Bert:Pre-trainingofdeepbidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Generalized SAM 15

  13. [13]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. CoRR abs/2010.11929 (2020), https://arxiv.org/abs/2010.11929

  14. [14]

    He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

  15. [15]

    Segmentation of neuronal structures in em stacks challenge.https://imagej.net/ events/isbi-2012-segmentation-challenge (2012)

  16. [16]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  17. [17]

    In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26

    Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., De Lange, T., Johansen, D., Johansen, H.D.: Kvasir-seg: A segmented polyp dataset. In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26. pp. 451–462. Springer (2020)

  18. [18]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4015–4026 (2023)

  19. [19]

    Advances in neural information processing systems25 (2012)

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. Advances in neural information processing systems25 (2012)

  20. [20]

    arXiv preprint arXiv:2309.06824 (2023)

    Lin, X., Xiang, Y., Zhang, L., Yang, X., Yan, Z., Yu, L.: Samus: Adapting segment anything model for clinically-friendly and generalizable ultrasound image segmen- tation. arXiv preprint arXiv:2309.06824 (2023)

  21. [21]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

  22. [22]

    Mnih, V.: Machine Learning for Aerial Image Labeling. Ph.D. thesis, University of Toronto (2013)

  23. [23]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

  24. [24]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  25. [25]

    In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, Oc- tober 5-9, 2015, proceedings, part III 18

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, Oc- tober 5-9, 2015, proceedings, part III 18. pp. 234–241. Springer (2015)

  26. [26]

    Multi-atlas labeling beyond the cranial vault - workshop and challenge.https: //www.synapse.org/#!Synapse:syn3193805/wiki/217789 (2015)

  27. [27]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  28. [28]

    Advances in neural information pro- cessing systems30 (2017) 16 S

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30 (2017) 16 S. Kato et al

  29. [29]

    In: Computer Vision–ECCV 2020: 16th European Confer- ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16

    Xie, E., Wang, W., Wang, W., Ding, M., Shen, C., Luo, P.: Segmenting transpar- ent objects in the wild. In: Computer Vision–ECCV 2020: 16th European Confer- ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16. pp. 696–711. Springer (2020)

  30. [30]

    Multi-Scale Context Aggregation by Dilated Convolutions

    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

  31. [31]

    arXiv preprint arXiv:2304.13785 (2023)

    Zhang, K., Liu, D.: Customized segment anything model for medical image seg- mentation. arXiv preprint arXiv:2304.13785 (2023)

  32. [32]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2881–2890 (2017)

  33. [33]

    Convolution meets lora: Parameter efficient finetuning for segment anything model

    Zhong, Z., Tang, Z., He, T., Fang, H., Yuan, C.: Convolution meets lora: Parameter efficient finetuning for segment anything model. arXiv preprint arXiv:2401.17868 (2024)

  34. [34]

    Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th In- ternational Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Grana...