Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes

Hinako Mitsuoka; Kazuhiro Hotta; Sota Kato

arxiv: 2408.12406 · v2 · submitted 2024-08-22 · 💻 cs.CV

Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes

Sota Kato , Hinako Mitsuoka , Kazuhiro Hotta This is my paper

Pith reviewed 2026-05-23 21:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords SAMfine-tuningimage segmentationrandom croppingvariable input sizeefficient trainingfoundation model

0 comments

The pith

GSAM makes SAM fine-tuning work with any input image size by adding random cropping to training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the fixed input size limitation of the Segment Anything Model during fine-tuning. SAM normally requires 1024 by 1024 pixel inputs, which increases computation and can discard parts of images with mismatched shapes. GSAM introduces random cropping so that training uses variable sized patches from the original images. This change lowers the resources needed for training while experiments show accuracy stays the same or improves across different kinds of datasets.

Core claim

GSAM is the first method to apply random cropping during training with SAM, allowing variable input sizes and thereby significantly reducing the computational cost of training compared to standard SAM fine-tuning and other methods.

What carries the argument

Random cropping of input images during the training phase, which enables the model to process patches of varying sizes instead of requiring a fixed 1024x1024 resolution.

If this is right

Training time and memory use decrease because smaller random crops replace full fixed-size images.
Images of arbitrary aspect ratios can be used directly without resizing that distorts content.
Accuracy remains comparable or higher than baseline methods on multiple dataset types and resolutions.
Fine-tuning becomes practical for larger collections of images that vary in pixel count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be tested on other foundation models that currently enforce fixed input dimensions.
Deployment scenarios might benefit if the model can adapt to the native resolution of new images without retraining from scratch.
Further experiments could check performance when crop sizes are chosen based on the target application rather than randomly.

Load-bearing premise

Random cropping integrates directly into SAM fine-tuning without harming the model's pre-trained segmentation performance or requiring changes to the model architecture.

What would settle it

If experiments on additional datasets show that GSAM requires more computation than standard fine-tuning or produces lower segmentation accuracy, the efficiency and accuracy claims would not hold.

Figures

Figures reproduced from arXiv: 2408.12406 by Hinako Mitsuoka, Kazuhiro Hotta, Sota Kato.

**Figure 1.** Figure 1: When SAM is fine-tuned for semantic segmentation with conventional methods, only fixed-size images can be input. As a result, input images are deformed to fit a specific size, causing information loss. In contrast, GSAM supports various input image sizes while maintaining the superior segmentation performance of SAM. This allows images to be used in their original form and enables random cropping during fi… view at source ↗

**Figure 2.** Figure 2: The trade-off between MACs and segmentation accuracy (mIoU) for conventional fine-tuning methods for SAM on the ISBI2012 dataset [15]. The Red circles indicate our proposed GSAM and the triangles indicate the conventional methods. Random cropping is only performed for GSAM, cropping to the number of pixels indicated by the "size". Random cropping cannot be used except for GSAM due to its structure. scale… view at source ↗

**Figure 3.** Figure 3: Overview of Generalized SAM. F ROZEN indicates a network in which the weight parameters are fixed, and Learnable indicates a network in which the weight parameters are updated. spatial orientation, which enables it to retain positional information even when the input size of the feature map is variable. However, the original pretrained SAM does not support random inputs, and it is possible that global lear… view at source ↗

**Figure 4.** Figure 4: Overview of Spatial-Multiscale AdaptFormer. Five convolutional layers with different receptive fields are used to acquire the spatial features necessary for semantic segmentation. spatial information into account. Since spatial features are important information in semantic segmentation, the proposed SM-AdaptFormer prepares multiple convolutional layers with kernels in various ranges and acquires multisca… view at source ↗

**Figure 5.** Figure 5: Qualitative results. The first row is the results on the ISBI2012, the second is on the M-Building, the third is on the Cityscapes, and the fourth is on the Trans10k dataset. The Cityscapes dataset can be fed into GSAM with its original aspect ratio, but for simplicity of comparison, the same aspect ratios as other methods are shown. (a) Input image, (b) Ground truth, (c) SAM [18], (d) AdaptFormer [9], (e)… view at source ↗

**Figure 6.** Figure 6: The MACs and segmentation accuracy of each method on the ISBI2012 dataset are illustrated. The bar graphs are the MACs for each method and the line graphs are mIoU. LoRA, ConvLoRA, AdaptFormer, and GSAM are compared when the size of the random cropping is changed. Since the number of input image sizes is fixed in conventional SAM fine-tuning methods, the computational cost becomes huge, and it can be seen … view at source ↗

read the original abstract

There has been a lot of recent research on improving the efficiency of fine-tuning foundation models. In this paper, we propose a novel efficient fine-tuning method that allows the input image size of Segment Anything Model (SAM) to be variable. SAM is a powerful foundational model for image segmentation trained on huge datasets, but it requires fine-tuning to recognize arbitrary classes. The input image size of SAM is fixed at 1024 x 1024, resulting in substantial computational demands during training. Furthermore, the fixed input image size may result in the loss of image information, e.g. due to fixed aspect ratios. To address this problem, we propose Generalized SAM (GSAM). Different from the previous methods, GSAM is the first to apply random cropping during training with SAM, thereby significantly reducing the computational cost of training. Experiments on datasets of various types and various pixel counts have shown that GSAM can train more efficiently than SAM and other fine-tuning methods for SAM, achieving comparable or higher accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GSAM's random-crop trick for variable-size SAM inputs collides with the ViT positional embeddings and the abstract's 'no architecture changes' claim, so the central efficiency story needs the full methods to hold up.

read the letter

The paper's main move is to train SAM with random crops so the model can accept inputs of varying sizes and aspect ratios without the usual resize-to-1024 step. That is a practical pain point for people fine-tuning on non-square data, and the idea is straightforward enough that it could save some compute if it works cleanly. The abstract positions this as the first such use of cropping with SAM and reports efficiency plus accuracy gains across several datasets, which is the sort of targeted engineering note that can still be useful to practitioners. Credit for naming the fixed-size limitation explicitly and trying a simple augmentation fix instead of new modules or adapters. The soft spot is load-bearing. SAM's image encoder is a ViT-B/16 whose positional embeddings are learned for exactly 64x64 tokens. A random crop whose spatial size differs from 1024x1024 produces a different token grid; either the crop gets resized back to 1024 (which erases the variable-size benefit) or the embeddings must be interpolated, padded, or replaced (which is an architectural change). The abstract asserts both variable sizes and no extra architecture, so at least one of those statements is false unless the full text shows a third option that is not obvious. No quantitative tables, baselines, or error bars appear in the provided abstract, so it is impossible to judge whether the reported gains survive proper controls or whether test-time inputs are actually variable. This is the kind of short note that might interest a reading group focused on SAM fine-tuning tricks, but only if the methods section resolves the embedding question. I would not cite it yet and would lean toward desk reject unless the full paper demonstrates that variable token grids are handled without touching the positional embeddings or resizing the input.

Referee Report

2 major / 2 minor

Summary. The paper proposes Generalized SAM (GSAM), a fine-tuning method for the Segment Anything Model that uses random cropping during training to support variable input image sizes. This is claimed to reduce computational cost compared to standard SAM fine-tuning (which fixes inputs at 1024x1024) while achieving comparable or higher accuracy, without additional architectural changes. Experiments are reported on multiple datasets with varying pixel counts and types.

Significance. If the central technical claim holds after clarification, the work would address a practical limitation in applying SAM to images of arbitrary resolutions and aspect ratios, offering efficiency gains for fine-tuning on resource-constrained settings. The use of random cropping as a simple mechanism for variable-size handling could be broadly applicable if it preserves pre-trained weights without hidden modifications.

major comments (2)

[Abstract and Methods] Abstract and Methods section: The claim that GSAM enables variable input sizes 'without additional architectural changes' conflicts with the fixed positional embeddings in SAM's ViT-B/16 image encoder (learned for exactly 64x64 tokens from 1024x1024 inputs). Random crops of differing spatial dimensions require either resizing back to 1024x1024 (rendering the variable-size claim false) or positional embedding interpolation/padding (an architectural change). This load-bearing inconsistency must be resolved with explicit implementation details.
[Experiments] Experiments section: The abstract asserts efficiency and accuracy gains over SAM and other fine-tuning methods across datasets of various types and pixel counts, but the provided text lacks quantitative results, specific baselines, training-time measurements, inference details for variable sizes, or error analysis. Without these, the empirical support for the central efficiency claim cannot be assessed.

minor comments (2)

[Abstract] Abstract: The opening sentence uses informal language ('There has been a lot of recent research'); revise to a more precise statement such as 'Recent work has explored efficient fine-tuning of foundation models.'
[Abstract] Abstract: The phrase 'various pixel counts' is vague; specify the range of resolutions or dataset statistics used in experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the manuscript to improve clarity and provide additional details.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods section: The claim that GSAM enables variable input sizes 'without additional architectural changes' conflicts with the fixed positional embeddings in SAM's ViT-B/16 image encoder (learned for exactly 64x64 tokens from 1024x1024 inputs). Random crops of differing spatial dimensions require either resizing back to 1024x1024 (rendering the variable-size claim false) or positional embedding interpolation/padding (an architectural change). This load-bearing inconsistency must be resolved with explicit implementation details.

Authors: We agree that the manuscript requires explicit implementation details on this point. GSAM applies random cropping to produce smaller training inputs of varying sizes from original images of arbitrary resolutions, which reduces token count and thus training compute. For the fixed positional embeddings, we use standard bilinear interpolation to match the resulting token grid (a common ViT adaptation with no new parameters or modules introduced). We view this as preserving the core architecture while enabling variable sizes. We will revise the Abstract and Methods sections to describe this process in detail, including training and inference handling. revision: yes
Referee: [Experiments] Experiments section: The abstract asserts efficiency and accuracy gains over SAM and other fine-tuning methods across datasets of various types and pixel counts, but the provided text lacks quantitative results, specific baselines, training-time measurements, inference details for variable sizes, or error analysis. Without these, the empirical support for the central efficiency claim cannot be assessed.

Authors: We apologize if the quantitative results were not sufficiently prominent. The full manuscript reports experiments across multiple datasets with accuracy metrics, comparisons to SAM and other fine-tuning baselines, and efficiency measurements. We will revise the Experiments section to add explicit tables with training-time results, inference details for variable input sizes, and error analysis to better substantiate the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental proposal with no derivation chain

full rationale

The paper proposes GSAM as an empirical fine-tuning method that applies random cropping to SAM for variable input sizes, supported solely by experimental comparisons on multiple datasets. No equations, fitted parameters, or mathematical derivations are described that could reduce to self-referential inputs by construction. Claims of efficiency and accuracy rest on direct empirical results rather than any self-definition, renamed predictions, or load-bearing self-citations. The central contribution is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no equations, parameters, or new entities are specified. The approach relies on the known fixed-input property of SAM and the standard practice of random cropping as a data augmentation technique.

axioms (1)

domain assumption SAM requires fixed 1024x1024 input images
Explicitly stated in the abstract as the core limitation being addressed.

pith-pipeline@v0.9.0 · 5707 in / 1146 out tokens · 32756 ms · 2026-05-23T21:24:45.542133+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 10 internal anchors

[1]

Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation

Alom, M.Z., Hasan, M., Yakopcic, C., Taha, T.M., Asari, V.K.: Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmen- tation. arXiv preprint arXiv:1802.06955 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Pattern recognition letters30(2), 88–97 (2009)

Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high- definition ground truth database. Pattern recognition letters30(2), 88–97 (2009)

work page 2009
[3]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Bubeck,S.,Chandrasekaran,V.,Eldan,R.,Gehrke,J.,Horvitz,E.,Kamar,E.,Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al.: Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

In: European conference on computer vision

Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin- unet: Unet-like pure transformer for medical image segmentation. In: European conference on computer vision. pp. 205–218. Springer (2022)

work page 2022
[5]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.:Transunet:Transformersmakestrongencodersformedicalimagesegmentation. arXiv preprint arXiv:2102.04306 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2017)

Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2017)

work page 2017
[7]

Rethinking Atrous Convolution for Semantic Image Segmentation

Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

In: Proceedings of the European conference on computer vision (ECCV)

Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 801–818 (2018)

work page 2018
[9]

Advances in Neural Information Processing Systems35, 16664–16678 (2022)

Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., Luo, P.: Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems35, 16664–16678 (2022)

work page 2022
[10]

arXiv preprint arXiv:2102.10882 (2021)

Chu, X., Tian, Z., Zhang, B., Wang, X., Shen, C.: Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)

work page arXiv 2021
[11]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

work page 2016
[12]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:Bert:Pre-trainingofdeepbidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Generalized SAM 15

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. CoRR abs/2010.11929 (2020), https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016
[15]

Segmentation of neuronal structures in em stacks challenge.https://imagej.net/ events/isbi-2012-segmentation-challenge (2012)

work page 2012
[16]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26

Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., De Lange, T., Johansen, D., Johansen, H.D.: Kvasir-seg: A segmented polyp dataset. In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26. pp. 451–462. Springer (2020)

work page 2020
[18]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4015–4026 (2023)

work page 2023
[19]

Advances in neural information processing systems25 (2012)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. Advances in neural information processing systems25 (2012)

work page 2012
[20]

arXiv preprint arXiv:2309.06824 (2023)

Lin, X., Xiang, Y., Zhang, L., Yang, X., Yan, Z., Yu, L.: Samus: Adapting segment anything model for clinically-friendly and generalizable ultrasound image segmen- tation. arXiv preprint arXiv:2309.06824 (2023)

work page arXiv 2023
[21]

SGDR: Stochastic Gradient Descent with Warm Restarts

Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[22]

Mnih, V.: Machine Learning for Aerial Image Labeling. Ph.D. thesis, University of Toronto (2013)

work page 2013
[23]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021
[24]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022
[25]

In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, Oc- tober 5-9, 2015, proceedings, part III 18

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, Oc- tober 5-9, 2015, proceedings, part III 18. pp. 234–241. Springer (2015)

work page 2015
[26]

Multi-atlas labeling beyond the cranial vault - workshop and challenge.https: //www.synapse.org/#!Synapse:syn3193805/wiki/217789 (2015)

work page 2015
[27]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Advances in neural information pro- cessing systems30 (2017) 16 S

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30 (2017) 16 S. Kato et al

work page 2017
[29]

In: Computer Vision–ECCV 2020: 16th European Confer- ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16

Xie, E., Wang, W., Wang, W., Ding, M., Shen, C., Luo, P.: Segmenting transpar- ent objects in the wild. In: Computer Vision–ECCV 2020: 16th European Confer- ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16. pp. 696–711. Springer (2020)

work page 2020
[30]

Multi-Scale Context Aggregation by Dilated Convolutions

Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[31]

arXiv preprint arXiv:2304.13785 (2023)

Zhang, K., Liu, D.: Customized segment anything model for medical image seg- mentation. arXiv preprint arXiv:2304.13785 (2023)

work page arXiv 2023
[32]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2881–2890 (2017)

work page 2017
[33]

Convolution meets lora: Parameter efficient finetuning for segment anything model

Zhong, Z., Tang, Z., He, T., Fang, H., Yuan, C.: Convolution meets lora: Parameter efficient finetuning for segment anything model. arXiv preprint arXiv:2401.17868 (2024)

work page arXiv 2024
[34]

Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th In- ternational Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Grana...

work page 2018

[1] [1]

Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation

Alom, M.Z., Hasan, M., Yakopcic, C., Taha, T.M., Asari, V.K.: Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmen- tation. arXiv preprint arXiv:1802.06955 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Pattern recognition letters30(2), 88–97 (2009)

Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high- definition ground truth database. Pattern recognition letters30(2), 88–97 (2009)

work page 2009

[3] [3]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Bubeck,S.,Chandrasekaran,V.,Eldan,R.,Gehrke,J.,Horvitz,E.,Kamar,E.,Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al.: Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

In: European conference on computer vision

Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin- unet: Unet-like pure transformer for medical image segmentation. In: European conference on computer vision. pp. 205–218. Springer (2022)

work page 2022

[5] [5]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.:Transunet:Transformersmakestrongencodersformedicalimagesegmentation. arXiv preprint arXiv:2102.04306 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2017)

Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2017)

work page 2017

[7] [7]

Rethinking Atrous Convolution for Semantic Image Segmentation

Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

In: Proceedings of the European conference on computer vision (ECCV)

Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 801–818 (2018)

work page 2018

[9] [9]

Advances in Neural Information Processing Systems35, 16664–16678 (2022)

Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., Luo, P.: Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems35, 16664–16678 (2022)

work page 2022

[10] [10]

arXiv preprint arXiv:2102.10882 (2021)

Chu, X., Tian, Z., Zhang, B., Wang, X., Shen, C.: Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)

work page arXiv 2021

[11] [11]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

work page 2016

[12] [12]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:Bert:Pre-trainingofdeepbidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Generalized SAM 15

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. CoRR abs/2010.11929 (2020), https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2010

[14] [14]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016

[15] [15]

Segmentation of neuronal structures in em stacks challenge.https://imagej.net/ events/isbi-2012-segmentation-challenge (2012)

work page 2012

[16] [16]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26

Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., De Lange, T., Johansen, D., Johansen, H.D.: Kvasir-seg: A segmented polyp dataset. In: MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26. pp. 451–462. Springer (2020)

work page 2020

[18] [18]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4015–4026 (2023)

work page 2023

[19] [19]

Advances in neural information processing systems25 (2012)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. Advances in neural information processing systems25 (2012)

work page 2012

[20] [20]

arXiv preprint arXiv:2309.06824 (2023)

Lin, X., Xiang, Y., Zhang, L., Yang, X., Yan, Z., Yu, L.: Samus: Adapting segment anything model for clinically-friendly and generalizable ultrasound image segmen- tation. arXiv preprint arXiv:2309.06824 (2023)

work page arXiv 2023

[21] [21]

SGDR: Stochastic Gradient Descent with Warm Restarts

Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[22] [22]

Mnih, V.: Machine Learning for Aerial Image Labeling. Ph.D. thesis, University of Toronto (2013)

work page 2013

[23] [23]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021

[24] [24]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022

[25] [25]

In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, Oc- tober 5-9, 2015, proceedings, part III 18

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, Oc- tober 5-9, 2015, proceedings, part III 18. pp. 234–241. Springer (2015)

work page 2015

[26] [26]

Multi-atlas labeling beyond the cranial vault - workshop and challenge.https: //www.synapse.org/#!Synapse:syn3193805/wiki/217789 (2015)

work page 2015

[27] [27]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Advances in neural information pro- cessing systems30 (2017) 16 S

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30 (2017) 16 S. Kato et al

work page 2017

[29] [29]

In: Computer Vision–ECCV 2020: 16th European Confer- ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16

Xie, E., Wang, W., Wang, W., Ding, M., Shen, C., Luo, P.: Segmenting transpar- ent objects in the wild. In: Computer Vision–ECCV 2020: 16th European Confer- ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16. pp. 696–711. Springer (2020)

work page 2020

[30] [30]

Multi-Scale Context Aggregation by Dilated Convolutions

Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[31] [31]

arXiv preprint arXiv:2304.13785 (2023)

Zhang, K., Liu, D.: Customized segment anything model for medical image seg- mentation. arXiv preprint arXiv:2304.13785 (2023)

work page arXiv 2023

[32] [32]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2881–2890 (2017)

work page 2017

[33] [33]

Convolution meets lora: Parameter efficient finetuning for segment anything model

Zhong, Z., Tang, Z., He, T., Fang, H., Yuan, C.: Convolution meets lora: Parameter efficient finetuning for segment anything model. arXiv preprint arXiv:2401.17868 (2024)

work page arXiv 2024

[34] [34]

Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th In- ternational Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Grana...

work page 2018