pith. sign in

arxiv: 2605.18013 · v1 · pith:ERSLGMLHnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI

TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

Pith reviewed 2026-05-20 12:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video segmentationmemory compressiontoken selectionlightweight modelobject trackingSAM 2
0
0 comments X

The pith

TinySAM 2 reduces SAM 2 memory tokens to 7 percent while preserving 90 percent of video segmentation performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TinySAM 2 as a compact video segmentation model built on SAM 2. It adds a memory quality management step to keep only high-value past frames and applies joint spatial-temporal compression to shrink the stored tokens. Average pooling first removes spatial redundancy within frames, after which similarity measurements pick the most useful tokens across the memory bank. A RepViT backbone further cuts parameter count. If these steps hold, the model supports accurate tracking and segmentation on devices that cannot run the full SAM 2.

Core claim

TinySAM 2 achieves 90 percent of SAM 2.1 performance on DAVIS and SA-V benchmarks using only 7 percent of the memory tokens and 3 percent of the training data. The gains come from a memory quality management mechanism that retains informative historical frames and a joint-spatial-temporal token compression method that applies average pooling in the spatial domain followed by similarity-based selection in the temporal domain across the memory bank. The image encoder is replaced with RepViT to lower overall model size.

What carries the argument

Joint-spatial-temporal token compression, which first applies average pooling to reduce spatial tokens inside each frame and then selects informative tokens across memory-bank frames by token-level similarity.

If this is right

  • The model runs with far lower memory and compute, opening video segmentation to hardware with limited capacity.
  • Experiments confirm the approach works on standard benchmarks while cutting storage and training costs.
  • The same compression pattern can be tested on other memory-dependent video models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adaptive selection thresholds might further improve results on videos of varying complexity.
  • The technique could combine with other lightweight encoders to push efficiency gains higher.
  • Real-time applications on mobile devices become more feasible once memory footprint falls this far.

Load-bearing premise

Average pooling plus similarity-based selection across the memory bank keeps every detail required for correct long-term tracking even when objects move quickly or become hidden.

What would settle it

A clear accuracy drop on video clips that contain fast motion or long occlusions would show the compression has lost critical information.

Figures

Figures reproduced from arXiv: 2605.18013 by Han Shu, Xinghao Chen, Yijing Yang, Zhaoyuan Ding.

Figure 1
Figure 1. Figure 1: Standard Semi-Supervised Video Object Segmenta￾tion Benchmark Comparison. The average J &F performance (y-axis) of TinySAM 2, SAM 2.1, SAM-I2V and other models on datasets such as SA-V, plotted against token length (x-axis). The size of the circles represents the amount of data used for model training. These data are from [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of TinySAM 2. Our work focuses on the colored modules in the diagram. We adopt a lighter image encoder based on SAM 2 to reduce the number of parameters. A spatial pooling compression is performed on individual memory frames before passing to the memory bank, and a temporal token compression module is proposed to compress tokens across multiple frames by the token selection. In additio… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of TinySAM 2, SAM 2.1 and SAM-I2V in multi-instance segmentation. We visually demonstrate the results by uniformly sampling frames from a complete video that tracks five sheep of different sizes with alternating positions within a flock. Our TinySAM 2 generates masks of comparable quality to SAM 2.1, whereas SAM-I2V encounters tracking errors. tween memory key and query key, but the … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of TinySAM 2 and SAM 2.1 across different instances and scenarios. We visualize frames from different videos involving single-person/multi-person scenes, single-animal/multi-animal scenes, and indoor/outdoor scenes, where TinySAM 2 performs comparably to SAM 2.1. the instance confusion or tracking loss issues. For instance, starting from the 40% column in [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of TinySAM 2, SAM 2.1 and SAM-I2V in videos across different scenarios. We visually demonstrate the results by uniformly sampling frames from complete videos of different scenes. The top example is a camel, the bottom one is an aquarium. In the top example, SAM-I2V incorrectly tracks objects behind the target, whereas TinySAM 2 clearly outperforms SAM-I2V in this aspect [PITH_FULL_I… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of TinySAM 2, SAM 2.1 and SAM-I2V on the SA-V dataset. The challenge of the SA-V dataset lies in the longer video lengths and the fact that the tracked objects are either partial or relatively small in size. The example on top shows a birthday scene, while the one below depicts an airport. SAM-I2V encounters issues with tracking failures or incomplete tracking when the target is occl… view at source ↗
read the original abstract

Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi-supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2's multi-stage image encoder and memory module have raised the barrier to the model's deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high-informative historical frames as the memory. In addition, a joint-spatial-temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token-level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA-V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource-efficient solution for the widespread application of video segmentation models on devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes TinySAM 2, a compressed version of SAM 2 for video object segmentation and tracking. It adds a memory quality management mechanism to retain high-informative historical frames, introduces joint spatial-temporal token compression (average pooling in the spatial domain followed by similarity-based selection across the memory bank in the temporal domain), and substitutes RepViT for the image encoder. Experiments on DAVIS and SA-V are reported to show that TinySAM 2 retains 90% of SAM 2.1 performance while using only 7% memory tokens and 3% training data.

Significance. If the performance retention claims can be substantiated through detailed ablations and statistical reporting, the work would offer a practical route to deploying video foundation models on memory-constrained devices. The deterministic compression approach avoids the need for full retraining and could generalize to other memory-bank-based video models.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: The headline claim of 90% performance retention with 7% memory tokens and 3% training data is presented without error bars, standard deviations across runs, or explicit description of the measurement protocol (including whether post-hoc selection of high-informative frames was applied uniformly). This makes it impossible to judge the robustness of the central efficiency-performance tradeoff.
  2. [Methods (joint-spatial-temporal compression)] Methods paragraph on joint-spatial-temporal compression: Average pooling is used to compress spatial tokens before temporal similarity selection, yet no component-wise ablation isolates the pooling step's effect on boundary-sensitive metrics such as J&F on DAVIS. Because pooling inherently smooths high-frequency edge information, this choice risks systematic degradation in long-term re-identification after occlusion; the absence of such an ablation leaves the information-preservation assumption untested.
minor comments (2)
  1. [Abstract] The abstract should explicitly state the SAM 2.1 baseline version and the precise definition of 'memory tokens' used to compute the 7% figure.
  2. Add a short related-work paragraph contrasting the proposed deterministic selection with learned memory compression methods in prior video segmentation literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of our efficiency claims. We address each major comment below and have revised the manuscript to incorporate additional statistical reporting and ablations.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: The headline claim of 90% performance retention with 7% memory tokens and 3% training data is presented without error bars, standard deviations across runs, or explicit description of the measurement protocol (including whether post-hoc selection of high-informative frames was applied uniformly). This makes it impossible to judge the robustness of the central efficiency-performance tradeoff.

    Authors: We agree that error bars and a precise protocol description are necessary to substantiate the central claims. In the revised manuscript we now report standard deviations over three independent runs with different random seeds for the key J&F and mIoU metrics on both DAVIS and SA-V. We have also expanded the Experiments section to explicitly state that frame selection follows the deterministic memory quality management mechanism uniformly across all sequences, with no post-hoc selection of high-informative frames. These additions allow readers to assess the robustness of the reported 90 % retention figure. revision: yes

  2. Referee: [Methods (joint-spatial-temporal compression)] Methods paragraph on joint-spatial-temporal compression: Average pooling is used to compress spatial tokens before temporal similarity selection, yet no component-wise ablation isolates the pooling step's effect on boundary-sensitive metrics such as J&F on DAVIS. Because pooling inherently smooths high-frequency edge information, this choice risks systematic degradation in long-term re-identification after occlusion; the absence of such an ablation leaves the information-preservation assumption untested.

    Authors: We acknowledge the value of isolating the spatial pooling component. Although the original submission contained an ablation of the combined spatial-temporal compression, it did not separate the pooling step. In the revised manuscript we have added a dedicated component-wise ablation that measures J&F on DAVIS with and without spatial average pooling (while keeping temporal similarity selection fixed). The results show only a small degradation, indicating that the subsequent temporal selection largely preserves the information needed for re-identification after occlusion. A short discussion of this finding has been inserted in the Methods section. revision: yes

Circularity Check

0 steps flagged

No circularity: TinySAM 2 reports empirical results from deterministic compression rules

full rationale

The paper's core claims rest on an explicit algorithmic pipeline—memory quality management followed by joint spatial-temporal compression via average pooling then similarity-based token selection—whose outputs are measured directly against external benchmarks (DAVIS, SA-V). No equations or derivations are presented that reduce the reported 90% performance retention, 7% memory tokens, or 3% training data to fitted parameters or self-referential definitions by construction. The compression steps are described as fixed, non-learned operations independent of the evaluation metrics, and no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The result is therefore a standard empirical engineering report whose performance numbers are not forced by the method's own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central performance numbers rest on the unstated assumption that the chosen compression ratios and similarity thresholds generalize beyond the two evaluation datasets; no explicit free parameters are named, but the token selection process implicitly introduces at least one tunable similarity threshold and a fixed pooling kernel size.

free parameters (2)
  • token similarity threshold
    Used to decide which temporal tokens to retain; value not stated but required for the pruning step to achieve the reported 7% memory reduction.
  • spatial pooling kernel size
    Determines how much spatial redundancy is removed; average pooling is mentioned but exact window size is unspecified.
axioms (1)
  • domain assumption High-informative historical frames can be reliably identified by a quality management mechanism without access to future frames.
    Invoked in the first proposed component; if false, the memory bank would retain low-value frames and degrade tracking.

pith-pipeline@v0.9.0 · 5800 in / 1426 out tokens · 28998 ms · 2026-05-20T12:11:37.871413+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

  1. [1]

    Ho Kei Cheng and Alexander G. Schwing. Xmem: Long- term video object segmentation with an atkinson-shiffrin memory model. InComputer Vision – ECCV 2022, pages 640–658, Cham, 2022. Springer Nature Switzerland. 3, 5, 6

  2. [2]

    Rethink- ing space-time networks with improved memory coverage for efficient video object segmentation

    Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Rethink- ing space-time networks with improved memory coverage for efficient video object segmentation. InProceedings of the 35th International Conference on Neural Information Pro- cessing Systems, Red Hook, NY , USA, 2021. Curran Asso- ciates Inc. 5, 6

  3. [3]

    Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion

    Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5559–5568, 2021. 2

  4. [4]

    Tracking anything with decoupled video segmentation

    Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexan- der Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1316–1326, 2023. 5, 6

  5. [5]

    Price, Joon-Young Lee, and Alexander G

    Ho Kei Cheng, Seoung Wug Oh, Brian L. Price, Joon-Young Lee, and Alexander G. Schwing. Putting the object back into video object segmentation.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3151–3161, 2023. 2

  6. [6]

    Putting the object back into video object segmentation

    Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3151–3161, 2024. 5, 6

  7. [7]

    Interac- tive video object segmentation using global and local trans- fer modules

    Yuk Heo, Yeong Jun Koh, and Chang-Su Kim. Interac- tive video object segmentation using global and local trans- fer modules. InEuropean Conference on Computer Vision, pages 297–313. Springer, 2020. 2

  8. [8]

    Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation

    Yuan-Ting Hu, Jia-Bin Huang, and Alexander G Schwing. Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. InProceed- ings of the European conference on computer vision (ECCV), pages 786–802, 2018. 2

  9. [9]

    Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023. 1, 2, 5

  10. [10]

    Recurrent dynamic embedding for video ob- ject segmentation

    Mingxing Li, Li Hu, Zhiwei Xiong, Bang Zhang, Pan Pan, and Dong Liu. Recurrent dynamic embedding for video ob- ject segmentation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 1332–1341, 2022. 5, 6

  11. [11]

    Unsupervised video object segmenta- tion with motion-based bilateral networks

    Siyang Li, Bryan Seybold, Alexey V orobyov, Xuejing Lei, and C-C Jay Kuo. Unsupervised video object segmenta- tion with motion-based bilateral networks. InProceedings of the European conference on computer vision (ECCV), pages 207–223, 2018. 2

  12. [12]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 5

  13. [13]

    See more, know more: Unsuper- vised video object segmentation with co-attention siamese networks

    Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. See more, know more: Unsuper- vised video object segmentation with co-attention siamese networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3623–3632,

  14. [14]

    Segment anything in medical images.Nature Communications, 15(1):654, 2024

    Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Communications, 15(1):654, 2024. 2

  15. [15]

    Segment anything model for medical image analysis: an experimental study

    Maciej A Mazurowski, Haoyu Dong, Hanxue Gu, Jichen Yang, Nicholas Konz, and Yixin Zhang. Segment anything model for medical image analysis: an experimental study. Medical Image Analysis, 89:102918, 2023. 2

  16. [16]

    Sam- i2v: Upgrading sam to support promptable video segmen- tation with less than 0.2% training cost

    Haiyang Mei, Pengyu Zhang, and Mike Zheng Shou. Sam- i2v: Upgrading sam to support promptable video segmen- tation with less than 0.2% training cost. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3417–3426, 2025. 3, 5, 6

  17. [17]

    Memory aggrega- tion networks for efficient interactive video object segmen- tation

    Jiaxu Miao, Yunchao Wei, and Yi Yang. Memory aggrega- tion networks for efficient interactive video object segmen- tation. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10366–10375,

  18. [18]

    Fast user-guided video object segmentation by interaction-and-propagation networks

    Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Fast user-guided video object segmentation by interaction-and-propagation networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5247–5256, 2019. 2

  19. [19]

    The 2017 davis challenge on video object segmentation, 2018

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation, 2018. 2, 5

  20. [20]

    Sam 2: Segment anything in images and videos,

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

  21. [21]

    Hiera: A hi- erarchical vision transformer without the bells-and-whistles

    Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Ma- lik, Yanghao Li, and Christoph Feichtenhofer. Hiera: A hi- erarchical vision transformer without the bells-and-whistles. InProceedings of the 40th International Conference on Ma- chine Learning,...

  22. [22]

    Ef- ficient video object segmentation via modulated cross- attention memory

    Abdelrahman Shaker, Syed Talal, Martin Danelljan, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Ef- ficient video object segmentation via modulated cross- attention memory. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 8681–8690,

  23. [23]

    Tinysam: Pushing the envelope for efficient segment any- thing model

    Han Shu, Wenshuo Li, Yehui Tang, Yiman Zhang, Yi- hao Chen, Houqiang Li, Yunhe Wang, and Xinghao Chen. Tinysam: Pushing the envelope for efficient segment any- thing model. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 20470–20478, 2025. 2

  24. [24]

    Dycoke: Dynamic compression of tokens for fast video large language models

    Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18992–19001, 2025. 4

  25. [25]

    A distractor-aware memory for visual object tracking with sam2

    Jovana Videnovic, Alan Lukezic, and Matej Kristan. A distractor-aware memory for visual object tracking with sam2. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 24255– 24264, 2025. 5

  26. [26]

    Repvit: Revisiting mobile cnn from vit perspective

    Ao Wang, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 15909–15920,

  27. [27]

    Learning unsupervised video object segmentation through visual attention

    Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven CH Hoi, and Haibin Ling. Learning unsupervised video object segmentation through visual attention. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3064– 3074, 2019. 2

  28. [28]

    Medical sam adapter: Adapting segment anything model for medical im- age segmentation.Medical image analysis, 102:103547,

    Junde Wu, Ziyue Wang, Mingxuan Hong, Wei Ji, Huazhu Fu, Yanwu Xu, Min Xu, and Yueming Jin. Medical sam adapter: Adapting segment anything model for medical im- age segmentation.Medical image analysis, 102:103547,

  29. [29]

    Efficientsam: Leveraged masked image pretraining for efficient segment anything

    Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xi- ang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, et al. Efficientsam: Leveraged masked image pretraining for efficient segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 16111–16121, 2024. 2

  30. [30]

    Efficient track anything, 2024

    Yunyang Xiong, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Balakrishnan Varadarajan, Ramya Akula, Forrest Iandola, Raghuraman Krishnamoorthi, Bilge Soran, and Vikas Chandra. Efficient track anything, 2024. 2, 3, 5, 6

  31. [31]

    Youtube-vos: A large-scale video object segmentation benchmark, 2018

    Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark, 2018. 5

  32. [32]

    Track anything: Segment anything meets videos, 2023

    Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos, 2023. 3

  33. [33]

    Scalable video object seg- mentation with identification mechanism.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 46(9): 6247–6262, 2024

    Zongxin Yang, Jiaxu Miao, Yunchao Wei, Wenguan Wang, Xiaohan Wang, and Yi Yang. Scalable video object seg- mentation with identification mechanism.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 46(9): 6247–6262, 2024. 2

  34. [34]

    Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

    Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mo- bile applications.arXiv preprint arXiv:2306.14289, 2023. 2

  35. [35]

    Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

    Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment any- thing.arXiv preprint arXiv:2306.12156, 2023

  36. [36]

    Edgesam: Prompt-in-the-loop distillation for on-device de- ployment of sam.arXiv preprint arXiv:2312.06660, 2023

    Chong Zhou, Xiangtai Li, Chen Change Loy, and Bo Dai. Edgesam: Prompt-in-the-loop distillation for on-device de- ployment of sam.arXiv preprint arXiv:2312.06660, 2023. 2

  37. [37]

    Ed- getam: On-device track anything model

    Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, and Bilge Soran. Ed- getam: On-device track anything model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13832–13842, 2025. 3, 5, 6

  38. [38]

    Rmem: Restricted memory banks improve video object segmenta- tion

    Junbao Zhou, Ziqi Pang, and Yu-Xiong Wang. Rmem: Restricted memory banks improve video object segmenta- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18602– 18611, 2024. 3 TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model Supplementary Material

  39. [39]

    Figure 5 illustrates the ability to accurately seg- ment movements among animals

    Appendix We further demonstrate the visual comparisons of TinySAM 2, SAM 2.1, and SAM-I2V across different videos. Figure 5 illustrates the ability to accurately seg- ment movements among animals. Figure 6 demonstrates the capability to precisely track small-sized or local parts of objects over more than 400 frames of video. In Figure 5, starting at 40% o...