pith. sign in

arxiv: 2510.18822 · v4 · pith:CUVGAYNKnew · submitted 2025-10-21 · 💻 cs.CV

SAM 2++: Tracking Anything at Any Granularity

Pith reviewed 2026-05-21 19:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified video trackingmulti-granularity trackingtask-adaptive memoryobject trackingvideo datasetprompt encodingSAM 2
0
0 comments X

The pith

SAM 2++ creates a single model for video tracking at mask, box, or point granularity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to replace separate trackers built for one target representation with one system that works across masks for precise outlines, boxes for rough positions, and points for centers. Task-specific designs currently waste multi-task data and duplicate model parts because they miss the shared tracking operations across these levels of detail. SAM 2++ adds task-specific prompts that turn any input into shared embeddings, a decoder that outputs results in a common format, and a memory system that matches across tasks while keeping each granularity's meaning separate. A new dataset supplies the varied annotations needed to train and test this approach. If the design holds, training becomes more efficient and one model can replace several specialized ones on many benchmarks.

Core claim

SAM 2++ unifies video tracking tasks at different granularities through task-specific prompts, a Unified Decoder, and a task-adaptive memory mechanism that unifies memory while preserving distinct state semantics, together with the new Tracking-Any-Granularity dataset, achieving state of the art across tasks.

What carries the argument

task-adaptive memory mechanism that unifies memory representations across granularities while preserving distinct state semantics to avoid interference

If this is right

  • Multi-task training data from different granularities can be combined without separate models.
  • Model design and parameter counts become less redundant across tracking tasks.
  • Performance reaches state of the art on benchmarks for masks, boxes, and points.
  • A single robust framework replaces task-specific trackers for video tracking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The shared components could extend to other video tasks that mix coarse and fine outputs.
  • The new dataset offers a way to study how granularity choice affects tracking accuracy in practice.
  • Real-time systems could maintain one model instead of switching between multiple granularity-specific versions.

Load-bearing premise

The task-adaptive memory mechanism can unify memory representations across granularities while preserving distinct state semantics and avoiding interference from full parameter sharing.

What would settle it

A controlled test that removes the task-adaptive memory and shows equal or higher accuracy on all three granularities would indicate the adaptation is not required.

Figures

Figures reproduced from arXiv: 2510.18822 by Cheng Liang, Chenkai Zeng, Gangshan Wu, Jiaming Zhang, Kai Ma, Limin Wang, Xinwen Zhang, Xin Zhou, Yichun Yang, Yutao Cui.

Figure 1
Figure 1. Figure 1: The overall of SAM 2++, including (a) tracking any granularity task, (b) our unified tracking foundation model, and (c) our [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The SAM 2++ architecture. When a new frame is received, the result is conditioned on the new prompt [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Annotation pipeline of our Tracking-Any-Granularity dataset. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of Tracking-Any-Granularity Dataset. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Statistics on sources and attributes distribution of Tracking-Any-Granularity Dataset. The link in (c) reflects the frequent co [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of memory design at different components and granularities. In the visualization of each component, the left side [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example videos from the Tracking-Any-Granularity dataset with annotation at various granularities. Each annotation has a [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples from SAM 2++ results on video benchmarks at various granularities. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison between our model and various SOTA methods on video tracking benchmarks at three granularities. Better viewing [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task, which specificity limits their generalization, preventing them from effectively utilizing multi-task training data and leading to redundancy in both model design and parameters. Although recent unified vision models share partial architectures across tasks, they usually retain task-specific interfaces and overlook the common tracking principle behind different granularities, leaving a gap for truly unified video tracking. To unify video tracking tasks, we present SAM 2++, a unified framework that can handle target states at different granularities, including masks, boxes, and points, through an integrated design of prompt encoding, output decoding, and memory representation. First, to handle different target granularities, we design task-specific prompts that map diverse task inputs into general prompt embeddings, together with a Unified Decoder that produces task results in a common output form without redesigning the overall pipeline. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities while preserving their distinct state semantics, preventing full parameter sharing from causing interference across granularities. Finally, we introduce Tracking-Any-Granularity, the first large and diverse video tracking dataset with rich annotations at three granularities. It is constructed through a customized data engine with phased manual annotation and model-assisted completion, providing a comprehensive resource for training, benchmarking, and analyzing unified tracking models. Comprehensive experiments confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces SAM 2++, a unified video tracking framework that handles targets at varying granularities (masks, boxes, points) via task-specific prompt mapping into general embeddings, a Unified Decoder producing common outputs, and a task-adaptive memory mechanism that unifies memory representations while preserving distinct state semantics and avoiding cross-granularity interference. It further contributes the Tracking-Any-Granularity dataset constructed via a phased annotation engine and reports state-of-the-art results across diverse tracking tasks.

Significance. If the central unification claims hold with supporting evidence, the work would offer a meaningful step toward reducing task-specific redundancy in trackers and enabling effective multi-task training. The new dataset would serve as a useful benchmark resource for analyzing granularity-agnostic tracking.

major comments (1)
  1. [Description of task-adaptive memory mechanism] The task-adaptive memory mechanism is load-bearing for the unification claim, yet the manuscript provides no ablation studies or quantitative comparisons demonstrating that full parameter sharing causes measurable interference or degradation across mask/box/point granularities. No details are given on the adaptation implementation (conditioning vectors, adapters, or state disentanglement) that would allow verification that distinct semantics are preserved during memory matching. This directly affects the robustness of the core tracking operation.
minor comments (1)
  1. [Abstract] The abstract states that comprehensive experiments confirm SOTA results but does not reference specific baselines, data splits, or error bars; if these details appear only in later sections, cross-referencing them in the abstract would strengthen the high-level claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and outline the revisions we will make to strengthen the evidence for the task-adaptive memory mechanism.

read point-by-point responses
  1. Referee: [Description of task-adaptive memory mechanism] The task-adaptive memory mechanism is load-bearing for the unification claim, yet the manuscript provides no ablation studies or quantitative comparisons demonstrating that full parameter sharing causes measurable interference or degradation across mask/box/point granularities. No details are given on the adaptation implementation (conditioning vectors, adapters, or state disentanglement) that would allow verification that distinct semantics are preserved during memory matching. This directly affects the robustness of the core tracking operation.

    Authors: We agree that the current manuscript would be strengthened by explicit ablation studies and expanded implementation details for the task-adaptive memory. The design is intended to unify memory representations across granularities while using task-specific adaptation to preserve distinct state semantics and prevent interference, but we acknowledge that quantitative comparisons against a fully shared baseline are not reported. In the revised version, we will add ablation experiments that measure performance degradation from cross-granularity interference under full parameter sharing, along with concrete details on the adaptation implementation including conditioning vectors, adapter modules, and the state disentanglement process used during memory matching. These additions will directly support the robustness claims for the core tracking operation. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new mechanisms, dataset, and experiments without reduction to fitted inputs or self-citations

full rationale

The abstract and provided text describe an integrated design of prompt encoding, output decoding, and task-adaptive memory for unifying mask/box/point tracking granularities, plus a new Tracking-Any-Granularity dataset constructed via a data engine. No equations, parameter-fitting steps, or derivations are presented that would make any 'prediction' equivalent to its inputs by construction. No self-citation load-bearing arguments, uniqueness theorems imported from the same authors, or ansatzes smuggled via prior work appear in the given material. The SOTA claims are tied to comprehensive experiments on the new dataset rather than tautological redefinitions or forced statistical outcomes from subsets of prior data. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a common tracking principle across granularities and on the effectiveness of the newly introduced task-adaptive memory; no major new physical entities are postulated.

free parameters (1)
  • task-specific prompt mapping parameters
    Learned or hand-chosen parameters that convert mask, box, and point inputs into shared prompt embeddings.
axioms (1)
  • domain assumption There exists a common tracking principle behind different granularities that permits a shared architecture without task-specific redesign of the overall pipeline.
    Invoked in the abstract when stating that the design unifies prompt encoding, output decoding, and memory representation.

pith-pipeline@v0.9.0 · 5846 in / 1386 out tokens · 105007 ms · 2026-05-21T19:46:15.926427+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 4 internal anchors

  1. [1]

    Burst: A benchmark for unifying object recognition, segmentation and tracking in video

    Ali Athar, Jonathon Luiten, Paul V oigtlaender, Tarasha Khu- rana, Achal Dave, Bastian Leibe, and Deva Ramanan. Burst: A benchmark for unifying object recognition, segmentation and tracking in video. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 1674–1683, 2023. 6, 7

  2. [2]

    Track-on: Transformer-based online point tracking with memory

    G ¨orkay Aydemir, Xiongyi Cai, Weidi Xie, and Fatma G¨uney. Track-on: Transformer-based online point tracking with memory. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24- 28, 2025. OpenReview.net, 2025. 7

  3. [3]

    Ar- trackv2: Prompting autoregressive tracker where to look and how to describe

    Yifan Bai, Zeyang Zhao, Yihong Gong, and Xing Wei. Ar- trackv2: Prompting autoregressive tracker where to look and how to describe. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19048– 19057, 2024. 7

  4. [4]

    A benchmark and simulator for uav track- ing

    UT Benchmark. A benchmark and simulator for uav track- ing. InEuropean conference on computer vision, 2016. 6

  5. [5]

    Creatures great and SMAL: Recovering the shape and motion of animals from video

    Benjamin Biggs, Thomas Roddick, Andrew Fitzgibbon, and Roberto Cipolla. Creatures great and SMAL: Recovering the shape and motion of animals from video. InACCV, 2018. 1, 7

  6. [6]

    Hiptrack: Visual tracking with historical prompts

    Wenrui Cai, Qingjie Liu, and Yunhong Wang. Hiptrack: Visual tracking with historical prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19258–19267, 2024. 7

  7. [7]

    Ro- bust object modeling for visual tracking

    Yidong Cai, Jie Liu, Jie Tang, and Gangshan Wu. Ro- bust object modeling for visual tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 9589–9600, 2023. 7

  8. [8]

    Pix2seq: A language modeling framework for object detection.arXiv preprint arXiv:2109.10852, 2021

    Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Ge- offrey Hinton. Pix2seq: A language modeling framework for object detection.arXiv preprint arXiv:2109.10852, 2021. 1, 2

  9. [9]

    Sam-adapter: Adapting segment anything in underperformed scenes

    Tianrun Chen, Lanyun Zhu, Chaotao Deng, Runlong Cao, Yan Wang, Shangzhan Zhang, Zejian Li, Lingyun Sun, Ying Zang, and Papa Mao. Sam-adapter: Adapting segment anything in underperformed scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3367–3375, 2023. 2

  10. [10]

    Seqtrack: Sequence to sequence learning for visual ob- ject tracking

    Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual ob- ject tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14572– 14581, 2023. 7

  11. [11]

    Ho Kei Cheng and Alexander G. Schwing. Xmem: Long- term video object segmentation with an atkinson-shiffrin memory model. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIII, 2022. 7

  12. [12]

    Rethink- ing space-time networks with improved memory coverage for efficient video object segmentation

    Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Rethink- ing space-time networks with improved memory coverage for efficient video object segmentation. InAdvances in Neu- ral Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021. 7

  13. [13]

    Price, Alexan- der G

    Ho Kei Cheng, Seoung Wug Oh, Brian L. Price, Alexan- der G. Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. InIEEE/CVF Interna- tional Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, 2023. 7

  14. [14]

    Price, Joon-Young Lee, and Alexander G

    Ho Kei Cheng, Seoung Wug Oh, Brian L. Price, Joon-Young Lee, and Alexander G. Schwing. Putting the object back into video object segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, 2024. 7

  15. [15]

    Local all-pair corre- spondence for point tracking

    Seokju Cho, Jiahui Huang, Jisu Nam, Honggyu An, Seun- gryong Kim, and Joon-Young Lee. Local all-pair corre- spondence for point tracking. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part X, pages 306–325. Springer, 2024. 7

  16. [16]

    Mixformer: End-to-end tracking with iterative mixed atten- tion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Yutao Cui, Cheng Jiang, Gangshan Wu, and Limin Wang. Mixformer: End-to-end tracking with iterative mixed atten- tion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 7

  17. [17]

    Samwise: Infusing wisdom in sam2 for text-driven video segmentation, 2024

    Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, and Giuseppe Averta. Samwise: Infusing wisdom in sam2 for text-driven video segmentation, 2024. 2

  18. [18]

    Epic-kitchens visor benchmark: Video segmenta- tions and object relations

    Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. Epic-kitchens visor benchmark: Video segmenta- tions and object relations. InAdvances in Neural Information Processing Systems, pages 13745–13758. Curran Associates, Inc., 2022. 7

  19. [19]

    Torr, and Song Bai

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H.S. Torr, and Song Bai. Mose: A new dataset for video object segmentation in complex scenes. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 20224–20234, 2023. 4, 6, 7, 1

  20. [20]

    Sam2long: Enhancing sam 2 for long video seg- mentation with a training-free memory tree.arXiv preprint arXiv:2410.16268, 2024

    Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, and Jiaqi Wang. Sam2long: Enhancing sam 2 for long video seg- mentation with a training-free memory tree.arXiv preprint arXiv:2410.16268, 2024. 2

  21. [21]

    TAP-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022

    Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Re- casens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. TAP-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems, 35:13610–13626, 2022. 1, 4, 6, 7

  22. [22]

    Tapir: Tracking any point with per-frame initialization and temporal refinement

    Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 10061– 10072, 2023. 7

  23. [23]

    Lasot: A high-quality benchmark for large-scale single ob- ject tracking

    Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single ob- ject tracking. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2019. 1, 4, 6

  24. [24]

    Generalized relation modeling for transformer tracking

    Shenyuan Gao, Chunluan Zhou, and Jun Zhang. Generalized relation modeling for transformer tracking. InProceedings of 9 the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18686–18695, 2023. 7

  25. [25]

    Learn- ing to linearize under uncertainty.Advances in neural infor- mation processing systems, 28, 2015

    Ross Goroshin, Michael F Mathieu, and Yann LeCun. Learn- ing to linearize under uncertainty.Advances in neural infor- mation processing systems, 28, 2015. 3

  26. [26]

    Tag: Tracking at any granularity, 2024

    Adam Harley, Yang You, Yang Zheng, Xinglong Sun, Nikhil Raghuraman, Sheldon Liang, Wen-Hsuan Chu, Suya You, Achal Dave, Pavel Tokmakov, Rares Ambrus, Katerina Fragkiadaki, and Leonidas Guibas. Tag: Tracking at any granularity, 2024. 7, 8

  27. [27]

    Particle video revisited: Tracking through occlusions using point trajectories

    Adam W Harley, Zhaoyuan Fang, and Katerina Fragkiadaki. Particle video revisited: Tracking through occlusions using point trajectories. InEuropean Conference on Computer Vi- sion, pages 59–75. Springer, 2022. 7

  28. [28]

    Lvos: A benchmark for long-term video object segmentation

    Lingyi Hong, Wenchao Chen, Zhongying Liu, Wei Zhang, Pinxue Guo, Zhaoyu Chen, and Wenqiang Zhang. Lvos: A benchmark for long-term video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13480–13492, 2023. 1, 6

  29. [29]

    Lvos: A benchmark for large- scale long-term video object segmentation.arXiv preprint arXiv:2404.19326, 2024

    Lingyi Hong, Zhongying Liu, Wenchao Chen, Chenzhi Tan, Yuang Feng, Xinyu Zhou, Pinxue Guo, Jinglun Li, Zhaoyu Chen, Shuyong Gao, et al. Lvos: A benchmark for large- scale long-term video object segmentation.arXiv preprint arXiv:2404.19326, 2024. 6, 7

  30. [30]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Represen- tations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022. 4

  31. [31]

    Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 43(5):1562–1577, 2021

    Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 43(5):1562–1577, 2021. 1, 4, 6, 7

  32. [32]

    Unleashing the temporal-spatial reasoning capacity of gpt for training-free audio and language referenced video object segmentation

    Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, and Si Liu. Unleashing the temporal-spatial reasoning capacity of gpt for training-free audio and language referenced video object segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3715–3723, 2025. 2

  33. [33]

    Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos

    Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos.CoRR, abs/2410.11831, 2024. 7

  34. [34]

    Co- tracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXII, pages 18–35. Springer, 2024. 7

  35. [35]

    Segment anything in high quality

    Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment anything in high quality. InNeurIPS, 2023. 2

  36. [36]

    Need for speed: A benchmark for higher frame rate object tracking

    Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. Need for speed: A benchmark for higher frame rate object tracking. InProceedings of the IEEE International Conference on Computer Vision (ICCV),

  37. [37]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything.arXiv:2304.02643, 2023. 2, 3, 4

  38. [38]

    TAPTR: tracking any point with transformers as detection

    Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, and Lei Zhang. TAPTR: tracking any point with transformers as detection. InComputer Vi- sion - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XVI, pages 57–75. Springer, 2024. 7

  39. [39]

    Onevos: unifying video object segmentation with all-in-one transformer framework

    Wanyun Li, Pinxue Guo, Xinyu Zhou, Lingyi Hong, Yangji He, Xiangyu Zheng, Wei Zhang, and Wenqiang Zhang. Onevos: unifying video object segmentation with all-in-one transformer framework. InEuropean Conference on Com- puter Vision, pages 20–40. Springer, 2024. 7

  40. [40]

    Tracking meets lora: Faster training, larger model, stronger performance

    Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, and Haibin Ling. Tracking meets lora: Faster training, larger model, stronger performance. InEuropean Confer- ence on Computer Vision, pages 300–318. Springer, 2024. 7

  41. [41]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 4, 1

  42. [42]

    SAMRefiner: Taming segment anything model for universal mask refine- ment

    Yuqi Lin, Hengjia Li, Wenqi Shao, Zheng Yang, Jun Zhao, Xiaofei He, Ping Luo, and Kaipeng Zhang. SAMRefiner: Taming segment anything model for universal mask refine- ment. InThe Thirteenth International Conference on Learn- ing Representations, 2025. 2

  43. [43]

    Hota: A higher order metric for evaluating multi-object tracking.International Journal of Computer Vision, pages 1–31, 2020

    Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taix´e, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking.International Journal of Computer Vision, pages 1–31, 2020. 7

  44. [44]

    Trackingnet: A large-scale dataset and benchmark for object tracking in the wild

    Matthias Muller, Adel Bibi, Silvio Giancola, Salman Al- subaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. InThe European Conference on Computer Vision (ECCV), 2018. 1, 4, 6, 7

  45. [45]

    Simtrack: A simulation- based framework for scalable real-time object pose detection and tracking

    Karl Pauwels and Danica Kragic. Simtrack: A simulation- based framework for scalable real-time object pose detection and tracking. In2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1300–1307. IEEE, 2015. 7

  46. [46]

    Vast- track: Vast category visual object tracking

    Liang Peng, Junyuan Gao, Xinran Liu, Weihong Li, Shaohua Dong, Zhipeng Zhang, Heng Fan, and Libo Zhang. Vast- track: Vast category visual object tracking. InAdvances in Neural Information Processing Systems 38: Annual Con- ference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. 6, 7

  47. [47]

    A benchmark dataset and evaluation methodology for video 10 object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video 10 object segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 7

  48. [48]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 1, 4, 6

  49. [49]

    Perception test: A diagnostic benchmark for multimodal video models

    Viorica P ˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adri`a Re- casens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osin- dero, Dima Da...

  50. [50]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

  51. [51]

    Breaking the “object” in video object segmentation

    Pavel Tokmakov, Jie Li, and Adrien Gaidon. Breaking the “object” in video object segmentation. InCVPR, 2023. 6, 7

  52. [52]

    A distractor-aware memory for visual object tracking with sam2.arXiv preprint arXiv:2411.17576, 2024

    Jovana Videnovic, Alan Lukezic, and Matej Kristan. A distractor-aware memory for visual object tracking with sam2.arXiv preprint arXiv:2411.17576, 2024. 2

  53. [53]

    Omnitracker: Unifying visual object tracking by tracking-with-detection

    Junke Wang, Zuxuan Wu, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, and Yu-Gang Jiang. Omnitracker: Unifying visual object tracking by tracking-with-detection. IEEE Trans. Pattern Anal. Mach. Intell., 47(4):3159–3174,

  54. [54]

    Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark

    Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. Towards more flexible and accurate object tracking with natural language: Algo- rithms and benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13763–13773, 2021. 6, 7

  55. [55]

    Do different track- ing tasks require different appearance models?Advances in neural information processing systems, 34:726–738, 2021

    Zhongdao Wang, Hengshuang Zhao, Ya-Li Li, Shengjin Wang, Philip Torr, and Luca Bertinetto. Do different track- ing tasks require different appearance models?Advances in neural information processing systems, 34:726–738, 2021. 1, 2

  56. [56]

    Autoregressive visual tracking

    Xing Wei, Yifan Bai, Yongchao Zheng, Dahu Shi, and Yi- hong Gong. Autoregressive visual tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9697–9706, 2023. 7

  57. [57]

    Dropmae: Masked autoen- coders with spatial-attention dropout for tracking tasks

    Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, and Antoni B Chan. Dropmae: Masked autoen- coders with spatial-attention dropout for tracking tasks. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14561–14571, 2023. 7

  58. [58]

    Object track- ing benchmark.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015

    Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object track- ing benchmark.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015. 6

  59. [59]

    Cat-sam: Con- ditional tuning for few-shot adaptation of segment anything model

    Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Ruijie Ren, Xiaoqin Zhang, Ling Shao, and Shijian Lu. Cat-sam: Con- ditional tuning for few-shot adaptation of segment anything model. InEuropean Conference on Computer Vision, pages 189–206. Springer, 2024. 2

  60. [60]

    YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark

    Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark.arXiv preprint arXiv:1809.03327, 2018. 1, 4, 6, 7

  61. [61]

    Integrating boxes and masks: A multi-object framework for unified visual tracking and segmentation

    Yuanyou Xu, Zongxin Yang, and Yi Yang. Integrating boxes and masks: A multi-object framework for unified visual tracking and segmentation. InIEEE/CVF International Con- ference on Computer Vision, ICCV 2023, Paris, France, Oc- tober 1-6, 2023, pages 9704–9717. IEEE, 2023. 7

  62. [62]

    Learning spatio-temporal transformer for vi- sual tracking

    Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for vi- sual tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 10448–10457, 2021. 4

  63. [63]

    Towards grand unification of object tracking

    Bin Yan, Yi Jiang, Peize Sun, Dong Wang, Zehuan Yuan, Ping Luo, and Huchuan Lu. Towards grand unification of object tracking. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXI, pages 733–751. Springer, 2022. 1, 2, 7

  64. [64]

    Universal instance perception as object discovery and retrieval

    Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Ze- huan Yuan, and Huchuan Lu. Universal instance perception as object discovery and retrieval. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 15325– 15336. IEEE, 2023. 1, 2, 7

  65. [65]

    Samurai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory, 2024

    Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, and Jenq-Neng Hwang. Samurai: Adapting segment anything model for zero-shot visual tracking with motion-aware memory, 2024. 2

  66. [66]

    Decoupling features in hier- archical propagation for video object segmentation

    Zongxin Yang and Yi Yang. Decoupling features in hier- archical propagation for video object segmentation. InAd- vances in Neural Information Processing Systems 35: An- nual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. 7

  67. [67]

    Associating ob- jects with transformers for video object segmentation

    Zongxin Yang, Yunchao Wei, and Yi Yang. Associating ob- jects with transformers for video object segmentation. In Advances in Neural Information Processing Systems 34: An- nual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021. 7

  68. [68]

    Joint feature learning and relation modeling for tracking: A one-stream framework

    Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InEuropean conference on computer vision, pages 341–357. Springer, 2022. 7

  69. [69]

    Unifiedtt: Visual tracking with unified transformer

    Peng Yu, Zhuolei Duan, Sujie Guan, Min Li, and Shaobo Deng. Unifiedtt: Visual tracking with unified transformer. Journal of Visual Communication and Image Representation, 99:104067, 2024. 1

  70. [70]

    Jointformer: A unified framework with joint modeling for video object segmentation.IEEE Trans

    Jiaming Zhang, Yutao Cui, Gangshan Wu, and Limin Wang. Jointformer: A unified framework with joint modeling for video object segmentation.IEEE Trans. Pattern Anal. Mach. Intell., 47(7):6039–6054, 2025. 7 11

  71. [71]

    Pointodyssey: A large-scale synthetic dataset for long-term point tracking

    Yang Zheng, Adam W Harley, Bokui Shen, Gordon Wet- zstein, and Leonidas J Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 19855–19865, 2023. 1, 4, 6, 7

  72. [72]

    Distance-iou loss: Faster and better learning for bounding box regression

    Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-iou loss: Faster and better learning for bounding box regression. InThe AAAI Confer- ence on Artificial Intelligence (AAAI), pages 12993–13000,

  73. [73]

    Uni-perceiver: Pre- training unified architecture for generic perception for zero- shot and few-shot tasks

    Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Hongsheng Li, Xiaohua Wang, and Jifeng Dai. Uni-perceiver: Pre- training unified architecture for generic perception for zero- shot and few-shot tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16804–16815, 2022. 1, 2 12 SAM 2++: Tracking Anything at Any Granulari...

  74. [74]

    Model Details 8.1. Model Architecture Task-Specific Prompt.In order to unify the different in- puts for each task and not modify the structure of the orig- inal Prompt Encoder, we providetask-specific promptfor each task, which provides an accurate and efficient repre- sentation of the target state of each task. The design of the task-specific prompt for ...

  75. [75]

    (2) Diversity: It encompasses a wide va- riety of scenes, sources, and tracked object categories, pro- viding a rich and representative sample of real-world sce- narios

    Data Details The key features of this dataset are as follows: (1) High Resolution: The dataset consists of high-resolution videos, ensuring that fine details are preserved and enabling more accurate analysis. (2) Diversity: It encompasses a wide va- riety of scenes, sources, and tracked object categories, pro- viding a rich and representative sample of re...

  76. [76]

    Video Selection.We downloaded a large number of videos from YouTube and instructed the annotators to select videos and objects that meet the above requirements

  77. [77]

    Coarse Annotation.Annotators mark key points and tight bounding boxes on target objects

  78. [78]

    Then, annotators refine these masks with the following re- quirements: • Only annotate the visible parts of the present object

    Fine Annotation.To reduce annotator workload and improve efficiency, we use SAM [37] to generate rough masks based on the coarse annotations (points and boxes). Then, annotators refine these masks with the following re- quirements: • Only annotate the visible parts of the present object. • In cases of motion blur, infer the approximate position based on t...

  79. [79]

    Final Completion.Experts perform a final review to thoroughly assess the accuracy and consistency of all three types of annotations, ensuring that the labeling meets the required standards and that any discrepancies are identified and corrected. 9.3. Data engine To increase the size of the dataset while reducing the work- load, we adopted a selective anno...

  80. [80]

    Performance Comparison Evaluation metrics.Invideo object segmentationtask, we use standard metrics [47] in most benchmarks: Jaccard indexJ, contour accuracyF, and their averageJ&F

    Additional Experiments 10.1. Performance Comparison Evaluation metrics.Invideo object segmentationtask, we use standard metrics [47] in most benchmarks: Jaccard indexJ, contour accuracyF, and their averageJ&F. In the YouTubeVOS benchmark,JandFare computed for ”seen” and ”unseen” categories separately.Gis the averagedJ&Ffor both seen and unseen classes. In...

Showing first 80 references.