pith. sign in

arxiv: 2605.17270 · v1 · pith:2DTVGZ4Snew · submitted 2026-05-17 · 💻 cs.CV

Beyond Detection: A Structure-Aware Framework for Scene Text Tracking

Pith reviewed 2026-05-20 14:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene text trackingvideo object trackingdetection-free frameworkdual-branch designtext in videosgeometric distortionvisual ambiguity
0
0 comments X

The pith

SymTrack provides a detection-free dual-branch framework that tracks scene text in videos despite geometric distortions, visual ambiguity, and structural sensitivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to establish scene text tracking as a specific task separate from general object tracking, because current trackers struggle with text due to perspective changes, instance similarities, and detail sensitivity. It introduces SymTrack, a framework that uses cross-expert calibration, token rectification, and adaptive inference to handle these issues in a detection-free manner. A sympathetic reader would care because accurate text tracking in videos opens up possibilities for dynamic text editing, removal, and segmentation in media. The authors also create benchmarks from existing datasets to evaluate such trackers. Their experiments show substantial improvements over prior methods.

Core claim

The central claim is that a unified detection-free framework with synergistic dual-branch design can solve scene text tracking by integrating Cross-Expert Calibration to reduce semantic bias, Predictive Token Rectification to correct structural imbalances, and an Adaptive Inference Engine to stabilize predictions, achieving new state-of-the-art performance on constructed benchmarks.

What carries the argument

Synergistic dual-branch design with Cross-Expert Calibration, Predictive Token Rectification, and Adaptive Inference Engine.

If this is right

  • Scene text tracking becomes practical for applications like video text editing and manipulation.
  • Three new benchmarks are provided for the task using high-quality annotations from video text spotting datasets.
  • Performance improves by up to 11.97% AUC over previous best trackers on BOVText_SOT.
  • The approach addresses the three identified challenges effectively across multiple benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Handling text-specific challenges this way might inspire structure-aware designs for tracking other small or detailed objects.
  • Integrating this with text recognition could lead to end-to-end video text understanding systems.
  • Testing on more diverse video conditions could reveal if the motion constraints assumption holds broadly.

Load-bearing premise

The three challenges of geometric distortions, visual ambiguity, and structural sensitivity are the main reasons for poor performance and can be mitigated by the dual-branch mechanisms without introducing biases.

What would settle it

Observing that SymTrack does not achieve higher AUC than previous trackers on one of the benchmarks or under new test conditions with extreme motions would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 2605.17270 by Chenmin Yu, Daiqing Wu, Gengluo Li, Liu Yu, Yu Zhou, Zeyu Chen.

Figure 1
Figure 1. Figure 1: Motivation and paradigm comparison. (a) Typical failure cases of scene text tracking. From top to bottom: Perspective shifts leading to drastic shape distortion; high visual ambiguity from nearby similar texts; fine-grained sensitivity in densely packed small texts. (b) VTS treats tracking as a by-product, where detec￾tion failures fragment text trajectories. (c) We propose a detection￾free, unified framew… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SymTrack framework, which employs a synergistic dual-branch design. The top branch performs Predictive Token Rectification (PTR). It distills a semantic query qsem from the template tokens Z to generate a modulation mask M. This mask is applied to the main visual feature map Fx to produce a rectified map Fˆx, addressing structural imbalances. The parallel branch performs Cross-Expert Calibr… view at source ↗
Figure 3
Figure 3. Figure 3: The structure of AIE, which uses confidence for adaptive search and a module for temporal regularization. nection and layer normalization: Xfused = LayerNorm(X ′ txt + Etxt). (7) Finally, the fused token sequence Xfused is reshaped and passed through a lightweight convolutional head Hψ to generate a spatial calibration mask Mcalib with values ∈ [0, 1]: Mcalib = σ(Hψ(Reshape(Xfused))). (8) This calibration … view at source ↗
Figure 5
Figure 5. Figure 5: Visualization comparison of search region with and without AIE. Zoom in for better view. Ground Truth Ours OSTrack ODTrack ARTrack [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization comparison of search frames attention maps and score maps with and without CEC. Zoom in for better view [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparison in scenes with relatively fixed-position text (e.g., subtitles, game UI, and overlaid captions) under drastic background changes or heavy motion blur. Generic trackers quickly drift to background objects or similar-looking regions due to the absence of text-specific semantic modeling. In contrast, SymTrack consistently focuses on the target text throughout the sequence, owing to the CEC m… view at source ↗
Figure 8
Figure 8. Figure 8: Visual comparison in highly challenging scenarios involving (a) dense and tiny scene text, (b) severe perspective-induced distortion, partial occlusion, and (c) text attached to fast-moving objects (e.g., on clothing or vehicles). Competing methods either lose the target rapidly or drift to nearby distractors because of structural imbalance and insufficient semantic discrimination. SymTrack maintains accur… view at source ↗
read the original abstract

Modern visual object trackers show impressive results on general targets, yet their performance drops substantially when dealing with scene text. Although currently underexplored, tracking text in videos is essential for dynamic text manipulations such as segmentation, removal, and editing. To fill this gap, this paper formalizes this specific task as Scene Text Tracking and presents the first systematic work for it. We identify three primary challenges in this task: 1) severe geometric distortions from perspective shifts, 2) high visual ambiguity across different instances, and 3) high sensitivity to fine-grained structural details. To address these issues, we propose SymTrack, a unified detection-free framework with synergistic dual-branch design. It integrates a Cross-Expert Calibration mechanism to reduce semantic bias, along with a Predictive Token Rectification mechanism to correct structural imbalances, complemented by an Adaptive Inference Engine that stabilizes predictions under motion constraints. Considering the lack of dedicated benchmarks for this task, we utilize three datasets from video text spotting to construct a benchmark with high-quality annotations. Extensive experiments demonstrate that SymTrack sets the new state-of-the-art on all three benchmarks, outperforming previous best trackers by up to 11.97\% AUC on $ \text{BOVText}_{\text{SOT}} $. Overall, our work promotes efficient and thorough text tracking, paving the way toward more generalized video text manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript formalizes Scene Text Tracking as a distinct task and introduces SymTrack, a detection-free framework with a synergistic dual-branch architecture. It incorporates Cross-Expert Calibration to reduce semantic bias, Predictive Token Rectification to address structural imbalances, and an Adaptive Inference Engine for motion-constrained stability. The authors repurpose three video text spotting datasets into benchmarks with high-quality annotations and report new state-of-the-art results, including an improvement of up to 11.97% AUC on BOVText_SOT over prior trackers.

Significance. If the performance claims are substantiated with rigorous validation, the work would be significant as the first systematic treatment of scene text tracking, a task relevant to video text manipulation applications. The explicit construction of dedicated benchmarks from existing sources is a constructive step toward standardized evaluation and reproducibility in this subdomain.

major comments (2)
  1. [§4] §4 (Experiments): The central SOTA claim (e.g., 11.97% AUC gain on BOVText_SOT) is presented without error bars, standard deviations across multiple runs, or statistical significance tests. This is load-bearing because small implementation variations or benchmark-specific biases could alter the reported ranking relative to prior trackers.
  2. [§3.3] §3.3 (Benchmark Construction): The description of how the three benchmarks were derived from prior video text spotting datasets provides no details on annotation protocol, verification process, or handling of ambiguous instances. This directly affects the reliability of the quantitative results that support the main contribution.
minor comments (2)
  1. [Abstract] Abstract: The notation BOVText_SOT is used without a brief parenthetical definition or reference to its relation to the original BOVText dataset.
  2. [Figure 3] Figure 3: The diagram of the dual-branch architecture would benefit from explicit labeling of the Cross-Expert Calibration and Predictive Token Rectification modules to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the significance of formalizing Scene Text Tracking as a distinct task. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The central SOTA claim (e.g., 11.97% AUC gain on BOVText_SOT) is presented without error bars, standard deviations across multiple runs, or statistical significance tests. This is load-bearing because small implementation variations or benchmark-specific biases could alter the reported ranking relative to prior trackers.

    Authors: We agree that statistical validation strengthens the reliability of the SOTA claims. In the revised manuscript, we will rerun the experiments over multiple independent trials with different random seeds, report mean performance with standard deviations, and include statistical significance tests (such as paired t-tests) against the strongest baselines to confirm the reported gains. revision: yes

  2. Referee: [§3.3] §3.3 (Benchmark Construction): The description of how the three benchmarks were derived from prior video text spotting datasets provides no details on annotation protocol, verification process, or handling of ambiguous instances. This directly affects the reliability of the quantitative results that support the main contribution.

    Authors: We appreciate this point. Section 3.3 currently offers only a high-level overview of benchmark derivation. We will expand it in the revision to describe the annotation protocol in detail, the multi-annotator verification process used to ensure quality, and the specific guidelines applied to ambiguous cases (e.g., heavy occlusion or extreme perspective distortion). revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is an empirical computer vision contribution that identifies three challenges in scene text tracking, proposes the SymTrack framework with dual-branch mechanisms (Cross-Expert Calibration, Predictive Token Rectification, Adaptive Inference Engine), constructs a benchmark by reusing annotations from three existing video text spotting datasets, and reports SOTA results via standard AUC comparisons. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs or self-definitions. The central SOTA claim rests on external benchmark evaluation rather than internal self-citation chains or ansatz smuggling. This qualifies as self-contained against external benchmarks, warranting a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the identified challenges are primary and that the proposed mechanisms address them; no free parameters or invented entities are visible in the abstract.

axioms (1)
  • domain assumption Scene text tracking is primarily limited by geometric distortions from perspective shifts, high visual ambiguity across instances, and high sensitivity to fine-grained structural details.
    Explicitly listed as the three primary challenges the framework is designed to solve.

pith-pipeline@v0.9.0 · 5781 in / 1292 out tokens · 56983 ms · 2026-05-20T14:38:27.162407+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

287 extracted references · 287 canonical work pages · 5 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Zhu, Jiawen and Lai, Simiao and Chen, Xin and Wang, Dong and Lu, Huchuan , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

  10. [10]

    Proceedings of the 30th ACM International Conference on Multimedia , pages =

    Yang, Jinyu and Li, Zhe and Zheng, Feng and Leonardis, Ales and Song, Jingkuan , title =. Proceedings of the 30th ACM International Conference on Multimedia , pages =. 2022 , isbn =. doi:10.1145/3503161.3547851 , abstract =

  11. [11]

    arXiv preprint arXiv:2410.12896 , year=

    A survey on data synthesis and augmentation for large language models , author=. arXiv preprint arXiv:2410.12896 , year=

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    Rethinking space-time networks with improved memory coverage for efficient video object segmentation , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    Proceedings of the European Conference on Computer Vision (ECCV) , month =

    Woo, Sanghyun and Park, Jongchan and Lee, Joon-Young and Kweon, In So , title =. Proceedings of the European Conference on Computer Vision (ECCV) , month =

  14. [14]

    FirstName LastName , title =

  15. [15]

    Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

    Modeling task relationships in multi-task learning with multi-gate mixture-of-experts , author=. Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

  16. [16]

    International conference on machine learning , pages=

    Parameter-efficient transfer learning for NLP , author=. International conference on machine learning , pages=. 2019 , organization=

  17. [17]

    Transactions on Machine Learning Research Journal , pages=

    DINOv2: Learning Robust Visual Features without Supervision , author=. Transactions on Machine Learning Research Journal , pages=

  18. [18]

    7th International Conference on Learning Representations,

    Ilya Loshchilov and Frank Hutter , title =. 7th International Conference on Learning Representations,. 2019 , biburl =

  19. [19]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

    AdapterHub: A Framework for Adapting Transformers , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

  20. [20]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  21. [21]

    Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen

    Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. LoRA: Low-Rank Adaptation of Large Language Models , booktitle =. 2022 , url =

  22. [22]

    International Journal of Computer Vision , volume=

    Clip-adapter: Better vision-language models with feature adapters , author=. International Journal of Computer Vision , volume=. 2024 , publisher=

  23. [23]

    arXiv preprint arXiv:2406.20024 , year=

    eMoE-Tracker: Environmental MoE-based Transformer for Robust Event-guided Object Tracking , author=. arXiv preprint arXiv:2406.20024 , year=

  24. [24]

    arXiv preprint arXiv:2405.00168 , year=

    Revisiting RGBT Tracking Benchmarks from the Perspective of Modality Validity: A New Benchmark, Problem, and Method , author=. arXiv preprint arXiv:2405.00168 , year=

  25. [25]

    arXiv preprint arXiv:2503.02304 , year=

    A Token-level Text Image Foundation Model for Document Understanding , author=. arXiv preprint arXiv:2503.02304 , year=

  26. [26]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  27. [27]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  28. [28]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll\'ar, Piotr and Girshick, Ross , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

  29. [29]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Zhang, Yihua and Cai, Ruisi and Chen, Tianlong and Zhang, Guanhua and Zhang, Huan and Chen, Pin-Yu and Chang, Shiyu and Wang, Zhangyang and Liu, Sijia , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

  30. [30]

    Scaling Vision with Sparse Mixture of Experts , url =

    Riquelme, Carlos and Puigcerver, Joan and Mustafa, Basil and Neumann, Maxim and Jenatton, Rodolphe and Susano Pinto, Andr\'. Scaling Vision with Sparse Mixture of Experts , url =. Advances in Neural Information Processing Systems , editor =

  31. [31]

    International Conference on Machine Learning , pages=

    Glam: Efficient scaling of language models with mixture-of-experts , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  32. [32]

    International conference on machine learning , pages=

    Unified scaling laws for routed language models , author=. International conference on machine learning , pages=. 2022 , organization=

  33. [33]

    Journal of Machine Learning Research , volume=

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

  34. [34]

    The Twelfth International Conference on Learning Representations,

    Xun Wu and Shaohan Huang and Furu Wei , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  35. [35]

    doi: 10.18653/v1/2024.acl-long.70

    Dai, Damai and Deng, Chengqi and Zhao, Chenggang and Xu, R.x. and Gao, Huazuo and Chen, Deli and Li, Jiashi and Zeng, Wangding and Yu, Xingkai and Wu, Y. and Xie, Zhenda and Li, Y.k. and Huang, Panpan and Luo, Fuli and Ruan, Chong and Sui, Zhifang and Liang, Wenfeng. D eep S eek M o E : Towards Ultimate Expert Specialization in Mixture-of-Experts Language...

  36. [36]

    The Twelfth International Conference on Learning Representations,

    Joan Puigcerver and Carlos Riquelme Ruiz and Basil Mustafa and Neil Houlsby , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  37. [37]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Aljundi, Rahaf and Chakravarty, Punarjay and Tuytelaars, Tinne , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  38. [38]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Cai, Yidong and Liu, Jie and Tang, Jie and Wu, Gangshan , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

  39. [39]

    Divert More Attention to Vision-Language Tracking , url =

    Guo, Mingzhe and Zhang, Zhipeng and Fan, Heng and Jing, Liping , booktitle =. Divert More Attention to Vision-Language Tracking , url =

  40. [40]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Yan, Bin and Jiang, Yi and Wu, Jiannan and Wang, Dong and Luo, Ping and Yuan, Zehuan and Lu, Huchuan , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

  41. [41]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Li, Xin and Huang, Yuqing and He, Zhenyu and Wang, Yaowei and Lu, Huchuan and Yang, Ming-Hsuan , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

  42. [42]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Huang, Yuqing and Li, Xin and Zhou, Zikun and Wang, Yaowei and He, Zhenyu and Yang, Ming-Hsuan , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

  43. [43]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Xie, Jinxia and Zhong, Bineng and Mo, Zhiyi and Zhang, Shengping and Shi, Liangtao and Song, Shuxiang and Ji, Rongrong , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

  44. [44]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Bai, Yifan and Zhao, Zeyang and Gong, Yihong and Wei, Xing , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

  45. [45]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Wang, Xiao and Shu, Xiujun and Zhang, Zhipeng and Jiang, Bo and Wang, Yaowei and Tian, Yonghong and Wu, Feng , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2021 , pages =

  46. [46]

    2022 , editor =

    Du, Nan and Huang, Yanping and Dai, Andrew M and Tong, Simon and Lepikhin, Dmitry and Xu, Yuanzhong and Krikun, Maxim and Zhou, Yanqi and Yu, Adams Wei and Firat, Orhan and Zoph, Barret and Fedus, Liam and Bosma, Maarten P and Zhou, Zongwei and Wang, Tao and Wang, Emma and Webster, Kellie and Pellat, Marie and Robinson, Kevin and Meier-Hellstern, Kathleen...

  47. [47]

    Neural computation , volume=

    Adaptive mixtures of local experts , author=. Neural computation , volume=. 1991 , publisher=

  48. [48]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Yang, Dawei and He, Jianfeng and Ma, Yinchao and Yu, Qianjin and Zhang, Tianzhu , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

  49. [49]

    FirstName Alpher , title =

  50. [50]

    Journal of Foo , volume = 13, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

  51. [51]

    Journal of Foo , volume = 14, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

  52. [52]

    FirstName Alpher and FirstName Gamow , title =

  53. [53]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Wei, Xing and Bai, Yifan and Zheng, Yongchao and Shi, Dahu and Gong, Yihong , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

  54. [54]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Seqtrack: Sequence to sequence learning for visual object tracking , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  55. [55]

    Campbell, S. L. and Gear, C. W. The index of general nonlinear D A E S. Numer. M ath. 1995

  56. [56]

    Slifka, M. K. and Whitton, J. L. Clinical implications of dysregulated cytokine production. J. M ol. M ed. 2000. doi:10.1007/s001090000086

  57. [57]

    Quasimonotonicity, regularity and duality for nonlinear systems of partial differential equations

    Hamburger, C. Quasimonotonicity, regularity and duality for nonlinear systems of partial differential equations. Ann. Mat. Pura. Appl. 1995

  58. [58]

    Geddes, K. O. and Czapor, S. R. and Labahn, G. Algorithms for C omputer A lgebra. 1992

  59. [59]

    Software engineering---from auxiliary to key technologies

    Broy, M. Software engineering---from auxiliary to key technologies. Software Pioneers. 1992

  60. [60]

    Conductive P olymers. 1981

  61. [61]

    Smith, S. E. Neuromuscular blocking drugs in man. Neuromuscular junction. H andbook of experimental pharmacology. 1976

  62. [62]

    Chung, S. T. and Morris, R. L. Isolation and characterization of plasmid deoxyribonucleic acid from Streptomyces fradiae. 1978

  63. [63]

    and AghaKouchak, A

    Hao, Z. and AghaKouchak, A. and Nakhjiri, N. and Farahmand, A. Global integrated drought monitoring and prediction system (GIDMaPS) data sets. 2014

  64. [64]

    Babichev, S. A. and Ries, J. and Lvovsky, A. I. Quantum scissors: teleportation of single-mode optical states by means of a nonlocal single photon. 2002

  65. [65]

    Wormholes in Maximal Supergravity

    Beneke, M. and Buchalla, G. and Dunietz, I. Mixing induced CP asymmetries in inclusive B decays. Phys. L ett. 1997. arXiv:0707.3168

  66. [66]

    deep SIP : deep learning of S upernova I a P arameters

    Stahl, B. deep SIP : deep learning of S upernova I a P arameters. 2020. ascl:2006.023

  67. [67]

    Abbott, T. M. C. and others. Dark Energy Survey Year 1 Results: Constraints on Extended Cosmological Models from Galaxy Clustering and Weak Lensing. Phys. Rev. D. 2019. doi:10.1103/PhysRevD.99.123505. arXiv:1810.02499

  68. [68]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Cui, Yutao and Jiang, Cheng and Wang, Limin and Wu, Gangshan , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

  69. [69]

    ODTrack: Online Dense Temporal Token Learning for Visual Tracking , volume =

    Zheng, Yaozong and Zhong, Bineng and Liang, Qihua and Mo, Zhiyi and Zhang, Shengping and Li, Xianxian , year =. ODTrack: Online Dense Temporal Token Learning for Visual Tracking , volume =. Proceedings of the AAAI Conference on Artificial Intelligence , doi =

  70. [70]

    European Conference on Computer Vision , pages=

    Tracking meets lora: Faster training, larger model, stronger performance , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  71. [71]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Cai, Wenrui and Liu, Qingjie and Wang, Yunhong , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

  72. [72]

    CoRR , volume=

    Guangxiang Zhao and Junyang Lin and Zhiyuan Zhang and Xuancheng Ren and Qi Su and Xu Sun , title=. CoRR , volume=. 2019 , cdate=

  73. [73]

    SparseTT: Visual Tracking with Sparse Transformers , url=

    Fu, Zhihong and Fu, Zehua and Liu, Qingjie and Cai, Wenrui and Wang, Yunhong , year=. SparseTT: Visual Tracking with Sparse Transformers , url=. doi:10.24963/ijcai.2022/127 , booktitle=

  74. [74]

    2022 , booktitle =

    SparseTT: Visual tracking with sparse transformers , author=. 2022 , booktitle =

  75. [75]

    Advances in Neural Information Processing Systems , volume=

    Swintrack: A simple and strong baseline for transformer tracking , author=. Advances in Neural Information Processing Systems , volume=

  76. [76]

    Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXII , pages=

    Backbone is all your need: a simplified architecture for visual object tracking , author=. Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXII , pages=. 2022 , organization=

  77. [77]

    Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXII , pages=

    Aiatrack: Attention in attention for transformer visual tracking , author=. Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXII , pages=. 2022 , organization=

  78. [78]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Transformer tracking with cyclic shifting window attention , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  79. [79]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Rao, Yongming and Zhao, Wenliang and Chen, Guangyi and Tang, Yansong and Zhu, Zheng and Huang, Guan and Zhou, Jie and Lu, Jiwen , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

  80. [80]

    Proceedings of the 38th International Conference on Machine Learning , pages=

    Learning Transferable Visual Models From Natural Language Supervision , author=. Proceedings of the 38th International Conference on Machine Learning , pages=. 2021 , editor=

Showing first 80 references.