RELO: Reinforcement Learning to Localize for Visual Object Tracking

Chuanyu Sun; Dong Wang; Houwen Peng; Huchuan Lu; Jiao Xu; Kede Ma; Xin Chen

arxiv: 2605.07379 · v2 · pith:LHFLFPOVnew · submitted 2026-05-08 · 💻 cs.CV · cs.AI

RELO: Reinforcement Learning to Localize for Visual Object Tracking

Xin Chen , Chuanyu Sun , Jiao Xu , Houwen Peng , Dong Wang , Huchuan Lu , Kede Ma This is my paper

Pith reviewed 2026-05-20 23:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual object trackingreinforcement learninglocalization policyMarkov decision processtemporal token propagationIoUAUC

0 comments

The pith

RELO replaces handcrafted spatial priors with a reinforcement learning policy that optimizes directly for IoU and AUC in visual object tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that handcrafted spatial priors like heatmaps in visual object tracking provide only surrogate supervision poorly aligned with actual performance metrics such as IoU and AUC. It formulates target localization as a Markov decision process and trains a policy over spatial positions using reinforcement learning, where rewards combine frame-level IoU with sequence-level AUC. This setup is meant to produce localization decisions that match tracking optimization and evaluation more closely than fixed priors. The approach further includes layer-aligned temporal token propagation to keep semantic consistency across frames at low extra cost. If correct, the result is a practical alternative that improves tracking accuracy on standard benchmarks.

Core claim

RELO formulates target localization as a Markov decision process and learns a localization policy over spatial positions via reinforcement learning with rewards that combine frame-level IoU and sequence-level AUC. It replaces handcrafted spatial priors and adds layer-aligned temporal token propagation to improve semantic consistency across frames with negligible overhead, attaining 57.5 percent AUC on LaSOText without template updates.

What carries the argument

A reinforcement-learning localization policy that chooses actions over spatial positions, guided by rewards integrating frame-level IoU and sequence-level AUC, together with layer-aligned temporal token propagation for cross-frame semantic consistency.

If this is right

Localization decisions become directly optimized for the metrics that define tracking success rather than surrogate heatmap supervision.
Trackers can achieve competitive or superior accuracy on multiple datasets, including 57.5 percent AUC on LaSOText, without template updates.
Layer-aligned temporal token propagation adds frame-to-frame consistency at negligible extra compute.
Reward-driven localization offers a general alternative to prior-driven methods across visual tracking pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same MDP-plus-RL framing could replace manual priors in related sequential tasks such as video object segmentation.
Training the policy on longer or more diverse sequences would test whether stability holds beyond the reported benchmarks.
End-to-end trackers might eventually eliminate all hand-designed spatial priors by extending this reward-driven approach.

Load-bearing premise

A reinforcement-learning policy trained on combined IoU and AUC rewards will align better with tracking optimization and evaluation than handcrafted priors, and the temporal token propagation will maintain semantic consistency without instability or overfitting.

What would settle it

Standard tracking benchmark results in which RELO fails to match or exceed the AUC or IoU of conventional trackers that still rely on handcrafted spatial priors.

Figures

Figures reproduced from arXiv: 2605.07379 by Chuanyu Sun, Dong Wang, Houwen Peng, Huchuan Lu, Jiao Xu, Kede Ma, Xin Chen.

**Figure 1.** Figure 1: Comparison of target localization learning paradigms in visual object tracking. (a) Conventional trackers learn target localization from handcrafted spatial priors. (b) RELO replaces handcrafted priors with an RL-based localization policy, optimized using task-oriented rewards. Red arrows indicate the signals used to optimize the localization module. Transformer-based trackers (Ye et al., 2022; Zheng et al… view at source ↗

**Figure 2.** Figure 2: System diagram of RELO. (a) Policy optimization over a video clip. Only the first and last frames are shown for simplicity. For each frame, the Transformer-based encoder extracts visual features, and the tracking heads predict candidate bounding boxes. Taskoriented tracking metrics, including frame-level IoU and sequence-level AUC, are used as reward signals to optimize target localization. Layer-aligned … view at source ↗

**Figure 3.** Figure 3: Comparison of temporal token propagation strategies. (a) Deep-to-shallow propagation transfers temporal tokens from the deep layers of frame t − 1 to the shallow layers of frame t, introducing a semantic mismatch across layers. (b) Our layer-aligned propagation passes temporal tokens from each layer of frame t − 1 to the corresponding layer of frame t, preserving semantic consistency during cross-frame i… view at source ↗

**Figure 4.** Figure 4: Effect of the sequence length T during training. AUC on LaSOT and LaSOText improves as T increases and saturates beyond T ≈ 8, supporting the default choice of T = 8. become less reliable under stronger appearance variation and distribution shift. Policy optimization. Rows 5-7 evaluate different policy optimization strategies. Replacing the standard actor-critic update with PPO (Schulman et al., 2017) or G… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison between the reward-driven RELO and the prior-driven tracker in a failure-and-recovery scenario. Frame indices are shown in the upper-left corner. For each frame, the lower-right inset visualizes the localization score map over the search region using the viridis colormap, where yellow indicates high confidence and dark blue indicates low confidence. Green and orange boxes denote the … view at source ↗

**Figure 6.** Figure 6: Additional qualitative comparisons between the reward-driven RELO and the prior-driven tracker across challenging scenarios. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Attribute-wise AUC comparison on LaSOT. For each attribute, the two numbers in parentheses indicate the minimum and maximum AUC values achieved by the compared trackers, respectively. B. More Analysis This section presents supplementary analysis that complements the results in the main text. Attribute-wise performance [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC. We additionally introduce layer-aligned temporal token propagation to improve semantic consistency across frames, with negligible computational overhead. Across multiple benchmarks, RELO achieves superior results, attaining 57.5% AUC on LaSOText without template updates. This confirms that reward-driven localization provides an effective alternative to prior-driven localization for visual object tracking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RELO, which formulates visual object tracking localization as a Markov decision process. It replaces handcrafted spatial priors (e.g., heatmaps) with a reinforcement-learned policy over spatial positions, using rewards that combine frame-level IoU and sequence-level AUC. An additional layer-aligned temporal token propagation module is proposed to maintain semantic consistency across frames. The method reports 57.5% AUC on LaSOText without template updates and claims that reward-driven localization is an effective alternative to prior-driven approaches.

Significance. If the central attribution holds, the work offers a direct optimization path for localization that aligns training rewards with standard tracking metrics (IoU, AUC), potentially reducing reliance on surrogate handcrafted priors. The negligible-overhead temporal propagation is a secondary but practical contribution. Reproducible code or machine-checked elements are not mentioned.

major comments (2)

[Experiments / Ablation studies] Experiments section (results and ablations): The headline 57.5% AUC on LaSOText is reported only for the joint system (RL localization + temporal token propagation). No ablation is shown that retains the RL policy and reward formulation while disabling temporal propagation and retraining; this prevents isolating whether the performance gain is driven by reward-driven localization or by improved cross-frame consistency, directly undermining the claim that “reward-driven localization provides an effective alternative.”
[Method / Reward design] §3.2 (reward formulation): Sequence-level AUC is used as a reward component, yet the paper does not detail how this non-differentiable, sequence-wide metric is computed or approximated inside the per-frame MDP episodes during training; without this, it is unclear whether the reported gains reflect genuine policy improvement or post-hoc metric alignment.

minor comments (2)

[Method] Notation: The definition of the action space over spatial positions should be cross-referenced to the exact grid resolution or sampling strategy used in the MDP.
[Figure 2] Figure clarity: The diagram illustrating layer-aligned temporal token propagation would benefit from explicit arrows showing token flow between frames and layers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our results and method.

read point-by-point responses

Referee: [Experiments / Ablation studies] Experiments section (results and ablations): The headline 57.5% AUC on LaSOText is reported only for the joint system (RL localization + temporal token propagation). No ablation is shown that retains the RL policy and reward formulation while disabling temporal propagation and retraining; this prevents isolating whether the performance gain is driven by reward-driven localization or by improved cross-frame consistency, directly undermining the claim that “reward-driven localization provides an effective alternative.”

Authors: We agree that an explicit ablation isolating the RL localization policy (with its IoU+AUC reward) from the temporal token propagation module would more clearly support our central claim. The current experiments report results for the integrated system because the propagation module was introduced to maintain semantic consistency for the policy across frames. In the revised manuscript we will add a new ablation that disables temporal propagation, retrains the RL policy on the same reward formulation, and reports the resulting AUC on LaSOText to quantify the contribution of reward-driven localization independently. revision: yes
Referee: [Method / Reward design] §3.2 (reward formulation): Sequence-level AUC is used as a reward component, yet the paper does not detail how this non-differentiable, sequence-wide metric is computed or approximated inside the per-frame MDP episodes during training; without this, it is unclear whether the reported gains reflect genuine policy improvement or post-hoc metric alignment.

Authors: We clarify that each training episode processes an entire video sequence: the policy selects actions frame by frame, producing a trajectory of bounding boxes. At the end of the episode the sequence-level AUC is computed from the success curve of the full trajectory using the standard definition. This scalar reward is then used to update the policy via the reinforcement learning objective. We acknowledge that the original manuscript did not provide this level of detail on the timing and scope of the AUC computation. We will expand §3.2 with a precise description of the episode structure and reward assignment procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: RL policy trained on external IoU/AUC rewards with independent temporal module

full rationale

The paper formulates localization as an MDP and trains a policy via reinforcement learning using rewards explicitly defined from standard external metrics (frame-level IoU and sequence-level AUC). These are not internal model quantities or fitted parameters renamed as predictions. The added layer-aligned temporal token propagation is presented as a separate, low-overhead module. No equations, self-citations, or uniqueness theorems are invoked in the abstract or described derivation that reduce the claimed superiority to a tautology or self-referential fit. Performance is reported on external benchmarks, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

From the abstract alone the central claim rests on the domain assumption that localization admits an MDP formulation whose optimal policy can be discovered by RL using metric-based rewards. No explicit free parameters or invented entities are named; any implicit weighting between IoU and AUC terms would constitute an unstated hyperparameter.

axioms (1)

domain assumption Target localization in tracking can be modeled as a Markov decision process whose actions are spatial position selections.
Explicitly stated when the abstract says the method formulates target localization as a Markov decision process.

pith-pipeline@v0.9.0 · 5701 in / 1325 out tokens · 61919 ms · 2026-05-20T23:11:36.225351+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the sequence length is set to T= 8 . RELO-B256 and RELO-L256 use two templates
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 18 internal anchors

[1]

Sequence to sequence learning with neural networks , author=

work page
[2]

Zhang, Haoyang and Wang, Ying and Dayoub, Feras and Sunderhauf, Niko , booktitle=CVPR, pages=

work page
[3]

Learning Phrase Representations using

Cho, Kyunghyun and van Merrienboer, Bart and G. Learning Phrase Representations using. EMNLP , pages=

work page
[4]

Neural machine translation by jointly learning to align and translate , author=

work page
[5]

Masked autoencoders are scalable vision learners , author=

work page
[6]

Improving language understanding by generative pre-training , author=

work page
[7]

Language models are unsupervised multitask learners , author=

work page
[8]

Language models are few-shot learners , author=

work page
[9]

Exploring the limits of transfer learning with a unified text-to-text

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J and others , journal=. Exploring the limits of transfer learning with a unified text-to-text

work page
[10]

Transformer Tracking , author=

work page
[11]

Chen, Xin and Peng, Houwen and Wang, Dong and Lu, Huchuan and Hu, Han , booktitle=CVPR, pages=

work page
[12]

Universal Instance Perception as Object Discovery and Retrieval , author=

work page
[13]

High-Performance

Chen, Xin and Yan, Bin and Zhu, Jiawen and Wang, Dong and Lu, Huchuan , journal=TPAMI, volume=. High-Performance

work page
[14]

Correlation-Aware Deep Tracking , author=

work page
[15]

MIR , volume=

Motion-guided Visual Tracking , author=. MIR , volume=. 2025 , publisher=

work page 2025
[16]

Flamingo: a Visual Language Model for Few-Shot Learning , year =

Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and Ring, Roman and Rutherford, Eliza and Cabi, Serkan and Han, Tengda and Gong, Zhitao and Samangooei, Sina and Monteiro, Marianne and Menick, Jacob L and Borgeaud, S...

work page
[17]

Unified-

Lu, Jiasen and Clark, Christopher and Zellers, Rowan and Mottaghi, Roozbeh and Kembhavi, Aniruddha , journal=. Unified-

work page
[18]

Wang, Peng and Yang, An and Men, Rui and Lin, Junyang and Bai, Shuai and Li, Zhikang and Ma, Jianxin and Zhou, Chang and Zhou, Jingren and Yang, Hongxia , booktitle=

work page
[19]

arXiv preprint arXiv:2206.06336 , year=

Language Models are General-Purpose Interfaces , author=. arXiv preprint arXiv:2206.06336 , year=

work page arXiv
[20]

A Generalist Agent

A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

A Unified Sequence Interface for Vision Tasks , year =

Chen, Ting and Saxena, Saurabh and Li, Lala and Lin, Tsung-Yi and Fleet, David J and Hinton, Geoffrey E , booktitle =. A Unified Sequence Interface for Vision Tasks , year =

work page
[22]

Pix2seq: A Language Modeling Framework for Object Detection , author=

work page
[23]

Learn to match: Automatic matching network design for visual tracking , author=

work page
[24]

Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark , author=

work page
[25]

Generation and comprehension of unambiguous object descriptions , author=

work page
[26]

Discriminative correlation filter with channel and spatial reliability , author=

work page
[27]

The Eighth Visual Object Tracking

Kristan, Matej and Leonardis, Ale. The Eighth Visual Object Tracking

work page
[28]

ECCVW , pages=

The Tenth Visual Object Tracking VOT2022 Challenge Results , author=. ECCVW , pages=. 2023 , organization=

work page 2023
[29]

ICCVW , pages=

The ninth visual object tracking vot2021 challenge results , author=. ICCVW , pages=

work page
[30]

Dynamical hyperparameter optimization via deep reinforcement learning in tracking , author=

work page
[31]

Bai, Yifan and Zhao, Zeyang and Gong, Yihong and Wei, Xing , booktitle=CVPR, pages=

work page
[32]

Autoregressive Visual Tracking , author=

work page
[33]

Zheng, Yaozong and Zhong, Bineng and Liang, Qihua and Mo, Zhiyi and Zhang, Shengping and Li, Xianxian , booktitle=AAAI, pages=. O

work page
[34]

Target-Aware Tracking with Long-Term Context Attention , author=

work page
[35]

Song, Zikai and Luo, Run and Yu, Junqing and Chen, Yi-Ping Phoebe and Yang, Wei , booktitle=AAAI, pages=. Compact

work page
[36]

Xie, Fei and Chu, Lei and Li, Jiahao and Lu, Yan and Ma, Chao , booktitle=CVPR, pages=. Video

work page
[37]

Learning target-aware representation for visual tracking via informative interactions , author=

work page
[38]

Tracking Meets

Lin, Liting and Fan, Heng and Zhang, Zhipeng and Wang, Yaowei and Xu, Yong and Ling, Haibin , booktitle=ECCV, pages=. Tracking Meets

work page
[39]

Tian, Yunjie and Xie, Lingxi and Qiu, Jihao and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang , journal=TPAMI, volume=

work page
[40]

Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi , booktitle=ICLR, year=. Hi

work page
[41]

Autoregressive Queries for Adaptive Tracking with Spatio-Temporal

Xie, Jinxia and Zhong, Bineng and Mo, Zhiyi and Zhang, Shengping and Shi, Liangtao and Song, Shuxiang and Ji, Rongrong , booktitle=CVPR, pages=. Autoregressive Queries for Adaptive Tracking with Spatio-Temporal

work page
[42]

Explicit Visual Prompts for Visual Object Tracking , author=

work page
[43]

Peng, Liang and Gao, Junyuan and Liu, Xinran and Li, Weihong and Dong, Shaohua and Zhang, Zhipeng and Fan, Heng and Zhang, Libo , booktitle =NIPS, pages =

work page
[44]

Law, Hei and Deng, Jia , booktitle=ECCV, pages=. Corner

work page
[45]

Li, Xin and Huang, Yuqing and He, Zhenyu and Wang, Yaowei and Lu, Huchuan and Yang, Ming-Hsuan , booktitle=ICCV, pages=. Cite

work page
[46]

Wu, Qiangqiang and Yang, Tianyu and Liu, Ziquan and Wu, Baoyuan and Shan, Ying and Chan, Antoni B , booktitle=CVPR, pages=. Drop

work page
[47]

Robust Object Modeling for Visual Tracking , author=

work page
[48]

Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework , author=

work page
[49]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=

work page
[51]

Less is more: Token context-aware learning for object tracking , author=

work page
[52]

Li, Xiaohai and Zhong, Bineng and Liang, Qihua and Li, Guorong and Mo, Zhiyi and Song, Shuxiang , booktitle=AAAI, pages=

work page
[53]

Cai, Wenrui and Liu, Qingjie and Wang, Yunhong , booktitle=CVPR, pages=

work page
[54]

Autoregressive Sequential Pretraining for Visual Tracking , author=

work page
[55]

End-to-end active object tracking via reinforcement learning , author=

work page
[56]

2018 , publisher=

Correlation filter selection for visual tracking using reinforcement learning , author=. 2018 , publisher=

work page 2018
[57]

Fast template matching and update for video object tracking and segmentation , author=

work page
[58]

Learning policies for adaptive tracking with deep feature cascades , author=

work page
[59]

Siamese regression tracking with reinforced template updating , author=

work page
[60]

Mamba: Linear-time sequence modeling with selective state spaces , author=

work page
[61]

Exploring enhanced contextual information for video-level object tracking , author=

work page
[62]

Online decision based visual tracking via reinforcement learning , author=

work page
[63]

Action-decision networks for visual tracking with deep reinforcement learning , author=

work page
[64]

Real-time `actor-critic' tracking , author=

work page
[65]

Deep reinforcement learning with iterative shift for visual tracking , author=

work page
[66]

Training language models to follow instructions with human feedback , author=

work page
[67]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

TMLR , year =

Oquab, Maxime and Darcet, Timoth. TMLR , year =

work page
[70]

Chen, Xin and Kang, Ben and Geng, Wanting and Zhu, Jiawen and Liu, Yi and Wang, Dong and Lu, Huchuan , booktitle=AAAI, pages=

work page
[71]

Gao, Shenyuan and Zhou, Chunluan and Ma, Chao and Wang, Xinggang and Yuan, Junsong , booktitle=ECCV, pages=

work page
[72]

Towards Grand Unification of Object Tracking , author=

work page
[73]

Towards Sequence-Level Training for Visual Tracking , author=

work page
[74]

Robust Visual Tracking by Segmentation , author=

work page
[75]

Generalized Relation Modeling for

Gao, Shenyuan and Zhou, Chunluan and Zhang, Jun , booktitle=CVPR, pages=. Generalized Relation Modeling for

work page
[76]

Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking , author=

work page
[77]

Learning tracking representations via dual-branch fully

Xie, Fei and Wang, Chunyu and Wang, Guangting and Yang, Wankou and Zeng, Wenjun , booktitle=. Learning tracking representations via dual-branch fully

work page
[78]

Transforming Model Prediction for Tracking , author=

work page
[79]

Transformer Tracking with Cyclic Shifting Window Attention , author=

work page
[80]

Cui, Yutao and Jiang, Cheng and Wang, Limin and Wu, Gangshan , booktitle=CVPR, pages=

work page

Showing first 80 references.

[1] [1]

Sequence to sequence learning with neural networks , author=

work page

[2] [2]

Zhang, Haoyang and Wang, Ying and Dayoub, Feras and Sunderhauf, Niko , booktitle=CVPR, pages=

work page

[3] [3]

Learning Phrase Representations using

Cho, Kyunghyun and van Merrienboer, Bart and G. Learning Phrase Representations using. EMNLP , pages=

work page

[4] [4]

Neural machine translation by jointly learning to align and translate , author=

work page

[5] [5]

Masked autoencoders are scalable vision learners , author=

work page

[6] [6]

Improving language understanding by generative pre-training , author=

work page

[7] [7]

Language models are unsupervised multitask learners , author=

work page

[8] [8]

Language models are few-shot learners , author=

work page

[9] [9]

Exploring the limits of transfer learning with a unified text-to-text

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J and others , journal=. Exploring the limits of transfer learning with a unified text-to-text

work page

[10] [10]

Transformer Tracking , author=

work page

[11] [11]

Chen, Xin and Peng, Houwen and Wang, Dong and Lu, Huchuan and Hu, Han , booktitle=CVPR, pages=

work page

[12] [12]

Universal Instance Perception as Object Discovery and Retrieval , author=

work page

[13] [13]

High-Performance

Chen, Xin and Yan, Bin and Zhu, Jiawen and Wang, Dong and Lu, Huchuan , journal=TPAMI, volume=. High-Performance

work page

[14] [14]

Correlation-Aware Deep Tracking , author=

work page

[15] [15]

MIR , volume=

Motion-guided Visual Tracking , author=. MIR , volume=. 2025 , publisher=

work page 2025

[16] [16]

Flamingo: a Visual Language Model for Few-Shot Learning , year =

Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and Ring, Roman and Rutherford, Eliza and Cabi, Serkan and Han, Tengda and Gong, Zhitao and Samangooei, Sina and Monteiro, Marianne and Menick, Jacob L and Borgeaud, S...

work page

[17] [17]

Unified-

Lu, Jiasen and Clark, Christopher and Zellers, Rowan and Mottaghi, Roozbeh and Kembhavi, Aniruddha , journal=. Unified-

work page

[18] [18]

Wang, Peng and Yang, An and Men, Rui and Lin, Junyang and Bai, Shuai and Li, Zhikang and Ma, Jianxin and Zhou, Chang and Zhou, Jingren and Yang, Hongxia , booktitle=

work page

[19] [19]

arXiv preprint arXiv:2206.06336 , year=

Language Models are General-Purpose Interfaces , author=. arXiv preprint arXiv:2206.06336 , year=

work page arXiv

[20] [20]

A Generalist Agent

A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

A Unified Sequence Interface for Vision Tasks , year =

Chen, Ting and Saxena, Saurabh and Li, Lala and Lin, Tsung-Yi and Fleet, David J and Hinton, Geoffrey E , booktitle =. A Unified Sequence Interface for Vision Tasks , year =

work page

[22] [22]

Pix2seq: A Language Modeling Framework for Object Detection , author=

work page

[23] [23]

Learn to match: Automatic matching network design for visual tracking , author=

work page

[24] [24]

Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark , author=

work page

[25] [25]

Generation and comprehension of unambiguous object descriptions , author=

work page

[26] [26]

Discriminative correlation filter with channel and spatial reliability , author=

work page

[27] [27]

The Eighth Visual Object Tracking

Kristan, Matej and Leonardis, Ale. The Eighth Visual Object Tracking

work page

[28] [28]

ECCVW , pages=

The Tenth Visual Object Tracking VOT2022 Challenge Results , author=. ECCVW , pages=. 2023 , organization=

work page 2023

[29] [29]

ICCVW , pages=

The ninth visual object tracking vot2021 challenge results , author=. ICCVW , pages=

work page

[30] [30]

Dynamical hyperparameter optimization via deep reinforcement learning in tracking , author=

work page

[31] [31]

Bai, Yifan and Zhao, Zeyang and Gong, Yihong and Wei, Xing , booktitle=CVPR, pages=

work page

[32] [32]

Autoregressive Visual Tracking , author=

work page

[33] [33]

Zheng, Yaozong and Zhong, Bineng and Liang, Qihua and Mo, Zhiyi and Zhang, Shengping and Li, Xianxian , booktitle=AAAI, pages=. O

work page

[34] [34]

Target-Aware Tracking with Long-Term Context Attention , author=

work page

[35] [35]

Song, Zikai and Luo, Run and Yu, Junqing and Chen, Yi-Ping Phoebe and Yang, Wei , booktitle=AAAI, pages=. Compact

work page

[36] [36]

Xie, Fei and Chu, Lei and Li, Jiahao and Lu, Yan and Ma, Chao , booktitle=CVPR, pages=. Video

work page

[37] [37]

Learning target-aware representation for visual tracking via informative interactions , author=

work page

[38] [38]

Tracking Meets

Lin, Liting and Fan, Heng and Zhang, Zhipeng and Wang, Yaowei and Xu, Yong and Ling, Haibin , booktitle=ECCV, pages=. Tracking Meets

work page

[39] [39]

Tian, Yunjie and Xie, Lingxi and Qiu, Jihao and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang , journal=TPAMI, volume=

work page

[40] [40]

Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi , booktitle=ICLR, year=. Hi

work page

[41] [41]

Autoregressive Queries for Adaptive Tracking with Spatio-Temporal

Xie, Jinxia and Zhong, Bineng and Mo, Zhiyi and Zhang, Shengping and Shi, Liangtao and Song, Shuxiang and Ji, Rongrong , booktitle=CVPR, pages=. Autoregressive Queries for Adaptive Tracking with Spatio-Temporal

work page

[42] [42]

Explicit Visual Prompts for Visual Object Tracking , author=

work page

[43] [43]

Peng, Liang and Gao, Junyuan and Liu, Xinran and Li, Weihong and Dong, Shaohua and Zhang, Zhipeng and Fan, Heng and Zhang, Libo , booktitle =NIPS, pages =

work page

[44] [44]

Law, Hei and Deng, Jia , booktitle=ECCV, pages=. Corner

work page

[45] [45]

Li, Xin and Huang, Yuqing and He, Zhenyu and Wang, Yaowei and Lu, Huchuan and Yang, Ming-Hsuan , booktitle=ICCV, pages=. Cite

work page

[46] [46]

Wu, Qiangqiang and Yang, Tianyu and Liu, Ziquan and Wu, Baoyuan and Shan, Ying and Chan, Antoni B , booktitle=CVPR, pages=. Drop

work page

[47] [47]

Robust Object Modeling for Visual Tracking , author=

work page

[48] [48]

Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework , author=

work page

[49] [49]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal=

work page

[51] [51]

Less is more: Token context-aware learning for object tracking , author=

work page

[52] [52]

Li, Xiaohai and Zhong, Bineng and Liang, Qihua and Li, Guorong and Mo, Zhiyi and Song, Shuxiang , booktitle=AAAI, pages=

work page

[53] [53]

Cai, Wenrui and Liu, Qingjie and Wang, Yunhong , booktitle=CVPR, pages=

work page

[54] [54]

Autoregressive Sequential Pretraining for Visual Tracking , author=

work page

[55] [55]

End-to-end active object tracking via reinforcement learning , author=

work page

[56] [56]

2018 , publisher=

Correlation filter selection for visual tracking using reinforcement learning , author=. 2018 , publisher=

work page 2018

[57] [57]

Fast template matching and update for video object tracking and segmentation , author=

work page

[58] [58]

Learning policies for adaptive tracking with deep feature cascades , author=

work page

[59] [59]

Siamese regression tracking with reinforced template updating , author=

work page

[60] [60]

Mamba: Linear-time sequence modeling with selective state spaces , author=

work page

[61] [61]

Exploring enhanced contextual information for video-level object tracking , author=

work page

[62] [62]

Online decision based visual tracking via reinforcement learning , author=

work page

[63] [63]

Action-decision networks for visual tracking with deep reinforcement learning , author=

work page

[64] [64]

Real-time `actor-critic' tracking , author=

work page

[65] [65]

Deep reinforcement learning with iterative shift for visual tracking , author=

work page

[66] [66]

Training language models to follow instructions with human feedback , author=

work page

[67] [67]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[68] [68]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

TMLR , year =

Oquab, Maxime and Darcet, Timoth. TMLR , year =

work page

[70] [70]

Chen, Xin and Kang, Ben and Geng, Wanting and Zhu, Jiawen and Liu, Yi and Wang, Dong and Lu, Huchuan , booktitle=AAAI, pages=

work page

[71] [71]

Gao, Shenyuan and Zhou, Chunluan and Ma, Chao and Wang, Xinggang and Yuan, Junsong , booktitle=ECCV, pages=

work page

[72] [72]

Towards Grand Unification of Object Tracking , author=

work page

[73] [73]

Towards Sequence-Level Training for Visual Tracking , author=

work page

[74] [74]

Robust Visual Tracking by Segmentation , author=

work page

[75] [75]

Generalized Relation Modeling for

Gao, Shenyuan and Zhou, Chunluan and Zhang, Jun , booktitle=CVPR, pages=. Generalized Relation Modeling for

work page

[76] [76]

Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking , author=

work page

[77] [77]

Learning tracking representations via dual-branch fully

Xie, Fei and Wang, Chunyu and Wang, Guangting and Yang, Wankou and Zeng, Wenjun , booktitle=. Learning tracking representations via dual-branch fully

work page

[78] [78]

Transforming Model Prediction for Tracking , author=

work page

[79] [79]

Transformer Tracking with Cyclic Shifting Window Attention , author=

work page

[80] [80]

Cui, Yutao and Jiang, Cheng and Wang, Limin and Wu, Gangshan , booktitle=CVPR, pages=

work page