Multi-Adapter RGBT Tracking

Aihua Zheng; Andong Lu; Chenglong Li; Jin Tang; Zhengzheng Tu

arxiv: 1907.07485 · v1 · pith:TXU2SG65new · submitted 2019-07-17 · 💻 cs.CV

Multi-Adapter RGBT Tracking

Chenglong Li , Andong Lu , Aihua Zheng , Zhengzheng Tu , Jin Tang This is my paper

Pith reviewed 2026-05-24 20:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords RGBT trackingmulti-adapter networkmodality fusionfeature learningvisual trackingdeep convolutional network

0 comments

The pith

A multi-adapter network jointly learns shared, modality-specific and instance-aware features for RGBT tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MANet, a convolutional architecture that performs three kinds of feature extraction in one end-to-end model for tracking objects with both RGB and thermal images. It argues that existing methods miss shared cues across modalities and instance-specific appearance changes, so the new network adds a generality adapter for common representations, a modality adapter for complementary differences, and an instance adapter for object-specific and temporal properties. A sympathetic reader would expect this joint learning to produce more robust fusion than weighting schemes that treat modalities separately. The design also uses a parallel structure between the generic and modality adapters to keep computation low enough for real-time use. Experiments on standard RGBT benchmarks are presented as evidence that the combined adapters outperform prior RGB and RGBT trackers.

Core claim

MANet jointly performs modality-shared, modality-specific and instance-aware feature learning through three adapters in an end-to-end trained framework, with a parallel structure between the generality and modality adapters to meet real-time demands.

What carries the argument

The Multi-Adapter convolutional Network (MANet) whose generality adapter extracts shared object representations, modality adapter encodes modality-specific information, and instance adapter models appearance and temporal variations of a tracked object.

If this is right

Modality-shared cues become usable for fusion without separate weighting steps.
Instance-aware modeling captures object-specific changes that improve long-term tracking stability.
Parallel adapter layout reduces computational cost while preserving the three feature types.
End-to-end optimization allows the network to balance shared and specific information automatically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adapter structure could be tested on other paired modalities such as RGB-depth or RGB-event tracking.
The instance adapter might transfer to single-modality trackers to handle appearance variation without thermal data.

Load-bearing premise

The three adapters can be trained together end-to-end without one type of learning interfering with or diminishing the others.

What would settle it

An ablation experiment on the RGBT tracking benchmarks in which removing any one adapter produces a measurable drop in tracking accuracy or success rate compared with the full MANet.

Figures

Figures reproduced from arXiv: 1907.07485 by Aihua Zheng, Andong Lu, Chenglong Li, Jin Tang, Zhengzheng Tu.

**Figure 1.** Figure 1: Pipeline of MANet. Herein, + and C denote the operations of addition and concatenation respectively. ReLU and LRN refer [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: PR and SR curves of different tracking result on GTOT [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: PR and SR curves of different tracking result on [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: SR evaluation results on various challenges comparing to the-state-of-the-art methods on RGBT234. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visual examples of our tracker comparing with four state-of-the-art baseline methods. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison results of MANet and its variants on GTOT [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

The task of RGBT tracking aims to take the complementary advantages from visible spectrum and thermal infrared data to achieve robust visual tracking, and receives more and more attention in recent years. Existing works focus on modality-specific information integration by introducing modality weights to achieve adaptive fusion or learning robust feature representations of different modalities. Although these methods could effectively deploy the modality-specific properties, they ignore the potential values of modality-shared cues as well as instance-aware information, which are crucial for effective fusion of different modalities in RGBT tracking. In this paper, we propose a novel Multi-Adapter convolutional Network (MANet) to jointly perform modality-shared, modality-specific and instance-aware feature learning in an end-to-end trained deep framework for RGBT tracking. We design three kinds of adapters within our network. In a specific, the generality adapter is to extract shared object representations, the modality adapter aims at encoding modality-specific information to deploy their complementary advantages, and the instance adapter is to model the appearance properties and temporal variations of a certain object. Moreover, to reduce computational complexity for real-time demand of visual tracking, we design a parallel structure of generic adapter and modality adapter. Extensive experiments on two RGBT tracking benchmark datasets demonstrate the outstanding performance of the proposed tracker against other state-of-the-art RGB and RGBT tracking algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a novel Multi-Adapter convolutional Network (MANet) for RGBT tracking. It introduces three adapters within an end-to-end trained framework: a generality adapter to extract shared object representations, a modality adapter to encode modality-specific information, and an instance adapter to model appearance properties and temporal variations of a specific object. A parallel structure between the generic and modality adapters is used to reduce computational complexity, and extensive experiments on two RGBT tracking benchmarks are claimed to demonstrate superior performance over state-of-the-art RGB and RGBT trackers.

Significance. If the experimental results and ablations hold, the work could advance RGBT tracking by explicitly addressing modality-shared cues and instance-aware information that prior methods overlook, while maintaining real-time feasibility through the parallel design. The end-to-end joint optimization of the three adapter types is a potentially useful architectural contribution if validated.

major comments (2)

[Proposed method] Proposed method section (adapter design and joint optimization): the central claim that the generality, modality, and instance adapters can be jointly optimized end-to-end to capture shared cues, modality-specific properties, and instance-aware variations without interference is not supported by any analysis, loss-term decomposition, or gradient-flow argument; this assumption is load-bearing for the novelty of the three-adapter design.
[Experiments] Experiments section: the abstract states superior benchmark performance, yet no quantitative tables, ablation results isolating each adapter's contribution, or statistical significance tests are referenced in the provided description, preventing verification that the joint learning actually delivers the claimed complementary advantages.

minor comments (1)

Notation for the three adapters could be made more consistent (e.g., explicit symbols for each adapter output) to improve readability of the network diagram and equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below, providing clarifications and indicating where revisions will strengthen the manuscript.

read point-by-point responses

Referee: [Proposed method] Proposed method section (adapter design and joint optimization): the central claim that the generality, modality, and instance adapters can be jointly optimized end-to-end to capture shared cues, modality-specific properties, and instance-aware variations without interference is not supported by any analysis, loss-term decomposition, or gradient-flow argument; this assumption is load-bearing for the novelty of the three-adapter design.

Authors: We agree that the original manuscript lacks an explicit loss decomposition or gradient-flow analysis to formally demonstrate non-interference among the three adapters. The design relies on architectural separation (parallel generic/modality branches plus per-instance adapter) and standard end-to-end training with a tracking loss, which empirically yields complementary features as shown by the ablations. To address this directly, we will add a dedicated paragraph in the revised Method section explaining the optimization dynamics and the role of the parallel structure in mitigating interference. revision: yes
Referee: [Experiments] Experiments section: the abstract states superior benchmark performance, yet no quantitative tables, ablation results isolating each adapter's contribution, or statistical significance tests are referenced in the provided description, preventing verification that the joint learning actually delivers the claimed complementary advantages.

Authors: The full manuscript contains quantitative tables (Section 4.1) reporting precision and success rates on RGBT234 and GTOT against both RGB and RGBT trackers, plus ablation studies (Section 4.3) that isolate the contribution of each adapter type and show cumulative gains when all three are combined. We will add statistical significance tests (e.g., paired t-tests on the success rates) in the revision to further substantiate the complementary advantages. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes a new convolutional network architecture (MANet) with three adapter types for joint modality-shared, modality-specific, and instance-aware feature learning in RGBT tracking. The central contribution is the network design itself plus empirical validation on two benchmarks; no derivation chain, equations, fitted parameters renamed as predictions, or uniqueness theorems are present. No self-citation load-bearing steps or ansatz smuggling occur. The architecture choices are presented as design decisions, not as outputs forced by prior self-referential results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim depends on the effectiveness of three newly introduced adapter components whose parameters are learned during end-to-end training on benchmark data. The key domain assumption is that standard deep learning optimization will successfully disentangle shared, modality-specific, and instance-aware representations without additional constraints.

free parameters (1)

adapter network parameters
All convolutional weights and adapter-specific parameters are fitted during training on the RGBT tracking benchmarks; no fixed values are stated in the abstract.

axioms (1)

domain assumption End-to-end training of the multi-adapter network suffices to learn the desired modality-shared, modality-specific, and instance-aware features.
Invoked in the description of the end-to-end trained deep framework for RGBT tracking.

invented entities (3)

generality adapter no independent evidence
purpose: extract shared object representations
New component introduced to capture modality-shared cues.
modality adapter no independent evidence
purpose: encode modality-specific information
New component introduced to deploy complementary advantages of RGB and thermal data.
instance adapter no independent evidence
purpose: model appearance properties and temporal variations of a specific object
New component introduced to capture instance-aware information.

pith-pipeline@v0.9.0 · 5762 in / 1418 out tokens · 30511 ms · 2026-05-24T20:30:53.485261+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

[1]

Bertinetto, J

L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fully-convolutional siamese networks for ob- ject tracking. In Proceedings of IEEE European Conference on Computer Vision, 2016. 6

work page 2016
[2]

B. Chen, D. Wang, P. Li, S. Wang, and H. Lu. Real-time actor-critic tracking. In Proceedings of IEEE Conference on European Conference on Computer Vision, 2018. 6

work page 2018
[3]

Danelljan, G

M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco: Efﬁcient convolution operators for tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recog- nition, 2017. 6

work page 2017
[4]

Danelljan, G

M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Learning spatially regularized correlation ﬁlters for visual tracking. In Proceedings of the IEEE International Confer- ence on Computer Vision, 2015. 6

work page 2015
[5]

Danelljan, A

M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation ﬁlters: Learning continuous convolution operators for visual tracking. In Proceedings of IEEE Euro- pean Conference on Computer Vision, 2016. 6, 7

work page 2016
[6]

Gade and T

R. Gade and T. B. Moeslund. Thermal cameras and ap- plications: a survey. Machine Vision and Applications , 25(1):245–262, 2014. 1

work page 2014
[7]

B. Han, J. Sim, and H. Adam. Branchout: Regularization for online ensemble tracking with convolutional neural net- works. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3

work page 2017
[8]

S. Hare, A. Saffari, and P. H. S. Torr. Struck: Structured output tracking with kernels. In Proceedings of IEEE Inter- national Conference on Computer Vision, 2011. 6

work page 2011
[9]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2, 3

work page 2016
[10]

Hwang, J

S. Hwang, J. Park, N. Kim, Y . Choi, and I. S. Kweon. Mul- tispectral pedestrian detection: Benchmark dataset and base- line. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. 1

work page 2015
[11]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of International Conference on Machine Learn- ing, 2015. 2

work page 2015
[12]

I. Jung, J. Son, M. Baek, and B. Han. Real-time mdnet. In Proceedings of IEEE European Conference on Computer Vision, 2018. 3, 8

work page 2018
[13]

I. Jung, J. Son, M. Baek, and B. Han. Real-time mdnet. In Proceedings of IEEE Conference on European Conference on Computer Vision, 2018. 6

work page 2018
[14]

N. Ketkar. Stochastic gradient descent. Optimization, 2014. 5

work page 2014
[15]

Kim, D.-Y

H.-U. Kim, D.-Y . Lee, J.-Y . Sim, and C.-S. Kim. Sowp: Spa- tially ordered and weighted patch descriptor for visual track- ing. In Proceedings of IEEE International Conference on Computer Vision, 2015. 6

work page 2015
[16]

X. Lan, M. Ye, S. Zhang, and P. C. Yuen. Robust collabora- tive discriminative learning for rgb-infrared tracking. InPro- ceedings of the AAAI Conference on Artiﬁcial Intelligence ,

work page
[17]

C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin. Learning collaborative sparse representation for grayscale- thermal tracking. IEEE Transactions on Image Processing, 25(12):5743–5756, 2016. 1, 3, 5, 6

work page 2016
[18]

C. Li, X. Liang, Y . Lu, N. Zhao, and J. Tang. Rgb-t ob- ject tracking: Benchmark and baseline. arXiv: 1805.08982,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

C. Li, X. Wang, L. Zhang, J. Tang, H. Wu, and L. Lin. Weld: Weighted low-rank decomposition for robust grayscale- thermal foreground detection.IEEE Transactions on Circuits and Systems for Video Technology, 25(12):5743–5756, 2017. 1

work page 2017
[20]

C. Li, X. Wu, N. Zhao, X. Cao, and J. Tang. Fusing two- stream convolutional neural networks for rgb-t object track- ing. IEEE Transactions on Information Theory, 2018. 2, 3, 6

work page 2018
[21]

C. Li, S. Xiang, W. Xiao, Z. Lei, and T. Jin. Grayscale- thermal object tracking via multitask laplacian sparse repre- sentation. IEEE Transactions on Systems Man and Cyber- netics Systems, 47(4):673–681, 2017. 3

work page 2017
[22]

C. Li, N. Zhao, Y . Lu, C. Zhu, and J. Tang. Weighted sparse representation regularized graph learning for rgb-t ob- ject tracking. In Proceedings of ACM International Confer- ence on Multimedia, 2017. 1, 3, 6

work page 2017
[23]

C. Li, C. Zhu, Y . Huang, J. Tang, and L. Wang. Cross-modal ranking with soft consistency and noisy labels for robust rgb- t tracking. In Proceedings of European Conference on Com- puter Vision, 2018. 1, 3

work page 2018
[24]

Lukezic, T

A. Lukezic, T. V ojir, L. C. Zajc, J. Matas, and M. Kristan. Discriminative correlation ﬁlter with channel and spatial re- liability. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 6

work page 2016
[25]

Nam and B

H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition ,

work page
[26]

Park and A

E. Park and A. C. Berg. Meta-tracker: Fast and robust online adaptation for visual object trackers. In Proceedings of IEEE European Conference on Computer Vision, 2018. 3

work page 2018
[27]

S. Pu, Y . Song, C. Ma, H. Zhang, and M. H. Yang. Deep attentive tracking via reciprocative learning. In Proceedings of IEEE Conference on Neural Information Processing Sys- tems, 2018. 6

work page 2018
[28]

S. A. Rebufﬁ, H. Bilen, and A. Vedaldi. Learning multi- ple visual domains with residual adapters. In Proceedings of IEEE Conference on Neural Information Processing Sys- tems, 2017. 3

work page 2017
[29]

S. A. Rebufﬁ, H. Bilen, and A. Vedaldi. Efﬁcient parametrization of multi-domain deep neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 3

work page 2018
[30]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of International Conference on Learning Representations ,

work page
[31]

Wu, W.-S

A. Wu, W.-S. Zheng, H. Yu, S. Gong, and J. Lai. Rgb- infrared cross-modality person re-identiﬁcation. In Proceed- ings of IEEE International Conference on Computer Vision,

work page
[32]

D. Xu, W. Ouyang, E. Ricci, X. Wang, and N. Sebe. Learn- ing cross-modal deep representations for robust pedestrian detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1

work page 2017
[33]

Yun and et al

S. Yun and et al. Action-decision networks for visual track- ing with deep reinforcement learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recog- nition, 2017. 6

work page 2017
[34]

Zhang, S

J. Zhang, S. Ma, and S. Sclaroff. MEEM: robust tracking via multiple experts using entropy minimization. InProceedings of IEEE European Conference on Computer Vision, 2014. 6

work page 2014
[35]

Deeper and Wider Siamese Networks for Real-Time Visual Tracking

Z. Zhipeng, P. Houwen, and W. Qiang. Deeper and wider siamese networks for real-time visual tracking. arXiv: 1901.01660, 2019. 6

work page internal anchor Pith review Pith/arXiv arXiv 1901
[36]

Y . Zhu, C. Li, Y . Lu, L. Lin, B. Luo, and J. Tang. Fanet: Quality-aware feature aggregation network for rgb-t track- ing. arXiv:1811.09855, 2018. 1, 2, 3

work page arXiv 2018

[1] [1]

Bertinetto, J

L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fully-convolutional siamese networks for ob- ject tracking. In Proceedings of IEEE European Conference on Computer Vision, 2016. 6

work page 2016

[2] [2]

B. Chen, D. Wang, P. Li, S. Wang, and H. Lu. Real-time actor-critic tracking. In Proceedings of IEEE Conference on European Conference on Computer Vision, 2018. 6

work page 2018

[3] [3]

Danelljan, G

M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco: Efﬁcient convolution operators for tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recog- nition, 2017. 6

work page 2017

[4] [4]

Danelljan, G

M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Learning spatially regularized correlation ﬁlters for visual tracking. In Proceedings of the IEEE International Confer- ence on Computer Vision, 2015. 6

work page 2015

[5] [5]

Danelljan, A

M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation ﬁlters: Learning continuous convolution operators for visual tracking. In Proceedings of IEEE Euro- pean Conference on Computer Vision, 2016. 6, 7

work page 2016

[6] [6]

Gade and T

R. Gade and T. B. Moeslund. Thermal cameras and ap- plications: a survey. Machine Vision and Applications , 25(1):245–262, 2014. 1

work page 2014

[7] [7]

B. Han, J. Sim, and H. Adam. Branchout: Regularization for online ensemble tracking with convolutional neural net- works. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3

work page 2017

[8] [8]

S. Hare, A. Saffari, and P. H. S. Torr. Struck: Structured output tracking with kernels. In Proceedings of IEEE Inter- national Conference on Computer Vision, 2011. 6

work page 2011

[9] [9]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2, 3

work page 2016

[10] [10]

Hwang, J

S. Hwang, J. Park, N. Kim, Y . Choi, and I. S. Kweon. Mul- tispectral pedestrian detection: Benchmark dataset and base- line. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. 1

work page 2015

[11] [11]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of International Conference on Machine Learn- ing, 2015. 2

work page 2015

[12] [12]

I. Jung, J. Son, M. Baek, and B. Han. Real-time mdnet. In Proceedings of IEEE European Conference on Computer Vision, 2018. 3, 8

work page 2018

[13] [13]

I. Jung, J. Son, M. Baek, and B. Han. Real-time mdnet. In Proceedings of IEEE Conference on European Conference on Computer Vision, 2018. 6

work page 2018

[14] [14]

N. Ketkar. Stochastic gradient descent. Optimization, 2014. 5

work page 2014

[15] [15]

Kim, D.-Y

H.-U. Kim, D.-Y . Lee, J.-Y . Sim, and C.-S. Kim. Sowp: Spa- tially ordered and weighted patch descriptor for visual track- ing. In Proceedings of IEEE International Conference on Computer Vision, 2015. 6

work page 2015

[16] [16]

X. Lan, M. Ye, S. Zhang, and P. C. Yuen. Robust collabora- tive discriminative learning for rgb-infrared tracking. InPro- ceedings of the AAAI Conference on Artiﬁcial Intelligence ,

work page

[17] [17]

C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin. Learning collaborative sparse representation for grayscale- thermal tracking. IEEE Transactions on Image Processing, 25(12):5743–5756, 2016. 1, 3, 5, 6

work page 2016

[18] [18]

C. Li, X. Liang, Y . Lu, N. Zhao, and J. Tang. Rgb-t ob- ject tracking: Benchmark and baseline. arXiv: 1805.08982,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

C. Li, X. Wang, L. Zhang, J. Tang, H. Wu, and L. Lin. Weld: Weighted low-rank decomposition for robust grayscale- thermal foreground detection.IEEE Transactions on Circuits and Systems for Video Technology, 25(12):5743–5756, 2017. 1

work page 2017

[20] [20]

C. Li, X. Wu, N. Zhao, X. Cao, and J. Tang. Fusing two- stream convolutional neural networks for rgb-t object track- ing. IEEE Transactions on Information Theory, 2018. 2, 3, 6

work page 2018

[21] [21]

C. Li, S. Xiang, W. Xiao, Z. Lei, and T. Jin. Grayscale- thermal object tracking via multitask laplacian sparse repre- sentation. IEEE Transactions on Systems Man and Cyber- netics Systems, 47(4):673–681, 2017. 3

work page 2017

[22] [22]

C. Li, N. Zhao, Y . Lu, C. Zhu, and J. Tang. Weighted sparse representation regularized graph learning for rgb-t ob- ject tracking. In Proceedings of ACM International Confer- ence on Multimedia, 2017. 1, 3, 6

work page 2017

[23] [23]

C. Li, C. Zhu, Y . Huang, J. Tang, and L. Wang. Cross-modal ranking with soft consistency and noisy labels for robust rgb- t tracking. In Proceedings of European Conference on Com- puter Vision, 2018. 1, 3

work page 2018

[24] [24]

Lukezic, T

A. Lukezic, T. V ojir, L. C. Zajc, J. Matas, and M. Kristan. Discriminative correlation ﬁlter with channel and spatial re- liability. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 6

work page 2016

[25] [25]

Nam and B

H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition ,

work page

[26] [26]

Park and A

E. Park and A. C. Berg. Meta-tracker: Fast and robust online adaptation for visual object trackers. In Proceedings of IEEE European Conference on Computer Vision, 2018. 3

work page 2018

[27] [27]

S. Pu, Y . Song, C. Ma, H. Zhang, and M. H. Yang. Deep attentive tracking via reciprocative learning. In Proceedings of IEEE Conference on Neural Information Processing Sys- tems, 2018. 6

work page 2018

[28] [28]

S. A. Rebufﬁ, H. Bilen, and A. Vedaldi. Learning multi- ple visual domains with residual adapters. In Proceedings of IEEE Conference on Neural Information Processing Sys- tems, 2017. 3

work page 2017

[29] [29]

S. A. Rebufﬁ, H. Bilen, and A. Vedaldi. Efﬁcient parametrization of multi-domain deep neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 3

work page 2018

[30] [30]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of International Conference on Learning Representations ,

work page

[31] [31]

Wu, W.-S

A. Wu, W.-S. Zheng, H. Yu, S. Gong, and J. Lai. Rgb- infrared cross-modality person re-identiﬁcation. In Proceed- ings of IEEE International Conference on Computer Vision,

work page

[32] [32]

D. Xu, W. Ouyang, E. Ricci, X. Wang, and N. Sebe. Learn- ing cross-modal deep representations for robust pedestrian detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1

work page 2017

[33] [33]

Yun and et al

S. Yun and et al. Action-decision networks for visual track- ing with deep reinforcement learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recog- nition, 2017. 6

work page 2017

[34] [34]

Zhang, S

J. Zhang, S. Ma, and S. Sclaroff. MEEM: robust tracking via multiple experts using entropy minimization. InProceedings of IEEE European Conference on Computer Vision, 2014. 6

work page 2014

[35] [35]

Deeper and Wider Siamese Networks for Real-Time Visual Tracking

Z. Zhipeng, P. Houwen, and W. Qiang. Deeper and wider siamese networks for real-time visual tracking. arXiv: 1901.01660, 2019. 6

work page internal anchor Pith review Pith/arXiv arXiv 1901

[36] [36]

Y . Zhu, C. Li, Y . Lu, L. Lin, B. Luo, and J. Tang. Fanet: Quality-aware feature aggregation network for rgb-t track- ing. arXiv:1811.09855, 2018. 1, 2, 3

work page arXiv 2018