A Strong Feature Representation for Siamese Network Tracker
Pith reviewed 2026-05-24 20:03 UTC · model grok-4.3
The pith
A fine-tuned VGG16 backbone plus AlexNet branch and channel attention lets a Siamese tracker reach high accuracy on OTB and VOT benchmarks at 41 FPS.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The SiamPF tracker forms a stronger feature representation by using a fine-tuned modified VGG16 as backbone, adding an AlexNet-like branch after the third convolutional layer and merging it with the backbone response map, inserting a channel attention block to adaptively weight features, and modifying the APCE to reduce interference in the response map; this yields excellent results on OTB-2013, OTB-2015, VOT2015 and VOT2017 while running at 41 FPS on a GTX 1080Ti.
What carries the argument
The merged response map from the fine-tuned VGG16 backbone and the added AlexNet-like branch, refined by channel attention and modified APCE.
If this is right
- Siamese trackers can match the accuracy of non-Siamese methods on OTB and VOT benchmarks when trained only on ILSVRC2015-VID.
- The real-time speed of 41 FPS on GTX 1080Ti is preserved under the added components.
- Channel attention and modified APCE can be combined with deeper backbones to focus the tracker on target features without extra data.
Where Pith is reading between the lines
- The same backbone fusion pattern could be tested on other Siamese tasks such as verification to see if accuracy gains appear without new training regimes.
- If the gains rely on the specific benchmarks used, performance on newer or more challenging video sets might narrow or disappear.
- The work implies that post-convolution feature merging plus attention can serve as a lightweight upgrade path for existing real-time trackers.
Load-bearing premise
That this exact combination of VGG16 fine-tuning, added AlexNet branch, channel attention, and modified APCE will strengthen features enough to close the accuracy gap without overfitting or slowing the tracker on the listed benchmarks.
What would settle it
Evaluating the full SiamPF pipeline versus an ablated version that removes the AlexNet-like branch on a fresh tracking dataset and checking whether accuracy drops below the claimed level while speed stays the same.
read the original abstract
Object tracking has important application in assistive technologies for personalized monitoring. Recent trackers choosing AlexNet as their backbone to extract features have gained great success. However, AlexNet is too shallow to form a strong feature representation, the tracker based on the Siamese network have an accuracy gap compared with state-of-the-art algorithms. To solve this problem, this paper proposes a tracker called SiamPF. Firstly, the modified pre-trained VGG16 network is fine-tuned as the backbone. Secondly, an AlexNet-like branch is added after the third convolutional layer and merged with the response map of the backbone network to form a preliminary strong feature representation. And then, a channel attention block is designed to adaptively select the contribution features. Finally, the APCE is modified to process the response map to reduce interference and focus the tracker on the target. Our SiamPF only used ILSVRC2015-VID for training, but it achieved excellent performance on OTB-2013 / OTB-2015 / VOT2015 / VOT2017, while maintaining the real-time performance of 41FPS on the GTX 1080Ti.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SiamPF, a Siamese network tracker that replaces the standard AlexNet backbone with a fine-tuned VGG16, adds an AlexNet-like branch after conv3 whose response map is merged with the backbone, inserts a channel attention block to weight features, and modifies the APCE criterion on the final response map. It claims this architecture produces a stronger feature representation that delivers excellent results on OTB-2013, OTB-2015, VOT2015 and VOT2017 while using only ILSVRC2015-VID for training and running at 41 FPS on a GTX 1080Ti.
Significance. If the performance numbers can be reproduced and the gains attributed to the proposed components, the work would show that modest, targeted modifications to the feature extractor can narrow the accuracy gap between real-time Siamese trackers and slower state-of-the-art methods without extra training data or speed penalties, which is relevant for applications such as assistive monitoring.
major comments (2)
- [Experiments] Experiments section: the reported benchmark scores are presented without any description of the training protocol, data splits, hyper-parameters, baseline re-implementations, or statistical measures (error bars, multiple seeds). This absence prevents verification of the central claim that the architecture achieves the stated performance.
- [Method / Experiments] Method and Experiments sections: no component-wise ablation tables or figures are supplied that quantify the contribution of the VGG16 backbone, the added AlexNet-like branch, the channel attention block, or the modified APCE. Without these deltas the attribution of gains specifically to the proposed feature-representation design remains unsecured.
minor comments (1)
- [Abstract] Abstract, line 3: 'the tracker based on the Siamese network have an accuracy gap' contains a subject-verb agreement error.
Simulated Author's Rebuttal
We thank the referee for the constructive comments regarding the need for greater experimental detail and component analysis. We will revise the manuscript to address these points and improve reproducibility and attribution of results.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the reported benchmark scores are presented without any description of the training protocol, data splits, hyper-parameters, baseline re-implementations, or statistical measures (error bars, multiple seeds). This absence prevents verification of the central claim that the architecture achieves the stated performance.
Authors: We agree that additional details are required for full verification. In the revised manuscript we will expand the Experiments section with a complete description of the training protocol on ILSVRC2015-VID, the exact data splits and augmentation used, all hyper-parameters, the procedure for obtaining or re-implementing baseline results, and any available statistical measures (including multiple-run results where feasible). revision: yes
-
Referee: [Method / Experiments] Method and Experiments sections: no component-wise ablation tables or figures are supplied that quantify the contribution of the VGG16 backbone, the added AlexNet-like branch, the channel attention block, or the modified APCE. Without these deltas the attribution of gains specifically to the proposed feature-representation design remains unsecured.
Authors: We concur that explicit ablations would strengthen the paper. The revised version will include new ablation tables and/or figures that isolate the performance contribution of each component (VGG16 backbone, AlexNet-style branch, channel attention module, and modified APCE) relative to a plain Siamese baseline, thereby clarifying the source of the reported gains. revision: yes
Circularity Check
No circularity: purely empirical architecture proposal with no derivations or fitted predictions
full rationale
The paper describes an empirical Siamese tracker design (modified VGG16 backbone, added AlexNet-like branch, channel attention, modified APCE) trained on ILSVRC2015-VID and evaluated on standard benchmarks. No equations, first-principles derivations, parameter fits presented as predictions, or self-citation load-bearing uniqueness theorems appear in the provided text. All performance claims rest on experimental results rather than any reduction of outputs to inputs by construction. This is the expected non-finding for an applied CV architecture paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bolme, D. S., Beveridge, J. R., Draper, B. A., & Lui, Y. M. V isual object tracking using adaptive correlation filters. In CVPR, pages 2544-2550, 2010
work page 2010
-
[2]
F., Caseiro, R., Martins, P., & Batista, J
Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. Exploiting the circulant structure of tracking-by-detection with kernels. In ECCV, pages 702-715, 2012
work page 2012
-
[3]
F., Caseiro, R., Martins, P., & Batista, J
Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. High -speed tracking with kernelized correlation filters. TPAMI, 37(3), 583-596, 2015
work page 2015
-
[4]
Adaptive color attributes for real-time visual tracking
Danelljan, M., Shahbaz Khan, F., Felsberg, M., & Van de Weijer, J. Adaptive color attributes for real-time visual tracking. In CVPR, pages 1090-1097, 2014
work page 2014
-
[5]
Convolutional features for correlation filter based visual tracking
Danelljan, M., Hager, G., Shahbaz Khan, F., & Felsberg, M. Convolutional features for correlation filter based visual tracking. In ICCV Workshops, pages 58-66, 2015
work page 2015
-
[6]
Danelljan, M., Robinson, A., K han, F. S., & Felsberg, M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In ECCV, pages 472-488, 2016
work page 2016
-
[7]
ECO: efficient convolution operators for tracking
Danelljan, M., Bhat, G., Shahbaz Khan, F., & Felsberg, M. ECO: efficient convolution operators for tracking. In CVPR, pages 6638-6646, 2017
work page 2017
-
[8]
Learning to track at 100 fps with deep regression networks
Held, D., Thrun, S., & Savarese, S. Learning to track at 100 fps with deep regression networks. In ECCV, pages. 749-765, 2016
work page 2016
-
[9]
Sanet: Structure -aware network for visual tracking
Fan, H., & Ling, H. Sanet: Structure -aware network for visual tracking. In CVPR Workshops, pages 42-49, 2017
work page 2017
-
[10]
Modeling and Propagating CNNs in a Tree Structure for Visual Tracking
Nam, H., Baek, M., & Han, B. Modeling and propagating cnns in a tree structure for visual tracking. arXiv preprint arXiv:1608.07242, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[11]
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. Fully -convolutional siamese networks for object tracking. In ECCV, pages 850-865, 2016
work page 2016
-
[12]
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., & Torr, P. H. End -to-end representation learning for correlation filter based tracking. In CVPR pages 2805-2813, 2017
work page 2017
-
[13]
Learning dynamic siamese network for visual object tracking
Guo, Q., Feng, W., Zhou, C., Huang, R., Wan, L., & Wang, S. Learning dynamic siamese network for visual object tracking. In ICCV, pages 1763-1771, 2017
work page 2017
-
[14]
A twofold siamese network for real -time object tracking
He, A., Luo, C., Tian, X., & Zeng, W. A twofold siamese network for real -time object tracking. In CVPR, pages 4834-4843, 2018
work page 2018
-
[15]
High performance visual tracking with siamese region proposal network
Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. High performance visual tracking with siamese region proposal network. In CVPR, pages 8971-8980, 2018
work page 2018
-
[16]
SiamVGG: Visual Tracking using Deeper Siamese Networks
Li, Y., & Zhang, X. SiamVGG: Visual Tracking using Deeper Siamese Networks. arXiv preprint arXiv:1902.02804, 2019
-
[17]
C., Yang, Y., Wang, J., Xu, W., & Yuille, A
Chen, L. C., Yang, Y., Wang, J., Xu, W., & Yuille, A. L. Attention to scale: Scale-aware semantic image segmentation. In CVPR, pages 3640-3649, 2016
work page 2016
-
[18]
A., Richardt, C., Tompkin, J., Cosker, D., & Kim, K
Mejjati, Y. A., Richardt, C., Tompkin, J., Cosker, D., & Kim, K. I. Unsupervised Attention-guided Image-to-Image Translation. In NIPS, pages 3693-3703, 2018
work page 2018
-
[19]
Wang, M., Liu, Y., & Huang, Z. (2017). Large margin object tracking with circulant feature maps. In CVPR, pages 4021-4029, 2017
work page 2017
-
[20]
Wu, Y., Lim, J., & Yang, M. H. Online object tracking: A benchmark. In CVPR, pages 2411-2418, 2013
work page 2013
-
[21]
Zhang, T., Xu, C., & Yang, M. H. Multi -task correlation particle filter for robust object tracking. In CVPR, pages 4335-4343, 2017
work page 2017
-
[22]
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., & Torr, P. H. Staple: Complementary learners for real-time tracking. In CVPR, pages 1401-1409, 2016
work page 2016
-
[23]
Kristan, M., Matas, J., Leonardis, A., Vojíř, T., Pflugfelder, R., Fernandez, G., ... & Čehovin, L. A novel performance evaluation methodology for single -target trackers. TPAMI, 38(11), 2137 -2155, 2016
work page 2016
-
[24]
Learning multi -domain convolutional neural networks for visual tracking
Nam H, Han B. Learning multi -domain convolutional neural networks for visual tracking . In CVPR, pages 4293-4302,2016
work page 2016
-
[25]
Convolutional features for correlation filter based visual tracking
Danelljan M, Hager G, Shahbaz Khan F, et al. Convolutional features for correlation filter based visual tracking. In CVPR Workshops, pages 58-66, 2015
work page 2015
-
[26]
Beyond local search: Tracking objects everywhere with instance -specific proposals
Zhu G, Porikli F, Li H. Beyond local search: Tracking objects everywhere with instance -specific proposals. In CVPR, pages 943-951,2016
work page 2016
-
[27]
Learning spatially regularized correlation filters for visual tracking
Danelljan M, Hager G, Shahbaz Khan F, et al. Learning spatially regularized correlation filters for visual tracking. In ICCV, 4310-4318, 2015
work page 2015
-
[28]
Y. Hua, K. Alahari, and C. Schmid. Online object tracking with proposal selectio n. In International Conference on Computer Vision, 2015
work page 2015
-
[29]
N. Wang and D.-Y. Yeung. Ensemble-based tracking: Aggregating crowdsourced structured time series data. In ICML, pages 1107–1115, 2015
work page 2015
-
[30]
A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration
Yang Li, Jianke Zhu. A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration. In ECCV Workshops ,2014
work page 2014
-
[31]
S. Hare, A. Saffari, and P. H. S. Torr. Struck: Structured output tracking with kernels. In D. N. Metaxas, L. Quan, A. Sanfeliu, and L. J. V. Gool, editors, International Conference on Computer Vision, pages 263–270. IEEE, 2011
work page 2011
-
[32]
Good Features to Correlate for Visual Tracking
E. Gundogdu and A. A. Alatan. Good features to correlate for visual tracking. CoRR, abs/1704.06326, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
Multi-Cue Correlation Filters for Robust Visual Tracking
Ning Wang, Wengang Zhou, Qi Tian, Richang Hong, Meng Wang, Houqiang Li. Multi-Cue Correlation Filters for Robust Visual Tracking. In CVPR, 2018
work page 2018
-
[34]
A. Lukezic, T. Vojir, L. Cehovin Zajc, J. Matas, and M. Kristan. Discriminative correlation filter with channel and spatial reliability. In CVPR, pages 6309–6318, 2017
work page 2017
-
[35]
Convolutional Regression for Visual Tracking
K. Chen and W. Tao. Convolutional regression for visual tracking. CoRR, abs/1611.04215, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.