Siamese Network Based on MLP and Multi-head Cross Attention for Visual Object Tracking

Li, Piaoyang; Lan, Shiyong; Sun, Shipeng; Wang, Wenwu; Gao, Yongyang; Yang, Yongyu; Yu, Guangyu

doi:10.1007/978-3-031-44204-9_35

Piaoyang Li¹¹,
Shiyong Lan¹¹,
Shipeng Sun¹¹,
Wenwu Wang¹²,
Yongyang Gao¹¹,
Yongyu Yang¹¹ &
…
Guangyu Yu¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14263))

Included in the following conference series:

International Conference on Artificial Neural Networks

722 Accesses

Abstract

Visual object tracking is an important prerequisite in many applications. However, the performance of the tracking system is often affected by the quality of the visual object’s feature representation and whether it can identify the best match of the target template in the search area. To alleviate these challenges, we propose a new method based on Multi-Layer Perceptron (MLP) and multi-head cross attention. First, a new MLP-based module is designed to enhance the input features, by refining the internal association between the spatial and channel dimensions of these features. Second, an improved head network is constructed for predicting the location of the target, in which the multi-head cross attention mechanism is used to find the optimal matching between the template and the search area. Experiments on four datasets show that the proposed method offers competitive tracking performance as compared with several recent baseline methods. The codes will be available at https://github.com/SYLan2019/MLP-MHCA.

This work was funded by 2035 Innovation Pilot Program of Sichuan University, China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Siamese visual tracking based on criss-cross attention and improved head network

Article 09 May 2023

Learning attention modules for visual tracking

Article 21 April 2022

Lightweight Object Tracking Algorithm Based on Siamese Network with Efficient Attention

References

Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE TSP 50(2), 174–188 (2002)
Google Scholar
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Bhat, G., Danelljan, M., Gool, L., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019)
Google Scholar
Chen, B., Tsotsos, J.K.: Fast visual object tracking with rotated bounding boxes. arXiv preprint arXiv:1907.03892 (2019)
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)
Google Scholar
Chen, Z., Zhong, B., Li, G., Zhang, S., Ji, R.: Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6668–6677 (2020)
Google Scholar
Cui, Y., Jiang, C., Wang, L., Wu, G.: MixFormer: end-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13618 (2022)
Google Scholar
Danelljan, M., Bhat, G., Khan, F., Felsberg, M.: ATOM: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ECO: efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6638–6646 (2017)
Google Scholar
Fan, H., et al.: LaSOT: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383 (2019)
Google Scholar
Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: a benchmark for higher frame rate object tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1125–1134 (2017)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hendrycks, D., Gimpel, K.: Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv1606.08415 (2016)
Google Scholar
Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1562–1577 (2019)
Article Google Scholar
Kristan, M., et al.: The eighth visual object tracking VOT2020 challenge results. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12539, pp. 547–601. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-68238-5_39
Chapter Google Scholar
Lan, S., Li, J., Sun, S., Lai, X., Wang, W.: Robust visual object tracking with spatiotemporal regularisation and discriminative occlusion deformation. In: IEEE International Conference on Image Processing, pp. 1879–1883 (2021)
Google Scholar
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4282–4291 (2019)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Mayer, C., et al.: Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8731–8740 (2022)
Google Scholar
Mayer, C., Danelljan, M., Paudel, D., Gool, L.V.: Learning target candidate association to keep track of what not to track. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13444–13454 (2021)
Google Scholar
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 445–461. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_27
Chapter Google Scholar
Paul, M., Danelljan, M., Mayer, C., Van Gool, L.: Robust visual tracking by segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 571–588. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_33
Chapter Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Song, Z., Yu, J., Chen, Y., Yang, W.: Transformer tracking with cyclic shifting window attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8791–8800 (2022)
Google Scholar
Tolstikhin, I., et al.: MLP-mixer: an all-MLP architecture for vision. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Google Scholar
Wu, Y., Lim, J., Yang, M.: Online object tracking: a benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418 (2013)
Google Scholar
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457 (2021)
Google Scholar
Yu, Y., Xiong, Y., Huang, W., Scott, M.: Deformable siamese attention networks for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6728–6737 (2020)
Google Scholar
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware siamese networks for visual object tracking. In: Proceedings of the European Conference on Computer Vision, pp. 101–117 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science, Sichuan University, Chengdu, China
Piaoyang Li, Shiyong Lan, Shipeng Sun, Yongyang Gao, Yongyu Yang & Guangyu Yu
University of Surrey, Guildford, GU2 7XH, UK
Wenwu Wang

Authors

Piaoyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Shiyong Lan
View author publications
You can also search for this author in PubMed Google Scholar
Shipeng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Wenwu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yongyang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yongyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Guangyu Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiyong Lan .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, P. et al. (2023). Siamese Network Based on MLP and Multi-head Cross Attention for Visual Object Tracking. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14263. Springer, Cham. https://doi.org/10.1007/978-3-031-44204-9_35

Download citation

DOI: https://doi.org/10.1007/978-3-031-44204-9_35
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44203-2
Online ISBN: 978-3-031-44204-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Siamese Network Based on MLP and Multi-head Cross Attention for Visual Object Tracking

Abstract

Access this chapter

Similar content being viewed by others

Siamese visual tracking based on criss-cross attention and improved head network

Learning attention modules for visual tracking

Lightweight Object Tracking Algorithm Based on Siamese Network with Efficient Attention

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Siamese Network Based on MLP and Multi-head Cross Attention for Visual Object Tracking

Abstract

Access this chapter

Similar content being viewed by others

Siamese visual tracking based on criss-cross attention and improved head network

Learning attention modules for visual tracking

Lightweight Object Tracking Algorithm Based on Siamese Network with Efficient Attention

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation