Abstract:In order to improve the tracking accuracy of moving targets in video when various interference factors such as deformation, scale variation and similar targets occur, a siamese network model with combined attention is proposed. Firstly, a lightweight network, i. e. , MobileNetV3, is adopted as the backbone network to extract object feature. Then, in order to improve the attention of the model to the key features of the target, a model structure combining channel combined spatial attention and siamese network is proposed. Finally, through weighting and fusing the cross-correlation results of the feature vectors of attention module and non-attention module, the response map can be obtained, which can be used to obtain the tracking result. Experiment results show that the proposed algorithm can achieve good tracking effect on the OTB50 and OTB100 datasets, the average accuracy and success rate for the two datasets reach 78. 5% and 58. 3% , respectively. In addition, when multiple uncooperative factors, such as deformation, scale variation and similar targets exist, the proposed algorithm can still achieve good tracking effect, which shows that the proposed algorithm has good robustness.