Abstract:Addressing the challenges of low resolution, severe occlusion and significant changes in personnel pose or shape variations, this paper proposes a new method for personnel re-identification (PR) in surveillance videos based on multimodal information fusion, using YOLOv9 as the backbone network and combining it with The Multi-Modal model CLIP (contrastive language-image pre-training). The method is divided into two stages. In the first stage, a ReID-YOLO network is constructed to enhance person feature detection performance under challenging conditions. A receptive-field enhancement module and deformable convolution are introduced to improve feature extraction for personnel with diverse poses and shapes. A spatially enhanced attention mechanism is employed to model relationships among person features and restore occluded information. In addition, a normalized Gaussian distance-based loss function is designed to increase sensitivity to low-resolution person features. These strategies jointly improve the accuracy and robustness of person feature detection in surveillance videos affected by low resolution, pose variation, shape deformation, and occlusion. In the second stage, the Multi-Modal model CLIP is introduced to improve the overall accuracy and scene generalization ability. By leveraging CLIP′s image-text alignment ability, personnel targets extracted in the first stage are predicted using discriminative features provided by ReID-YOLO. This fusion strategy mitigates CLIP′s excessive reliance on global scene information while compensating for the limited scene-awareness and target semantic parsing capability of YOLO-based networks. Experimental results under challenging conditions such as low resolution, ablation studies, and cross-identity scenarios demonstrate that the proposed method achieves outstanding performance in video-based person re-identification. It outperforms YOLO-series networks and seven other state-of-the-art video re-identification models, showing considerable promise for practical applications.