多模态信息融合下的监控视频人员身份重识别
DOI:
CSTR:
作者:
作者单位:

1.桂林电子科技大学电子工程与自动化学院桂林541004; 2.中国科学院宁波材料技术与工程研究所宁波315201

作者简介:

通讯作者:

中图分类号:

TP391.41TH74

基金项目:

国家自然科学基金(42361071,42261061)、广西重点研发计划(桂科FN2504240020)、宁波市科技计划(2024Z016)项目资助


Surveillance video person re-identification under multi-modal information fusion
Author:
Affiliation:

1.School of Electronic Engineering and Automation, Guilin University of Electronic Technology, Guilin 541004, China; 2.Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315201, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对目前监控视频人员身份重识别任务难以有效应对低分辨率小目标、人员姿(形)态变化及遮挡检测问题,以YOLOv9为基础网络并结合多模态预训练神经网络(CLIP)模型提出一种多模态信息融合下的监控视频人员身份识别新方法,主要涉及两个方面工作:1)通过引入感受野增强模块与可变形卷积计算提高目标人员不同姿态(形)态下的特征检测性能、引入空间增强注意力机制学习特征间的关系以恢复被遮挡人员特征、引入基于归一化高斯距离的损失度量以增强低分辨率目标人员特征检测敏感性等系列模块设计,构建网络ReID-YOLO有效增强监控视频在不同姿态、形态及低分辨率、遮挡条件下的目标人员特征检测精度、鲁棒性;2)将CLIP跨模态信息融合优势迁移到视频人员身份重识别任务并利用CLIP图像-文本信息对齐优势对前一阶段提取的人员目标特征进行身份预测,在利用ReID-YOLO (Re-identification with YOLO)人员视觉特征有效区分能力缓解CLIP全局场景过度依赖的同时,借助CLIP模型场景泛化优势有效克服YOLO系列网络在整合场景信息深入解析目标方面的不足,从而整体提高网络模型的监控视频人员身份重识别精度与场景泛化能力。在低分辨率、消融与身份重叠等条件下的实验结果表明,所提方法视频人员身份重识别性能表现出色,优于YOLO系列网络及其他7个主流的视频人员身份重识别网络模型,具有良好应用前景。

    Abstract:

    Addressing the challenges of low resolution, severe occlusion and significant changes in personnel pose or shape variations, this paper proposes a new method for personnel re-identification (PR) in surveillance videos based on multimodal information fusion, using YOLOv9 as the backbone network and combining it with The Multi-Modal model CLIP (contrastive language-image pre-training). The method is divided into two stages. In the first stage, a ReID-YOLO network is constructed to enhance person feature detection performance under challenging conditions. A receptive-field enhancement module and deformable convolution are introduced to improve feature extraction for personnel with diverse poses and shapes. A spatially enhanced attention mechanism is employed to model relationships among person features and restore occluded information. In addition, a normalized Gaussian distance-based loss function is designed to increase sensitivity to low-resolution person features. These strategies jointly improve the accuracy and robustness of person feature detection in surveillance videos affected by low resolution, pose variation, shape deformation, and occlusion. In the second stage, the Multi-Modal model CLIP is introduced to improve the overall accuracy and scene generalization ability. By leveraging CLIP′s image-text alignment ability, personnel targets extracted in the first stage are predicted using discriminative features provided by ReID-YOLO. This fusion strategy mitigates CLIP′s excessive reliance on global scene information while compensating for the limited scene-awareness and target semantic parsing capability of YOLO-based networks. Experimental results under challenging conditions such as low resolution, ablation studies, and cross-identity scenarios demonstrate that the proposed method achieves outstanding performance in video-based person re-identification. It outperforms YOLO-series networks and seven other state-of-the-art video re-identification models, showing considerable promise for practical applications.

    参考文献
    相似文献
    引证文献
引用本文

吴军,陈慧,徐刚,赵雪梅,陈睿星.多模态信息融合下的监控视频人员身份重识别[J].仪器仪表学报,2026,47(1):270-286

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2026-03-30
  • 出版日期:
文章二维码