Abstract:Infrared and visible images exhibit complementary characteristics, making their fusion highly suitable for achieving high accuracy and robustness in target detection for applications such as autonomous driving. However, existing multimodal object detection algorithms often feature large models and long inference times, making them unsuitable for deployment on edge devices. Additionally, direct fusion methods fail to fully leverage the strengths of different modalities. To address these challenges, we propose a fusion object detection algorithm that integrates a gradient operator and an attention mechanism. A gradient operator is employed to design a customized convolutional layer for capturing image texture. In the infrared branch, coordinate attention is incorporated to enhance target localization capabilities. Additionally, a weight generation network is introduced to adaptively balance the features of both modalities. The algorithm is modular and lightweight, making it ideal for edge device deployment. Experiments on benchmark datasets demonstrate that the proposed method achieves mAP@ 0. 50 and mAP@ 0. 5 ∶ 0. 95 scores that are 6. 3% and 7. 2% higher, respectively, than singlemodal detection using visible images, and 11. 3% and 9. 8% higher than infrared detection. The inference frame rate reaches 22. 7 FPS, meeting real-time processing requirements. Keywords:object detection; dual-modal; f