汽车工程 ›› 2024, Vol. 46 ›› Issue (12): 2290-2302.doi: 10.19562/j.chinasae.qcgc.2024.12.015

• • 上一篇    下一篇

SFW-YOLOv8复杂场景视频车辆检测模型

祝琴1,2,韩沈阳2,曾明如2(),赖平红3,吴垂茂2,胡玮轶2   

  1. 1.南昌大学公共政策与管理学院,南昌 330036
    2.南昌大学信息工程学院,南昌 330036
    3.江西省人民医院,南昌 330038
  • 收稿日期:2024-04-26 修回日期:2024-06-12 出版日期:2024-12-25 发布日期:2024-12-20
  • 通讯作者: 曾明如 E-mail:zeng_mr@163.com
  • 基金资助:
    国家自然科学基金(72164027)

SFW-YOLOv8 Complex Scene Video Vehicle Detection Model

Qin Zhu1,2,Shenyang Han2,Mingru Zeng2(),Pinghong Lai3,Chuimao Wu2,Weiyi Hu2   

  1. 1.School of Public Policy and Management,Nanchang University,Nanchang 330036
    2.School of Information Engineering,Nanchang University,Nanchang 330036
    3.Jiangxi Provincial People's Hospital,Nanchang 330038
  • Received:2024-04-26 Revised:2024-06-12 Online:2024-12-25 Published:2024-12-20
  • Contact: Mingru Zeng E-mail:zeng_mr@163.com

摘要:

针对复杂交通监控场景中视频车辆检测模型难以提取丰富的目标特征的问题,本文从充分利用视频图像时空特征信息的角度,新建时空特征融合模块SF-Module,运用Transformer模型中的多头自注意力机制实现视频车辆图像当前帧和历史帧时空特征信息的提取和融合,丰富目标的特征信息;在此基础上,基于YOLOv8网络,在其颈部网络融合新建的时空特征融合模块SF-Module,挖掘视频图像序列的时空特征信息;同时,引入WIoU损失函数作为预测框回归损失,减少低质量标注框产生的有害梯度,设计SFW-YOLOv8视频车辆检测模型。最后,新建的SFW-YOLOv8复杂场景视频车辆检测模型在UA-DETRAC数据集上进行实验,对数据集中的部分图片进行了模拟雨天和雾天的数据增强,提高车辆检测模型的泛化性。实验结果表明,SFW-YOLOv8视频车辆检测模型的MAP50和MAP50:5:95值为79.1%和63.6%,较YOLOv8模型分别提高了1.7%和3.3%,推理速度为11 ms/帧,具有较为优秀的检测性能。

关键词: 车辆目标检测, 时空特征融合, Transformer, YOLOv8, 注意力机制

Abstract:

For the problem that it is difficult for video vehicle detection models to extract rich target features in complex traffic monitoring scenarios, in this paper a new spatial-temporal feature fusion module SF-Module is established from the perspective of making full use of spatial-temporal feature information of video images. The multi-head self-attention mechanism in Transformer model is used to extract and fuse the temporal and spatial feature information of current and historical frames of video vehicle images to enrich the feature information of the target. On this basis, based on YOLOv8 network, the newly created spatio-temporal feature fusion module SF-Module is integrated in its neck network to mine spatio-temporal feature information of video image sequences. At the same time, the WIoU loss function is introduced as the prediction frame regression loss to reduce the harmful gradient generated by the low quality label frame, and the SFW-YOLOv8 video vehicle detection model is designed. Finally, the newly established SFW-YOLOv8 complex scene video vehicle detection model is tested on the UA-DETRAC dataset, and some images in the dataset are simulated to enhance the data on rainy and foggy days, so as to improve the generalization of the vehicle detection model. The experimental results show that the values of mAP50 and mAP50:5:95 of the SFW-YOLOv8 video vehicle detection model are 79.1% and 63.6%, which are 1.7% and 3.3% higher than that of the YOLOv8 model, respectively. The reasoning speed is 11 ms/ frame, which has excellent detection performance.

Key words: vehicle target detection, spatio-temporal feature fusion, Transformer, YOLOv8, attention mechanism