汽车工程 ›› 2023, Vol. 45 ›› Issue (5): 759-767.doi: 10.19562/j.chinasae.qcgc.2023.05.005

所属专题: 智能网联汽车技术专题-感知&HMI&测评2023年

• • 上一篇    下一篇

基于多层时空融合网络的驾驶人注意力预测

金立生,纪丙东,郭柏苍()   

  1. 燕山大学车辆与能源学院,秦皇岛 066004
  • 收稿日期:2022-10-16 修回日期:2022-11-25 出版日期:2023-05-25 发布日期:2023-05-26
  • 通讯作者: 郭柏苍 E-mail:guobaicang@ysu.edu.cn
  • 基金资助:
    国家重点研发计划(2021YFB3202200);国家自然科学基金(52072333);河北省自然科学基金(F2022203054);河北省高等学校科学技术研究项目(BJK2023026)

Driver’s Attention Prediction Based on Multi-Level Temporal-Spatial Fusion Network

Lisheng Jin,Bingdong Ji,Baicang Guo()   

  1. School of Vehicle and Energy,Yanshan University,Qinhuangdao 066004
  • Received:2022-10-16 Revised:2022-11-25 Online:2023-05-25 Published:2023-05-26
  • Contact: Baicang Guo E-mail:guobaicang@ysu.edu.cn

摘要:

类人驾驶是提升汽车智能化程度的重要途径之一,识别和定位驾驶人的感兴趣目标和区域,进而快速、精确地感知驾驶场景中潜在风险或提供决策所需关键信息,能够有效增强智能汽车的功能可理解性和鲁棒性。本文基于层次化编码器-解码器架构设计轻量化多层时空融合网络,建立轻量化驾驶人注意力预测模型。首先,以MobileNetV2作为编码器的骨干网络,提取当前帧4个尺度上的多层次空间特征,将其存入记忆模块并与在历史帧上提取的多层次特征在时间维度叠加,得到连续帧间的时空特征后传输至解码器。其次,基于层次化解码结构设计解码器,采用逆瓶颈3D卷积模块设计时空融合层,融合每个独立分支上的时空特征。最后,融合4个独立分支上捕获不同尺度信息的预测结果,获得驾驶人注意力预测值作为模型预测输出结果。结果表明:本文提出的驾驶人注意力预测模型,通过在多个特征尺度上的编码与解码,能够有效利用动态场景当前帧和历史帧间的时间、空间、尺度信息;在DADA-2000和TDV数据集上的测试实验表明,在多个指标上优于当前同类优秀模型;模型尺寸为19 MB,单帧运算速度为0.02 s,实现了优秀的模型轻量化与实时性效果。综上所述,本研究解决了当前复杂交通环境下的动态驾驶场景驾驶人注意力预测模型体积庞大、实时性较差的问题,对智能汽车类人感知、决策等研究有一定的理论支持和应用价值。

关键词: 汽车工程, 驾驶人注意力预测, 时空融合, 类人驾驶, 显著性预测, 轻量化模型

Abstract:

Humanoid driving is one of the important ways to improve the level of vehicle intelligence. It can identify and locate the target and area of interest of the driver, and then quickly and accurately perceive the potential risks in the driving scene or provide the key information required for decision-making, which can effectively enhance the functional understandability and robustness of intelligent vehicles. In this paper, a lightweight multi-level temporal-spatial fusion network is designed based on the hierarchical encoder-decoder architecture, and a lightweight driver attention prediction model is established. Firstly, the MobileNetV2 is used as the backbone network of the encoder to extract the multi-level spatial features on the four scales of the current frame, store them in the memory module, and superimpose them with the multi-level features extracted on the historical frames in the time dimension. Then the temporal-spatial features between consecutive frames are obtained and then transmitted to the decoder. Secondly, the decoder is designed based on the hierarchical decoding structure. The inverse bottleneck 3D convolution module is used to design the temporal-spatial fusion layer to fuse the temporal-spatial features on each independent branch. Finally, the prediction results of different scales information captured on four independent branches are fused. The predicted value of the driver’s attention is obtained as the model prediction output result. The results show that the driver’s attention prediction model proposed in this paper can effectively utilize the time, space and scale information between the current and historical frames of the dynamic scene through encoding and decoding on multiple feature scales. The tests on DADA-2000 and TDV datasets show that it is superior to the current excellent models of the same kind in many indicators. The model size is 19 MB, and the single frame operation speed is 0.02 s, which realizes excellent model lightweight and real-time effect. In summary, this study has solved the problems of large volume and poor real-time performance of the driver’s attention prediction model in the dynamic driving scene under the current complex traffic environment, and has certain theoretical support and application value for the research on humanoid perception and decision-making of intelligent vehicles.

Key words: automotive engineering, driver’s attention prediction, temporal?spatial fusion, humanoid driving, saliency prediction, lightweight model