Administrator by China Associction for Science and Technology
Sponsored by China Society of Automotive Engineers
Published by AUTO FAN Magazine Co. Ltd.

Automotive Engineering ›› 2023, Vol. 45 ›› Issue (5): 759-767.doi: 10.19562/j.chinasae.qcgc.2023.05.005

Special Issue: 智能网联汽车技术专题-感知&HMI&测评2023年

Previous Articles     Next Articles

Driver’s Attention Prediction Based on Multi-Level Temporal-Spatial Fusion Network

Lisheng Jin,Bingdong Ji,Baicang Guo()   

  1. School of Vehicle and Energy,Yanshan University,Qinhuangdao 066004
  • Received:2022-10-16 Revised:2022-11-25 Online:2023-05-25 Published:2023-05-26
  • Contact: Baicang Guo E-mail:guobaicang@ysu.edu.cn

Abstract:

Humanoid driving is one of the important ways to improve the level of vehicle intelligence. It can identify and locate the target and area of interest of the driver, and then quickly and accurately perceive the potential risks in the driving scene or provide the key information required for decision-making, which can effectively enhance the functional understandability and robustness of intelligent vehicles. In this paper, a lightweight multi-level temporal-spatial fusion network is designed based on the hierarchical encoder-decoder architecture, and a lightweight driver attention prediction model is established. Firstly, the MobileNetV2 is used as the backbone network of the encoder to extract the multi-level spatial features on the four scales of the current frame, store them in the memory module, and superimpose them with the multi-level features extracted on the historical frames in the time dimension. Then the temporal-spatial features between consecutive frames are obtained and then transmitted to the decoder. Secondly, the decoder is designed based on the hierarchical decoding structure. The inverse bottleneck 3D convolution module is used to design the temporal-spatial fusion layer to fuse the temporal-spatial features on each independent branch. Finally, the prediction results of different scales information captured on four independent branches are fused. The predicted value of the driver’s attention is obtained as the model prediction output result. The results show that the driver’s attention prediction model proposed in this paper can effectively utilize the time, space and scale information between the current and historical frames of the dynamic scene through encoding and decoding on multiple feature scales. The tests on DADA-2000 and TDV datasets show that it is superior to the current excellent models of the same kind in many indicators. The model size is 19 MB, and the single frame operation speed is 0.02 s, which realizes excellent model lightweight and real-time effect. In summary, this study has solved the problems of large volume and poor real-time performance of the driver’s attention prediction model in the dynamic driving scene under the current complex traffic environment, and has certain theoretical support and application value for the research on humanoid perception and decision-making of intelligent vehicles.

Key words: automotive engineering, driver’s attention prediction, temporal?spatial fusion, humanoid driving, saliency prediction, lightweight model