基于局部窗口交叉注意力的轻量型语义分割

doi:10.19562/j.chinasae.qcgc.2023.09.010

摘要/Abstract

摘要：

在自动驾驶汽车环境感知任务中，采用环视相机在统一鸟瞰图（bird's eye view，BEV）坐标系下对车道线、车辆等目标进行语义分割受到广泛关注。针对相机个数增加致使任务推理延迟线性上升、实时性难以完成语义分割任务的难题，本文提出一种基于局部窗口交叉注意力的轻量型语义分割方法。采用改进型EdgeNeXt骨干网络提取特征，通过构建BEV查询和图像特征之间的局部窗口交叉注意力，完成对跨相机透视图之间的特征查询，然后对融合后的BEV特征图通过上采样残差块对BEV特征解码，以得到BEV语义分割结果。在nuScenes公开数据集上的实验结果表明，该方法在BEV地图静态车道线分割任务中平均IoU达到35.1%，相较于表现较好的HDMapNet提高2.2%，推理速度相较于较快的GKT提高58.2%，帧率达到106 FPS。

关键词: 鸟瞰图, 语义分割, 局部窗口, 交叉注意力

Abstract:

For the environmental perception of autonomous vehicle， the application of circumnavigation cameras in the Bird's Eye View （BEV） coordinate for semantic segmentation of lanes， vehicles and other targets has attracted wide attention. For the problems of linear increase of task inference delay due to the increasing number of cameras as well as difficulty in completing semantic segmentation tasks in real-time in autonomous driving perception， this paper proposes a lightweight semantic segmentation method based on local window cross-attention. The model adopts the improved EdgeNeXt backbone network to extract features. By constructing the local window cross attention between BEV query and image features， the feature query between the cross-camera perspectives is constructed. Then， the fused BEV feature map is decoded by up sampling residual block to obtain the BEV semantic segmentation results. The experimental results on the nuScenes dataset show that the proposed method achieves 35.1% mIoU in the lane static segmentation task of BEV map， which is 2.2% higher than that of HDMapNet. In particular， the inference speed increases by 58.2% compared with that of GKT， with the frame detection rate reaching 106 FPS.

Key words: BEV, semantic segmentation, local window, cross-attention

金祖亮,隗寒冰,Liu Zheng,娄路,郑国峰. 基于局部窗口交叉注意力的轻量型语义分割[J]. 汽车工程, 2023, 45(9): 1617-1625.

Zuliang Jin,Hanbing Wei,Liu Zheng,Lu Lou,Guofeng Zheng. Lightweight Semantic Segmentation Method Based on Local Window Cross Attention[J]. Automotive Engineering, 2023, 45(9): 1617-1625.

图/表 10

图1

图2

图3

图4

图5

图6

表1

表2

图7

图8

参考文献 22

1	王海，蔡柏湘，蔡英凤，等. 基于语义分割网络的路面积水与湿滑区域检测［J］.汽车工程，2021，43（4）：485-491.
	WANG H， CAI B X， CAI Y F， et al. Detection of water⁃covered and wet areas on road pavement based on semantic segmentation network［J］. Automotive Engineering， 2021， 43（4）： 485-491.
2	高涛，邢可，刘占文，等. 基于金字塔多尺度融合的交通标志检测算法［J］. 交通运输工程学报， 2022，22（3）： 210-224.
	GAO T， XING K， LIU Z W， et al. Traffic sign detection algorithm based on pyramid multi-scale fusion［J］.Journal of Traffic and Transportation Engineering， 2022，22（3）： 210-224.
3	CHEN L C， PAPANDREOU G， KOKKINOS I， et al. DeepLab： semantic image segmentation with deep convolutional nets， atrous convolution， and fully connected CRFs［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017，40（4）： 834-848.
4	RONNEBERGER O， FISCHER P， BROX T. U⁃Net： convolutional networks for biomedical image segmentation［C］.2015 International Conference on Medical Image Computing and Computer-Assisted Intervention （MICCAI 2015）， 2015：234-241.
5	MALLOT， HANSPETER A， LITTLE J， et al. Inverse perspective mapping simplifies optical flow computation and obstacle detection［J］. Biological Cybernetics， 1991， 64（3）： 177-185.
6	PHILION J， FIDLER S. Lift， splat， shoot： encoding images from arbitrary camera rigs by implicitly unprojecting to 3D ［C］.2020 European Conference on Computer Vision （ECCV 2020）， 2020：194-210.
7	LANG A H， VORA S， CAESAR H， et al. PointPillars： fast encoders for object detection from point clouds［C］.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2019： 12689-12697.
8	HU A， MUREZ Z， MOHAN N， et al. FIERY： future instance prediction in bird's-eye view from surround monocular cameras［C］.2021 IEEE International Conference on Computer Vision （ICCV2021）， 2021：15253-15262.
9	HUANG J， HUANG G， ZHU Z， et al. Bevdet： highperformance multi-camera 3D object detection in bird-eye-view［J］. arXiv preprint arXiv：， 2021.
10	PAN B， SUN J， LEUNG H Y T， et al. Cross view semantic segmentation for sensing surroundings［J］. IEEE Robotics and Automation Letters， 2020，5（3）：4867-4873.
11	LI Q， WANG Y， WANG Y， et al. HDMapNet： an online HD map construction and evaluation framework［J］. arXiv preprint arXiv：，2021.
12	ZHOU B， KRHENBÜHL P. Cross-view transformers for real-time map-view semantic segmentation［C］.2022 IEEE Conference on Computer Vision and Pattern Recognition （CVPR2022）， 2022： 13750-13759.
13	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］.2017 In Advances in Neural Information Processing Systems （NIPS）， 2017.
14	LI Z， WANG W， LI H， et al. BEVFormer： learning bird’s-eye-view representation from multicamera images via spatiotemporal transformers［J］. arXiv preprint arXiv：， 2022.
15	CHEN S Y， CHENG T H， WANG X G， et al. Efficient and robust 2D-to-bev representation learning via geometry-guided kernel transformer［J］. arXiv preprint arXiv：， 2022.
16	LIN T Y， DOLLAR P， GIRSHICK R， et al. Feature pyramid networks for object detection［C］.2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR2017）. IEEE Computer Society， 2017：936-944.
17	MUHAMMAD M， ABDELRAHMAN S， HISHAM C， et al. EdgeNeXt： efficiently amalgamated CNN-transformer architecture for mobile vision applications［J］. arXiv preprint arXiv：2206.10589
18	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］.2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR2016）， 2016：770-778.
19	LONG J， SHELHAMER E， DARRELL T， et al. Fully convolutional networks for semantic segmentation［C］.2015 IEEE Conference on Computer Vision and Pattern Recognition （CVPR 2015）， 2015：3431-3440.
20	HOWARD A G， ZHU M， CHEN B， et al. Mobilenets： efficient convolutional neural networks for mobile vision applications［J］. CoRR abs/1704.04861 （2017）.
21	LIU Z， MAO H， WU C Y， et al. A convnet for the 2020s［C］.2022 In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR2022）， 2022： 11966-11976.
22	CAESAR H， BANKITI V， LANG A H， et al. nuScenes： a multimodal dataset for autonomous driving［J］. arXiv preprint arXiv：， 2019.

EdgeNeXt	特征金字塔	局部窗口交叉注意力	IoU/%				推理延迟/ms
EdgeNeXt	特征金字塔	局部窗口交叉注意力	车道线	人行横道	道路边缘	平均	推理延迟/ms
			38.8	13.9	37.6	30.1	8.6
√			39.0	14.1	38.2	30.4	7.9
√	√		39.8	14.7	38.9	31.1	8.7
√		√	41.2	15.2	40.4	32.3	8.4
√	√	√	42.2	20.8	42.4	35.1	9.4

方法	图片尺寸	骨干网络	IoU/%				计算量/GFLOPs	推理延迟/ms	FPS
方法	图片尺寸	骨干网络	车道线	人行横道	道路边缘	平均	计算量/GFLOPs	推理延迟/ms	FPS
LLS	128×352	EfficientNet-B0	38.3	14.9	39.3	30.8	62.79	25.1	40
HDMapNet	128×352	EfficientNet-B0	40.6	18.7	39.5	32.9	62.82	19.3	52
CVT	224×480	EfficientNet-B4	39.6	13.8	39.5	30.9	40.01	20.3	49
GKT	224×480	EfficientNet-B4	41.1	15.7	41.0	32.6	42.12	14.9	67
本文方法	128×352	EdgeNeXt-S	42.2	20.8	42.4	35.1	21.57	9.1	106

[1]	伍文广,田双岳,张志勇,张斌. 非铺装道路凹凸不平特征语义分割方法研究[J]. 汽车工程, 2023, 45(8): 1468-1478.
[2]	张雷, 关可人, 丁晓林, 郭鹏宇, 王震坡, 孙逢春. 基于图像识别与动力学融合的路面附着系数估计方法[J]. 汽车工程, 2023, 45(7): 1222-1234.
[3]	黄润辉,胡立坤,苏鸣方,徐大也,陈奥然. 基于三维锥形栅格的激光点云语义分割方法[J]. 汽车工程, 2022, 44(8): 1173-1182.
[4]	王大方,刘磊,曹江,赵刚,赵文硕,唐伟. 基于空洞空间池化金字塔的自动驾驶图像语义分割方法[J]. 汽车工程, 2022, 44(12): 1818-1824.
[5]	王大方,尚海,曹江,王涛,夏祥腾,韩雨霖. 基于自注意力机制的自动驾驶场景点云语义分割方法[J]. 汽车工程, 2022, 44(11): 1656-1664.
[6]	黄圣杰,胡满江,周云水,殷周平,秦晓辉,边有钢,贾倩倩. 动态场景下基于语义分割与运动一致性约束的车辆视觉SLAM[J]. 汽车工程, 2022, 44(10): 1503-1510.
[7]	夏祥腾,王大方,曹江,赵刚,张京明. 基于稀疏卷积神经网络的车载激光雷达点云语义分割方法[J]. 汽车工程, 2022, 44(1): 26-35.
[8]	王海,蔡柏湘,蔡英凤,刘泽,孙恺,陈龙. 基于语义分割网络的路面积水与湿滑区域检测[J]. 汽车工程, 2021, 43(4): 485-491.