基于极坐标的环视视觉稀疏化时序3D目标检测

doi:10.19562/j.chinasae.qcgc.2025.06.018

摘要/Abstract

摘要：

在自动驾驶领域，针对基于环视视觉的3D目标检测方法准确性和实时性之间的矛盾，本文提出了一种极坐标参数化的基于稀疏查询的时序3D目标检测方法PolarSparse4D，该模型由图像编码器、3D锚框解码器以及辅助训练的质量检测分支组成。首先为避免参数归一化带来的检测距离限制，设计了3D锚框中心距离与方位角参数解耦的特征编码方式。其次，通过设计锚框空间信息交互自注意力模块以及锚框时序信息融合模块，高效高精度地完成了极坐标系下环视相机图像时空信息融合过程。最后，通过设计锚框参数质量检测分支，显著提高了检测精度和模型收敛速度。在nuScenes数据集上进行实验验证，本文模型的mAP和NDS均得到了极大的提升，分别为41.3%和52.5%，模型速度为19.2 FPS，证明了本方法在精度和速度方面的优越性和有效性。

关键词: 3D目标检测, 环视视觉, 极坐标参数化, 自动驾驶

Abstract:

To address the trade-off between accuracy and real-time performance in vision-based surround-view 3D object detection for autonomous vehicles， PolarSparse4D， a sparse query-based method using polar parametrization， is proposed. The model consists of an image encoder， a 3D anchor decoder and an auxiliary quality assessment branch for training. Firstly， to avoid the detection distance limitation caused by parameter normalization， a feature encoding method that decouples the center distance and azimuth angle of the 3D anchor boxes is designed. Secondly， by designing an anchor spatial information interaction self-attention module and a temporal information fusion module， the spatiotemporal information fusion process of anchors is completed efficiently and accurately. Finally， an anchor box parameter quality assessment branch is established to improve the detection accuracy and model convergence speed significantly. The experimental results on the nuScenes validation set show that the proposed model achieves 41.3% and 52.5% on mAP and NDS， respectively， with a speed of 19.2 FPS， demonstrating high accuracy and real-time capability.

Key words: 3D object detection, surround-view camera, polar parametrization, autonomous vehicle

魏超,随淑鑫,李路兴. 基于极坐标的环视视觉稀疏化时序3D目标检测[J]. 汽车工程, 2025, 47(6): 1198-1206.

Chao Wei,Shuxin Sui,Luxing Li. PolarSparse4D: Polar Parametrization for Vision-Based Surround-View Temporal Sparse 3D Object Detection[J]. Automotive Engineering, 2025, 47(6): 1198-1206.

图/表 11

图1

图2

图3

图4

图5

表1

损失函数的权重参数设置"

$λ 1$	$λ 2$	$λ 3$	$ξ 1$	$ξ 2$	$ξ 3$
2.0	0.25	1.0	1.0	1.0	1.0

表1

表2

表3

表4

表5

图6

参考文献 27

1	PHILION J， FIDLER S. Lift， splat， shoot： encoding images from arbitrary camera rigs by implicitly unprojecting to 3D［C］. Computer Vision-ECCV 2020： 16th European Conference， Glasgow， UK， August 23-28， 2020， Proceedings， Part XIV 16. Springer International Publishing， 2020： 194-210.
2	HUANG J， HUANG G， ZHU Z， et al. BEVDet： high-performance multi-camera 3D object detection in bird-eye-view［J］. arXiv preprint arXiv：， 2021.
3	HUANG J， HUANG G. BEVDet4D： exploit temporal cues in multi-camera 3D object detection［J］. arXiv preprint arXiv：， 2022.
4	XIE E， YU Z， ZHOU D， et al. M² BEV： multi-camera joint 3D detection and segmentation with unified birds-eye view representation［J］. arXiv preprint arXiv：， 2022.
5	HUANG J， HUANG G. BEVPoolv2： a cutting-edge implementation of BEVDet toward deployment［J］. arXiv preprint arXiv：， 2022.
6	LI Y， HUANG B， CHEN Z， et al. Fast-BEV： a fast and strong bird's-eye view perception baseline［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2024.
7	WANG Y， GUIZILINI V C， ZHANG T， et al. DETR3D： 3D object detection from multi-view images via 3D-to-2D queries［C］. Conference on Robot Learning. PMLR， 2022： 180-191.
8	LI Z， WANG W， LI H， et al. BEVFormer： learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers［C］. European Conference on Computer Vision. Cham： Springer Nature Switzerland， 2022： 1-18.
9	YANG C， CHEN Y， TIAN H， et al. BEVFormer v2： adapting modern image backbones to bird's-eye-view recognition via perspective supervision［C］. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition， 2023： 17830-17839.
10	LIN X， LIN T， PEI Z， et al. Sparse4D： multi-view 3D object detection with sparse spatial-temporal fusion［J］. arXiv preprint arXiv：， 2022.
11	LIN X， LIN T， PEI Z， et al. Sparse4D v2： recurrent temporal fusion with sparse model［J］. arXiv preprint arXiv：， 2023.
12	LIN X， PEI Z， LIN T， et al. Sparse4D v3： advancing end-to-end 3D detection and tracking［J］. arXiv preprint arXiv：， 2023.
13	LIU H， TENG Y， LU T， et al. SparseBEV： high-performance sparse 3D object detection from multi-camera videos［C］. Proceedings of the IEEE/CVF International Conference on Computer Vision， 2023： 18580-18590.
14	JIANG Y， ZHANG L， MIAO Z， et al. PolarFormer： multi-camera 3D object detection with polar transformer［C］. Proceedings of the AAAI conference on Artificial Intelligence， 2023， 37（1）： 1042-1050.
15	CHEN S， WANG X， CHENG T， et al. Polar parametrization for vision-based surround-view 3D detection［J］. arXiv preprint arXiv：， 2022.
16	VASWANI A. Attention is all you need［J］. Advances in Neural Information Processing Systems， 2017.
17	KUHN H W. The Hungarian method for the assignment problem［J］. Naval Research Logistics Quarterly， 1955， 2（1‐2）： 83-97.
18	LIN T. Focal loss for dense object detection［J］. arXiv preprint arXiv：， 2017.
19	WANG J， LI F， BI H. Gaussian focal loss： learning distribution polarized angle prediction for rotated object detection in aerial images［J］. IEEE Transactions on Geoscience and Remote Sensing， 2022， 60： 1-13.
20	CAESAR H， BANKITI V， LANG A H， et al. nuScenes： a multimodal dataset for autonomous driving［C］. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition， 2020： 11621-11631.
21	PASZKE A， GROSS S， MASSA F， et al. Pytorch： an imperative style， high-performance deep learning library［J］. Advances in Neural Information Processing Systems， 2019， 32.
22	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition， 2016： 770-778.
23	LEE Y， PARK J. CenterMask： real-time anchor-free instance segmentation［C］. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition， 2020： 13906-13915.
24	LOSHCHILOV I. Decoupled weight decay regularization［J］. arXiv preprint arXiv：， 2017.
25	LIU Y， YAN J， JIA F， et al. Petrv2： a unified framework for 3D perception from multi-camera images［C］. Proceedings of the IEEE/CVF International Conference on Computer Vision， 2023： 3262-3272.
26	LI Y， BAO H， GE Z， et al. Bevstereo： enhancing depth estimation in multi-view 3D object detection with temporal stereo［C］. Proceedings of the AAAI Conference on Artificial Intelligence， 2023， 37（2）： 1486-1494.
27	HUANG J， HUANG G. Bevpoolv2： a cutting-edge implementation of bevdet toward deployment［J］. arXiv preprint arXiv：， 2022.

坐标系	方法	主干	图像尺寸	NDS↑	mAP↑	mATE↓	mASE↓	mAOE↓	mAVE↓	mAAE↓	FPS↑
笛卡尔坐标系	DETR3D^［7］	R50	900×1600	37.3%	30.2%	0.811	0.282	0.493	0.979	0.212	6.3
	BEVDet^［2］	R50	256×704	37.2%	28.6%	0.724	0.278	0.590	0.873	0.247	7.8
	BEVDet^［2］	R50	384×1056	38.1%	30.4%	0.719	0.272	0.555	0.903	0.257	4.2
	PETRv2^［25］	R50	256×704	45.6%	34.9%	0.700	0.275	0.580	0.437	0.187
	FastBEV^［6］	R50	256×704	48.7%	35.4%	0.656	0.281	0.384	0.361	0.217	41.3
	BEVStereo^［26］	R50	256×704	50.0%	37.2%	0.598	0.270	0.438	0.367	0.190
	BEVPoolv2^［27］	R50	256×704	52.6%	40.6%	0.572	0.275	0.463	0.275	0.188	16.6
	BEVFormerV2^［9］	R50		52.9%	42.3%	0.618	0.273	0.413	0.333	0.188
	Sparse4Dv2^§［11］	R50	256×704	52.1%	41.0%	0.609	0.272	0.449	0.330	0.187	19.9
极坐标系	PolarDETR-T^［15］	R50	900×1600	45.8%	35.4%	0.748	0.277	0.432	0.539	0.197	~6.0
	PolarDETR-T^［15］	R101	900×1600	48.8%	38.3%	0.707	0.269	0.344	0.518	0.196	~3.5
	PolarFormer^［14］	R101	900×1600	47.0%	41.5%	0.657	0.263	0.405	0.911	0.139
	PolarSparse4D	R50	256×704	52.5%	41.3%	0.599	0.270	0.508	0.256	0.185	19.2

类别	AP↑	ATE↓	ASE↓	AOE↓	AVE↓	AAE↓
轿车	0.616	0.395	0.144	0.070	0.201	0.196
货车	0.314	0.607	0.202	0.070	0.219	0.212
公共汽车	0.403	0.740	0.194	0.086	0.440	0.212
拖车	0.144	1.006	0.248	0.928	0.174	0.053
建筑工程车	0.099	1.002	0.502	1.202	0.130	0.436
行人	0.489	0.575	0.291	0.594	0.339	0.175
摩托车	0.421	0.523	0.257	0.690	0.372	0.227
自行车	0.398	0.455	0.249	1.115	0.147	0.008
交通锥	0.667	0.305	0.316
路障	0.575	0.365	0.286	0.146

解码器	NDS↑	mAP↑	GFlops↓	Params↓
NT+5T	52.3%	41.1%	167.995	47.905 M
NT+4T+NT	52.4%	41.1%	167.798	46.854 M
NT+3T+2NT	52.3%	41.0%	167.602	45.804 M
NT+2T+3NT	52.5%	41.2%	167.405	44.753 M
NT+1T+4NT	52.5%	41.3%	167.209	43.702 M

C	Y	Az	NDS↑	mAP↑	GFlops↓	Params↓
			51.8%	40.8%	167.207	43.697 M
√			52.0%	40.9%	167.208	43.699 M
√	√		52.3%	41.0%	167.208	43.700 M
√	√	√	52.5%	41.3%	167.209	43.702 M

[1]	王明辰,王海,蔡英凤,陈龙,李祎承. MSF-Diffuser：BEV下基于扩散模型的多传感器自适应融合自动驾驶方法[J]. 汽车工程, 2025, 47(6): 1122-1132.
[2]	朱凌云,王海洋. 基于LiDAR点云特征补全的雪天无人车目标检测[J]. 汽车工程, 2025, 47(6): 1133-1143.
[3]	马庆禄,蹇秋伟,李美强,邹政. 自动驾驶环境下车道级雷视融合SLAM[J]. 汽车工程, 2025, 47(6): 1155-1168.
[4]	孙鑫宇,金立生,霍震,王欢欢,贺阳,刘栋. 基于角度交并比和自适应生命周期的三维多目标跟踪算法[J]. 汽车工程, 2025, 47(6): 1169-1176.
[5]	隆艾岐,冯治国,张振博,田兴强,向巍. 基于轻量级RT-DETR-tiny的车辆目标检测算法[J]. 汽车工程, 2025, 47(6): 1188-1197.
[6]	刘宸宇,王海,蔡英凤,陈龙. 面向自动驾驶道路场景的相机与毫米波融合的多目标检测算法[J]. 汽车工程, 2025, 47(5): 829-838.
[7]	秦启瑞,王海,蔡英凤,陈龙,李祎承. 基于实例激活图的自动驾驶实时实例分割算法[J]. 汽车工程, 2025, 47(4): 614-624.
[8]	索锦辉, 王晓伟, 蒋沛文, 丁驰, 高铭, 边有钢. 基于多粒度关系推理的自动驾驶域自适应视觉目标检测算法[J]. 汽车工程, 2025, 47(2): 201-210.
[9]	朱冰,贾士政,赵健,韩嘉懿,张培兴,宋东鉴,陈志成. 考虑主观认知的自动驾驶汽车序贯博弈类人决策[J]. 汽车工程, 2025, 47(1): 13-22.
[10]	李江坤,纵瑞雪,邓伟文,王莹,丁娟. 基于有向图的城市交叉口场景相似性评价方法[J]. 汽车工程, 2025, 47(1): 23-34.
[11]	陈鹏,蔡英凤,原海波,陈龙,孙晓强. 基于二型模糊逻辑控制的半挂车轨迹跟踪研究[J]. 汽车工程, 2025, 47(1): 55-66.
[12]	李道飞,潘豪. 场景复杂度评估在轨迹预测和驾驶决策中的应用[J]. 汽车工程, 2024, 46(9): 1556-1563.
[13]	朱冰,范天昕,赵健,张培兴,宋东鉴,薛越,赵文博. 自动驾驶拟人连续交互测试场景生成方法[J]. 汽车工程, 2024, 46(9): 1600-1607.
[14]	张国娟,胡宏宇,李浩淼,王明剑,高菲,高镇海. 自动驾驶车辆乘坐舒适性评价研究综述[J]. 汽车工程, 2024, 46(9): 1617-1627.
[15]	张佳楠,胡钊政,孟杰,胡华桦,左洁. 面向车-路-图协同的分布式自动驾驶仿真平台架构及应用[J]. 汽车工程, 2024, 46(8): 1335-1345.