自动驾驶车辆轨迹跟踪避撞的扩散强化学习方法研究

doi:10.19562/j.chinasae.qcgc.2025.08.006

摘要/Abstract

摘要：

自动驾驶汽车的智能化是推进汽车产业转型升级的关键，其中轨迹跟踪避撞技术对确保自动驾驶汽车行驶安全至关重要。本研究针对现有强化学习型控制方法探索不充分问题，提出了一种扩散型强化学习算法。通过将扩散模型与强化学习框架相结合，把传统策略网络替换为扩散式生成策略网络，将扩散模型的多模态分布匹配能力引入强化学习中，并与值分布柔性执行-评价算法结合，提出了扩散型值分布执行-评价算法。仿真与实车试验表明，所提算法展现出较高的探索效率，实车横向平均跟踪误差小于0.03 m，速度平均跟踪误差小于0.05 m/s，验证了算法的优越性。

关键词: 轨迹跟踪, 主动避撞, 值分布强化学习, 扩散模型

Abstract:

The intelligence of autonomous vehicles is key to upgrading of the automotive industry， where trajectory tracking and collision avoidance technologies are crucial for ensuring vehicle safety. In this paper， for the problem of insufficient exploration of existing reinforcement learning control methods， a diffusion reinforcement learning algorithm is proposed. By combining diffusion models with reinforcement learning frameworks and replacing traditional policy networks with diffusion generative policy networks， introducing the multimodal distribution matching capability of diffusion models into reinforcement learning， and combining it with the distributional soft actor-critic algorithm， a diffusion distributional actor-critic algorithm （DDAC） is proposed. Simulation and real-vehicle experiments demonstrate that the proposed algorithm exhibits high exploration efficiency， with real vehicle lateral tracking error less than 0.03 m and velocity tracking error less than 0.05 m/s， verifying the superiority of the algorithm.

Key words: trajectory tracking, active collision avoidance, distributional reinforcement learning, diffusion model

赵俊杰,王以诺,吴江,吴思潮,邹昌迪,王洪达,李升波,马飞,段京良. 自动驾驶车辆轨迹跟踪避撞的扩散强化学习方法研究[J]. 汽车工程, 2025, 47(8): 1490-1500.

Junjie Zhao,Yinuo Wang,Jiang Wu,Sichao Wu,Changdi Zou,Hongda Wang,ShengboEben Li,Fei Ma,Jingliang Duan. Research on Diffusion Reinforcement Learning Method for Vehicle Trajectory Tracking and Collision Avoidance of Autonomous Vehicles[J]. Automotive Engineering, 2025, 47(8): 1490-1500.

图/表 19

图1

图2

图3

表1

扩散型策略网络伪代码"

Network：扩散型策略网络
Input：状态 $s t$ ，总降噪步数 $N$
初始化降噪网络： $? ? a t i, s t, i$
初始化超参数： $α 、 β$
Repeat
	$a i - 1 a i = a i α i - β i α i 1 - α ˉ i ? ? a i, s, i$
until $i = 1$
Output： $a t 0$

表1

表2

DDAC算法伪代码"

Algorithm：DDAC算法
初始化扩散策略网络参数 $? 、值分布网络参数 θ 1 、 θ 2$
初始化值分布网络学习率 $β Z$ 、策略网络学习率 $β π$
初始化目标网络参数 $θ ˉ 1$ ← $θ 1$ ， $θ ˉ 2$ ← $θ 2$ ， $? ˉ$ ← $?$
初始化同步速率 $τ$
初始化迭代步数 $k = 0$
repeat
	repeat
		从 $π ? (a t s t)$ 中计算 $a t$
		与系统交互得到奖励 $r$ 和下一个状态 $s t + 1$
		更新 $T' = s, a, r, s'$ 至经验池 $B$
	until 采样完成
	repeat
		从 $B$ 中采样数据
		从目标网络 $π ? ˉ (a t + 1 s t + 1)$ 采样 $a t + 1 0$
		更新值分布网络梯度： $θ ← θ - β Z ? θ J Z (θ)$
		更新策略网络梯度： $? ← ? + β π ? ? J π (?)$
		更新目标网络： $θ ˉ ← τ θ + 1 - τ θ ˉ$
		更新目标网络： $? ˉ ← τ ? + 1 - τ ? ˉ$
	until 收敛
until 迭代完成

表2

图4

表3

图5

表4

训练超参数列表"

参数	数值
策略网络学习率	$2 e$ -3
值网络学习率	$1 e$ -3
目标网络学习率	$2 e$ -3
折扣因子	$0.99$
优化方法经验池大小	$1 e 6$
单次迭代样本数	256
优化方法	Adam
反向降噪链系数 $β$	$0 ~ 1$
反向步数	$5$
策略分布标准差 $σ π$ 边界	$[- 20,0.5]$

表4

图6

图7

表5

不同算法累计回报"

算法	累计回报
DDAC （ $N$ =20）	-269±57
DDAC （ $N$ =5）	-307±24
DSAC	-272±24
DDPG	-401±51
TD3	-306±47
SAC	-356±33
PPO	-354±92
TRPO	-389±101

表5

图8

表6

仿真试验决策耗时"

算法名称	平均耗时/ms	最大耗时/ms
DDAC （ $N$ =5）	2.11	3.00
DDAC （ $N$ =20）	8.46	10.00

表6

图9

图10

图11

图12

表7

参考文献 30

[1]	王建，许叁征，甘浩，等. 智能汽车纵深防御关键技术及挑战［C］. 2018 中国汽车工程学会年会论文集，2018：287⁃291.
	WANG J，XU S Z，GAN H，et al. Key technologies and challenges of intelligent vehicle in-depth defense［C］. 2018 SAE-China Annual Conference Proceedings，2018：287-291.
[2]	肖礼明，张发旺，陈良发，等.依托多风格强化学习的车辆轨迹跟踪避撞控制［J］.汽车工程，2024，46（6）：945-955.
	XIAO L M，ZHANG F W，CHEN L F，et al. Vehicle trajectory tracking and collision avoidance control based on multi-style reinforcement learning［J］. Automotive Engineering，2024，46（6）， 945-955.
[3]	DUAN J， REN Y， ZHANG F， et al. Encoding distributional soft actor-critic for autonomous driving in multi-lane scenarios ［J］. IEEE Computational Intelligence Magazine， 2024， 19（2）： 96-112.
[4]	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Playing atari with deep reinforcement learning［J］. arXiv preprint arXiv：， 2013.
[5]	LILLICRAP T P， HUNT J J， PRITZEL A， et al. Continuous control with deep reinforcement learning［J］. arXiv preprint arXiv：， 2015.
[6]	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms［J］. arXiv preprint arXiv：， 2017.
[7]	MNIH V， BADIA A P， MIRZA M， et al. Asynchronous methods for deep reinforcement learning［C］. International Conference on Machine Learning. PMLR， 2016： 1928-1937.
[8]	HAARNOJA T， ZHOU A， ABBEEL P， et al. Soft actor-critic： off-policy maximum entropy deep reinforcement learning with a stochastic actor［C］.International Conference on Machine Learning. PMLR， 2018： 1861-1870.
[9]	DUAN J， GUAN Y， LI S E， et al. Distributional soft actor-critic： off-policy reinforcement learning for addressing value estimation errors［J］. IEEE Transactions on Neural Networks and Learning Systems， 2021， 33（11）： 6584-6598.
[10]	DUAN J， WANG W， XIAO L， et al. DSAC-T： distributional soft actor-critic with three refinements［J］. arXiv preprint arXiv：， 2023.
[11]	YANG L， HUANG Z， LEI F， et al. Policy representation via diffusion probability model for reinforcement learning［J］. arXiv preprint arXiv：， 2023.
[12]	KANG B， MA X， DU C， et al. Efficient diffusion policies for offline reinforcement learning［J］. Advances in Neural Information Processing Systems， 2024， 36.
[13]	ARENZ O， NEUMANN G， ZHONG M. Efficient gradient-free variational inference using policy search［C］.International Conference on Machine Learning. PMLR， 2018： 234-243.
[14]	TANG Y， AGRAWAL S. Boosting trust region policy optimization by normalizing flows policy［J］. arXiv preprint arXiv：， 2018.
[15]	HAARNOJA T， TANG H， ABBEEL P， et al. Reinforcement learning with deep energy-based policies［C］. International Conference on Machine Learning. PMLR， 2017： 1352-1361.
[16]	CROITORU F A， HONDRU V， IONESCU R T， et al. Diffusion models in vision： a survey［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2023.
[17]	SONG Y， DURKAN C， MURRAY I， et al. Maximum likelihood training of score-based diffusion models［J］. Advances in Neural Information Processing Systems， 2021， 34： 1415-1428.
[18]	DHARIWAL P， NICHOL A. Diffusion models beat gans on image synthesis［J］. Advances in Neural Information Processing Systems， 2021， 34： 8780-8794.
[19]	SOHL-DICKSTEIN J， WEISS E， MAHESWARANATHAN N， et al. Deep unsupervised learning using nonequilibrium thermodynamics［C］. International Conference on Machine Learning. PMLR， 2015： 2256-2265.
[20]	HO J， JAIN A， ABBEEL P. Denoising diffusion probabilistic models［J］. Advances in Neural Information Processing Systems， 2020， 33： 6840-6851.
[21]	WANG Z， HUNT J J， ZHOU M. Diffusion policies as an expressive policy class for offline reinforcement learning［J］. arXiv preprint arXiv：， 2022.
[22]	AJAY A， DU Y， GUPTA A， et al. Is conditional generative modeling all you need for decision-making？［J］. arXiv preprint arXiv：， 2022.
[23]	CHEN Y， LI H， ZHAO D. Boosting continuous control with consistency policy［J］. arXiv preprint arXiv：， 2023.
[24]	CHI C， FENG S， DU Y， et al. Diffusion policy： visuomotor policy learning via action diffusion［J］. arXiv preprint arXiv：， 2023.
[25]	CODEVILLA F， SANTANA E， LÓPEZ A M， et al. Exploring the limitations of behavior cloning for autonomous driving［C］.Proceedings of the IEEE/CVF International Conference on Computer Vision， 2019： 9329-9338.
[26]	LY A O， AKHLOUFI M. Learning to drive by imitation： an overview of deep behavior cloning methods［J］. IEEE Transactions on Intelligent Vehicles， 2020， 6（2）： 195-209.
[27]	PSENKA M， ESCONTRELA A， ABBEEL P， et al. Learning a diffusion model policy from rewards via Q-score matching［J］. arXiv preprint arXiv：， 2023.
[28]	XIAO Z， KREIS K， VAHDAT A. Tackling the generative learning trilemma with denoising diffusion gans［J］. arXiv preprint arXiv：， 2021.
[29]	WANG W， ZHANG Y， GAO J， et al. GOPS： a general optimal control problem solver for autonomous driving and industrial control applications［J］. Communications in Transportation Research， 2023， 3： 100096.
[30]	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms［J］. arXiv preprint arXiv：， 2017.

参数	数值
自身质量	68 kg
外形大小	760 mm×520 mm×210 mm
负载能力	250 kg
旋转半径	387 mm
轮胎半径	150 mm
车轮间距	423 mm