基于强化学习的自动驾驶决策研究综述

doi:10.19562/j.chinasae.qcgc.2023.04.001

摘要/Abstract

摘要：

强化学习的发展推动了自动驾驶决策技术的进步，智能决策技术已成为自动驾驶领域高度关注的要点问题。本文以强化学习算法发展为主线，综述该算法在单车自动驾驶决策领域的深入应用。对强化学习传统算法、经典算法和前沿算法从基本原理和理论建模等方面进行归纳总结与对比分析。针对不同场景的自动驾驶决策方法分类，分析环境状态可观测性对建模的影响，重点阐述了不同层次强化学习典型算法的应用技术路线，并对自动驾驶决策方法提出研究展望，以期为自动驾驶决策方案研究提供有益参考。

关键词: 自动驾驶, 决策算法, 强化学习, 前沿发展

Abstract:

Decision-making technology of autonomous vehicle is promoted by the development of reinforcement learning， and intelligent decision-making technology has become a key issue of high concern in the field of autonomous driving. Taking the development of reinforcement learning algorithm as the main line in this paper， the in-depth application of this algorithm in the field of single-car autonomous driving decision-making is summarized. Traditional reinforcement learning algorithms， classic algorithms and frontier algorithms are summarized and compared from the aspect of basic principles and theoretical modeling methods. According to the classification of autonomous driving decision-making methods in different scenarios， the impact of environmental state observability on modeling is analyzed， and the application technology routes of typical reinforcement learning algorithms at different levels are emphasized. The research prospects for the autonomous driving decision-making method are proposed in order to provide a useful reference for the research of autonomous driving decision-making.

Key words: autonomous driving, decision-making algorithm, reinforcement learning, frontier development

金立生,韩广德,谢宪毅,郭柏苍,刘国峰,朱文涛. 基于强化学习的自动驾驶决策研究综述[J]. 汽车工程, 2023, 45(4): 527-540.

Lisheng Jin,Guangde Han,Xianyi Xie,Baicang Guo,Guofeng Liu,Wentao Zhu. Review of Autonomous Driving Decision-Making Research Based on Reinforcement Learning[J]. Automotive Engineering, 2023, 45(4): 527-540.

图/表 11

表1

表2

表3

表4

表5

图1

图2

图3

表6

图4

图5

参考文献 144

1	RAHIM M A， RAHMAN M A， RAHMAN M M， et al. Evolution of IoT-enabled connectivity and applications in automotive industry： a review［J］. Vehicular Communications， 2021， 27： 100285.
2	MA Y， WANG Z， YANG H， et al. Artificial intelligence applications in the development of autonomous vehicles： a survey［J］. IEEE/CAA Journal of AutomaticaSinica， 2020， 7（2）： 315-329.
3	高振海，闫相同，高菲，等. 仿驾驶员DDPG汽车纵向自动驾驶决策方法［J］. 汽车工程， 2021， 43（12）： 1737-1744.
	GAO Z H， YAN X T， GAO F， et al. A driver-like decision-making method for longitudinal autonomous driving based on DDPG［J］. Automotive Engineering， 2021， 43（12）： 1737-1744.
4	杨殿阁，黄晋，江昆，等. 汽车自动驾驶［M］. 北京：清华大学出版社， 2022： 348-350.
	YANG D G， HUANG J， JIANG K， et al. Automotive autonomous driving［M］. Beijing： Tsinghua University Press， 2022： 348-350.
5	HUSSONNOIS M， JUN J Y. End-to-end autonomous driving using the Ape-X algorithm in Carla simulation environment［C］. 2022 Thirteenth International Conference on Ubiquitous and Future Networks （ICUFN）， IEEE， 2022： 18-23.
6	CHOPRA R， ROY S S. End-to-end reinforcement learning for self-driving car［M］. Singapore： Advanced Computing and Intelligent Engineering， 2020： 53-61.
7	LI Q， PENG H， LI J， et al. A survey on text classification： from traditional to deep learning［J］. ACM Transactions on Intelligent Systems and Technology （TIST）， 2022， 13（2）： 1-41.
8	ZHU Z， ZHAO H. A survey of deep rl and il for autonomous driving policy learning［J］. IEEE Transactions on Intelligent Transportation Systems， 2021， 23（9）： 14043-14065.
9	KAELBLING L P， LITTMAN M L， MOORE A W. Reinforcement learning： a survey［J］. Journal of Artificial Intelligence Research， 1996， 4： 237-285.
10	MINSKY M L. Theory of neural-analog reinforcement systems and its application to the brain-model problem［D］. New Jersey： Princeton University， 1954.
11	BELLMAN R E. Dynamic programming［M］. Princeton， Princeton University Press， 1957：86-122.
12	HOWARD R A. Dynamic programming and Markov process［M］. Cambridge： MIT Press， 1960： 32-42.
13	WALTZ W G， FU K S. A heuristic approach to reinforcement control systems［J］. IEEE Transactions on Automatic Control， 1965， 10（4）： 390-398.
14	SUTTON R S. Learning to predict by the method of temporal difference［J］. Machine Learning， 1988， 3（1）： 9-44.
15	WATKINS C J C H. Learning from delayed rewards［D］. Cambridge： University of Cambridge， 1989.
16	RUMMERY G A， NIRANJAN M. On-line q-learning using connectionist systems［R］. Department of Engineering， University of Cambridge， 1994.
17	BERTSEKAS D P， TSITSIKLIS J N. Neuro-dynamic programming： an overview［C］. Proceedings of 1995 34th IEEE Conference on Decision and Control. IEEE， 1995， 1： 560-564.
18	KOCSIS L， SZEPESVÁRI C. Bandit based monte-carlo planning［C］. European Conference on Machine Learning. Springer， Berlin， Heidelberg， 2006： 282-293.
19	KOCSIS L， SZEPESVÁRI C. Bandit based monte-carlo planning［C］. 2016 17th European Conference on Machine Learning， 2006： 282-293.
20	张伟楠，沈键，俞勇. 动手学强化学习［M］. 北京：人民邮电出版社， 2022： 348-350.
	ZHANG W N， SHEN J， YU Y. Hands-on reinforcement learning［M］. Beijing： Posts & Telecom Press， 2022： 348-350.
21	PYEATT L D， HOWE A E. Learning to race： experiments with a simulated race car［C］. FLAIRS Conference， 1998： 357-361.
22	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Human-level control through deep reinforcement learning［J］. Nature， 2015， 518（7540）： 529-533.
23	WOLF P， HUBSCHNEIDER C， WEBER M， et al. Learning how to drive in a real world simulation with deep q-networks［C］. 2017 IEEE Intelligent Vehicles Symposium （IV）. IEEE， 2017： 244-250.
24	GLÄSCHER J， DAW N， DAYAN P， et al. States versus rewards： dissociable neural prediction error signals underlying model-based and model-free reinforcement learning［J］. Neuron， 2010， 66（4）： 585-595.
25	GUAN Y， LI S E， DUAN J， et al. Direct and indirect reinforcement learning［J］. International Journal of Intelligent Systems， 2021， 36（8）： 4439-4467.
26	FARAZI N P， ZOU B， AHAMED T， et al. Deep reinforcement learning in transportation research： a review［J］. Transportation Research Interdisciplinary Perspectives， 2021， 11（12）： 100425.
27	HASSELT H V， GUEZ A， SILVER D. Deep reinforcement learning with double q-learning［C］. Proceedings of the AAAI Conference on Artificial Intelligence， 2016： 2094-2100.
28	WANG Z， SCHAUL T， HESSEL M， et al. Dueling network architectures for deep reinforcement learning［C］. International Conference on Machine Learning. PMLR， 2016： 1995-2003.
29	SCHAUL T， QUAN J， ANTONOGLOU I， et al. Prioritized experience replay［J］. arXiv Preprint arXiv：， 2015.
30	OSBAND I， BLUNDELL C， PRITZEL A， et al. Deep exploration via bootstrapped DQN［C］. Advances in Neural Information Processing Systems29， 2016： 4026-4034.
31	BELLEMARE M G， DABNEY W， MUNOS R. A distributional perspective on reinforcement learning［C］. International Conference on Machine Learning. PMLR， 2017： 449-458.
32	FORTUNATO M， AZAR M G， PIOT B， et al. Noisy networks for exploration［J］. arXiv Preprint arXiv：， 2017.
33	HESSEL M， MODAYIL J， VAN HASSELT H， et al. Rainbow： combining improvements in deep reinforcement learning［C］. Thirty-second AAAI Conference on Artificial Intelligence， 2018： 3215-3222.
34	WANG H， LIU N， ZHANG Y， et al. Deep reinforcement learning： a survey［J］. Frontiers of Information Technology & Electronic Engineering， 2020， 21（12）： 1726-1744.
35	LILLICRAP T P， HUNT J J， PRITZEL A， et al. Continuous control with deep reinforcement learning［J］. arXiv Preprint arXiv：， 2015.
36	WITTEN I H. An adaptive optimal controller for discrete-time Markov environments［J］. Information and Control， 1977， 34（4）： 286-295.
37	BARTO A G， SUTTON R S， ANDERSON C W. Neuronlike adaptive elements that can solve difficult learning control problems［J］. IEEE Transactions on Systems， Man and Cybernetics， 1983，（5）： 834-846.
38	KONDA V， TSITSIKLIS J. Actor-critic algorithms［J］. Advances in Neural Information Processing Systems， 1999（12）： 1008-1014.
39	SCHULMAN J， LEVINE S， ABBEEL P， et al. Trust region policy optimization［C］. International Conference on Machine Learning. PMLR， 2015： 1889-1897.
40	MNIH V， BADIA A P， MIRZA M， et al. Asynchronous methods for deep reinforcement learning［C］. International Conference on Machine Learning. PMLR， 2016： 1928-1937.
41	CHRISTIANO P F， LEIKE J， BROWN T B， et al. Deep reinforcement learning from human preferences［C］. Proceedings of the 31st International Conference on Neural Information Processing Systems， 2017： 4302-4310.
42	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms［J］. arXiv Preprint arXiv：， 2017.
43	FUJIMOTO S， HOOF H， MEGER D. Addressing function approximation error in actor-critic methods［C］. International Conference on Machine Learning. PMLR， 2018： 1587-1596.
44	唐蕾，刘广钟. 改进TD3算法在四旋翼无人机避障中的应用［J］. 计算机工程与应用， 2021， 57（11）： 254-259.
	TANG L， LIU G Z. Application for improved TD3 algorithm in obstacle avoidance of quad-rotor UAV［J］. Computer Engineering and Applications， 2021， 57（11）： 254-259.
45	HAARNOJA T， ZHOU A， ABBEEL P， et al. Soft actor-critic： off-policy maximum entropy deep reinforcement learning with a stochastic actor［C］. International Conference on Machine Learning. PMLR， 2018： 1861-1870.
46	刘庆强，刘鹏云. 基于优先级经验回放的SAC强化学习算法［J］. 吉林大学学报（信息科学版）， 2021， 39（2）： 192-199.
	LIU Q Q， LIU P Y. Soft actor critic reinforcement learning with prioritized experience replay［J］. Journal of Jilin University （Information Science Edition）， 2021， 39（2）： 192-199.
47	XIONG X， WANG J， ZHANG F， et al. Combining deep reinforcement learning and safety based control for autonomous driving［J］. arXiv Preprint arXiv：， 2016.
48	SALLAB A E， ABDOU M， PEROT E， et al. End-to-end deep reinforcement learning for lane keeping assist［J］. arXiv Preprint arXiv：， 2016.
49	SAXENA D M， BAE S， NAKHAEI A， et al. Driving in dense traffic with model-free reinforcement learning［C］. 2020 IEEE International Conference on Robotics and Automation （ICRA）. IEEE， 2020： 5385-5392.
50	HUANG Z， ZHANG J， TIAN R， et al. End-to-end autonomous driving decision based on deep reinforcement learning［C］. 2019 5th International Conference on Control， Automation and Robotics （ICCAR）. IEEE， 2019： 658-662.
51	朱冰，蒋渊德，赵健，等. 基于深度强化学习的车辆跟驰控制［J］. 中国公路学报， 2019， 32（6）： 53-60.
	ZHU B， JIANG Y D， ZHAO J， et al. A car-following control algorithm based on deep reinforcement learning［J］. China Journal of Highway and Transport， 2019， 32（6）： 53-60.
52	GAO Z H， SUN T J， XIAO H W， et al.Decision-making method for vehicle longitudinal automatic driving based on reinforcement Q-learning［J］. International Journal of Advanced Robotic Systems， 2019， 16（3）： 1-13.
53	VECERIK M， HESTER T， SCHOLZ J， et al. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards［J］. arXiv Preprint arXiv：， 2017.
54	LIU H， HUANG Z， WU J， et al. Improved deep reinforcement learning with expert demonstrations for urban autonomous driving［C］. 2022 IEEE Intelligent Vehicles Symposium （IV）. IEEE， 2022： 921-928.
55	LI D， OKHRIN O. DDPG car-following model with real-world human driving experience in CARLA［J］. arXiv Preprint arXiv：， 2021.
56	TANG X， HUANG B， LIU T， et al. Highway decision-making and motion planning for autonomous driving via soft actor-critic［J］. IEEE Transactions on Vehicular Technology， 2022， 71（5）： 4706-4717.
57	LIU X， LI H， WANG J， et al. Personalized automatic driving system based on reinforcement learning technology［C］. 2019 4th International Conference on Mechanical， Control and Computer Engineering （ICMCCE）. IEEE， 2019： 373-3733.
58	LI G， YANG Y， LI S， et al. Decision making of autonomous vehicles in lane change scenarios： deep reinforcement learning approaches with risk awareness［J］. Transportation Research Part C： Emerging Technologies， 2022， 134： 103452.
59	WANG P， CHAN C Y. Formulation of deep reinforcement learning architecture toward autonomous driving for on-ramp merge［C］. 2017 IEEE 20th International Conference on Intelligent Transportation Systems （ITSC）. IEEE， 2017： 1-6.
60	LIN Y， MCPHEE J， AZAD N L. Anti-jerk on-ramp merging using deep reinforcement learning［C］. 2020 IEEE Intelligent Vehicles Symposium （IV）. IEEE， 2020： 7-14.
61	LI G， LIN S， LI S， et al. Learning automated driving in complex intersection scenarios based on camera sensors： a deep reinforcement learning approach［J］. IEEE Sensors Journal， 2022， 22（5）： 4687-4696.
62	KARGAR E， KYRKI V. Vision transformer for learning driving policies in complex and dynamic environments［C］. 2022 IEEE Intelligent Vehicles Symposium （IV）. IEEE， 2022： 1558-1564.
63	KAMRAN D， LOPEZ C F， LAUER M， et al. Risk-aware high-level decisions for automated driving at occluded intersections with reinforcement learning［C］. 2020 IEEE Intelligent Vehicles Symposium （IV）. IEEE， 2020： 1205-1212.
64	POLYDOROS A S， NALPANTIDIS L. Survey of model-based reinforcement learning： applications on robotics［J］. Journal of Intelligent & Robotic Systems， 2017， 86（2）： 153-173.
65	SILVER D， HUANG A， MADDISON C J， et al. Mastering the game of go with deep neural networks and tree search ［J］. Nature， 2016， 529（7587）： 484=489.
66	PUCCETTI L， YASSER A， RATHGEBER C， et al. Speed tracking control using model-based reinforcement learning in a real vehicle［C］. 2021 IEEE Intelligent Vehicles Symposium （IV）. IEEE， 2021： 1213-1219.
67	LEE H， KIM K， KIM N， et al. Energy efficient speed planning of electric vehicles for car-following scenario using model-based reinforcement learning［J］. Applied Energy， 2022， 313：118460.
68	ARADI S. Survey of deep reinforcement learning for motion planning of autonomous vehicles［J］. arXiv e-prints， 2020： arXiv： .
69	SEILER K M， KURNIAWATI H， SINGH S P N. An online and approximate solver for POMDPs with continuous action space［C］. 2015 IEEE International Conference on Robotics and Automation （ICRA）. IEEE， 2015： 2290-2297.
70	BAI H， CAI S， YE N， et al. Intention-aware online POMDP planning for autonomous driving in a crowd［C］. 2015 IEEE International Conference on Robotics and Automation （ICRA）. IEEE， 2015： 454-460.
71	HOEL C J， DRIGGS-CAMPBELL K， WOLFF K， et al. Combining planning and deep reinforcement learning in tactical decision making for autonomous driving［J］. IEEE Transactions on Intelligent Vehicles， 2019， 5（2）： 294-305.
72	SHU K， YU H， CHEN X， et al. Autonomous driving at intersections： a behavior-oriented critical-turning-point approach for decision making［J］. IEEE/ASME Transactions on Mechatronics， 2021， 27（1）： 234-244.
73	ARORA S， DOSHI P. A survey of inverse reinforcement learning： challenges， methods and progress［J］. Artificial Intelligence， 2021， 297： 103500.
74	BARTO A G， MAHADEVAN S. Recent advances in hierarchical reinforcement learning［J］. Discrete Event Dynamic Systems， 2003， 13（1）： 41-77.
75	SCHWEIGHOFER N， DOYA K. Meta-learning in reinforcement learning［J］. Neural Networks， 2003， 16（1）： 5-9.
76	LEVINE S， KUMAR A， TUCKER G， et al. Offline reinforcement learning： tutorial， review， and perspectives on open problems［J］. arXiv Preprint arXiv：， 2020.
77	WILSON A， FERN A， RAY S， et al. Multi-task reinforcement learning： a hierarchical bayesian approach［C］. Proceedings of the 24th International Conference on Machine Learning， 2007： 1015-1022.
78	GUAN Y， REN Y， MA H， et al. Learn collision-free self-driving skills at urban intersections with model-based reinforcement learning［C］. 2021 IEEE International Intelligent Transportation Systems Conference （ITSC）. IEEE， 2021： 3462-3469.
79	TAYLOR M E， STONE P. Transfer learning for reinforcement learning domains： a survey［J］. Journal of Machine Learning Research， 2009， 10（7）： 1633-1685
80	DONG D， CHEN C， LI H， et al. Quantum reinforcement learning［J］. IEEE Transactions on Systems， Man， and Cybernetics， Part B （Cybernetics）， 2008， 38（5）： 1207-1220.
81	BELLEMARE M G， DABNEY W， MUNOS R. A distributional perspective on reinforcement learning［C］. International Conference on Machine Learning. PMLR， 2017： 449-458.
82	PARISOTTO E， SONG F， RAE J， et al. Stabilizing transformers for reinforcement learning［C］. International Conference on Machine Learning. PMLR， 2020： 7487-7498.
83	AMODEI D， OLAH C， STEINHARDT J， et al. Concrete problems in AI safety［J］. arXiv Preprint arXiv：， 2016.
84	STRENS M. A Bayesian framework for reinforcement learning［C］. Proc. of the 17th Intl. Conf. on Machine Learning （ICML）， 2000： 943-950.
85	PUIUTTA E， VEITH E. Explainable reinforcement learning： a survey［C］. International Cross-domain Conference for Machine Learning and Knowledge Extraction. Springer， Cham， 2020： 77-95.
86	RATLIFF N D， BAGNELL J A， ZINKEVICH M A. Maximum margin planning［C］. Proceedings of the 23rd International Conference on Machine Learning， 2006： 729-736.
87	ZIEBART B D， MAAS A L， BAGNELL J A， et al. Maximum entropy inverse reinforcement learning［C］. Aaai. 2008， 8： 1433-1438.
88	BOULARIAS A， KOBER J， PETERS J. Relative entropy inverse reinforcement learning［C］. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings， 2011： 182-189.
89	HO J， ERMON S. Generative adversarial imitation learning［J］. arXiv Preprint arXiv：， 2016.
90	YUAN M， PUN M O， CHEN Y. Hybrid adversarial inverse reinforcement learning［J］. arXiv Preprint arXiv：， 2021.
91	BACON P L， HARB J， PRECUP D. The option-critic architecture［C］. Proceedings of the AAAI Conference on Artificial Intelligence， 2017：1726-1734.
92	KULKARNI T D， NARASIMHAN K R， SAEEDI A， et al. Hierarchical deep reinforcement learning： integrating temporal abstraction and intrinsic motivation［C］. Proceedings of the 30th International Conference on Neural Information Processing Systems， 2016： 3682-3690.
93	TESSLER C， GIVONY S， ZAHAVY T， et al. A deep hierarchical approach to lifelong learning in minecraft［J］. arXiv Preprint arXiv：， 2016.
94	SHI H， SUN Y， LI G.Intemittent control with reinforcement leaning［C］. 2017 International Conference on Progress in Informatics and Computing （PIC）. IEEE， 2017： 56-60.
95	VEZHNEVETS A S， OSINDERO S， SCHAUL T， et al. Feudal networks for hierarchical reinforcement learning［C］. International Conference on Machine Learning. PMLR， 2017： 3540-3549.
96	ANDRYCHOWICZ M， WOLSKI F， RAY A， et al. Hindsight experience replay［J］. arXiv Preprint arXiv：， 2017.
97	NACHUM O， GU S， LEE H， et al. Data-efficient hierarchical reinforcement learning［C］. Proceedings of the 32nd International Conference on Neural Information Processing Systems， 2018： 3307-3317.
98	LEVY A， KONIDARIS G， PLATT R， et al. Learning multi-level hierarchies with hindsight［J］. arXiv Preprint arXiv：， 2017.
99	SUKHBAATAR S， DENTON E， SZLAM A， et al. Learning goal embeddings via self-play for hierarchical reinforcement learning［J］. arXiv Preprint arXiv：， 2018.
100	EYSENBACH B， GUPTA A， IBARZ J， et al. Diversity is all you need： learning skills without a reward function［J］. arXiv Preprint arXiv：， 2018.
101	SHARMA A， GU S， LEVINE S， et al. Dynamics-aware unsupervised discovery of skills［J］. arXiv Preprint arXiv：， 2019.
102	SCHWEIGHOFER N， DOYA K. Meta-learning in reinforcement learning［J］. Neural Networks， 2003， 16（1）：5-9.
103	DUAN Y， SCHULMAN J， CHEN X， et al. Rl $^2$： fast reinforcement learning via slow reinforcement learning［J］. arXiv Preprint arXiv：， 2016.
104	MISHRA N， ROHANINEJAD M， CHEN X， et al. A simple neural attentive meta-learner［J］. arXiv Preprint arXiv：， 2017.
105	KIRSCH L， VAN STEENKISTE S， SCHMIDHUBER J. Improving generalization in meta reinforcement learning using learned objectives［J］. arXiv Preprint arXiv：， 2019.
106	OH J， HESSEL M， CZARNECKI W M， et al. Discovering reinforcement learning algorithms［J］. Advances in Neural Information Processing Systems， 2020， 33： 1060-1070.
107	FINN C， ABBEEL P， LEVINE S. Model-agnostic meta-learning for fast adaptation of deep networks［C］. International Conference on Machine Learning. PMLR， 2017： 1126-1135.
108	NICHOL A， ACHIAM J， SCHULMAN J. On first-order meta-learning algorithms［J］. arXiv Preprint arXiv：， 2018.
109	HOUTHOOFT R， CHEN R Y， ISOLA P， et al. Evolved policy gradients［C］. Proceedings of the 32nd International Conference on Neural Information Processing Systems， 2018： 5405-5414.
110	RAKELLY K， ZHOU A， FINN C， et al. Efficient off-policy meta-reinforcement learning via probabilistic context variables［C］. International Conference on Machine Learning. PMLR， 2019： 5331-5340.
111	FAKOOR R， CHAUDHARI P， SOATTO S， et al. Meta-q-learning［J］. arXiv Preprint arXiv：， 2019.
112	KUMAR A， FU J， TUCKER G， et al. Stabilizing off-policy Q-learning via bootstrapping error reduction［C］. Proceedings of the 33rd International Conference on Neural Information Processing Systems， 2019： 11784-11794.
113	WU Y， TUCKER G， NACHUM O. Behavior regularized offline reinforcement learning［J］. arXiv Preprint arXiv：， 2019.
114	AGARWAL R， SCHUURMANS D， NOROUZI M. An optimistic perspective on offline reinforcement learning［C］. International Conference on Machine Learning. PMLR， 2020： 104-114.
115	KUMAR A， ZHOU A， TUCKER G， et al. Conservative q-learning for offline reinforcement learning［J］. Advances in Neural Information Processing Systems， 2020， 33： 1179-1191.
116	KOSTRIKOV I， NAIR A， LEVINE S. Offline reinforcement learning with implicit Q-learning［J］. arXiv Preprint arXiv：， 2021.
117	BRANDFONBRENER D， WHITNEY W， RANGANATH R， et al. Offline rl without off-policy evaluation［J］. Advances in Neural Information Processing Systems， 2021， 34： 4933-4946.
118	PENG X B， KUMAR A， ZHANG G， et al. Advantage-weighted regression： simple and scalable off-policy reinforcement learning［J］. arXiv Preprint arXiv：， 2019.
119	NAIR A， DALAL M， GUPTA A， et al. Accelerating online reinforcement learning with offline datasets［J］. arXiv Preprint arXiv：， 2020.
120	KUMAR A， FU J， TUCKER G， et al. Stabilizing off-policy Q-learning via bootstrapping error reduction［J］. arXiv e-prints， 2019： arXiv： .
121	KUMAR A， ZHOU A， TUCKER G， et al. Conservative q-learning for offline reinforcement learning［J］.Advances in Neural Information Processing Systems， 2020， 33： 1179-1191.
122	FERNANDO C， BANARSE D，BLUNDELL C， et al. PathNet： evolution channels gradient descent in super neural networks［J］. arXiv Preprint arXiv： 1701.08734，2017.
123	TEH Y W， BAPST V， CZARNECKI W M， et al. Distral： robust multitask reinforcement learning［C］. Proceedings of the 31st International Conference on Neural Information Processing Systems， 2017： 4499-4509.
124	HESSEL M， SOYER H， ESPEHOLT L， et al. Multi-task deep reinforcement learning with PopArt［C］. Proceedings of the AAAI Conference on Artificial Intelligence， 2019： 3796-3803.
125	TEH Y W， BAPST V， CZARNECKI W M， et al. Distral： robust multitask reinforcement learning［J］. arXiv Preprint arXiv：， 2017.
126	ESPEHOLT L， SOYER H， MUNOS R， et al. Impala： scalable distributed deep-rl with importance weighted actor-learner architectures［C］. International Conference on Machine Learning. PMLR， 2018： 1407-1416.
127	MEREL J， TASSA Y， TB D， et al. Learning human behaviors from motion capture by adversarial imitation［J］. arXiv Preprint arXiv：， 2017.
128	KUEFLER A， MORTON J， WHEELER T， et al. Imitating driver behavior with generative adversarial networks［C］. 2017 IEEE Intelligent Vehicles Symposium （IV）. IEEE， 2017： 204-211.
129	YOU C， LU J， FILEV D， et al. Advanced planning for autonomous vehicles using reinforcement learning and deep inverse reinforcement learning［J］. Robotics and Autonomous Systems， 2019， 114： 1-18.
130	WANG P， LI H， CHAN C Y. Meta-adversarial inverse reinforcement learning for decision-making tasks［C］. 2021 IEEE International Conference on Robotics and Automation （ICRA）. IEEE， 2021： 12632-12638.
131	LIU J， BOYLE L N， BANERJEE A G. An inverse reinforcement learning approach for customizing automated lane change systems［J］. IEEE Transactions on Vehicular Technology， 2022， 71（9）： 9261-9271.
132	CHEN J， WANG Z， TOMIZUKA M. Deep hierarchical reinforcement learning for autonomous driving with distinct behaviors［C］. 2018 IEEE Intelligent Vehicles Symposium （IV）. IEEE， 2018： 1239-1244.
133	CHEN Y， DONG C， PALANISAMY P， et al. Attention-based hierarchical deep reinforcement learning for lane change behaviors in autonomous driving ［C］. RSJ International Conference on Intelligent Robots and Systems （IROS）， 2019： 3697-3703.
134	DUAN J， EBEN LI S， GUAN Y， et al. Hierarchical reinforcement learning for self‐driving decision‐making without reliance on labelled driving data［J］. IET Intelligent Transport Systems， 2020， 14（5）： 297-305.
135	吕超，鲁洪良，于洋，等. 基于分层强化学习和社会偏好的自主超车决策系统［J］. 中国公路学报， 2022， 35（3）： 115-126.
	LV C， LU H L， YU Y， et al. Autonomous overtaking decision making system based on hierarchical reinforcement learning and social preferences［J］. China Journal of Highway and Transport，2022， 35（3）： 115-126.
136	QIAO Z， TYREE Z， MUDALIGE P， et al. Hierarchical reinforcement learning method for autonomous vehicle behavior planning［C］. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems （IROS）. IEEE， 2020： 6084-6089.
137	LUBARS J， GUPTA H， CHINCHALI S， et al. Combining reinforcement learning with model predictive control for on-ramp merging［C］.2021 IEEE International Intelligent Transportation Systems Conference （ITSC）. IEEE， 2021： 942-947.
138	BAI Z， HAO P， SHANGGUAN W， et al. Hybrid reinforcement learning-based eco-driving strategy for connected and automated vehicles at signalized intersections［J］. IEEE Transactions on Intelligent Transportation Systems， 2022， 23（9）： 15850-15863.
139	SHI T， WANG P， CHENG X， et al. Driving decision and control for autonomous lane change based on deep reinforcement learning［J］. arXiv Preprint arXiv：， 2019.
140	NAVEED K B， QIAO Z， DOLAN J M. Trajectory planning for autonomous vehicles using hierarchical reinforcement learning［C］. 2021 IEEE International Intelligent Transportation Systems Conference （ITSC）. IEEE， 2021： 601-606.
141	YAVAS U， KUMBASAR T， URE N K. Model-based reinforcement learning for advanced adaptive cruise control： a hybrid car following policy［C］. 2022 IEEE Intelligent Vehicles Symposium （IV）. IEEE， 2022： 1466-1472.
142	JIANG J， REN Y， GUAN Y， et al. Integrated decision and control at multi-lane intersections with mixed traffic flow［C］. Journal of Physics： Conference Series. IOP Publishing， 2022， 2234（1）： 012015.
143	李升波. 混合型强化学习及其高级别自动驾驶应用［R］. 北京：北京智源大会自动驾驶论， 2022.
	LI S B. Mixed reinforcement learning and its application in autonomous vehicle［R］. Beijing：BAAI Conference，2022.
144	李克强，常雪阳，李家文，等. 智能网联汽车云控系统及其实现［J］. 汽车工程， 2020， 42（12）： 1595-1605.
	LI K Q， CHANG X Y， LI J W， et al. Cloud control system for intelligent and connected vehicles and its application［J］. Automotive Engineering， 2020， 42（12）： 1595-1605.

项目	基于模型	基于无模型
间接式	DP， ADP， HDP， ADHDP， DHP， GDHP， ADGDHP， DGPI	MC， SARSA， TD， Q-learning， AC， A3C， DQN， GAE， DDQN， PER， Dueling DQN， C51， Rainbow， NAF
直接式	PILCO， GPS， 12A， MVE， STEVE	TRPO， PPO， DPG， DDPG， Off-PAC， ACER， REACTOR， IPG， TD3， SAC， Trust-PCL， SIL， APEX， IMPALA

算法	主要思想
DQN	Q函数神经网络化
Double DQN^［27］	双Q-learning
Dueling DQN^［28］	对抗神经网络
Prioritized DQN^［29］	优先级经验
Bootstrapped DQN^［30］	深度探索
Distributional DQN^［31］	值函数分布
Noisy DQN^［32］	噪声网络
Rainbow DQN^［33］	6种改进方法融合

算法	提出时间	算法类型	主要思想	适用场景	主要优缺点
TRPO^［39］	2015	基于策略	新旧策略增加策略参数变化约束	离散/连续动作空间	优点：模型可靠度高，策略可以实现单调递增缺点：算法过于复杂
A3C^［40］	2016	基于AC	多个线程同步训练	离散/连续动作空间	优点：可用CPU多线程，资源消耗少，降低交互数据相关性缺点：各Agent为异步独立更新，其不同策略可能导致全局网络策略的更新效果并非最优
A2C^［41］	2017	基于AC	优势函数	离散/连续动作空间	优点：利用优势函数解决高方差问题缺点：交互数据的相关性，影响收敛效果
PPO^［42］	2017	基于策略	设置目标函数限制更新幅度范围	离散/连续动作空间	优点：实现简单，适用范围广缺点：Actor和Critic网络均只包含两层全连接层，在高维状态空间的应用中学习能力易下降
TD3^［43］	2018	基于AC	double Q-Learning 融入DDPG	连续动作空间	优点：优化DDPG算法高估Q值的问题，训练更加稳定缺点：训练时随机采样的数据质量残次不齐，效果受影响^［44］；确定性策略算法的Action选择范围相对狭窄，行为随机性受限
SAC^［45］	2018	基于AC	熵正则化 soft策略迭代	连续动作空间	优点：探索性和鲁棒性得到提升缺点：等概率随机采样，网络随机初始化，造成训练速度慢，训练过程不稳定^［46］

算法	主要思想	常见算法
IRL	反向推理出Reward，Agent通过专家示范来学习如何决策复杂问题	最大规划^［86］、最大信息熵^［87］、相对熵^［88］、GAIL^［89］、HAIRL^［90］
HRL	采用逐步实现目标或多级控制这种分而治之思想，将复杂问题分解成若干子问题分而治之	Option-Critic^［91］、h-DQN^［92］、H-DRLN^［93］、h-DDPG^［94］、FuN^［95］、HER^［96］、 HIRO^［97］、HAC^［98］、HSP^［99］、DIAYN^［100］、DADS^［101］
Meta RL	从过去的任务中获取历史经验，从而迅速适应新任务并得到相应最优策略	Meta-RL^［102］、RL^2［103］、SNAIL^［104］、MetaGenRL^［105］、LPG^［106］、MAML^［107］、 Reptile^［108］、EPG^［109］、PEARL^［110］、MQL^［111］
Offline RL	Agent不与环境交互的情况下，仅从获取的数据学习经验知识	BEAR^［112］、BRAC^［113］、REM^［114］、CQL^［115］、IQL^［116］、Onestep^［117］、AWR^［118］、AWAC^［119］、BEAR^［120］、CQL^［121］
MTDRL	基于共享表示，把多个相关的任务放在一起学习	PathNet^［122］、Distral^［123］、PopArt^［124］、DISTRA^［125］、IMPALA^［126］

[1]	付新科,蔡英凤,陈龙,王海,刘擎超. 不确定性环境下的自动驾驶汽车行为决策方法[J]. 汽车工程, 2024, 46(2): 211-221.
[2]	赵晓聪,房世玉,李子睿,孙剑. 社会性驾驶交互关键效用析取与应用[J]. 汽车工程, 2024, 46(2): 230-240.
[3]	马艳丽, 秦钦, 董方琦, 娄艺苧. 基于风险场的不同认知次任务下接管风险评估模型[J]. 汽车工程, 2024, 46(1): 9-17.
[4]	刘卫国,项志宇,刘伟平,齐道新,王子旭. 基于分布式强化学习的车辆控制算法研究[J]. 汽车工程, 2023, 45(9): 1637-1645.
[5]	王明,唐小林,杨凯,李国法,胡晓松. 考虑预测风险的自动驾驶车辆运动规划方法[J]. 汽车工程, 2023, 45(8): 1362-1372.
[6]	朱向雷,吴志新,张宇飞,赵帅,李克秋,孙博华. 基于场景降维及采样方法的场景库优化方法研究[J]. 汽车工程, 2023, 45(8): 1408-1416.
[7]	吴新政,邢星宇,刘力豪,沈勇,陈君毅. 基于错误注入的决策规划系统抗扰性测试与分析[J]. 汽车工程, 2023, 45(8): 1428-1437.
[8]	高锋,冯德福,胡秋霞. 面向NMPC运动规划系统的数值优化加速技术[J]. 汽车工程, 2023, 45(8): 1438-1447.
[9]	芦涛,金馨,廖毅霏,黄圣杰,杨依琳,谢国涛,秦晓辉. 基于雅克比域零空间边缘化的视觉SLAM[J]. 汽车工程, 2023, 45(8): 1457-1467.
[10]	伍文广,田双岳,张志勇,张斌. 非铺装道路凹凸不平特征语义分割方法研究[J]. 汽车工程, 2023, 45(8): 1468-1478.
[11]	林程, 汪博文, 吕沛原, 宫新乐, 于潇. 面向变曲率道路的自动驾驶汽车换道博弈运动规划与协同控制研究[J]. 汽车工程, 2023, 45(7): 1099-1111.
[12]	赵东宇, 赵树恩. 基于级联YOLOv7的自动驾驶三维目标检测[J]. 汽车工程, 2023, 45(7): 1112-1122.
[13]	李军, 周伟, 唐爽. 基于自适应拟合的智能车换道避障轨迹规划[J]. 汽车工程, 2023, 45(7): 1174-1183.
[14]	赵嘉豪,齐志权,齐智峰,王皓,何磊. 基于轮胎特征点的并行大型车辆朝向角计算[J]. 汽车工程, 2023, 45(6): 1031-1039.
[15]	陈妍妍,王海,蔡英凤,陈龙,李祎承. 基于检测的高效自动驾驶实例分割方法[J]. 汽车工程, 2023, 45(4): 541-550.