Deep Multi Agent Reinforcement Learning for Autonomous Driving

Bhalla, Sushrut

Deep Multi Agent Reinforcement Learning for Autonomous Driving

Files

Bhalla_Sushrut.pdf (2.15 MB)

Date

2020-04-29

Authors

Bhalla, Sushrut

Advisor

Crowley, Mark

Publisher

University of Waterloo

Abstract

Deep Learning and back-propagation have been successfully used to perform centralized training with communication protocols among multiple agents in a cooperative Multi-Agent Deep Reinforcement Learning (MARL) environment. In this work, I present techniques for centralized training of MARL agents in large scale environments and compare my work against current state of the art techniques. This work uses model-free Deep Q-Network (DQN) as the baseline model and allows inter agent communication for cooperative policy learning. I present two novel, scalable and centralized MARL training techniques (MA-MeSN, MA-BoN), which are developed under the principle that the behavior policy and message/communication policies have different optimization criteria. Thus, this work presents models which separate the message learning module from the behavior policy learning module. As shown in the experiments, the separation of these modules helps in faster convergence in complex domains like autonomous driving simulators and achieves better results than the current techniques in literature. Subsequently, this work presents two novel techniques for achieving decentralized execution for the communication based cooperative policy. The first technique uses behavior cloning as a method of cloning an expert cooperative policy to a decentralized agent without message sharing. In the second method, the behavior policy is coupled with a memory module which is local to each model. This memory model is used by the independent agents to mimic the communication policies of other agents and thus generate an independent behavior policy. This decentralized approach has minimal effect on degradation of the overall cumulative reward achieved by the centralized policy. Using a fully decentralized approach allows us to address the challenges of noise and communication bottlenecks in real-time communication channels. In this work, I theoretically and empirically compare the centralized and decentralized training algorithms to current research in the field of MARL. As part of this thesis, I also developed a large scale multi-agent testing environment. It is a new OpenAI-Gym environment which can be used for large scale multi-agent research as it simulates multiple autonomous cars driving cooperatively on a highway in the presence of a bad actor. I compare the performance of the centralized algorithms to existing state-of-the-art algorithms, for ex, DIAL and IMS which are based on cumulative reward achieved per episode and other metrics. MA-MeSN and MA-BoN achieve a cumulative reward of at-least 263% higher than the reward achieved by the DIAL and IMS. I also present an ablation study of the scalability of MA-BoN and show that MA-MeSN and MA-BoN algorithms only exhibit a linear increase in inference time and number of trainable parameters compared to quadratic increase for DIAL.