The development of autonomous driving is progressing rapidly, and new approaches continually promise improved performance and safety. One promising approach is the use of diffusion models, which have proven to be a powerful generative method for modeling multimodal action distributions in robotics. A research team from Huazhong University of Science & Technology and Horizon Robotics has now introduced DiffusionDrive, a new model that leverages the advantages of truncated diffusion models for autonomous driving.
Conventional end-to-end driving models are often based on regressions with unimodal actions, which do not consider the inherent uncertainty and multimodality of driving behavior. Furthermore, existing diffusion models require numerous denoising steps, which leads to significant computational overhead and limits real-time capability in dynamic traffic environments. Another problem is the so-called "mode collapse," where different random inputs converge to similar trajectories, thus limiting the diversity of generated driving actions.
DiffusionDrive aims to address these challenges and develop a model that can generate diverse and high-quality driving actions in real-time. The motivation lies in improving the interaction of the driving model with the conditional scene context to enable better trajectory reconstruction and planning in complex environments.
DiffusionDrive employs a novel truncated diffusion policy that incorporates prior multimodal anchors and shortens the diffusion schedule. This allows the model to learn denoising from an anchored Gaussian distribution to a multimodal driving action distribution. An efficient, Transformer-based diffusion decoder enhances interaction with the conditional scene context through a sparse deformable attention mechanism, which optimizes trajectory reconstruction.
Unlike conventional diffusion policies, which sample actions from random Gaussian noise based on the scene context, DiffusionDrive is guided by established driving patterns that are dynamically adapted to traffic conditions. By embedding these prior driving patterns into the diffusion policy and shortening the denoising process, the number of required steps is reduced from 20 to just 2 – a significant acceleration that meets the real-time requirements of autonomous driving.
The model was evaluated on the planning-oriented NAVSIM dataset and the NuScenes dataset. On NAVSIM, DiffusionDrive with the ResNet-34 backbone achieved a PDMS score of 88.1 and a real-time speed of 45 FPS on an NVIDIA 4090. Compared to other state-of-the-art methods, DiffusionDrive demonstrated significantly improved planning performance, reduced collision rates, and lower L2 errors. On the NuScenes dataset, DiffusionDrive ran 1.8 times faster than VAD and achieved a 20.8% lower L2 error and a 63.6% lower collision rate with the same ResNet-50 backbone.
DiffusionDrive represents a promising approach for autonomous driving. By leveraging truncated diffusion models, it enables efficient and robust trajectory planning in real-time. The results of the experiments on the NAVSIM and NuScenes datasets demonstrate the model's performance and suggest great potential for future developments in the field of autonomous driving. In particular, the ability to generate diverse and plausible trajectories contributes to safety and robustness in complex traffic situations.