Assessing the Reliability of Vision-Language Models for Autonomous Driving

Are Visual-Language Models (VLMs) Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Visual-Language Models (VLMs) have made remarkable progress in recent years and are increasingly being used in various fields. In the context of autonomous driving, VLMs offer the possibility of generating interpretable driving decisions through natural language. This promises a better understanding of the decision-making process of autonomous vehicles and could contribute to greater trust in the technology. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving has remained largely untested.

To address this gap, DriveBench was developed, a benchmark dataset for evaluating the reliability of VLMs under various conditions. DriveBench comprises 19,200 frames, 20,498 question-answer pairs, and covers three question types, four common driving tasks, and a total of 12 popular VLMs. The data were collected under 17 different settings, including clean, corrupted, and text-only inputs, to test the robustness of the models against realistic challenges.

Results of the Study

The results of the study show that VLMs frequently generate plausible answers based on general knowledge or textual cues rather than genuine visual grounding, especially with degraded or missing visual input. This behavior, masked by dataset imbalances and inadequate evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving.

Furthermore, the study shows that VLMs struggle with multimodal reasoning and exhibit increased sensitivity to input corruptions, leading to inconsistencies in performance. This suggests that current VLMs are not yet robust enough to reliably handle the complex and unpredictable conditions of real-world traffic.

Improved Evaluation Metrics and Future Research

To address these challenges, the study proposed refined evaluation metrics that prioritize robust visual grounding and multimodal understanding. These metrics are intended to enable a more accurate assessment of the actual capabilities of VLMs in the context of autonomous driving.

Additionally, the potential of leveraging VLMs' awareness of corruptions to improve their reliability was highlighted. This could be achieved by specifically training the models on corrupted data to increase their robustness against realistic conditions.

Conclusion

The study provides important insights into the reliability of VLMs in the context of autonomous driving. The results show that current models are not yet sufficiently robust and visually grounded to be deployed in safety-critical scenarios. However, the proposed improvements to evaluation metrics and further research on VLMs' awareness of corruptions offer promising approaches for developing more reliable and interpretable decision-making systems for autonomous driving.

Mindverse, as a German provider of AI-powered content solutions, offers a wide range of tools and services for creating texts, images, and research. Furthermore, Mindverse develops customized solutions such as chatbots, voicebots, AI search engines, and knowledge systems. The research results presented here underscore the importance of careful evaluation and further development of AI models, particularly regarding their application in safety-critical areas like autonomous driving.

Bibliography: - Xie, S. et al. (2025). Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives. arXiv preprint arXiv:2501.04003. - DRIVE-Bench: https://drive-bench.github.io/ - PaperReading: https://paperreading.club/page?id=277163 - ChatPaper: https://www.chatpaper.com/chatpaper/zh-CN/paper/96684 - arXiv-Sanity-Lite: https://arxiv-sanity-lite.com/?rank=pid&pid=2501.04003 - Li, L. et al. (2024). Data-Centric Evolution in Autonomous Driving: A Comprehensive Survey of Big Data System, Data Mining, and Closed-Loop Technologies. arXiv preprint arXiv:2401.12888v2. - ResearchGate: https://www.researchgate.net/publication/385108014_Large_Language_Models_for_Autonomous_Driving_LLM4AD_Concept_Benchmark_Simulation_and_Real-Vehicle_Experiment - ResearchGate: https://www.researchgate.net/publication/380653076_Vision_Language_Models_in_Autonomous_Driving_A_Survey_and_Outlook - Xu, Z. et al. (2021). Reliability of GAN Generated Data to Train and Validate Perception. In WACV (pp. 1018-1027). - McKinsey: https://www.mckinsey.com/industries/automotive-and-assembly/our-insights/autonomous-drivings-future-convenient-and-connected ```