OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

Shihao Wang; Zhiding Yu; Xiaohui Jiang; Shiyi Lan; Min Shi; Nadine Chang; Jan Kautz; Ying Li; Jose M. Alvarez

doi:10.1109/CVPR52734.2025.02090

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

Shihao Wang, Zhiding Yu^*, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez

^*Corresponding author for this work

School of Mechanical Engineering

Research output: Contribution to journal › Conference article › peer-review

1 Citation (Scopus)

Abstract

The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counter-factual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.

Original language	English
Pages (from-to)	22442-22452
Number of pages	11
Journal	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOIs	http://doi.org/10.1109/CVPR52734.2025.02090
Publication status	Published - 2025
Event	2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Nashville, United States Duration: 11 Jun 2025 → 15 Jun 2025

Keywords

vlm; autonomous driving; dataset

Access to Document

10.1109/CVPR52734.2025.02090

Cite this

@article{3105f60e65b643cf84b1becf972ec9bc,

title = "OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning",

abstract = "The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counter-factual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q\&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.",

keywords = "vlm; autonomous driving; dataset",

author = "Shihao Wang and Zhiding Yu and Xiaohui Jiang and Shiyi Lan and Min Shi and Nadine Chang and Jan Kautz and Ying Li and Alvarez, \{Jose M.\}",

note = "Publisher Copyright: {\textcopyright} 2025 IEEE.; 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 ; Conference date: 11-06-2025 Through 15-06-2025",

year = "2025",

doi = "10.1109/CVPR52734.2025.02090",

language = "English",

pages = "22442--22452",

journal = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

issn = "1063-6919",

publisher = "IEEE Computer Society",

}

TY - JOUR

T1 - OmniDrive

T2 - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025

AU - Wang, Shihao

AU - Yu, Zhiding

AU - Jiang, Xiaohui

AU - Lan, Shiyi

AU - Shi, Min

AU - Chang, Nadine

AU - Kautz, Jan

AU - Li, Ying

AU - Alvarez, Jose M.

PY - 2025

Y1 - 2025

N2 - The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counter-factual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.

AB - The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counter-factual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.

KW - vlm; autonomous driving; dataset

UR - http://www.scopus.com/pages/publications/105017065039

U2 - 10.1109/CVPR52734.2025.02090

DO - 10.1109/CVPR52734.2025.02090

M3 - Conference article

AN - SCOPUS:105017065039

SN - 1063-6919

SP - 22442

EP - 22452

JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Y2 - 11 June 2025 through 15 June 2025

ER -

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this