UniVoxel: A Novel Framework for 3-D Object Detection in Autonomous Vehicles With Multimodal Voxel Representation

Kaiqi Liu; Yuanyuan Deng; Jiaxun Tong; Wei Li

doi:10.1109/JSEN.2025.3589494

UniVoxel: A Novel Framework for 3-D Object Detection in Autonomous Vehicles With Multimodal Voxel Representation

Kaiqi Liu, Yuanyuan Deng, Jiaxun Tong^*, Wei Li

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Fusing camera and LiDAR information is one of the effective means for achieving robust 3-D object detection. However, current 3-D multimodal methods typically rely on independent branches to extract features from different sensors separately, leading to underutilization of complementary information. In this article, a multimodal detector named UniVoxel is proposed, which is built on a query-based detection paradigm. The UniVoxel integrates inputs from various modalities into the voxel representation for fusion. Specifically, a semantic-guided query generator (SQG) is proposed, in which the low-level voxel features are utilized to adaptively sample multiscale image features, producing unified multimodal voxel features. The multimodal voxel features contain both the geometric and semantic information of the voxels and can ensure that the model focuses on the regions of interest (RoIs). Meanwhile, for maximizing the utilization of complementary information, a fusion voxel encoder (FVE) is introduced to update the multimodal voxels through interacting with the multiscale semantic information of different cameras. Extensive experiments are conducted on the nuScenes dataset. With the help of the proposed framework, the precision of the object detection has been improved both on the validation set and the test set.

源语言	英语
页（从-至）	33142-33152
页数	11
期刊	IEEE Sensors Journal
卷	25
期	17
DOI	http://doi.org/10.1109/JSEN.2025.3589494
出版状态	已出版 - 2025
已对外发布	是

访问文件

10.1109/JSEN.2025.3589494

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{9585ec8bb1ce4ef2963892a04c94db09,

title = "UniVoxel: A Novel Framework for 3-D Object Detection in Autonomous Vehicles With Multimodal Voxel Representation",

abstract = "Fusing camera and LiDAR information is one of the effective means for achieving robust 3-D object detection. However, current 3-D multimodal methods typically rely on independent branches to extract features from different sensors separately, leading to underutilization of complementary information. In this article, a multimodal detector named UniVoxel is proposed, which is built on a query-based detection paradigm. The UniVoxel integrates inputs from various modalities into the voxel representation for fusion. Specifically, a semantic-guided query generator (SQG) is proposed, in which the low-level voxel features are utilized to adaptively sample multiscale image features, producing unified multimodal voxel features. The multimodal voxel features contain both the geometric and semantic information of the voxels and can ensure that the model focuses on the regions of interest (RoIs). Meanwhile, for maximizing the utilization of complementary information, a fusion voxel encoder (FVE) is introduced to update the multimodal voxels through interacting with the multiscale semantic information of different cameras. Extensive experiments are conducted on the nuScenes dataset. With the help of the proposed framework, the precision of the object detection has been improved both on the validation set and the test set.",

keywords = "3-D object detection, deformable attention, multimodal backbone",

author = "Kaiqi Liu and Yuanyuan Deng and Jiaxun Tong and Wei Li",

note = "Publisher Copyright: {\textcopyright} 2001-2012 IEEE.",

year = "2025",

doi = "10.1109/JSEN.2025.3589494",

language = "English",

volume = "25",

pages = "33142--33152",

journal = "IEEE Sensors Journal",

issn = "1530-437X",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "17",

}

TY - JOUR

T1 - UniVoxel

T2 - A Novel Framework for 3-D Object Detection in Autonomous Vehicles With Multimodal Voxel Representation

AU - Liu, Kaiqi

AU - Deng, Yuanyuan

AU - Tong, Jiaxun

AU - Li, Wei

PY - 2025

Y1 - 2025

N2 - Fusing camera and LiDAR information is one of the effective means for achieving robust 3-D object detection. However, current 3-D multimodal methods typically rely on independent branches to extract features from different sensors separately, leading to underutilization of complementary information. In this article, a multimodal detector named UniVoxel is proposed, which is built on a query-based detection paradigm. The UniVoxel integrates inputs from various modalities into the voxel representation for fusion. Specifically, a semantic-guided query generator (SQG) is proposed, in which the low-level voxel features are utilized to adaptively sample multiscale image features, producing unified multimodal voxel features. The multimodal voxel features contain both the geometric and semantic information of the voxels and can ensure that the model focuses on the regions of interest (RoIs). Meanwhile, for maximizing the utilization of complementary information, a fusion voxel encoder (FVE) is introduced to update the multimodal voxels through interacting with the multiscale semantic information of different cameras. Extensive experiments are conducted on the nuScenes dataset. With the help of the proposed framework, the precision of the object detection has been improved both on the validation set and the test set.

AB - Fusing camera and LiDAR information is one of the effective means for achieving robust 3-D object detection. However, current 3-D multimodal methods typically rely on independent branches to extract features from different sensors separately, leading to underutilization of complementary information. In this article, a multimodal detector named UniVoxel is proposed, which is built on a query-based detection paradigm. The UniVoxel integrates inputs from various modalities into the voxel representation for fusion. Specifically, a semantic-guided query generator (SQG) is proposed, in which the low-level voxel features are utilized to adaptively sample multiscale image features, producing unified multimodal voxel features. The multimodal voxel features contain both the geometric and semantic information of the voxels and can ensure that the model focuses on the regions of interest (RoIs). Meanwhile, for maximizing the utilization of complementary information, a fusion voxel encoder (FVE) is introduced to update the multimodal voxels through interacting with the multiscale semantic information of different cameras. Extensive experiments are conducted on the nuScenes dataset. With the help of the proposed framework, the precision of the object detection has been improved both on the validation set and the test set.

KW - 3-D object detection

KW - deformable attention

KW - multimodal backbone

UR - http://www.scopus.com/pages/publications/105012282802

U2 - 10.1109/JSEN.2025.3589494

DO - 10.1109/JSEN.2025.3589494

M3 - Article

AN - SCOPUS:105012282802

SN - 1530-437X

VL - 25

SP - 33142

EP - 33152

JO - IEEE Sensors Journal

JF - IEEE Sensors Journal

IS - 17

ER -

UniVoxel: A Novel Framework for 3-D Object Detection in Autonomous Vehicles With Multimodal Voxel Representation

摘要

访问文件

其它文件与链接

指纹

引用此