UniVoxel: A Novel Framework for 3-D Object Detection in Autonomous Vehicles With Multimodal Voxel Representation

Kaiqi Liu; Yuanyuan Deng; Jiaxun Tong; Wei Li

doi:10.1109/JSEN.2025.3589494

UniVoxel: A Novel Framework for 3-D Object Detection in Autonomous Vehicles With Multimodal Voxel Representation

Kaiqi Liu, Yuanyuan Deng, Jiaxun Tong^*, Wei Li

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Fusing camera and LiDAR information is one of the effective means for achieving robust 3-D object detection. However, current 3-D multimodal methods typically rely on independent branches to extract features from different sensors separately, leading to underutilization of complementary information. In this article, a multimodal detector named UniVoxel is proposed, which is built on a query-based detection paradigm. The UniVoxel integrates inputs from various modalities into the voxel representation for fusion. Specifically, a semantic-guided query generator (SQG) is proposed, in which the low-level voxel features are utilized to adaptively sample multiscale image features, producing unified multimodal voxel features. The multimodal voxel features contain both the geometric and semantic information of the voxels and can ensure that the model focuses on the regions of interest (RoIs). Meanwhile, for maximizing the utilization of complementary information, a fusion voxel encoder (FVE) is introduced to update the multimodal voxels through interacting with the multiscale semantic information of different cameras. Extensive experiments are conducted on the nuScenes dataset. With the help of the proposed framework, the precision of the object detection has been improved both on the validation set and the test set.

Original language	English
Pages (from-to)	33142-33152
Number of pages	11
Journal	IEEE Sensors Journal
Volume	25
Issue number	17
DOIs	http://doi.org/10.1109/JSEN.2025.3589494
Publication status	Published - 2025
Externally published	Yes

Keywords

3-D object detection
deformable attention
multimodal backbone

Access to Document

10.1109/JSEN.2025.3589494

Cite this

@article{9585ec8bb1ce4ef2963892a04c94db09,

title = "UniVoxel: A Novel Framework for 3-D Object Detection in Autonomous Vehicles With Multimodal Voxel Representation",

abstract = "Fusing camera and LiDAR information is one of the effective means for achieving robust 3-D object detection. However, current 3-D multimodal methods typically rely on independent branches to extract features from different sensors separately, leading to underutilization of complementary information. In this article, a multimodal detector named UniVoxel is proposed, which is built on a query-based detection paradigm. The UniVoxel integrates inputs from various modalities into the voxel representation for fusion. Specifically, a semantic-guided query generator (SQG) is proposed, in which the low-level voxel features are utilized to adaptively sample multiscale image features, producing unified multimodal voxel features. The multimodal voxel features contain both the geometric and semantic information of the voxels and can ensure that the model focuses on the regions of interest (RoIs). Meanwhile, for maximizing the utilization of complementary information, a fusion voxel encoder (FVE) is introduced to update the multimodal voxels through interacting with the multiscale semantic information of different cameras. Extensive experiments are conducted on the nuScenes dataset. With the help of the proposed framework, the precision of the object detection has been improved both on the validation set and the test set.",

keywords = "3-D object detection, deformable attention, multimodal backbone",

author = "Kaiqi Liu and Yuanyuan Deng and Jiaxun Tong and Wei Li",

note = "Publisher Copyright: {\textcopyright} 2001-2012 IEEE.",

year = "2025",

doi = "10.1109/JSEN.2025.3589494",

language = "English",

volume = "25",

pages = "33142--33152",

journal = "IEEE Sensors Journal",

issn = "1530-437X",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "17",

}

TY - JOUR

T1 - UniVoxel

T2 - A Novel Framework for 3-D Object Detection in Autonomous Vehicles With Multimodal Voxel Representation

AU - Liu, Kaiqi

AU - Deng, Yuanyuan

AU - Tong, Jiaxun

AU - Li, Wei

PY - 2025

Y1 - 2025

N2 - Fusing camera and LiDAR information is one of the effective means for achieving robust 3-D object detection. However, current 3-D multimodal methods typically rely on independent branches to extract features from different sensors separately, leading to underutilization of complementary information. In this article, a multimodal detector named UniVoxel is proposed, which is built on a query-based detection paradigm. The UniVoxel integrates inputs from various modalities into the voxel representation for fusion. Specifically, a semantic-guided query generator (SQG) is proposed, in which the low-level voxel features are utilized to adaptively sample multiscale image features, producing unified multimodal voxel features. The multimodal voxel features contain both the geometric and semantic information of the voxels and can ensure that the model focuses on the regions of interest (RoIs). Meanwhile, for maximizing the utilization of complementary information, a fusion voxel encoder (FVE) is introduced to update the multimodal voxels through interacting with the multiscale semantic information of different cameras. Extensive experiments are conducted on the nuScenes dataset. With the help of the proposed framework, the precision of the object detection has been improved both on the validation set and the test set.

AB - Fusing camera and LiDAR information is one of the effective means for achieving robust 3-D object detection. However, current 3-D multimodal methods typically rely on independent branches to extract features from different sensors separately, leading to underutilization of complementary information. In this article, a multimodal detector named UniVoxel is proposed, which is built on a query-based detection paradigm. The UniVoxel integrates inputs from various modalities into the voxel representation for fusion. Specifically, a semantic-guided query generator (SQG) is proposed, in which the low-level voxel features are utilized to adaptively sample multiscale image features, producing unified multimodal voxel features. The multimodal voxel features contain both the geometric and semantic information of the voxels and can ensure that the model focuses on the regions of interest (RoIs). Meanwhile, for maximizing the utilization of complementary information, a fusion voxel encoder (FVE) is introduced to update the multimodal voxels through interacting with the multiscale semantic information of different cameras. Extensive experiments are conducted on the nuScenes dataset. With the help of the proposed framework, the precision of the object detection has been improved both on the validation set and the test set.

KW - 3-D object detection

KW - deformable attention

KW - multimodal backbone

UR - http://www.scopus.com/pages/publications/105012282802

U2 - 10.1109/JSEN.2025.3589494

DO - 10.1109/JSEN.2025.3589494

M3 - Article

AN - SCOPUS:105012282802

SN - 1530-437X

VL - 25

SP - 33142

EP - 33152

JO - IEEE Sensors Journal

JF - IEEE Sensors Journal

IS - 17

ER -

UniVoxel: A Novel Framework for 3-D Object Detection in Autonomous Vehicles With Multimodal Voxel Representation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this