TY - JOUR
T1 - UniVoxel
T2 - A Novel Framework for 3-D Object Detection in Autonomous Vehicles With Multimodal Voxel Representation
AU - Liu, Kaiqi
AU - Deng, Yuanyuan
AU - Tong, Jiaxun
AU - Li, Wei
N1 - Publisher Copyright:
© 2001-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Fusing camera and LiDAR information is one of the effective means for achieving robust 3-D object detection. However, current 3-D multimodal methods typically rely on independent branches to extract features from different sensors separately, leading to underutilization of complementary information. In this article, a multimodal detector named UniVoxel is proposed, which is built on a query-based detection paradigm. The UniVoxel integrates inputs from various modalities into the voxel representation for fusion. Specifically, a semantic-guided query generator (SQG) is proposed, in which the low-level voxel features are utilized to adaptively sample multiscale image features, producing unified multimodal voxel features. The multimodal voxel features contain both the geometric and semantic information of the voxels and can ensure that the model focuses on the regions of interest (RoIs). Meanwhile, for maximizing the utilization of complementary information, a fusion voxel encoder (FVE) is introduced to update the multimodal voxels through interacting with the multiscale semantic information of different cameras. Extensive experiments are conducted on the nuScenes dataset. With the help of the proposed framework, the precision of the object detection has been improved both on the validation set and the test set.
AB - Fusing camera and LiDAR information is one of the effective means for achieving robust 3-D object detection. However, current 3-D multimodal methods typically rely on independent branches to extract features from different sensors separately, leading to underutilization of complementary information. In this article, a multimodal detector named UniVoxel is proposed, which is built on a query-based detection paradigm. The UniVoxel integrates inputs from various modalities into the voxel representation for fusion. Specifically, a semantic-guided query generator (SQG) is proposed, in which the low-level voxel features are utilized to adaptively sample multiscale image features, producing unified multimodal voxel features. The multimodal voxel features contain both the geometric and semantic information of the voxels and can ensure that the model focuses on the regions of interest (RoIs). Meanwhile, for maximizing the utilization of complementary information, a fusion voxel encoder (FVE) is introduced to update the multimodal voxels through interacting with the multiscale semantic information of different cameras. Extensive experiments are conducted on the nuScenes dataset. With the help of the proposed framework, the precision of the object detection has been improved both on the validation set and the test set.
KW - 3-D object detection
KW - deformable attention
KW - multimodal backbone
UR - http://www.scopus.com/pages/publications/105012282802
U2 - 10.1109/JSEN.2025.3589494
DO - 10.1109/JSEN.2025.3589494
M3 - Article
AN - SCOPUS:105012282802
SN - 1530-437X
VL - 25
SP - 33142
EP - 33152
JO - IEEE Sensors Journal
JF - IEEE Sensors Journal
IS - 17
ER -