Online Vectorized HD Map Construction using Geometry

ECCV 2024
Zhixin Zhang1, Yiyuan Zhang2, Xiaohan Ding3, Fusheng Jin1*, Xiangyu Yue2
1Beijing Institute of Technology
2The Chinese University of Hong Kong, 3Tencent AI Lab
arXiv Code PDF

Teaser

Online vectorized HD maps provide intime, compact and semantically rich
environmental information for autonomous driving vehicles.
Using geometry enhances the construction of online vectorized HD maps,
leading to better robustness to occlusions and ego-vehicle movements.

"Robustness under rainy weather conditions."

"Better performances in more occluded and rotated scenes."

Abstract

The construction of online vectorized High-Definition (HD) maps is critical for downstream prediction and planning. Recent efforts have built strong baselines for this task, however, shapes and relations of instances in urban road systems are still under-explored, such as the rectangle-shape, parallelism, perpendicular, or particualr lane width.

geometry

Moreover, these shape and relation properties will remain unchanged when the ego-vehicle moves forward or turns. Previous methods format instances based on the absolute coordinates, which is sensitive to translations and rotations under such conditions.

geometry

In our work, we propose GeMap (Geometry Map), which end-to-end learns Euclidean shapes and relations of map instances beyond basic perception. Specifically, we design a geometric loss based on angle and distance clues, which is robust to rigid transformations. We also decouple self-attention to independently handle Euclidean shapes and relations. Our method achieves new state-of-the-art performance on the NuScenes and Argoverse 2 datasets. Remarkably, it reaches a 71.8% mAP on the large-scale Argoverse 2 dataset, outperforming MapTR V2 by +4.4% and surpassing the 70% mAP threshold for the first time.

"A short sample driving video."

Video Presentation

GeMap

framework

Illustration of GeMap framework. First, Perspective View (PV) images are transformed into Bird's-Eye-View (BEV) features, then a geometry-decoupled decoder outputs the vectorized HD Map. In each block of the decoder, queries are firstly processed by Euclidean shape and relation attention, which focuses on geometric relevance. Finally, predictions are enhanced in G-Representations by shape and relation constraint.

Geometric Representation

We introduce the geometric representation that captures shapes of individual map instances (Euclidean shape clues) and relations between different instances (Euclidean relation clues).
framework

Euclidean shape clues. Lengths of displacement vectors and angles between neighboring vectors consist of shape clues and are utilized to compute shape loss. As an example, the rectangle-shape is related to a 90 degree angle and two equal sides.

framework

Euclidean relation clues. Distances between pairs of points and angles between pairs of displacement vectors indicate relation clues. The relation clues are more superficially connected to Euclidean relation geometry, such as the parallelism or perpendicular.

Geometry-Decoupled Attention

The geometry of shape and relation pertains to distinct subsets of tokens. For any given map instance, its shape is intimately related to tokens representing that instance's points. Precise shape geometry capture requires the model to discern token correlations specific to an instance while avoiding interference from tokens of unrelated instances. Conversely, for relation geometry modeling, it is beneficial to isolate token correlations that span across different instances, rather than those confined within a single instance.

framework

Therefore, we propose Geometry-Decoupled Attention (GDA). GDA decouples the vanilla self-attention module into the Euclidean shape attention (captures intra-instance geometry) and Euclidean relation attention (captures inter-instance geometry).


Experiment

nuscenes

Comparison on the NuScenes dataset and our method reaches a new state-of-the-art performance. "EB0", "R50", "PP", "Sec", denote EfficientNet-B0, ResNet50, PointPillars, SECOND respectively. Methods with two backbones utilize both camera and LiDAR inputs. "Seg. Loss" denotes whether any segmentation-based loss is adopted. The best result is highlighted in bold. We reproduce different methods on a single RTX3090 GPU to test FPS for fair comparison.

av2

Comparison on the Argoverse 2 dataset. Our method demonstrates significant performance improvements over previous methods.

ablation

Ablation study on the NuScenes dataset. For fast ablation, we only train the model for 24 epochs and use simplified objectives.

Visualization

We also visualize some challenging cases, especially scenes under heavy occluded or rotated. Moreover, visualization results under different weather conditions are also provided. The challenging instances are highlighted by orange boxes.
main

Scenario (a) depicts a complex triangular road boundary with an extended tail. Scenarios (b) and (c) depict pedestrian crossings that can only be partially observed. In scenario (d), the BEV map is tilted and lane markings are obscured by vehicles. These challenging cases indicate the superiority and robustness of GeMap.

rainy

Visualization results under challenging rainy weather conditions. Even with noisy reflections on the road and map instances occluded by water drops, our method still provides robust predictions.

cloudy

Visualization results under sunny weather conditions.

sunny

Visualization results under cloudy weather conditions.

BibTeX

If you find our work useful, please cite our paper. BibTex code is provided below:
@article{zhang2023online,
  title={Online Vectorized HD Map Construction using Geometry},
  author={Zhang, Zhixin and Zhang, Yiyuan and Ding, Xiaohan and Jin, Fusheng and Yue, Xiangyu},
  journal={arXiv preprint arXiv:2312.03341},
  year={2023}
}