arxiv: 2603.01558 · v2 · submitted 2026-03-02 · 💻 cs.CV

Recognition: no theorem link

TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding

Muhammet Esat Kalfaoglu , Halil Ibrahim Ozturk , Ozsel Kilinc , Alptekin Temizel

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords road topology3D centerline extractionBEV mask predictiondense offset fieldheight mapgeographic data leakagelong-range benchmark

0 comments

The pith

TopoMaskV3 adds dense offset and height heads to mask-based road topology so the pipeline runs as a standalone 3D predictor and reaches 28.5 OLS on geographically disjoint long-range tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Road topology methods extract centerlines from sensor data for mapping and planning. Earlier mask approaches stayed in 2D and still needed a separate parametric head to fix grid errors. TopoMaskV3 inserts an offset field that nudges each mask pixel to sub-grid accuracy and a height map that supplies elevation directly from the same dense representation. The authors also release new train-test splits that keep geographic regions completely separate and add a long-range evaluation out to 100 m. On this stricter benchmark the updated mask pipeline sets a new record while showing less overfitting than Bezier alternatives.

Core claim

TopoMaskV3 extends the mask pipeline with two dense heads—one predicting a 2D offset field inside each BEV cell to correct discretization and one predicting a height value per cell for direct 3D centerline recovery—removing the need for any parametric fusion stage. The same work introduces geographically distinct data partitions and a long-range benchmark that together eliminate location-based memorization, after which the mask representation proves more robust than prior Bezier methods and LiDAR fusion yields its largest relative gains at distance.

What carries the argument

Dense offset field and dense height map heads that operate directly on the BEV mask grid to supply sub-pixel corrections and elevation without a separate parametric branch.

If this is right

Mask representations exhibit lower geographic overfitting than Bezier curve methods on the new splits.
LiDAR fusion improves scores most at long range and shows bigger gains on the original overlapping split.
Standalone 3D centerline extraction becomes possible without hybrid fusion stages.
Geographically disjoint evaluation becomes the required standard for fair road topology benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Autonomous systems could deploy the model in entirely new cities without retraining on local map data.
The same offset-plus-height pattern could be tested on other linear 3D structures such as overhead wires.
Removing the need for parametric fusion may lower compute cost enough for real-time onboard use.
Future datasets should adopt the geographic-split protocol as default practice.

Load-bearing premise

The offset and height heads produce accurate 3D corrections on their own and the new geographic splits remove every form of location-based memorization.

What would settle it

Retraining the model without the offset or height heads and measuring whether OLS on the disjoint long-range split falls below the previous best methods.

Figures

Figures reproduced from arXiv: 2603.01558 by Alptekin Temizel, Halil Ibrahim Ozturk, Muhammet Esat Kalfaoglu, Ozsel Kilinc.

**Figure 1.** Figure 1: Quad-Direction Labels Encoding. Each centerline is assigned one of four directional labels: up, down, left, or right, based on majority voting between consecutive points. Ties are resolved using the angle between the start and end points. 2.2. Multi-modal and Temporal Road Topology Understanding • Multi-Modality: A significant trend is the use of Standard Definition (SD) map priors. SMERF [27] tokenizes… view at source ↗

**Figure 2.** Figure 2: TopoMaskV3 Architecture Overview. The method adopts an instance-query-based design. Bird’s Eye View (BEV) features extracted from multi-camera images are processed by a transformer decoder that predicts: binary masks, quad-direction labels, 2D offsets, and height maps. A quad-direction-aware post-processing step then converts these dense outputs into 3D centerline instances. Offset Head Height Head Transfo… view at source ↗

**Figure 3.** Figure 3: TopoMaskV3 Decoder Architecture. Each sparse query is decoded by five parallel heads, each predicting a different centerline attribute. (which uses the primary path), but it is essential for two specific extensions: (i) replacing the baseline Masked Attention (MA) with Bezier Deformable Attention (BDA) [10] (See Section S.1), or (ii) enabling the output fusion mechanism described below. When the Bezier … view at source ↗

read the original abstract

Mask-based paradigms for road topology understanding, such as TopoMaskV2, offer a complementary alternative to query-based methods by generating centerlines via a dense rasterized intermediate representation. However, prior work was limited to 2D predictions and suffered from severe discretization artifacts, necessitating fusion with parametric heads. We introduce TopoMaskV3, which advances this pipeline into a robust, standalone 3D predictor via two novel dense prediction heads: a dense offset field for sub-grid discretization correction within the existing BEV resolution, and a dense height map for direct 3D estimation. Beyond the architecture, we are the first to address geographic data leakage in road topology evaluation by introducing (1) geographically distinct splits to prevent memorization and ensure fair generalization, and (2) a long-range (+/-100 m) benchmark. TopoMaskV3 achieves state-of-the-art 28.5 OLS on this geographically disjoint benchmark, surpassing all prior methods. Our analysis shows that the mask representation is more robust to geographic overfitting than Bezier, while LiDAR fusion is most beneficial at long range and exhibits larger relative gains on the overlapping original split, suggesting overlap-induced memorization effects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TopoMaskV3 adds dense 3D offset and height heads to the mask pipeline plus geographic splits for a harder benchmark, but the SOTA claim needs more checks on whether those splits actually block leakage.

read the letter

The core advance is moving the mask-based centerline approach from 2D to native 3D without relying on a separate parametric head. They add a dense offset field to correct sub-grid positions and a dense height map for direct 3D output. That looks like a clean fix for the discretization issues mentioned in prior mask work. They also introduce geographically distinct train/test splits and a long-range benchmark, which is a practical step for autonomous driving evaluation where location-based memorization is a real risk. The abstract reports 28.5 OLS on the new disjoint split, beating prior methods, and notes that the mask representation holds up better than Bezier under geographic shifts while LiDAR helps more at distance. Those are useful observations if they hold up in the full experiments. The main soft spot is verification. The abstract claims the splits prevent memorization, but without details on split construction, ablation tables showing the contribution of each new head, or quantitative checks for residual regional patterns, it's difficult to judge how much the gains come from the architecture versus incomplete leakage control. The stress-test point about shared road conventions across coarse geographic boundaries is worth pressing in review. This paper is aimed at the road topology and BEV perception crowd rather than general computer vision. Readers working on mask versus query methods or on realistic driving benchmarks will find the ideas worth examining. The work is grounded enough in a concrete task and shows clear thinking about evaluation pitfalls, so it deserves a serious referee even if revisions are needed on the experimental rigor.

Referee Report

2 major / 2 minor

Summary. The paper introduces TopoMaskV3, extending mask-based road topology methods to 3D via two new dense prediction heads (offset field for sub-grid correction and height map for direct 3D estimation). It also defines geographically distinct train/test splits and a long-range benchmark to reduce geographic leakage, reporting SOTA performance of 28.5 OLS on the disjoint split while providing analysis of mask robustness versus Bezier curves and LiDAR fusion benefits.

Significance. If the empirical gains are confirmed, the work offers a standalone 3D centerline predictor that avoids parametric fusion and introduces evaluation practices that could reduce memorization risks in geographic datasets; the mask-vs-Bezier robustness findings may guide representation choices in future topology models.

major comments (2)

[Abstract] Abstract and Experiments section: the central claim that the dense offset and height heads produce accurate 3D corrections sufficient for standalone use (without additional parametric fusion) is load-bearing for the 28.5 OLS result, yet the visible text provides no ablation tables, error analysis, or quantitative attribution linking these heads specifically to the reported gain over prior methods.
[Abstract] Abstract: the geographically distinct splits are asserted to eliminate location-based memorization, but no quantitative verification (e.g., performance metrics under stricter feature-matched cross-region testing or comparison of regional pattern similarity) is supplied to rule out residual leakage via shared road topologies or densities, which directly undermines the fairness of the new benchmark and the SOTA claim.

minor comments (2)

Clarify the precise formulation of the OLS metric for 3D predictions and how height/offset errors are incorporated.
Add explicit comparison tables showing results on both the original overlapping split and the new disjoint split to quantify the memorization effect mentioned in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide the requested empirical support.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: the central claim that the dense offset and height heads produce accurate 3D corrections sufficient for standalone use (without additional parametric fusion) is load-bearing for the 28.5 OLS result, yet the visible text provides no ablation tables, error analysis, or quantitative attribution linking these heads specifically to the reported gain over prior methods.

Authors: We agree that the manuscript would benefit from explicit ablations to attribute the performance gains. In the revised version we will add ablation tables in the Experiments section that isolate the contribution of the dense offset field and height map heads, including direct comparisons of OLS scores with and without each head. We will also include error analysis quantifying the reduction in discretization artifacts and 3D estimation accuracy provided by these heads relative to the prior TopoMaskV2 baseline. revision: yes
Referee: [Abstract] Abstract: the geographically distinct splits are asserted to eliminate location-based memorization, but no quantitative verification (e.g., performance metrics under stricter feature-matched cross-region testing or comparison of regional pattern similarity) is supplied to rule out residual leakage via shared road topologies or densities, which directly undermines the fairness of the new benchmark and the SOTA claim.

Authors: We acknowledge that stronger quantitative verification of reduced leakage would strengthen the benchmark claims. In revision we will add analysis comparing performance on feature-matched cross-region subsets and report similarity metrics for road topologies and densities across the geographic splits. This will provide direct evidence supporting the fairness of the disjoint benchmark and the reported 28.5 OLS SOTA result. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on empirical evaluation of new heads and splits

full rationale

The paper advances a mask-based pipeline by adding two dense prediction heads (offset field and height map) and introduces geographically distinct splits plus a long-range benchmark. The 28.5 OLS SOTA claim is presented as the outcome of experimental comparison on these splits. No mathematical derivation chain, equations, or self-referential definitions appear in the provided text. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. The central results are externally falsifiable via the reported metrics and splits rather than reducing to fitted parameters or prior self-work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The new heads are architectural additions whose internal hyperparameters are not detailed here.

pith-pipeline@v0.9.0 · 5531 in / 1062 out tokens · 40621 ms · 2026-05-15T18:00:58.159608+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

[1]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 6

work page 2020
[2]

Structured bird’s-eye-view traffic scene un- derstanding from onboard images

Yigit Baran Can, Alexander Liniger, Danda Pani Paudel, and Luc Van Gool. Structured bird’s-eye-view traffic scene un- derstanding from onboard images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15661–15670, 2021. 2, 17

work page 2021
[3]

Efficient and robust 2d-to-bev representation learning via geometry-guided ker- nel transformer.arXiv preprint arXiv:2206.04584, 2022

Shaoyu Chen, Tianheng Cheng, Xinggang Wang, Wenming Meng, Qian Zhang, and Wenyu Liu. Efficient and robust 2d-to-bev representation learning via geometry-guided ker- nel transformer.arXiv preprint arXiv:2206.04584, 2022. 3

work page arXiv 2022
[4]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022. 7, 11, 13

work page 2022
[5]

TopoLogic: An Interpretable Pipeline for Lane Topology Reasoning on Driving Scenes

Yanping Fu, Wenbin Liao, Xinyuan Liu, Yike Ma, Feng Dai, Yucheng Zhang, and others. TopoLogic: An Interpretable Pipeline for Lane Topology Reasoning on Driving Scenes. arXiv preprint arXiv:2405.14747, 2024. 2, 8, 17

work page arXiv 2024
[6]

TopoPoint: Enhance Topology Reasoning via Endpoint Detection in Autonomous Driving,

Yanping Fu, Xinyuan Liu, Tianyu Li, Yike Ma, Yucheng Zhang, and Feng Dai. TopoPoint: Enhance Topology Reasoning via Endpoint Detection in Autonomous Driving,

work page
[7]

arXiv:2505.17771 [cs]. 2

work page arXiv
[8]

Simple-bev: What really mat- ters for multi-sensor bev perception? In2023 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 2759–2765

Adam W Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple-bev: What really mat- ters for multi-sensor bev perception? In2023 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 2759–2765. IEEE, 2023. 3

work page 2023
[9]

Bevpoolv2: A cutting-edge implementation of bevdet toward deployment.arXiv preprint arXiv:2211.17111, 2022

Junjie Huang and Guan Huang. Bevpoolv2: A cutting-edge implementation of bevdet toward deployment.arXiv preprint arXiv:2211.17111, 2022. 14

work page arXiv 2022
[10]

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Junjie Huang, Guan Huang, Zheng Zhu, Ye Yun, and Dalong Du. BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View.arXiv preprint arXiv:2112.11790, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

TopoBDA: Towards Bezier De- formable Attention for Road Topology Understanding.arXiv preprint arXiv:2412.18951, 2024

Muhammet Esat Kalfaoglu, Halil Ibrahim Ozturk, Ozsel Kil- inc, and Alptekin Temizel. TopoBDA: Towards Bezier De- formable Attention for Road Topology Understanding.arXiv preprint arXiv:2412.18951, 2024. 2, 3, 4, 5, 6, 7, 8, 11, 12, 14, 15, 17

work page arXiv 2024
[12]

TopoMaskV2: Enhanced Instance-Mask-Based Formulation for the Road Topology Problem.arXiv preprint arXiv:2409.11325, 2024

Muhammet Esat Kalfaoglu, Halil Ibrahim Ozturk, Ozsel Kilinc, and Alptekin Temizel. TopoMaskV2: Enhanced Instance-Mask-Based Formulation for the Road Topology Problem.arXiv preprint arXiv:2409.11325, 2024. 1, 2, 3, 6, 8, 13, 14, 15, 17

work page arXiv 2024
[13]

Dn-detr: Accelerate detr training by intro- ducing query denoising

Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by intro- ducing query denoising. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13619–13627, 2022. 14

work page 2022
[14]

Mask dino: Towards a unified transformer-based framework for object detection and segmentation

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3041–3050, 2023. 7, 11

work page 2023
[15]

Enhancing 3D Lane Detection and Topology Reasoning with 2D Lane Priors, 2024

Han Li, Zehao Huang, Zitian Wang, Wenge Rong, Naiyan Wang, and Si Liu. Enhancing 3D Lane Detection and Topology Reasoning with 2D Lane Priors, 2024. arXiv:2406.03105 [cs]. 2, 17

work page arXiv 2024
[16]

Hdmapnet: An online hd map construction and evaluation framework

Qi Li, Yue Wang, Yilun Wang, and Hang Zhao. Hdmapnet: An online hd map construction and evaluation framework. In 2022 International Conference on Robotics and Automation (ICRA), pages 4628–4634. IEEE, 2022. 16

work page 2022
[17]

Graph-based topology reasoning for driving scenes.arXiv preprint arXiv:2304.05277, 2023

Tianyu Li, Li Chen, Huijie Wang, Yang Li, Jiazhi Yang, Xiangwei Geng, Shengyin Jiang, Yuting Wang, Hang Xu, Chunjing Xu, and others. Graph-based topology reasoning for driving scenes.arXiv preprint arXiv:2304.05277, 2023. 2, 6, 8, 13, 17

work page arXiv 2023
[18]

LaneSegNet: Map Learning with Lane Segment Perception for Autonomous Driving

Tianyu Li, Peijin Jia, Bangjun Wang, Li Chen, Kun Jiang, Junchi Yan, and Hongyang Li. LaneSegNet: Map Learning with Lane Segment Perception for Autonomous Driving. In ICLR, 2024. 2

work page 2024
[19]

Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion

Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1477–1485, 2023. 3

work page 2023
[20]

Fast-bev: A fast and strong bird’s-eye view perception baseline.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):8665– 8679, 2024

Yangguang Li, Bin Huang, Zeren Chen, Yufeng Cui, Feng Liang, Mingzhu Shen, Fenggang Liu, Enze Xie, Lu Sheng, Wanli Ouyang, and others. Fast-bev: A fast and strong bird’s-eye view perception baseline.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):8665– 8679, 2024. Publisher: IEEE. 3

work page 2024
[21]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. Publisher: IEEE. 3

work page 2024
[22]

Lane Graph as Path: Continuity-preserving Path- wise Modeling for Online Lane Graph Construction.arXiv preprint arXiv:2303.08815, 2023

Bencheng Liao, Shaoyu Chen, Bo Jiang, Tianheng Cheng, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Lane Graph as Path: Continuity-preserving Path- wise Modeling for Online Lane Graph Construction.arXiv preprint arXiv:2303.08815, 2023. 2

work page arXiv 2023
[23]

Maptr: Structured modeling and learning for online vectorized hd map construction

Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. InInternational Conference on Learning Representations, 2023. 2, 16, 17

work page 2023
[24]

Maptrv2: An end-to-end framework for online vectorized hd map construction.International Journal of Computer Vision, pages 1–23, 2024

Bencheng Liao, Shaoyu Chen, Yunchi Zhang, Bo Jiang, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Maptrv2: An end-to-end framework for online vectorized hd map construction.International Journal of Computer Vision, pages 1–23, 2024. Publisher: Springer. 2

work page 2024
[25]

Localization Is All You Evaluate: Data Leak- 9 age in Online Mapping Datasets and How to Fix It, 2024

Adam Lilja, Junsheng Fu, Erik Stenborg, and Lars Ham- marstrand. Localization Is All You Evaluate: Data Leak- 9 age in Online Mapping Datasets and How to Fix It, 2024. arXiv:2312.06420 [cs]. 1, 2, 3, 6, 16

work page arXiv 2024
[26]

DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. InInternational Conference on Learning Representations, 2022. 13, 14

work page 2022
[27]

Vectormapnet: End-to-end vectorized hd map learning

Yicheng Liu, Tianyuan Yuan, Yue Wang, Yilun Wang, and Hang Zhao. Vectormapnet: End-to-end vectorized hd map learning. InInternational Conference on Machine Learning, pages 22352–22369. PMLR, 2023. 16, 17

work page 2023
[28]

Augmenting Lane Perception and Topology Understanding with Standard Definition Navigation Maps.arXiv preprint arXiv:2311.04079, 2023

Katie Z Luo, Xinshuo Weng, Yan Wang, Shuang Wu, Jie Li, Kilian Q Weinberger, Yue Wang, and Marco Pavone. Augmenting Lane Perception and Topology Understanding with Standard Definition Navigation Maps.arXiv preprint arXiv:2311.04079, 2023. 3

work page arXiv 2023
[29]

T2SG: Traffic Topology Scene Graph for Topol- ogy Reasoning in Autonomous Driving.arXiv preprint arXiv:2411.18894, 2024

Changsheng Lv, Mengshi Qi, Liang Liu, and Huadong Ma. T2SG: Traffic Topology Scene Graph for Topol- ogy Reasoning in Autonomous Driving.arXiv preprint arXiv:2411.18894, 2024. 2, 17

work page arXiv 2024
[30]

RoadPainter: Points Are Ideal Navigators for Topology transformER.arXiv preprint arXiv:2407.15349, 2024

Zhongxing Ma, Shuang Liang, Yongkun Wen, Weixin Lu, and Guowei Wan. RoadPainter: Points Are Ideal Navigators for Topology transformER.arXiv preprint arXiv:2407.15349, 2024. 17

work page arXiv 2024
[31]

TorchVision: Py- Torch’s Computer Vision library, 2016

TorchVision maintainers and contributors. TorchVision: Py- Torch’s Computer Vision library, 2016. 14

work page 2016
[32]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XIV 16, pages 194–210. Springer, 2020. 3, 12, 14

work page 2020
[33]

Openlane-v2: A topology rea- soning benchmark for unified 3d hd mapping.Advances in Neural Information Processing Systems, 36, 2024

Huijie Wang, Tianyu Li, Yang Li, Li Chen, Chonghao Sima, Zhenbo Liu, Bangjun Wang, Peijin Jia, Yuting Wang, Shengyin Jiang, and others. Openlane-v2: A topology rea- soning benchmark for unified 3d hd mapping.Advances in Neural Information Processing Systems, 36, 2024. 6

work page 2024
[34]

Exploring object-centric temporal modeling for efficient multi-view 3d object detection

Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xi- angyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 3621–3631, 2023. 3

work page 2023
[35]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lam- bert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Rat- nesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, and others. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

TopoMLP: An Simple yet Strong Pipeline for Driving Topology Reasoning.ICLR, 2024

Dongming Wu, Jiahao Chang, Fan Jia, Yingfei Liu, Tiancai Wang, and Jianbing Shen. TopoMLP: An Simple yet Strong Pipeline for Driving Topology Reasoning.ICLR, 2024. 2, 8, 14, 17

work page 2024
[37]

Mˆ2BEV: Multi-Camera Joint 3D Detection and Segmen- tation with Unified Birds-Eye View Representation.arXiv preprint arXiv:2204.05088, 2022

Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, Ping Luo, and Jose M Alvarez. Mˆ2BEV: Multi-Camera Joint 3D Detection and Segmen- tation with Unified Birds-Eye View Representation.arXiv preprint arXiv:2204.05088, 2022. 3

work page arXiv 2022
[38]

Second: Sparsely em- bedded convolutional detection.Sensors, 18(10):3337, 2018

Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely em- bedded convolutional detection.Sensors, 18(10):3337, 2018. Publisher: Multidisciplinary Digital Publishing Institute. 12

work page 2018
[39]

TopoSD: Topology-Enhanced Lane Segment Percep- tion with SDMap Prior.arXiv preprint arXiv:2411.14751,

Sen Yang, Minyue Jiang, Ziwei Fan, Xiaolu Xie, Xiao Tan, Yingying Li, Errui Ding, Liang Wang, and Jingdong Wang. TopoSD: Topology-Enhanced Lane Segment Percep- tion with SDMap Prior.arXiv preprint arXiv:2411.14751,

work page arXiv
[40]

FASTopoWM: Fast-Slow Lane Segment Topol- ogy Reasoning with Latent World Models.arXiv preprint arXiv:2507.23325, 2025

Yiming Yang, Hongbin Lin, Yueru Luo, Suzhong Fu, Chao Zheng, Xinrui Yan, Shuqi Mei, Kun Tang, Shuguang Cui, and Zhen Li. FASTopoWM: Fast-Slow Lane Segment Topol- ogy Reasoning with Latent World Models.arXiv preprint arXiv:2507.23325, 2025. 3

work page arXiv 2025
[41]

TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving,

Yiming Yang, Yueru Luo, Bingkun He, Hongbin Lin, Suzhong Fu, Chao Yan, Kun Tang, Xinrui Yan, Chao Zheng, Shuguang Cui, and Zhen Li. TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving,

work page
[42]

arXiv:2507.00709 [cs]. 3

work page arXiv
[43]

SMART: Advancing Scalable Map Priors for Driving Topol- ogy Reasoning.arXiv preprint arXiv:2502.04329, 2025

Junjie Ye, David Paz, Hengyuan Zhang, Yuliang Guo, Xinyu Huang, Henrik I Christensen, Yue Wang, and Liu Ren. SMART: Advancing Scalable Map Priors for Driving Topol- ogy Reasoning.arXiv preprint arXiv:2502.04329, 2025. 3

work page arXiv 2025
[44]

Streammapnet: Streaming mapping network for vectorized online hd map construction

Tianyuan Yuan, Yicheng Liu, Yue Wang, Yilun Wang, and Hang Zhao. Streammapnet: Streaming mapping network for vectorized online hd map construction. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7356–7365, 2024. 1, 2, 3, 6, 16

work page 2024
[45]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 14

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

flawed” V1.1 metric and the “healthy

Brady Zhou and Philipp Kr ¨ahenb¨uhl. Cross-view transform- ers for real-time map-view semantic segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13760–13769, 2022. 3 10 Supplementary Material for: TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding (a) ...

work page 2022