pith. sign in

arxiv: 2304.10891 · v2 · submitted 2023-04-21 · 💻 cs.LG · cs.AI· cs.CV· cs.RO· cs.SY· eess.SY

Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey

Pith reviewed 2026-05-24 09:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.ROcs.SYeess.SY
keywords autonomous drivingtransformer modelsmodel compressionquantizationpruningknowledge distillationdeploymentsafety
0
0 comments X

The pith

Compression strategies for Transformer autonomous driving models must be integrated into system design rather than applied afterward.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews representative Transformer-based models for autonomous driving tasks including perception, prediction, and planning. It organizes the models by task role, sensing configuration, and architectural design while examining their computational demands. The central argument is that high-capacity attention architectures create latency, memory, and energy barriers to real-vehicle use, making compression techniques such as quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention necessary from the design stage. A sympathetic reader would care because treating compression as a system-level factor directly shapes deployability, robustness, and safety outcomes instead of leaving them as afterthoughts. The paper concludes by identifying open challenges for standardized, safety-aware, and hardware-conscious evaluation.

Core claim

Rather than treating compression as an isolated post-processing step, the survey highlights it as a system-level design consideration that directly affects deployability, robustness, and safety of Transformer-based autonomous driving models.

What carries the argument

Deployment-oriented perspective that examines how efficiency constraints reshape model design choices across task roles and sensing configurations.

If this is right

  • Model architectures will be selected and modified with upfront awareness of which compression methods preserve performance on specific driving tasks.
  • Safety and robustness testing will need to evaluate compressed versions on target hardware rather than full-precision models alone.
  • Future system designs will prioritize efficient attention mechanisms and low-rank approximations during initial development.
  • Evaluation benchmarks will incorporate metrics for latency, memory, and energy under realistic vehicle constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hardware platforms for vehicles may need accelerators tuned specifically to the compressed attention patterns common in these models.
  • The same system-level view could be tested on non-Transformer architectures to see if the deployability benefits hold more generally.
  • Regulatory requirements for autonomous vehicles might eventually demand documented compression strategies as part of safety certification.

Load-bearing premise

The survey assumes that the representative models and compression strategies selected from the literature are sufficiently complete and unbiased to support general statements about task-dependent applicability and design trade-offs.

What would settle it

A systematic review that adds many previously omitted models and shows compression applicability patterns that contradict the surveyed task-dependent conclusions would falsify the general claims.

Figures

Figures reproduced from arXiv: 2304.10891 by Juan Zhong, Xi Chen, Yuhang Shi, Zukang Xu.

Figure 1
Figure 1. Figure 1: The left panel depicts self-attention (or scaled dot [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A timeline diagram illustrating the history and key milestones of attention mechanisms and Transformer architectures research. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The architecture of ViT, the left panel shows the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Transformer inputs and outputs in different encoder and decode structures: (a) Object query from 2D image features; (b) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layers in ResNet and Swin-Transformer: (a) The ResNet basic unit, known as the ”bottleneck,” comprises 1x1 and 3x3 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A table lists primary operators for deploying an example Transformer model onto the portable hardware. Parameter Category: [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Transformer 4D Encoder Structure: BEVformer encoder structure ”encoder layer”, the same as Swin-Transformer, BEVformer [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This survey reviews Transformer-based models for autonomous driving, organizing them by task role (perception, prediction, planning), sensing configuration, and architectural design. It analyzes compression and acceleration techniques including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, with discussion of their benefits, limitations, and task-dependent applicability. The central thesis is that compression should be treated as a system-level design consideration affecting deployability, robustness, and safety rather than a post-processing step, and the paper concludes by identifying open challenges for standardized, safety-aware evaluation.

Significance. If the reviewed models and methods are representative, the survey would usefully synthesize an emerging intersection of Transformers and efficient AD systems, providing researchers with a deployment-oriented lens that connects architectural choices to real-vehicle constraints. The explicit framing of compression as integral to safety and robustness could influence future work on hardware-conscious AD pipelines.

major comments (2)
  1. [Introduction] The manuscript states that it reviews 'representative' Transformer-based AD models and compression strategies but contains no description of the literature search protocol, databases, keywords, inclusion/exclusion criteria, date range, or total paper count (Introduction and §2). This absence is load-bearing for the claims of task-dependent applicability and the system-level safety perspective, because omitted counterexamples (e.g., cases where compression degrades safety metrics) could invalidate the highlighted patterns.
  2. [Compression Strategies] §4 (compression review) asserts task-dependent trade-offs and limitations without citing a systematic selection process or quantitative meta-analysis of the reviewed works. The general statements on robustness and safety therefore rest on an unverified sample; a concrete test would be to report how many papers were screened versus included and whether any safety-critical negative results were excluded.
minor comments (2)
  1. Figure captions and table headers could more explicitly link back to the system-level design claim (e.g., by annotating which compression methods are shown to affect safety metrics).
  2. A small number of citations appear to be from preprints without noting their archival status; adding DOIs or arXiv identifiers would improve traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our survey. We agree that greater transparency regarding the literature selection process will strengthen the paper and support the claims of representativeness and task-dependent applicability. We address each major comment below.

read point-by-point responses
  1. Referee: [Introduction] The manuscript states that it reviews 'representative' Transformer-based AD models and compression strategies but contains no description of the literature search protocol, databases, keywords, inclusion/exclusion criteria, date range, or total paper count (Introduction and §2). This absence is load-bearing for the claims of task-dependent applicability and the system-level safety perspective, because omitted counterexamples (e.g., cases where compression degrades safety metrics) could invalidate the highlighted patterns.

    Authors: We agree that the manuscript would benefit from an explicit description of the literature search process. Although the survey is intended as a representative rather than exhaustive systematic review, the lack of this information does limit assessment of scope and potential omissions. In the revised version we will add a dedicated subsection to §2 that specifies the databases searched, keywords and queries employed, inclusion/exclusion criteria, date range, and approximate counts of papers screened versus included. This addition will directly support the claims of representativeness and allow readers to evaluate the risk of omitted counterexamples. revision: yes

  2. Referee: [Compression Strategies] §4 (compression review) asserts task-dependent trade-offs and limitations without citing a systematic selection process or quantitative meta-analysis of the reviewed works. The general statements on robustness and safety therefore rest on an unverified sample; a concrete test would be to report how many papers were screened versus included and whether any safety-critical negative results were excluded.

    Authors: We agree that §4 would be strengthened by greater transparency on paper selection. While the review is narrative rather than a quantitative meta-analysis, we will revise the section to describe the selection criteria for the compression strategies and papers discussed, report screened versus included counts where records permit, and note any safety-critical negative results that were considered. These changes will provide clearer grounding for the statements on task-dependent trade-offs, robustness, and safety. revision: yes

Circularity Check

0 steps flagged

No circularity: literature survey with no derivations or predictions

full rationale

The paper is a survey that reviews and organizes existing Transformer-based autonomous driving models and compression methods from the literature. It presents no equations, no fitted parameters, no predictions, and no derivation chain. The central claim is a perspective on treating compression as a system-level factor, supported by synthesis of reviewed works rather than any self-referential reduction. No self-citation load-bearing, ansatz smuggling, or renaming of results occurs. The selection of representative models is acknowledged as a potential limitation in the reader's take, but that is a completeness issue, not circularity. This matches the default expectation for non-circular survey papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey with no new mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5733 in / 1015 out tokens · 27441 ms · 2026-05-24T09:32:27.712912+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

123 extracted references · 123 canonical work pages · 2 internal anchors

  1. [1]

    Three decades of driver assistance systems: Review and future perspectives,

    K. Bengler, K. Dietmayer, B. Farber, M. Maurer, C. Stiller, and H. Winner, “Three decades of driver assistance systems: Review and future perspectives,” IEEE Intelligent trans- portation systems magazine , vol. 6, no. 4, pp. 6–22, 2014

  2. [2]

    Autonomous cars: Research results, issues, and future challenges,

    R. Hussain and S. Zeadally, “Autonomous cars: Research results, issues, and future challenges,” IEEE Communica- tions Surveys & Tutorials , vol. 21, no. 2, pp. 1275–1313, 2018

  3. [3]

    A survey of deep learning techniques for autonomous driving,

    S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, “A survey of deep learning techniques for autonomous driving,” Journal of Field Robotics, vol. 37, no. 3, pp. 362–386, 2020

  4. [4]

    Autonomous driving in urban environments: approaches, lessons and challenges,

    M. Campbell, M. Egerstedt, J. P. How, and R. M. Murray, “Autonomous driving in urban environments: approaches, lessons and challenges,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 368, no. 1928, pp. 4649–4672, 2010

  5. [5]

    Real-time motion planning methods for autonomous on- road driving: State-of-the-art and future research directions,

    C. Katrakazas, M. Quddus, W.-H. Chen, and L. Deka, “Real-time motion planning methods for autonomous on- road driving: State-of-the-art and future research directions,” Transportation Research Part C: Emerging Technologies , vol. 60, pp. 416–442, 2015

  6. [6]

    A survey of motion planning and control techniques for self-driving urban vehicles,

    B. Paden, M. ˇC´ap, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of motion planning and control techniques for self-driving urban vehicles,” IEEE Transactions on intelli- gent vehicles, vol. 1, no. 1, pp. 33–55, 2016

  7. [7]

    Simultane- ous localization and mapping: A survey of current trends in autonomous driving,

    G. Bresson, Z. Alsayed, L. Yu, and S. Glaser, “Simultane- ous localization and mapping: A survey of current trends in autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. 2, no. 3, pp. 194–220, 2017

  8. [8]

    Deep learning,

    Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015

  9. [9]

    Recent advances in convolutional neural networks,

    J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Caiet al., “Recent advances in convolutional neural networks,” Pattern recognition, vol. 77, pp. 354–377, 2018

  10. [10]

    A survey of autonomous driving: Common practices and emerging technologies,

    E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A survey of autonomous driving: Common practices and emerging technologies,” IEEE access , vol. 8, pp. 58 443– 58 469, 2020

  11. [11]

    A survey of deep learning applications to autonomous vehicle control,

    S. Kuutti, R. Bowden, Y . Jin, P. Barber, and S. Fallah, “A survey of deep learning applications to autonomous vehicle control,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 2, pp. 712–733, 2020

  12. [12]

    Autonomous driving ar- chitectures: insights of machine learning and deep learning algorithms,

    M. R. Bachute and J. M. Subhedar, “Autonomous driving ar- chitectures: insights of machine learning and deep learning algorithms,” Machine Learning with Applications , vol. 6, p. 100164, 2021

  13. [13]

    A review on autonomous vehicles: Progress, methods and challenges,

    D. Parekh, N. Poddar, A. Rajpurkar, M. Chahal, N. Kumar, G. P. Joshi, and W. Cho, “A review on autonomous vehicles: Progress, methods and challenges,” Electronics, vol. 11, no. 14, p. 2162, 2022

  14. [14]

    Deep reinforcement learning for autonomous driving: A survey,

    B. R. Kiran, I. Sobh, V . Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. P ´erez, “Deep reinforcement learning for autonomous driving: A survey,” IEEE Transac- tions on Intelligent Transportation Systems , vol. 23, no. 6, pp. 4909–4926, 2021

  15. [15]

    Deep learning-based image 3d object detection for autonomous driving,

    S. Y . Alaba and J. E. Ball, “Deep learning-based image 3d object detection for autonomous driving,” IEEE Sensors Journal, 2023

  16. [16]

    Deep learning-based vehicle behavior prediction for autonomous driving applications: A review,

    S. Mozaffari, O. Y . Al-Jarrah, M. Dianati, P. Jennings, and A. Mouzakitis, “Deep learning-based vehicle behavior prediction for autonomous driving applications: A review,” IEEE Transactions on Intelligent Transportation Systems , vol. 23, no. 1, pp. 33–47, 2020

  17. [17]

    A survey on trajectory-prediction methods for autonomous driving,

    Y . Huang, J. Du, Z. Yang, Z. Zhou, L. Zhang, and H. Chen, “A survey on trajectory-prediction methods for autonomous driving,” IEEE Transactions on Intelligent Vehicles , vol. 7, no. 3, pp. 652–674, 2022

  18. [18]

    Deep learning for image and point cloud fusion in autonomous driving: A review,

    Y . Cui, R. Chen, W. Chu, L. Chen, D. Tian, Y . Li, and D. Cao, “Deep learning for image and point cloud fusion in autonomous driving: A review,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 2, pp. 722– 739, 2021

  19. [19]

    Planning and decision-making for autonomous vehicles,

    W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-making for autonomous vehicles,” Annual Review of Control, Robotics, and Autonomous Systems , vol. 1, pp. 187–210, 2018

  20. [20]

    Decision-making technology for autonomous vehicles: Learning-based meth- ods, applications and future outlook,

    Q. Liu, X. Li, S. Yuan, and Z. Li, “Decision-making technology for autonomous vehicles: Learning-based meth- ods, applications and future outlook,” in 2021 IEEE In- ternational Intelligent Transportation Systems Conference (ITSC). IEEE, 2021, pp. 30–37

  21. [21]

    Thorough review analysis of safe control of autonomous vehicles: path planning and navigation techniques,

    S. Abdallaoui, E.-H. Aglzim, A. Chaibet, and A. Krib `eche, “Thorough review analysis of safe control of autonomous vehicles: path planning and navigation techniques,” Ener- gies, vol. 15, no. 4, p. 1358, 2022

  22. [22]

    Explainability of deep vision-based autonomous driving systems: Review and challenges,

    ´E. Zablocki, H. Ben-Younes, P. P ´erez, and M. Cord, “Explainability of deep vision-based autonomous driving systems: Review and challenges,” International Journal of Computer Vision (IJCV 2022) , vol. 130, no. 10, pp. 2425– 2452, 2022

  23. [23]

    A survey on safety-critical driving scenario generation—a methodological perspective,

    W. Ding, C. Xu, M. Arief, H. Lin, B. Li, and D. Zhao, “A survey on safety-critical driving scenario generation—a methodological perspective,” IEEE Transactions on Intelli- gent Transportation Systems, 2023

  24. [24]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” 31st Conference on Neural Information Processing Systems (NIPS 2017) , vol. 30, 2017

  25. [25]

    Neural machine translation by jointly learning to align and translate,

    D. Bahdanau, K. H. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations (ICLR), 2015

  26. [26]

    Effective approaches to attention-based neural machine translation,

    M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in The 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421

  27. [27]

    Bert: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in The 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), 2019, pp. 4171–4186

  28. [28]

    Improving language understanding by generative pre- training,

    A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre- training,” 2018

  29. [29]

    An image is worth 16x16 words: Transformers for image recog- nition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recog- nition at scale,” ICLR, 2021

  30. [30]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision (ICCV) , 2021, pp. 10 012–10 022

  31. [31]

    Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,

    Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han, “Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,” 2023

  32. [32]

    BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework,

    T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y . Wang, T. Tang, B. Wang, and Z. Tang, “BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework,” in Neural Information Processing Systems (NeurIPS) , 2022

  33. [33]

    Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving,

    Y . Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu, “Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving,” arXiv preprint arXiv:2205.09743 , 2022

  34. [34]

    Detr3d: 3d object detection from multi- view images via 3d-to-2d queries,

    Y . Wang, V . C. Guizilini, T. Zhang, Y . Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi- view images via 3d-to-2d queries,” in Conference on Robot Learning, 2021, pp. 180–191

  35. [35]

    Futr3d: A unified sensor fusion framework for 3d detec- tion,

    X. Chen, T. Zhang, Y . Wang, Y . Wang, and H. Zhao, “Futr3d: A unified sensor fusion framework for 3d detec- tion,” arXiv preprint arXiv:2203.10642 , 2022

  36. [36]

    Petr: Position em- bedding transformation for multi-view 3d object detection,

    Y . Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position em- bedding transformation for multi-view 3d object detection,” in Computer Vision–ECCV 2022: 17th European Confer- JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 17 ence, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII. Springer, 2022, pp. 531–548

  37. [37]

    Petrv2: A unified framework for 3d perception from multi-camera images,

    Y . Liu, J. Yan, F. Jia, S. Li, Q. Gao, T. Wang, X. Zhang, and J. Sun, “Petrv2: A unified framework for 3d perception from multi-camera images,” arXiv preprint arXiv:2206.01256 , 2022

  38. [38]

    Crossdtr: Cross-view and depth-guided transformers for 3d object detection,

    C.-Y . Tseng, Y .-R. Chen, H.-Y . Lee, T.-H. Wu, W.-C. Chen, and W. Hsu, “Crossdtr: Cross-view and depth-guided transformers for 3d object detection,” The 40th IEEE In- ternational Conference on Robotics and Automation (ICRA 2023, 2023

  39. [39]

    Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in Computer Vision–ECCV 2022: 17th European Confer- ence, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX. Springer, 2022, pp. 1–18

  40. [40]

    Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision,

    C. Yang, Y . Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, H. Li, Y . Qiao, L. Lu et al. , “Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision,” arXiv preprint arXiv:2211.10439, 2022

  41. [41]

    Unifying voxel-based representation with transformer for 3d object detection,

    Y . Li, Y . Chen, X. Qi, Z. Li, J. Sun, and J. Jia, “Unifying voxel-based representation with transformer for 3d object detection,” in 36th Conference on Neural Information Pro- cessing Systems (NeurIPS 2022). , 2022

  42. [42]

    Tri- perspective view for vision-based 3d semantic occupancy prediction,

    Y . Huang, W. Zheng, Y . Zhang, J. Zhou, and J. Lu, “Tri- perspective view for vision-based 3d semantic occupancy prediction,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023

  43. [43]

    V oxformer: Sparse voxel transformer for camera-based 3d semantic scene comple- tion,

    Y . Li, Z. Yu, C. Choy, C. Xiao, J. M. Alvarez, S. Fidler, C. Feng, and A. Anandkumar, “V oxformer: Sparse voxel transformer for camera-based 3d semantic scene comple- tion,” in The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) , 2023

  44. [44]

    Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving,

    Y . Wei, L. Zhao, W. Zheng, Z. Zhu, J. Zhou, and J. Lu, “Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving,” in The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) , 2023

  45. [45]

    Motr: End-to-end multiple-object tracking with transformer,

    F. Zeng, B. Dong, Y . Zhang, T. Wang, X. Zhang, and Y . Wei, “Motr: End-to-end multiple-object tracking with transformer,” in Computer Vision–ECCV 2022: 17th Euro- pean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII. Springer, 2022, pp. 659–675

  46. [46]

    Mutr3d: A multi-camera tracking framework via 3d-to-2d queries,

    T. Zhang, X. Chen, Y . Wang, Y . Wang, and H. Zhao, “Mutr3d: A multi-camera tracking framework via 3d-to-2d queries,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4537– 4546

  47. [47]

    Bevseg- former: Bird’s eye view semantic segmentation from arbi- trary camera rigs,

    L. Peng, Z. Chen, Z. Fu, P. Liang, and E. Cheng, “Bevseg- former: Bird’s eye view semantic segmentation from arbi- trary camera rigs,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2023, pp. 5935–5943

  48. [49]

    End-to-end lane shape prediction with transformers,

    R. Liu, Z. Yuan, T. Liu, and Z. Xiong, “End-to-end lane shape prediction with transformers,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 3694–3702

  49. [50]

    Curveformer: 3d lane detection by curve propagation with curve queries and attention,

    Y . Bai, Z. Chen, Z. Fu, L. Peng, P. Liang, and E. Cheng, “Curveformer: 3d lane detection by curve propagation with curve queries and attention,” IEEE Conference on Robotics and Automation, ICRA 2023 , 2023

  50. [51]

    Translat- ing images into maps,

    A. Saha, O. Mendez, C. Russell, and R. Bowden, “Translat- ing images into maps,” in 2022 International Conference on Robotics and Automation (ICRA) . IEEE, 2022, pp. 9200– 9206

  51. [52]

    Panoptic segformer: Delving deeper into panoptic segmentation with transformers,

    Z. Li, W. Wang, E. Xie, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo, and T. Lu, “Panoptic segformer: Delving deeper into panoptic segmentation with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 1280–1289

  52. [53]

    Struc- tured bird’s-eye-view traffic scene understanding from on- board images,

    Y . B. Can, A. Liniger, D. P. Paudel, and L. Van Gool, “Struc- tured bird’s-eye-view traffic scene understanding from on- board images,” in Proceedings of the IEEE/CVF interna- tional conference on computer vision (ICCV) , 2021, pp. 15 661–15 670

  53. [54]

    Vectormapnet: End-to-end vectorized hd map learning,

    Y . Liu, Y . Wang, Y . Wang, and H. Zhao, “Vectormapnet: End-to-end vectorized hd map learning,” arXiv preprint arXiv:2206.08920, 2022

  54. [55]

    Maptr: Structured modeling and learning for online vectorized hd map construction,

    B. Liao, S. Chen, X. Wang, T. Cheng, Q. Zhang, W. Liu, and C. Huang, “Maptr: Structured modeling and learning for online vectorized hd map construction,” in International Conference on Learning Representations , 2023

  55. [56]

    Vectornet: Encoding hd maps and agent dynamics from vectorized representation,

    J. Gao, C. Sun, H. Zhao, Y . Shen, D. Anguelov, C. Li, and C. Schmid, “Vectornet: Encoding hd maps and agent dynamics from vectorized representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 525–11 533

  56. [57]

    Densetnt: End-to-end trajec- tory prediction from dense goal sets,

    J. Gu, C. Sun, and H. Zhao, “Densetnt: End-to-end trajec- tory prediction from dense goal sets,” in Proceedings of the IEEE/CVF international conference on computer vision (ICCV), 2021, pp. 15 303–15 312

  57. [58]

    Mul- timodal motion prediction with stacked transformers,

    Y . Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou, “Mul- timodal motion prediction with stacked transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 7577–7586

  58. [59]

    Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting,

    Y . Yuan, X. Weng, Y . Ou, and K. M. Kitani, “Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , 2021, pp. 9813– 9823

  59. [60]

    Wayformer: Motion forecasting via simple & efficient attention networks,

    N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp, “Wayformer: Motion forecasting via simple & efficient attention networks,” arXiv preprint arXiv:2207.05844, 2022

  60. [61]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driving,

    K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger, “Transfuser: Imitation with transformer-based sensor fusion for autonomous driving,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022

  61. [62]

    Neat: Neural atten- tion fields for end-to-end autonomous driving,

    K. Chitta, A. Prakash, and A. Geiger, “Neat: Neural atten- tion fields for end-to-end autonomous driving,” in Proceed- ings of the IEEE/CVF international conference on computer vision (ICCV), 2021

  62. [63]

    Safety- enhanced autonomous driving using interpretable sensor fusion transformer,

    H. Shao, L. Wang, R. Chen, H. Li, and Y . Liu, “Safety- enhanced autonomous driving using interpretable sensor fusion transformer,” in 6th Conference on Robot Learning (CoRL 2022). PMLR, 2022, pp. 726–737

  63. [64]

    Mmfn: Multi-modal-fusion-net for end-to-end driving,

    Q. Zhang, M. Tang, R. Geng, F. Chen, R. Xin, and L. Wang, “Mmfn: Multi-modal-fusion-net for end-to-end driving,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2022, pp. 8638–8643

  64. [65]

    St- p3: End-to-end vision-based autonomous driving via spatial- temporal feature learning,

    S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “St- p3: End-to-end vision-based autonomous driving via spatial- temporal feature learning,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII . Springer, 2022, pp. 533–549

  65. [66]

    Planning-oriented autonomous driv- ing,

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li, “Planning-oriented autonomous driv- ing,” 2023

  66. [67]

    End-to-end object detection with trans- formers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with trans- formers,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part I 16 . Springer, 2020, pp. 213–229

  67. [68]

    Deformable detr: Deformable transformers for end-to-end object detection,

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” Ninth International Conference on Learn- ing Representations (ICLR 2021) , 2020

  68. [69]

    Future transformer for long-term action anticipation,

    D. Gong, J. Lee, M. Kim, S. J. Ha, and M. Cho, “Future transformer for long-term action anticipation,” in The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022, pp. 3052–3061. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2020 18

  69. [70]

    Hdmapnet: An online hd map construction and evaluation framework,

    Q. Li, Y . Wang, Y . Wang, and H. Zhao, “Hdmapnet: An online hd map construction and evaluation framework,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 4628–4634

  70. [71]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

  71. [72]

    Microsoft coco: Com- mon objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Com- mon objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 . Springer, 2014, pp. 740–755

  72. [73]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 11 621–11 631

  73. [74]

    End-to-end lane marker detection via row-wise classification,

    S. Yoo, H. S. Lee, H. Myeong, S. Yun, H. Park, J. Cho, and D. H. Kim, “End-to-end lane marker detection via row-wise classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , 2020, pp. 1006–1007

  74. [75]

    Persformer: 3d lane detection via perspective transformer and the openlane benchmark,

    L. Chen, C. Sima, Y . Li, Z. Zheng, J. Xu, X. Geng, H. Li, C. He, J. Shi, Y . Qiao, and J. Yan, “Persformer: 3d lane detection via perspective transformer and the openlane benchmark,” in European Conference on Computer Vision (ECCV), 2022

  75. [76]

    Argoverse: 3d tracking and forecasting with rich maps,

    M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan et al., “Argoverse: 3d tracking and forecasting with rich maps,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2019, pp. 8748–8757

  76. [77]

    Carla: An open urban driving simulator,

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “Carla: An open urban driving simulator,” in Conference on robot learning . PMLR, 2017, pp. 1–16

  77. [78]

    Tnt: Target- driven trajectory prediction,

    H. Zhao, J. Gao, T. Lan, C. Sun, B. Sapp, B. Varadarajan, Y . Shen, Y . Shen, Y . Chai, C. Schmidet al., “Tnt: Target- driven trajectory prediction,” in Conference on Robot Learn- ing. PMLR, 2021, pp. 895–904

  78. [79]

    Deformable convolutional networks,

    J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei, “Deformable convolutional networks,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 764–773

  79. [80]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778

  80. [81]

    Post-training quantization for vision transformer,

    Z. Liu, Y . Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” 35th Conference on Neural Information Processing Systems (NeurIPS 2021)., vol. 34, pp. 28 092–28 103, 2021

Showing first 80 references.