STELLAR: Scaling 3D Perception Large Models for Autonomous Driving
Pith reviewed 2026-05-21 07:05 UTC · model grok-4.3
The pith
Larger models with more data and compute improve 3D perception accuracy for autonomous driving.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training models with up to 500 million parameters on a 50-million-example dataset that fuses LiDAR, radar, camera, and map prior data via an extended Sparse Window Transformer produces measurable scaling trends in which detection and tracking performance rise with greater model size, data volume, and compute, ultimately establishing a new state-of-the-art on the Waymo Open Dataset.
What carries the argument
The STELLAR model, which extends the Sparse Window Transformer architecture to jointly process heterogeneous inputs from LiDAR, radar, camera, and map priors.
If this is right
- Perception performance in driving scenes improves predictably as model capacity and training resources grow.
- Multi-modal sensor fusion becomes more effective at larger scales for tasks requiring 3D spatial reasoning.
- Autonomous driving perception systems may benefit more from collecting additional real-world data than from designing new hand-engineered modules.
- Very large models could reduce reliance on task-specific architectural innovations in favor of scale-driven improvements.
Where Pith is reading between the lines
- The same scaling approach could be applied to other robotics perception problems that involve multiple sensors and 3D understanding.
- If the trends continue, practical systems would face higher inference compute demands that may require model compression or specialized hardware.
- Data collection pipelines for autonomous vehicles could become a central competitive advantage if scale remains the dominant driver of accuracy.
Load-bearing premise
The observed performance gains result primarily from increases in model size, data volume, and compute rather than from unmeasured differences in data curation or training procedures.
What would settle it
A smaller model trained with carefully matched data curation and hyper-parameters that matches or exceeds the reported Waymo scores of the largest STELLAR configuration.
Figures
read the original abstract
Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents STELLAR, a Sparse Window Transformer model for multi-modal 3D perception in autonomous driving. Inputs are extended to include LiDAR, radar, camera, and map priors. The model is trained on a dataset of 50 million driving examples with up to 500 million parameters. Large-scale experiments are reported to reveal empirical scaling trends connecting performance to model size, data volume, and compute. The resulting model claims new state-of-the-art results on the Waymo Open Dataset, outperforming prior methods by a large margin, and concludes that large-scale training is a promising direction for perception models in this domain.
Significance. If the scaling trends are shown to be robustly attributable to scale rather than confounding factors, the work would demonstrate that scaling laws can be successfully applied to the domain-specific challenges of heterogeneous sensor fusion and 3D spatial understanding in autonomous driving. This could shift research emphasis toward larger models and datasets in the field and provide a concrete benchmark for future scaling studies on standard datasets like Waymo.
major comments (1)
- [Large-scale experiments and results sections] The central claim attributes performance gains and SOTA results primarily to large-scale training (model size, data volume, compute). However, the manuscript does not appear to include ablations that hold the data-processing pipeline, sensor-calibration procedure, and fusion architecture fixed while varying only scale factors. Without such controls, the observed trends and Waymo improvements cannot be confidently isolated to scale as opposed to unstated choices in curation or hyperparameters. This is load-bearing for the abstract's attribution of results to 'large-scale training'.
minor comments (1)
- [Abstract] The abstract refers to 'outperforming prior arts by a large margin' without specifying the exact metrics (e.g., mAP, NDS) or numerical deltas; providing these values would strengthen the SOTA claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to strengthen the presentation of our scaling controls.
read point-by-point responses
-
Referee: [Large-scale experiments and results sections] The central claim attributes performance gains and SOTA results primarily to large-scale training (model size, data volume, compute). However, the manuscript does not appear to include ablations that hold the data-processing pipeline, sensor-calibration procedure, and fusion architecture fixed while varying only scale factors. Without such controls, the observed trends and Waymo improvements cannot be confidently isolated to scale as opposed to unstated choices in curation or hyperparameters. This is load-bearing for the abstract's attribution of results to 'large-scale training'.
Authors: We agree that isolating the contribution of scale is essential for the central claim. Our experiments train variants of the same STELLAR Sparse Window Transformer architecture on the identical data-processing pipeline and sensor-calibration procedure. Model size is varied from smaller configurations to 500M parameters, data volume is varied via controlled subsampling of the 50M-scene corpus, and compute is scaled accordingly, with all other factors (including fusion design, hyperparameters for the base architecture, and curation rules) held fixed. The reported scaling trends and Waymo gains are therefore measured under these controls. To address the concern directly, we will add an explicit ablation subsection in the revised Large-scale experiments section that tabulates these fixed factors and reports the isolated scaling curves. This revision will also update the abstract and results discussion to reference the controls more precisely. revision: yes
Circularity Check
No significant circularity; empirical scaling observations on external benchmarks
full rationale
The paper reports results from training a Sparse Window Transformer extended to LiDAR/radar/camera/map inputs on a 50-million-example dataset, with model sizes up to 500M parameters. It presents observed performance trends versus scale and SOTA numbers on the Waymo Open Dataset. These are direct experimental measurements, not quantities derived from parameters or outputs defined in terms of the reported results themselves. No self-definitional equations, fitted-input predictions, or load-bearing self-citations that reduce the central claim to its own inputs appear in the abstract or described structure. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sparse Window Transformer architecture is a suitable base for multi-modal 3D driving perception
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop our STELLAR model based on Sparse Window Transformer... train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Figure 3. Model scaling curves... Log-linear fits... diminishing returns... Figure 4. Data scaling curves.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Sun, Pei and Tan, Mingxing and Wang, Weiyue and Liu, Chenxi and Xia, Fei and Leng, Zhaoqi and Anguelov, Dragomir , booktitle=. 2022 , organization=
work page 2022
-
[2]
Conference on Robot Learning , pages=
End-to-end multi-view fusion for 3d object detection in lidar point clouds , author=. Conference on Robot Learning , pages=. 2020 , organization=
work page 2020
-
[3]
Qi, Charles R and Su, Hao and Mo, Kaichun and Guibas, Leonidas J , booktitle=
-
[4]
Proceedings of the IEEE conference on computer vision and pattern recognition , year=
Scene Reconstruction as Mapping Priors for 3D Detection , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , year=
-
[5]
Lang, Alex H and Vora, Sourabh and Caesar, Holger and Zhou, Lubing and Yang, Jiong and Beijbom, Oscar , booktitle=
-
[6]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[7]
European conference on computer vision , pages=
Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d , author=. European conference on computer vision , pages=. 2020 , organization=
work page 2020
-
[8]
Duan, Kaiwen and Bai, Song and Xie, Lingxi and Qi, Honggang and Huang, Qingming and Tian, Qi , booktitle=
-
[9]
Agro, Ben and Casas, Sergio and Wang, Patrick and Gilles, Thomas and Urtasun, Raquel , booktitle=
-
[10]
Zhang, Gang and Chen, Junnan and Gao, Guohuan and Li, Jianmin and Liu, Si and Hu, Xiaolin , booktitle=
-
[11]
Wu, Xiaoyang and Jiang, Li and Wang, Peng-Shuai and Liu, Zhijian and Liu, Xihui and Qiao, Yu and Ouyang, Wanli and He, Tong and Zhao, Hengshuang , booktitle=
-
[12]
Zeid, Karim Abou and Yilmaz, Kadir and de Geus, Daan and Hermans, Alexander and Adrian, David and Linder, Timm and Leibe, Bastian , journal=
-
[13]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
Uni-to-multi modal knowledge distillation for bidirectional lidar-camera semantic segmentation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
-
[14]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[15]
Scaling Laws for Neural Language Models
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[16]
Training Compute-Optimal Large Language Models
Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[18]
Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
-
[19]
Gemini: A Family of Highly Capable Multimodal Models
Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[21]
Tian, Xiaoyu and Jiang, Tao and Yun, Longfei and Mao, Yucheng and Yang, Huitong and Wang, Yue and Wang, Yilun and Zhao, Hang , journal=
-
[22]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[24]
CoCa: Contrastive Captioners are Image-Text Foundation Models
Coca: Contrastive captioners are image-text foundation models , author=. arXiv preprint arXiv:2205.01917 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=
-
[26]
Improving language understanding by generative pre-training , author=. 2018 , journal=
work page 2018
-
[27]
Huang, Xin and Wolff, Eric M and Vernaza, Paul and Phan-Minh, Tung and Chen, Hongge and Hayden, David S and Edmonds, Mark and Pierce, Brian and Chen, Xinxin and Jacob, Pratik Elias and others , booktitle=. 2025 , organization=
work page 2025
-
[28]
arXiv preprint arXiv:2506.08228 , year=
Scaling Laws of Motion Forecasting and Planning--A Technical Report , author=. arXiv preprint arXiv:2506.08228 , year=
-
[29]
Fan, Lue and Wang, Feng and Wang, Naiyan and Zhang, Zhaoxiang , journal=. 2024 , publisher=
work page 2024
-
[30]
Zhang, Gang and Junnan, Chen and Gao, Guohuan and Li, Jianmin and Hu, Xiaolin , journal=
-
[31]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Center-based 3d object detection and tracking , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[32]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
nuscenes: A multimodal dataset for autonomous driving , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[33]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Scalability in perception for autonomous driving: Waymo open dataset , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[34]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Modar: Using motion forecasting for 3d object detection in point cloud sequences , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[35]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[36]
Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting
Argoverse 2: Next generation datasets for self-driving perception and forecasting , author=. arXiv preprint arXiv:2301.00493 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Towards learning-based planning: The nuplan benchmark for real-world autonomous driving , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=
work page 2024
-
[38]
https://www.huggingface.com/blog/lerobot-goes-to-driving-school , year =
Team Yaak , title =. https://www.huggingface.com/blog/lerobot-goes-to-driving-school , year =
-
[39]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=
work page 2020
-
[40]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. arXiv preprint arXiv:2304.11277 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
GSPMD: General and Scalable Parallelization for ML Computation Graphs
GSPMD: General and Scalable Parallelization for ML Computation Graphs , author=. arXiv preprint arXiv:2105.04663 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Astrophysics Source Code Library , pages=
JAX: Autograd and XLA , author=. Astrophysics Source Code Library , pages=
- [43]
-
[44]
Proceedings of the 27th international conference on machine learning (ICML-10) , pages=
Rectified linear units improve restricted boltzmann machines , author=. Proceedings of the 27th international conference on machine learning (ICML-10) , pages=
-
[45]
XLA : Compiling Machine Learning for Peak Performance ,author =
-
[46]
Bevfusion: Multi-task multi-sensor fusion with unified bird's-eye view representation , author=. ICRA , year=
-
[47]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Embracing single stride 3d object detector with sparse transformer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[48]
Zhou, Zixiang and Zhao, Xiangchen and Wang, Yu and Wang, Panqu and Foroosh, Hassan , booktitle=. 2022 , organization=
work page 2022
-
[49]
Advances in Neural Information Processing Systems , volume=
Fully sparse 3d object detection , author=. Advances in Neural Information Processing Systems , volume=
-
[50]
Liu, Zhe and Hou, Jinghua and Ye, Xiaoqing and Wang, Tong and Wang, Jingdong and Bai, Xiang , booktitle=. 2024 , organization=
work page 2024
-
[51]
Liu, Zhe and Hou, Jinghua and Wang, Xinyu and Ye, Xiaoqing and Wang, Jingdong and Zhao, Hengshuang and Bai, Xiang , journal=
-
[52]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
Super Sparse 3D Object Detection , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
-
[53]
He, Chenhang and Li, Ruihuang and Zhang, Yabin and Li, Shuai and Zhang, Lei , booktitle=
-
[54]
Li, Xin and Ma, Tao and Hou, Yuenan and Shi, Botian and Yang, Yuchen and Liu, Youquan and Wu, Xingjiao and Chen, Qin and Li, Yikang and Qiao, Yu and others , booktitle=
-
[55]
2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=
VADet: Multi-Frame LiDAR 3D Object Detection Using Variable Aggregation , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=
work page 2025
-
[56]
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=
work page internal anchor Pith review arXiv 1904
-
[57]
European conference on computer vision , pages=
Deep networks with stochastic depth , author=. European conference on computer vision , pages=. 2016 , organization=
work page 2016
-
[58]
Yang, Zhenpei and Chai, Yuning and Anguelov, Dragomir and Zhou, Yin and Sun, Pei and Erhan, Dumitru and Rafferty, Sean and Kretzschmar, Henrik , booktitle=
-
[59]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Beyond attention: Breaking the limits of transformer context length with recurrent memory , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[60]
Training Deep Nets with Sublinear Memory Cost
Training deep nets with sublinear memory cost , author=. arXiv preprint arXiv:1604.06174 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Fitting larger networks into memory , author =. 2018 , institution =
work page 2018
-
[62]
arXiv preprint arXiv:2403.08763 , year=
Simple and scalable strategies to continually pre-train large language models , author=. arXiv preprint arXiv:2403.08763 , year=
-
[63]
Advances in Neural Information Processing Systems , volume=
Scaling laws and compute-optimal training beyond fixed training durations , author=. Advances in Neural Information Processing Systems , volume=
-
[64]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Exploring object-centric temporal modeling for efficient multi-view 3d object detection , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[65]
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=
Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net , author=. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=
-
[66]
Zhou, Yin and Tuzel, Oncel , booktitle=
-
[67]
Huang, Junjie and Huang, Guan and Zhu, Zheng and Ye, Yun and Du, Dalong , journal=
-
[68]
Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Yu, Qiao and Dai, Jifeng , journal=. 2024 , publisher=
work page 2024
-
[69]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Rethinking imagenet pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[70]
Proceedings of the 27th annual conference on Computer graphics and interactive techniques , pages=
Surfels: Surface elements as rendering primitives , author=. Proceedings of the 27th annual conference on Computer graphics and interactive techniques , pages=
-
[71]
Proceedings of Association for Computational Linguistics (ACL) , pages=
The impact of depth on compositional generalization in transformer language models , author=. Proceedings of Association for Computational Linguistics (ACL) , pages=
-
[72]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
On offline evaluation of 3d object detection for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[73]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Planning-oriented autonomous driving , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[74]
Wozniak, Maciej K and Govindarajan, Hariprasath and Klingner, Marvin and Maurice, Camille and Kiran, B Ravi and Yogamani, Senthil , booktitle=. 2025 , organization=
work page 2025
-
[75]
Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
Masked autoencoder for self-supervised pre-training on lidar point clouds , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
-
[76]
Agro, Ben and Sykora, Quinlan and Casas, Sergio and Gilles, Thomas and Urtasun, Raquel , booktitle=
-
[77]
arXiv preprint arXiv:2503.15672 , year=
Ljungbergh, William and Lilja, Adam and Ling, Adam Tonderski and Lindstr. arXiv preprint arXiv:2503.15672 , year=
-
[78]
Yang, Honghui and Zhang, Sha and Huang, Di and Wu, Xiaoyang and Zhu, Haoyi and He, Tong and Tang, Shixiang and Zhao, Hengshuang and Qiu, Qibo and Lin, Binbin and others , booktitle=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.