STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

Alex Zihao Zhu; Anant Subramanian; Chen Song; Dragomir Anguelov; Govind Thattai; Hao Xiang; Junwen Yao; Mingxing Tan; Tom Hoddes; Weijing Shi

arxiv: 2605.20390 · v1 · pith:E7SOU7PSnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

Yingwei Li , Xin Huang , Yang Liu , Yang Fu , Alex Zihao Zhu , Chen Song , Junwen Yao , Anant Subramanian

show 8 more authors

Hao Xiang Weijing Shi Yuliang Zou Tom Hoddes Zhaoqi Leng Govind Thattai Dragomir Anguelov Mingxing Tan

This is my paper

Pith reviewed 2026-05-21 07:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords scaling laws3D perceptionautonomous drivingmulti-modal fusionWaymo datasetSparse Window Transformerlarge models

0 comments

The pith

Larger models with more data and compute improve 3D perception accuracy for autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether scaling laws observed in other AI domains hold for the specialized task of 3D perception in self-driving cars. It introduces the STELLAR model by extending a Sparse Window Transformer to process inputs from LiDAR, radar, camera, and map priors together. Models are trained on 50 million driving examples at scales reaching 500 million parameters. Experiments show steady gains in performance as model size, dataset volume, and compute increase. The largest configuration sets a new state-of-the-art on the Waymo Open Dataset challenge.

Core claim

Training models with up to 500 million parameters on a 50-million-example dataset that fuses LiDAR, radar, camera, and map prior data via an extended Sparse Window Transformer produces measurable scaling trends in which detection and tracking performance rise with greater model size, data volume, and compute, ultimately establishing a new state-of-the-art on the Waymo Open Dataset.

What carries the argument

The STELLAR model, which extends the Sparse Window Transformer architecture to jointly process heterogeneous inputs from LiDAR, radar, camera, and map priors.

If this is right

Perception performance in driving scenes improves predictably as model capacity and training resources grow.
Multi-modal sensor fusion becomes more effective at larger scales for tasks requiring 3D spatial reasoning.
Autonomous driving perception systems may benefit more from collecting additional real-world data than from designing new hand-engineered modules.
Very large models could reduce reliance on task-specific architectural innovations in favor of scale-driven improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scaling approach could be applied to other robotics perception problems that involve multiple sensors and 3D understanding.
If the trends continue, practical systems would face higher inference compute demands that may require model compression or specialized hardware.
Data collection pipelines for autonomous vehicles could become a central competitive advantage if scale remains the dominant driver of accuracy.

Load-bearing premise

The observed performance gains result primarily from increases in model size, data volume, and compute rather than from unmeasured differences in data curation or training procedures.

What would settle it

A smaller model trained with carefully matched data curation and hyper-parameters that matches or exceeds the reported Waymo scores of the largest STELLAR configuration.

Figures

Figures reproduced from arXiv: 2605.20390 by Alex Zihao Zhu, Anant Subramanian, Chen Song, Dragomir Anguelov, Govind Thattai, Hao Xiang, Junwen Yao, Mingxing Tan, Tom Hoddes, Weijing Shi, Xin Huang, Yang Fu, Yang Liu, Yingwei Li, Yuliang Zou, Zhaoqi Leng.

**Figure 1.** Figure 1: STELLAR achieves better 3D detection performance through scaling model parameters and multi-task mid-training on high quality driving data, measured by average L2 APH on the Waymo Open Dataset validation set. The dashed horizontal line represents previous state-of-the-art using up to 4 temporal frames. enabling safer and more robust driving in complex environments (Schreier et al., 2023; Hu et al., 2023).… view at source ↗

**Figure 2.** Figure 2: Overview of STELLAR, a multi-modal perception model. The model projects LiDAR, radar, and surfel inputs directly into a bird’s-eye-view (BEV) representation, while camera features are mapped to BEV via a lift-splat-shoot (LSS) transformation. These features are subsequently concatenated and processed by a sparse window transformer backbone. Task-specific heads are applied to the unified BEV features to pro… view at source ↗

**Figure 4.** Figure 4: illustrates the impact of data scaling on models of varying sizes. We observe a consistent trend that increasing the training example size monotonically reduces the loss for all models, though this benefit exhibits diminishing returns as the loss curves flatten. Similar to our model scaling findings, we do not observe the strong log-linear scaling laws often reported in LLM literature. We attribute this to… view at source ↗

**Figure 3.** Figure 3: Model scaling curves. Final loss consistently decreases as model parameter size increases. Log-linear fits are overlaid for each dataset size to illustrate the scaling trend. pass and recomputes them for backpropagation. 6.1. Model Scaling We first scale STELLAR at different model sizes, by varying the transformer parameters, including hidden dimension size, feed-forward ratio, and number of layers, as sho… view at source ↗

**Figure 5.** Figure 5: Compute scaling curves. Each dot represents a model size, and each line represents various model sizes training with a given data size. Both large models and larger datasets lead to lower loss. The efficient frontier curve indicates that for a fixed compute FLOPs budget, it is more effective to train a smaller model on a larger dataset than to train a larger model on a small dataset. 7. Evaluation We evalu… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of the STELLAR-96M (left) and STELLAR-483M (right) models, pre-trained on the full dataset and finetuned on the WOD validation set. Ground truth boxes are shown in blue, and predictions (confidence > 0.2) are in red. Compared to the smaller model, the larger model (right) demonstrates superior performance: it successfully detects a pedestrian at the crosswalk (green), yields more acc… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of the STELLAR-483M model pre-trained on the 12.8M dataset (left) vs. the full dataset (right) and finetuned on WOD. Ground truth boxes are shown in blue, and predictions (confidence > 0.2) are in red. The model trained on more examples exhibits better quality detections, especially in long range (highlighted in orange). Input Modality L2 APH LiDAR Camera Surfel Overall ✓ 74.9 ✓ ✓ 75… view at source ↗

**Figure 8.** Figure 8: Temporal context ablation across mid-training and finetuning. The results reveal that Overall L2 APH consistently improves as the number of finetuning frames increases, regardless of the mid-training frames at (2, 4, 6). Furthermore, longer context in mid-training offers limited benefit when finetuning uses fewer frames. D. Additional Ablation Studies In this section, we provide additional ablation studi… view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of the STELLAR-96M (left column) and STELLAR-483M (right column) models, pre-trained on the full internal dataset and finetuned on WOD. Ground truth boxes are shown in blue, and predictions (confidence > 0.2) are in red. In all three examples, the larger model demonstrates superior performance, achieving higher recall (orange) and predicting more accurate location and size (green), i… view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of the STELLAR-483M model pre-trained on the 12.8M dataset (left) vs. the full dataset (right) and finetuned on WOD. Ground truth boxes are shown in blue, and predictions (confidence > 0.2) are in red. The model trained on more examples achieves higher recall (orange) and more accurate location (green) in challenging scenarios, involving sparse points, partial occlusions, and crowde… view at source ↗

read the original abstract

Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling a 500M-param multi-modal model on 50M driving examples reaches SOTA on Waymo, though scale attribution needs tighter controls.

read the letter

The paper's main result is that scaling a multi-modal 3D perception model up to 500 million parameters on 50 million driving examples delivers state-of-the-art performance on the Waymo Open Dataset. They document empirical trends showing better results with more model size, data, and compute. They do a few things right. Applying the scaling approach to the specific challenges of autonomous driving perception, with its mix of sensors and need for 3D spatial reasoning, fills a gap that 2D or language scaling papers do not cover. Building the model on a Sparse Window Transformer and adding radar and map priors alongside the usual LiDAR and camera inputs shows real attention to the domain. Reporting concrete benchmark wins over prior methods gives readers something tangible to evaluate. The soft spots center on attribution. The central claim ties the gains to scale, yet the description does not detail ablations that hold data curation, calibration, and fusion architecture constant while changing only the scale factors. Without those, the trends could reflect other differences in how the large run was set up. The lack of reported error bars or split details in the summary also leaves the SOTA margin open to questions about variability. Readers who work on self-driving perception stacks or who study scaling in applied vision tasks will get the most from this. It is the kind of large empirical study that industry labs might use as a reference point even if they adapt the details. The work deserves a serious referee because the topic matters for safety-critical systems and the scale of the experiments is substantial enough to warrant close examination. I would send it out for review, but with a note to strengthen the controls around what exactly is driving the improvements.

Referee Report

1 major / 1 minor

Summary. The manuscript presents STELLAR, a Sparse Window Transformer model for multi-modal 3D perception in autonomous driving. Inputs are extended to include LiDAR, radar, camera, and map priors. The model is trained on a dataset of 50 million driving examples with up to 500 million parameters. Large-scale experiments are reported to reveal empirical scaling trends connecting performance to model size, data volume, and compute. The resulting model claims new state-of-the-art results on the Waymo Open Dataset, outperforming prior methods by a large margin, and concludes that large-scale training is a promising direction for perception models in this domain.

Significance. If the scaling trends are shown to be robustly attributable to scale rather than confounding factors, the work would demonstrate that scaling laws can be successfully applied to the domain-specific challenges of heterogeneous sensor fusion and 3D spatial understanding in autonomous driving. This could shift research emphasis toward larger models and datasets in the field and provide a concrete benchmark for future scaling studies on standard datasets like Waymo.

major comments (1)

[Large-scale experiments and results sections] The central claim attributes performance gains and SOTA results primarily to large-scale training (model size, data volume, compute). However, the manuscript does not appear to include ablations that hold the data-processing pipeline, sensor-calibration procedure, and fusion architecture fixed while varying only scale factors. Without such controls, the observed trends and Waymo improvements cannot be confidently isolated to scale as opposed to unstated choices in curation or hyperparameters. This is load-bearing for the abstract's attribution of results to 'large-scale training'.

minor comments (1)

[Abstract] The abstract refers to 'outperforming prior arts by a large margin' without specifying the exact metrics (e.g., mAP, NDS) or numerical deltas; providing these values would strengthen the SOTA claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to strengthen the presentation of our scaling controls.

read point-by-point responses

Referee: [Large-scale experiments and results sections] The central claim attributes performance gains and SOTA results primarily to large-scale training (model size, data volume, compute). However, the manuscript does not appear to include ablations that hold the data-processing pipeline, sensor-calibration procedure, and fusion architecture fixed while varying only scale factors. Without such controls, the observed trends and Waymo improvements cannot be confidently isolated to scale as opposed to unstated choices in curation or hyperparameters. This is load-bearing for the abstract's attribution of results to 'large-scale training'.

Authors: We agree that isolating the contribution of scale is essential for the central claim. Our experiments train variants of the same STELLAR Sparse Window Transformer architecture on the identical data-processing pipeline and sensor-calibration procedure. Model size is varied from smaller configurations to 500M parameters, data volume is varied via controlled subsampling of the 50M-scene corpus, and compute is scaled accordingly, with all other factors (including fusion design, hyperparameters for the base architecture, and curation rules) held fixed. The reported scaling trends and Waymo gains are therefore measured under these controls. To address the concern directly, we will add an explicit ablation subsection in the revised Large-scale experiments section that tabulates these fixed factors and reports the isolated scaling curves. This revision will also update the abstract and results discussion to reference the controls more precisely. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical scaling observations on external benchmarks

full rationale

The paper reports results from training a Sparse Window Transformer extended to LiDAR/radar/camera/map inputs on a 50-million-example dataset, with model sizes up to 500M parameters. It presents observed performance trends versus scale and SOTA numbers on the Waymo Open Dataset. These are direct experimental measurements, not quantities derived from parameters or outputs defined in terms of the reported results themselves. No self-definitional equations, fitted-input predictions, or load-bearing self-citations that reduce the central claim to its own inputs appear in the abstract or described structure. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on the background assumption that transformer-based architectures can be extended to heterogeneous sensor inputs and that performance scales predictably with resources; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Sparse Window Transformer architecture is a suitable base for multi-modal 3D driving perception
The model is built by extending this architecture to LiDAR, radar, camera, and map inputs.

pith-pipeline@v0.9.0 · 5764 in / 1380 out tokens · 34430 ms · 2026-05-21T07:05:45.205989+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop our STELLAR model based on Sparse Window Transformer... train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Figure 3. Model scaling curves... Log-linear fits... diminishing returns... Figure 4. Data scaling curves.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 10 internal anchors

[1]

2022 , organization=

Sun, Pei and Tan, Mingxing and Wang, Weiyue and Liu, Chenxi and Xia, Fei and Leng, Zhaoqi and Anguelov, Dragomir , booktitle=. 2022 , organization=

work page 2022
[2]

Conference on Robot Learning , pages=

End-to-end multi-view fusion for 3d object detection in lidar point clouds , author=. Conference on Robot Learning , pages=. 2020 , organization=

work page 2020
[3]

Qi, Charles R and Su, Hao and Mo, Kaichun and Guibas, Leonidas J , booktitle=

work page
[4]

Proceedings of the IEEE conference on computer vision and pattern recognition , year=

Scene Reconstruction as Mapping Priors for 3D Detection , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , year=

work page
[5]

Lang, Alex H and Vora, Sourabh and Caesar, Holger and Zhou, Lubing and Yang, Jiong and Beijbom, Oscar , booktitle=

work page
[6]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[7]

European conference on computer vision , pages=

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d , author=. European conference on computer vision , pages=. 2020 , organization=

work page 2020
[8]

Duan, Kaiwen and Bai, Song and Xie, Lingxi and Qi, Honggang and Huang, Qingming and Tian, Qi , booktitle=

work page
[9]

Agro, Ben and Casas, Sergio and Wang, Patrick and Gilles, Thomas and Urtasun, Raquel , booktitle=

work page
[10]

Zhang, Gang and Chen, Junnan and Gao, Guohuan and Li, Jianmin and Liu, Si and Hu, Xiaolin , booktitle=

work page
[11]

Wu, Xiaoyang and Jiang, Li and Wang, Peng-Shuai and Liu, Zhijian and Liu, Xihui and Qiao, Yu and Ouyang, Wanli and He, Tong and Zhao, Hengshuang , booktitle=

work page
[12]

Zeid, Karim Abou and Yilmaz, Kadir and de Geus, Daan and Hermans, Alexander and Adrian, David and Linder, Timm and Leibe, Bastian , journal=

work page
[13]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Uni-to-multi modal knowledge distillation for bidirectional lidar-camera semantic segmentation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[14]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[15]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[16]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[18]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

work page
[19]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[21]

Tian, Xiaoyu and Jiang, Tao and Yun, Longfei and Mao, Yucheng and Yang, Huitong and Wang, Yue and Wang, Yilun and Zhao, Hang , journal=

work page
[22]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[24]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Coca: Contrastive captioners are image-text foundation models , author=. arXiv preprint arXiv:2205.01917 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=

work page
[26]

2018 , journal=

Improving language understanding by generative pre-training , author=. 2018 , journal=

work page 2018
[27]

2025 , organization=

Huang, Xin and Wolff, Eric M and Vernaza, Paul and Phan-Minh, Tung and Chen, Hongge and Hayden, David S and Edmonds, Mark and Pierce, Brian and Chen, Xinxin and Jacob, Pratik Elias and others , booktitle=. 2025 , organization=

work page 2025
[28]

arXiv preprint arXiv:2506.08228 , year=

Scaling Laws of Motion Forecasting and Planning--A Technical Report , author=. arXiv preprint arXiv:2506.08228 , year=

work page arXiv
[29]

2024 , publisher=

Fan, Lue and Wang, Feng and Wang, Naiyan and Zhang, Zhaoxiang , journal=. 2024 , publisher=

work page 2024
[30]

Zhang, Gang and Junnan, Chen and Gao, Guohuan and Li, Jianmin and Hu, Xiaolin , journal=

work page
[31]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Center-based 3d object detection and tracking , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[32]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

nuscenes: A multimodal dataset for autonomous driving , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[33]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Scalability in perception for autonomous driving: Waymo open dataset , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[34]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Modar: Using motion forecasting for 3d object detection in point cloud sequences , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[35]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[36]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

Argoverse 2: Next generation datasets for self-driving perception and forecasting , author=. arXiv preprint arXiv:2301.00493 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Towards learning-based planning: The nuplan benchmark for real-world autonomous driving , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

work page 2024
[38]

https://www.huggingface.com/blog/lerobot-goes-to-driving-school , year =

Team Yaak , title =. https://www.huggingface.com/blog/lerobot-goes-to-driving-school , year =

work page
[39]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020
[40]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. arXiv preprint arXiv:2304.11277 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

GSPMD: General and Scalable Parallelization for ML Computation Graphs

GSPMD: General and Scalable Parallelization for ML Computation Graphs , author=. arXiv preprint arXiv:2105.04663 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Astrophysics Source Code Library , pages=

JAX: Autograd and XLA , author=. Astrophysics Source Code Library , pages=

work page
[43]

2016 , eprint=

Layer Normalization , author=. 2016 , eprint=

work page 2016
[44]

Proceedings of the 27th international conference on machine learning (ICML-10) , pages=

Rectified linear units improve restricted boltzmann machines , author=. Proceedings of the 27th international conference on machine learning (ICML-10) , pages=

work page
[45]

XLA : Compiling Machine Learning for Peak Performance ,author =

work page
[46]

ICRA , year=

Bevfusion: Multi-task multi-sensor fusion with unified bird's-eye view representation , author=. ICRA , year=

work page
[47]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Embracing single stride 3d object detector with sparse transformer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[48]

2022 , organization=

Zhou, Zixiang and Zhao, Xiangchen and Wang, Yu and Wang, Panqu and Foroosh, Hassan , booktitle=. 2022 , organization=

work page 2022
[49]

Advances in Neural Information Processing Systems , volume=

Fully sparse 3d object detection , author=. Advances in Neural Information Processing Systems , volume=

work page
[50]

2024 , organization=

Liu, Zhe and Hou, Jinghua and Ye, Xiaoqing and Wang, Tong and Wang, Jingdong and Bai, Xiang , booktitle=. 2024 , organization=

work page 2024
[51]

Liu, Zhe and Hou, Jinghua and Wang, Xinyu and Ye, Xiaoqing and Wang, Jingdong and Zhao, Hengshuang and Bai, Xiang , journal=

work page
[52]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Super Sparse 3D Object Detection , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[53]

He, Chenhang and Li, Ruihuang and Zhang, Yabin and Li, Shuai and Zhang, Lei , booktitle=

work page
[54]

Li, Xin and Ma, Tao and Hou, Yuenan and Shi, Botian and Yang, Yuchen and Liu, Youquan and Wu, Xingjiao and Chen, Qin and Li, Yikang and Qiao, Yu and others , booktitle=

work page
[55]

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

VADet: Multi-Frame LiDAR 3D Object Detection Using Variable Aggregation , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

work page 2025
[56]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=

work page internal anchor Pith review arXiv 1904
[57]

European conference on computer vision , pages=

Deep networks with stochastic depth , author=. European conference on computer vision , pages=. 2016 , organization=

work page 2016
[58]

Yang, Zhenpei and Chai, Yuning and Anguelov, Dragomir and Zhou, Yin and Sun, Pei and Erhan, Dumitru and Rafferty, Sean and Kretzschmar, Henrik , booktitle=

work page
[59]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Beyond attention: Breaking the limits of transformer context length with recurrent memory , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[60]

Training Deep Nets with Sublinear Memory Cost

Training deep nets with sublinear memory cost , author=. arXiv preprint arXiv:1604.06174 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

2018 , institution =

Fitting larger networks into memory , author =. 2018 , institution =

work page 2018
[62]

arXiv preprint arXiv:2403.08763 , year=

Simple and scalable strategies to continually pre-train large language models , author=. arXiv preprint arXiv:2403.08763 , year=

work page arXiv
[63]

Advances in Neural Information Processing Systems , volume=

Scaling laws and compute-optimal training beyond fixed training durations , author=. Advances in Neural Information Processing Systems , volume=

work page
[64]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Exploring object-centric temporal modeling for efficient multi-view 3d object detection , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[65]

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net , author=. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

work page
[66]

Zhou, Yin and Tuzel, Oncel , booktitle=

work page
[67]

Huang, Junjie and Huang, Guan and Zhu, Zheng and Ye, Yun and Du, Dalong , journal=

work page
[68]

2024 , publisher=

Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Yu, Qiao and Dai, Jifeng , journal=. 2024 , publisher=

work page 2024
[69]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Rethinking imagenet pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[70]

Proceedings of the 27th annual conference on Computer graphics and interactive techniques , pages=

Surfels: Surface elements as rendering primitives , author=. Proceedings of the 27th annual conference on Computer graphics and interactive techniques , pages=

work page
[71]

Proceedings of Association for Computational Linguistics (ACL) , pages=

The impact of depth on compositional generalization in transformer language models , author=. Proceedings of Association for Computational Linguistics (ACL) , pages=

work page
[72]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

On offline evaluation of 3d object detection for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[73]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Planning-oriented autonomous driving , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[74]

2025 , organization=

Wozniak, Maciej K and Govindarajan, Hariprasath and Klingner, Marvin and Maurice, Camille and Kiran, B Ravi and Yogamani, Senthil , booktitle=. 2025 , organization=

work page 2025
[75]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Masked autoencoder for self-supervised pre-training on lidar point clouds , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

work page
[76]

Agro, Ben and Sykora, Quinlan and Casas, Sergio and Gilles, Thomas and Urtasun, Raquel , booktitle=

work page
[77]

arXiv preprint arXiv:2503.15672 , year=

Ljungbergh, William and Lilja, Adam and Ling, Adam Tonderski and Lindstr. arXiv preprint arXiv:2503.15672 , year=

work page arXiv
[78]

Yang, Honghui and Zhang, Sha and Huang, Di and Wu, Xiaoyang and Zhu, Haoyi and He, Tong and Tang, Shixiang and Zhao, Hengshuang and Qiu, Qibo and Lin, Binbin and others , booktitle=

work page

[1] [1]

2022 , organization=

Sun, Pei and Tan, Mingxing and Wang, Weiyue and Liu, Chenxi and Xia, Fei and Leng, Zhaoqi and Anguelov, Dragomir , booktitle=. 2022 , organization=

work page 2022

[2] [2]

Conference on Robot Learning , pages=

End-to-end multi-view fusion for 3d object detection in lidar point clouds , author=. Conference on Robot Learning , pages=. 2020 , organization=

work page 2020

[3] [3]

Qi, Charles R and Su, Hao and Mo, Kaichun and Guibas, Leonidas J , booktitle=

work page

[4] [4]

Proceedings of the IEEE conference on computer vision and pattern recognition , year=

Scene Reconstruction as Mapping Priors for 3D Detection , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , year=

work page

[5] [5]

Lang, Alex H and Vora, Sourabh and Caesar, Holger and Zhou, Lubing and Yang, Jiong and Beijbom, Oscar , booktitle=

work page

[6] [6]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[7] [7]

European conference on computer vision , pages=

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d , author=. European conference on computer vision , pages=. 2020 , organization=

work page 2020

[8] [8]

Duan, Kaiwen and Bai, Song and Xie, Lingxi and Qi, Honggang and Huang, Qingming and Tian, Qi , booktitle=

work page

[9] [9]

Agro, Ben and Casas, Sergio and Wang, Patrick and Gilles, Thomas and Urtasun, Raquel , booktitle=

work page

[10] [10]

Zhang, Gang and Chen, Junnan and Gao, Guohuan and Li, Jianmin and Liu, Si and Hu, Xiaolin , booktitle=

work page

[11] [11]

Wu, Xiaoyang and Jiang, Li and Wang, Peng-Shuai and Liu, Zhijian and Liu, Xihui and Qiao, Yu and Ouyang, Wanli and He, Tong and Zhao, Hengshuang , booktitle=

work page

[12] [12]

Zeid, Karim Abou and Yilmaz, Kadir and de Geus, Daan and Hermans, Alexander and Adrian, David and Linder, Timm and Leibe, Bastian , journal=

work page

[13] [13]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Uni-to-multi modal knowledge distillation for bidirectional lidar-camera semantic segmentation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page

[14] [14]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[15] [15]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[16] [16]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[18] [18]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

work page

[19] [19]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[21] [21]

Tian, Xiaoyu and Jiang, Tao and Yun, Longfei and Mao, Yucheng and Yang, Huitong and Wang, Yue and Wang, Yilun and Zhao, Hang , journal=

work page

[22] [22]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[24] [24]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Coca: Contrastive captioners are image-text foundation models , author=. arXiv preprint arXiv:2205.01917 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=

work page

[26] [26]

2018 , journal=

Improving language understanding by generative pre-training , author=. 2018 , journal=

work page 2018

[27] [27]

2025 , organization=

Huang, Xin and Wolff, Eric M and Vernaza, Paul and Phan-Minh, Tung and Chen, Hongge and Hayden, David S and Edmonds, Mark and Pierce, Brian and Chen, Xinxin and Jacob, Pratik Elias and others , booktitle=. 2025 , organization=

work page 2025

[28] [28]

arXiv preprint arXiv:2506.08228 , year=

Scaling Laws of Motion Forecasting and Planning--A Technical Report , author=. arXiv preprint arXiv:2506.08228 , year=

work page arXiv

[29] [29]

2024 , publisher=

Fan, Lue and Wang, Feng and Wang, Naiyan and Zhang, Zhaoxiang , journal=. 2024 , publisher=

work page 2024

[30] [30]

Zhang, Gang and Junnan, Chen and Gao, Guohuan and Li, Jianmin and Hu, Xiaolin , journal=

work page

[31] [31]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Center-based 3d object detection and tracking , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[32] [32]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

nuscenes: A multimodal dataset for autonomous driving , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[33] [33]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Scalability in perception for autonomous driving: Waymo open dataset , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[34] [34]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Modar: Using motion forecasting for 3d object detection in point cloud sequences , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[35] [35]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[36] [36]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

Argoverse 2: Next generation datasets for self-driving perception and forecasting , author=. arXiv preprint arXiv:2301.00493 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Towards learning-based planning: The nuplan benchmark for real-world autonomous driving , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

work page 2024

[38] [38]

https://www.huggingface.com/blog/lerobot-goes-to-driving-school , year =

Team Yaak , title =. https://www.huggingface.com/blog/lerobot-goes-to-driving-school , year =

work page

[39] [39]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020

[40] [40]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. arXiv preprint arXiv:2304.11277 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

GSPMD: General and Scalable Parallelization for ML Computation Graphs

GSPMD: General and Scalable Parallelization for ML Computation Graphs , author=. arXiv preprint arXiv:2105.04663 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Astrophysics Source Code Library , pages=

JAX: Autograd and XLA , author=. Astrophysics Source Code Library , pages=

work page

[43] [43]

2016 , eprint=

Layer Normalization , author=. 2016 , eprint=

work page 2016

[44] [44]

Proceedings of the 27th international conference on machine learning (ICML-10) , pages=

Rectified linear units improve restricted boltzmann machines , author=. Proceedings of the 27th international conference on machine learning (ICML-10) , pages=

work page

[45] [45]

XLA : Compiling Machine Learning for Peak Performance ,author =

work page

[46] [46]

ICRA , year=

Bevfusion: Multi-task multi-sensor fusion with unified bird's-eye view representation , author=. ICRA , year=

work page

[47] [47]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Embracing single stride 3d object detector with sparse transformer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[48] [48]

2022 , organization=

Zhou, Zixiang and Zhao, Xiangchen and Wang, Yu and Wang, Panqu and Foroosh, Hassan , booktitle=. 2022 , organization=

work page 2022

[49] [49]

Advances in Neural Information Processing Systems , volume=

Fully sparse 3d object detection , author=. Advances in Neural Information Processing Systems , volume=

work page

[50] [50]

2024 , organization=

Liu, Zhe and Hou, Jinghua and Ye, Xiaoqing and Wang, Tong and Wang, Jingdong and Bai, Xiang , booktitle=. 2024 , organization=

work page 2024

[51] [51]

Liu, Zhe and Hou, Jinghua and Wang, Xinyu and Ye, Xiaoqing and Wang, Jingdong and Zhao, Hengshuang and Bai, Xiang , journal=

work page

[52] [52]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Super Sparse 3D Object Detection , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page

[53] [53]

He, Chenhang and Li, Ruihuang and Zhang, Yabin and Li, Shuai and Zhang, Lei , booktitle=

work page

[54] [54]

Li, Xin and Ma, Tao and Hou, Yuenan and Shi, Botian and Yang, Yuchen and Liu, Youquan and Wu, Xingjiao and Chen, Qin and Li, Yikang and Qiao, Yu and others , booktitle=

work page

[55] [55]

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

VADet: Multi-Frame LiDAR 3D Object Detection Using Variable Aggregation , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

work page 2025

[56] [56]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=

work page internal anchor Pith review arXiv 1904

[57] [57]

European conference on computer vision , pages=

Deep networks with stochastic depth , author=. European conference on computer vision , pages=. 2016 , organization=

work page 2016

[58] [58]

Yang, Zhenpei and Chai, Yuning and Anguelov, Dragomir and Zhou, Yin and Sun, Pei and Erhan, Dumitru and Rafferty, Sean and Kretzschmar, Henrik , booktitle=

work page

[59] [59]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Beyond attention: Breaking the limits of transformer context length with recurrent memory , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[60] [60]

Training Deep Nets with Sublinear Memory Cost

Training deep nets with sublinear memory cost , author=. arXiv preprint arXiv:1604.06174 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

2018 , institution =

Fitting larger networks into memory , author =. 2018 , institution =

work page 2018

[62] [62]

arXiv preprint arXiv:2403.08763 , year=

Simple and scalable strategies to continually pre-train large language models , author=. arXiv preprint arXiv:2403.08763 , year=

work page arXiv

[63] [63]

Advances in Neural Information Processing Systems , volume=

Scaling laws and compute-optimal training beyond fixed training durations , author=. Advances in Neural Information Processing Systems , volume=

work page

[64] [64]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Exploring object-centric temporal modeling for efficient multi-view 3d object detection , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[65] [65]

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net , author=. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

work page

[66] [66]

Zhou, Yin and Tuzel, Oncel , booktitle=

work page

[67] [67]

Huang, Junjie and Huang, Guan and Zhu, Zheng and Ye, Yun and Du, Dalong , journal=

work page

[68] [68]

2024 , publisher=

Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Yu, Qiao and Dai, Jifeng , journal=. 2024 , publisher=

work page 2024

[69] [69]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Rethinking imagenet pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[70] [70]

Proceedings of the 27th annual conference on Computer graphics and interactive techniques , pages=

Surfels: Surface elements as rendering primitives , author=. Proceedings of the 27th annual conference on Computer graphics and interactive techniques , pages=

work page

[71] [71]

Proceedings of Association for Computational Linguistics (ACL) , pages=

The impact of depth on compositional generalization in transformer language models , author=. Proceedings of Association for Computational Linguistics (ACL) , pages=

work page

[72] [72]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

On offline evaluation of 3d object detection for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[73] [73]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Planning-oriented autonomous driving , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[74] [74]

2025 , organization=

Wozniak, Maciej K and Govindarajan, Hariprasath and Klingner, Marvin and Maurice, Camille and Kiran, B Ravi and Yogamani, Senthil , booktitle=. 2025 , organization=

work page 2025

[75] [75]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Masked autoencoder for self-supervised pre-training on lidar point clouds , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

work page

[76] [76]

Agro, Ben and Sykora, Quinlan and Casas, Sergio and Gilles, Thomas and Urtasun, Raquel , booktitle=

work page

[77] [77]

arXiv preprint arXiv:2503.15672 , year=

Ljungbergh, William and Lilja, Adam and Ling, Adam Tonderski and Lindstr. arXiv preprint arXiv:2503.15672 , year=

work page arXiv

[78] [78]

Yang, Honghui and Zhang, Sha and Huang, Di and Wu, Xiaoyang and Zhu, Haoyi and He, Tong and Tang, Shixiang and Zhao, Hengshuang and Qiu, Qibo and Lin, Binbin and others , booktitle=

work page