pith. sign in

arxiv: 2605.20390 · v1 · pith:E7SOU7PSnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

Pith reviewed 2026-05-21 07:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords scaling laws3D perceptionautonomous drivingmulti-modal fusionWaymo datasetSparse Window Transformerlarge models
0
0 comments X

The pith

Larger models with more data and compute improve 3D perception accuracy for autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether scaling laws observed in other AI domains hold for the specialized task of 3D perception in self-driving cars. It introduces the STELLAR model by extending a Sparse Window Transformer to process inputs from LiDAR, radar, camera, and map priors together. Models are trained on 50 million driving examples at scales reaching 500 million parameters. Experiments show steady gains in performance as model size, dataset volume, and compute increase. The largest configuration sets a new state-of-the-art on the Waymo Open Dataset challenge.

Core claim

Training models with up to 500 million parameters on a 50-million-example dataset that fuses LiDAR, radar, camera, and map prior data via an extended Sparse Window Transformer produces measurable scaling trends in which detection and tracking performance rise with greater model size, data volume, and compute, ultimately establishing a new state-of-the-art on the Waymo Open Dataset.

What carries the argument

The STELLAR model, which extends the Sparse Window Transformer architecture to jointly process heterogeneous inputs from LiDAR, radar, camera, and map priors.

If this is right

  • Perception performance in driving scenes improves predictably as model capacity and training resources grow.
  • Multi-modal sensor fusion becomes more effective at larger scales for tasks requiring 3D spatial reasoning.
  • Autonomous driving perception systems may benefit more from collecting additional real-world data than from designing new hand-engineered modules.
  • Very large models could reduce reliance on task-specific architectural innovations in favor of scale-driven improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scaling approach could be applied to other robotics perception problems that involve multiple sensors and 3D understanding.
  • If the trends continue, practical systems would face higher inference compute demands that may require model compression or specialized hardware.
  • Data collection pipelines for autonomous vehicles could become a central competitive advantage if scale remains the dominant driver of accuracy.

Load-bearing premise

The observed performance gains result primarily from increases in model size, data volume, and compute rather than from unmeasured differences in data curation or training procedures.

What would settle it

A smaller model trained with carefully matched data curation and hyper-parameters that matches or exceeds the reported Waymo scores of the largest STELLAR configuration.

Figures

Figures reproduced from arXiv: 2605.20390 by Alex Zihao Zhu, Anant Subramanian, Chen Song, Dragomir Anguelov, Govind Thattai, Hao Xiang, Junwen Yao, Mingxing Tan, Tom Hoddes, Weijing Shi, Xin Huang, Yang Fu, Yang Liu, Yingwei Li, Yuliang Zou, Zhaoqi Leng.

Figure 1
Figure 1. Figure 1: STELLAR achieves better 3D detection performance through scaling model parameters and multi-task mid-training on high quality driving data, measured by average L2 APH on the Waymo Open Dataset validation set. The dashed horizontal line represents previous state-of-the-art using up to 4 temporal frames. enabling safer and more robust driving in complex environ￾ments (Schreier et al., 2023; Hu et al., 2023).… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of STELLAR, a multi-modal perception model. The model projects LiDAR, radar, and surfel inputs directly into a bird’s-eye-view (BEV) representation, while camera features are mapped to BEV via a lift-splat-shoot (LSS) transformation. These features are subsequently concatenated and processed by a sparse window transformer backbone. Task-specific heads are applied to the unified BEV features to pro… view at source ↗
Figure 4
Figure 4. Figure 4: illustrates the impact of data scaling on models of varying sizes. We observe a consistent trend that increasing the training example size monotonically reduces the loss for all models, though this benefit exhibits diminishing returns as the loss curves flatten. Similar to our model scaling findings, we do not observe the strong log-linear scaling laws often reported in LLM literature. We attribute this to… view at source ↗
Figure 3
Figure 3. Figure 3: Model scaling curves. Final loss consistently decreases as model parameter size increases. Log-linear fits are overlaid for each dataset size to illustrate the scaling trend. pass and recomputes them for backpropagation. 6.1. Model Scaling We first scale STELLAR at different model sizes, by varying the transformer parameters, including hidden dimension size, feed-forward ratio, and number of layers, as sho… view at source ↗
Figure 5
Figure 5. Figure 5: Compute scaling curves. Each dot represents a model size, and each line represents various model sizes training with a given data size. Both large models and larger datasets lead to lower loss. The efficient frontier curve indicates that for a fixed compute FLOPs budget, it is more effective to train a smaller model on a larger dataset than to train a larger model on a small dataset. 7. Evaluation We evalu… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of the STELLAR-96M (left) and STELLAR-483M (right) models, pre-trained on the full dataset and finetuned on the WOD validation set. Ground truth boxes are shown in blue, and predictions (confidence > 0.2) are in red. Compared to the smaller model, the larger model (right) demonstrates superior performance: it successfully detects a pedestrian at the crosswalk (green), yields more acc… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of the STELLAR-483M model pre-trained on the 12.8M dataset (left) vs. the full dataset (right) and finetuned on WOD. Ground truth boxes are shown in blue, and predictions (confidence > 0.2) are in red. The model trained on more examples exhibits better quality detections, especially in long range (highlighted in orange). Input Modality L2 APH LiDAR Camera Surfel Overall ✓ 74.9 ✓ ✓ 75… view at source ↗
Figure 8
Figure 8. Figure 8: Temporal context ablation across mid-training and fine￾tuning. The results reveal that Overall L2 APH consistently im￾proves as the number of finetuning frames increases, regardless of the mid-training frames at (2, 4, 6). Furthermore, longer context in mid-training offers limited benefit when finetuning uses fewer frames. D. Additional Ablation Studies In this section, we provide additional ablation studi… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of the STELLAR-96M (left column) and STELLAR-483M (right column) models, pre-trained on the full internal dataset and finetuned on WOD. Ground truth boxes are shown in blue, and predictions (confidence > 0.2) are in red. In all three examples, the larger model demonstrates superior performance, achieving higher recall (orange) and predicting more accurate location and size (green), i… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of the STELLAR-483M model pre-trained on the 12.8M dataset (left) vs. the full dataset (right) and finetuned on WOD. Ground truth boxes are shown in blue, and predictions (confidence > 0.2) are in red. The model trained on more examples achieves higher recall (orange) and more accurate location (green) in challenging scenarios, involving sparse points, partial occlusions, and crowde… view at source ↗
read the original abstract

Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents STELLAR, a Sparse Window Transformer model for multi-modal 3D perception in autonomous driving. Inputs are extended to include LiDAR, radar, camera, and map priors. The model is trained on a dataset of 50 million driving examples with up to 500 million parameters. Large-scale experiments are reported to reveal empirical scaling trends connecting performance to model size, data volume, and compute. The resulting model claims new state-of-the-art results on the Waymo Open Dataset, outperforming prior methods by a large margin, and concludes that large-scale training is a promising direction for perception models in this domain.

Significance. If the scaling trends are shown to be robustly attributable to scale rather than confounding factors, the work would demonstrate that scaling laws can be successfully applied to the domain-specific challenges of heterogeneous sensor fusion and 3D spatial understanding in autonomous driving. This could shift research emphasis toward larger models and datasets in the field and provide a concrete benchmark for future scaling studies on standard datasets like Waymo.

major comments (1)
  1. [Large-scale experiments and results sections] The central claim attributes performance gains and SOTA results primarily to large-scale training (model size, data volume, compute). However, the manuscript does not appear to include ablations that hold the data-processing pipeline, sensor-calibration procedure, and fusion architecture fixed while varying only scale factors. Without such controls, the observed trends and Waymo improvements cannot be confidently isolated to scale as opposed to unstated choices in curation or hyperparameters. This is load-bearing for the abstract's attribution of results to 'large-scale training'.
minor comments (1)
  1. [Abstract] The abstract refers to 'outperforming prior arts by a large margin' without specifying the exact metrics (e.g., mAP, NDS) or numerical deltas; providing these values would strengthen the SOTA claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to strengthen the presentation of our scaling controls.

read point-by-point responses
  1. Referee: [Large-scale experiments and results sections] The central claim attributes performance gains and SOTA results primarily to large-scale training (model size, data volume, compute). However, the manuscript does not appear to include ablations that hold the data-processing pipeline, sensor-calibration procedure, and fusion architecture fixed while varying only scale factors. Without such controls, the observed trends and Waymo improvements cannot be confidently isolated to scale as opposed to unstated choices in curation or hyperparameters. This is load-bearing for the abstract's attribution of results to 'large-scale training'.

    Authors: We agree that isolating the contribution of scale is essential for the central claim. Our experiments train variants of the same STELLAR Sparse Window Transformer architecture on the identical data-processing pipeline and sensor-calibration procedure. Model size is varied from smaller configurations to 500M parameters, data volume is varied via controlled subsampling of the 50M-scene corpus, and compute is scaled accordingly, with all other factors (including fusion design, hyperparameters for the base architecture, and curation rules) held fixed. The reported scaling trends and Waymo gains are therefore measured under these controls. To address the concern directly, we will add an explicit ablation subsection in the revised Large-scale experiments section that tabulates these fixed factors and reports the isolated scaling curves. This revision will also update the abstract and results discussion to reference the controls more precisely. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical scaling observations on external benchmarks

full rationale

The paper reports results from training a Sparse Window Transformer extended to LiDAR/radar/camera/map inputs on a 50-million-example dataset, with model sizes up to 500M parameters. It presents observed performance trends versus scale and SOTA numbers on the Waymo Open Dataset. These are direct experimental measurements, not quantities derived from parameters or outputs defined in terms of the reported results themselves. No self-definitional equations, fitted-input predictions, or load-bearing self-citations that reduce the central claim to its own inputs appear in the abstract or described structure. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on the background assumption that transformer-based architectures can be extended to heterogeneous sensor inputs and that performance scales predictably with resources; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Sparse Window Transformer architecture is a suitable base for multi-modal 3D driving perception
    The model is built by extending this architecture to LiDAR, radar, camera, and map inputs.

pith-pipeline@v0.9.0 · 5764 in / 1380 out tokens · 34430 ms · 2026-05-21T07:05:45.205989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 10 internal anchors

  1. [1]

    2022 , organization=

    Sun, Pei and Tan, Mingxing and Wang, Weiyue and Liu, Chenxi and Xia, Fei and Leng, Zhaoqi and Anguelov, Dragomir , booktitle=. 2022 , organization=

  2. [2]

    Conference on Robot Learning , pages=

    End-to-end multi-view fusion for 3d object detection in lidar point clouds , author=. Conference on Robot Learning , pages=. 2020 , organization=

  3. [3]

    Qi, Charles R and Su, Hao and Mo, Kaichun and Guibas, Leonidas J , booktitle=

  4. [4]

    Proceedings of the IEEE conference on computer vision and pattern recognition , year=

    Scene Reconstruction as Mapping Priors for 3D Detection , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , year=

  5. [5]

    Lang, Alex H and Vora, Sourabh and Caesar, Holger and Zhou, Lubing and Yang, Jiong and Beijbom, Oscar , booktitle=

  6. [6]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  7. [7]

    European conference on computer vision , pages=

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d , author=. European conference on computer vision , pages=. 2020 , organization=

  8. [8]

    Duan, Kaiwen and Bai, Song and Xie, Lingxi and Qi, Honggang and Huang, Qingming and Tian, Qi , booktitle=

  9. [9]

    Agro, Ben and Casas, Sergio and Wang, Patrick and Gilles, Thomas and Urtasun, Raquel , booktitle=

  10. [10]

    Zhang, Gang and Chen, Junnan and Gao, Guohuan and Li, Jianmin and Liu, Si and Hu, Xiaolin , booktitle=

  11. [11]

    Wu, Xiaoyang and Jiang, Li and Wang, Peng-Shuai and Liu, Zhijian and Liu, Xihui and Qiao, Yu and Ouyang, Wanli and He, Tong and Zhao, Hengshuang , booktitle=

  12. [12]

    Zeid, Karim Abou and Yilmaz, Kadir and de Geus, Daan and Hermans, Alexander and Adrian, David and Linder, Timm and Leibe, Bastian , journal=

  13. [13]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    Uni-to-multi modal knowledge distillation for bidirectional lidar-camera semantic segmentation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  14. [14]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  15. [15]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  16. [16]

    Training Compute-Optimal Large Language Models

    Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

  17. [17]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  18. [18]

    Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

  19. [19]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  20. [20]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  21. [21]

    Tian, Xiaoyu and Jiang, Tao and Yun, Longfei and Mao, Yucheng and Yang, Huitong and Wang, Yue and Wang, Yilun and Zhao, Hang , journal=

  22. [22]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  23. [23]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  24. [24]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Coca: Contrastive captioners are image-text foundation models , author=. arXiv preprint arXiv:2205.01917 , year=

  25. [25]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=

  26. [26]

    2018 , journal=

    Improving language understanding by generative pre-training , author=. 2018 , journal=

  27. [27]

    2025 , organization=

    Huang, Xin and Wolff, Eric M and Vernaza, Paul and Phan-Minh, Tung and Chen, Hongge and Hayden, David S and Edmonds, Mark and Pierce, Brian and Chen, Xinxin and Jacob, Pratik Elias and others , booktitle=. 2025 , organization=

  28. [28]

    arXiv preprint arXiv:2506.08228 , year=

    Scaling Laws of Motion Forecasting and Planning--A Technical Report , author=. arXiv preprint arXiv:2506.08228 , year=

  29. [29]

    2024 , publisher=

    Fan, Lue and Wang, Feng and Wang, Naiyan and Zhang, Zhaoxiang , journal=. 2024 , publisher=

  30. [30]

    Zhang, Gang and Junnan, Chen and Gao, Guohuan and Li, Jianmin and Hu, Xiaolin , journal=

  31. [31]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Center-based 3d object detection and tracking , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  32. [32]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    nuscenes: A multimodal dataset for autonomous driving , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  33. [33]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Scalability in perception for autonomous driving: Waymo open dataset , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  34. [34]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Modar: Using motion forecasting for 3d object detection in point cloud sequences , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  35. [35]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  36. [36]

    Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

    Argoverse 2: Next generation datasets for self-driving perception and forecasting , author=. arXiv preprint arXiv:2301.00493 , year=

  37. [37]

    2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Towards learning-based planning: The nuplan benchmark for real-world autonomous driving , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=

  38. [38]

    https://www.huggingface.com/blog/lerobot-goes-to-driving-school , year =

    Team Yaak , title =. https://www.huggingface.com/blog/lerobot-goes-to-driving-school , year =

  39. [39]

    SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    ZeRO: Memory Optimizations Toward Training Trillion Parameter Models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

  40. [40]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. arXiv preprint arXiv:2304.11277 , year=

  41. [41]

    GSPMD: General and Scalable Parallelization for ML Computation Graphs

    GSPMD: General and Scalable Parallelization for ML Computation Graphs , author=. arXiv preprint arXiv:2105.04663 , year=

  42. [42]

    Astrophysics Source Code Library , pages=

    JAX: Autograd and XLA , author=. Astrophysics Source Code Library , pages=

  43. [43]

    2016 , eprint=

    Layer Normalization , author=. 2016 , eprint=

  44. [44]

    Proceedings of the 27th international conference on machine learning (ICML-10) , pages=

    Rectified linear units improve restricted boltzmann machines , author=. Proceedings of the 27th international conference on machine learning (ICML-10) , pages=

  45. [45]

    XLA : Compiling Machine Learning for Peak Performance ,author =

  46. [46]

    ICRA , year=

    Bevfusion: Multi-task multi-sensor fusion with unified bird's-eye view representation , author=. ICRA , year=

  47. [47]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Embracing single stride 3d object detector with sparse transformer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  48. [48]

    2022 , organization=

    Zhou, Zixiang and Zhao, Xiangchen and Wang, Yu and Wang, Panqu and Foroosh, Hassan , booktitle=. 2022 , organization=

  49. [49]

    Advances in Neural Information Processing Systems , volume=

    Fully sparse 3d object detection , author=. Advances in Neural Information Processing Systems , volume=

  50. [50]

    2024 , organization=

    Liu, Zhe and Hou, Jinghua and Ye, Xiaoqing and Wang, Tong and Wang, Jingdong and Bai, Xiang , booktitle=. 2024 , organization=

  51. [51]

    Liu, Zhe and Hou, Jinghua and Wang, Xinyu and Ye, Xiaoqing and Wang, Jingdong and Zhao, Hengshuang and Bai, Xiang , journal=

  52. [52]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    Super Sparse 3D Object Detection , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  53. [53]

    He, Chenhang and Li, Ruihuang and Zhang, Yabin and Li, Shuai and Zhang, Lei , booktitle=

  54. [54]

    Li, Xin and Ma, Tao and Hou, Yuenan and Shi, Botian and Yang, Yuchen and Liu, Youquan and Wu, Xingjiao and Chen, Qin and Li, Yikang and Qiao, Yu and others , booktitle=

  55. [55]

    2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

    VADet: Multi-Frame LiDAR 3D Object Detection Using Variable Aggregation , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

  56. [56]

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=

  57. [57]

    European conference on computer vision , pages=

    Deep networks with stochastic depth , author=. European conference on computer vision , pages=. 2016 , organization=

  58. [58]

    Yang, Zhenpei and Chai, Yuning and Anguelov, Dragomir and Zhou, Yin and Sun, Pei and Erhan, Dumitru and Rafferty, Sean and Kretzschmar, Henrik , booktitle=

  59. [59]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Beyond attention: Breaking the limits of transformer context length with recurrent memory , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  60. [60]

    Training Deep Nets with Sublinear Memory Cost

    Training deep nets with sublinear memory cost , author=. arXiv preprint arXiv:1604.06174 , year=

  61. [61]

    2018 , institution =

    Fitting larger networks into memory , author =. 2018 , institution =

  62. [62]

    arXiv preprint arXiv:2403.08763 , year=

    Simple and scalable strategies to continually pre-train large language models , author=. arXiv preprint arXiv:2403.08763 , year=

  63. [63]

    Advances in Neural Information Processing Systems , volume=

    Scaling laws and compute-optimal training beyond fixed training durations , author=. Advances in Neural Information Processing Systems , volume=

  64. [64]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Exploring object-centric temporal modeling for efficient multi-view 3d object detection , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  65. [65]

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

    Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net , author=. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

  66. [66]

    Zhou, Yin and Tuzel, Oncel , booktitle=

  67. [67]

    Huang, Junjie and Huang, Guan and Zhu, Zheng and Ye, Yun and Du, Dalong , journal=

  68. [68]

    2024 , publisher=

    Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Yu, Qiao and Dai, Jifeng , journal=. 2024 , publisher=

  69. [69]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Rethinking imagenet pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  70. [70]

    Proceedings of the 27th annual conference on Computer graphics and interactive techniques , pages=

    Surfels: Surface elements as rendering primitives , author=. Proceedings of the 27th annual conference on Computer graphics and interactive techniques , pages=

  71. [71]

    Proceedings of Association for Computational Linguistics (ACL) , pages=

    The impact of depth on compositional generalization in transformer language models , author=. Proceedings of Association for Computational Linguistics (ACL) , pages=

  72. [72]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    On offline evaluation of 3d object detection for autonomous driving , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  73. [73]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Planning-oriented autonomous driving , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  74. [74]

    2025 , organization=

    Wozniak, Maciej K and Govindarajan, Hariprasath and Klingner, Marvin and Maurice, Camille and Kiran, B Ravi and Yogamani, Senthil , booktitle=. 2025 , organization=

  75. [75]

    Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

    Masked autoencoder for self-supervised pre-training on lidar point clouds , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

  76. [76]

    Agro, Ben and Sykora, Quinlan and Casas, Sergio and Gilles, Thomas and Urtasun, Raquel , booktitle=

  77. [77]

    arXiv preprint arXiv:2503.15672 , year=

    Ljungbergh, William and Lilja, Adam and Ling, Adam Tonderski and Lindstr. arXiv preprint arXiv:2503.15672 , year=

  78. [78]

    Yang, Honghui and Zhang, Sha and Huang, Di and Wu, Xiaoyang and Zhu, Haoyi and He, Tong and Tang, Shixiang and Zhao, Hengshuang and Qiu, Qibo and Lin, Binbin and others , booktitle=