pith. machine review for the scientific record. sign in

arxiv: 2605.10388 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:15 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords temporal sampling frequencyend-to-end drivingtrajectory predictioncapacity-awareautonomous drivingmodel sizeWaymonuScenes
0
0 comments X

The pith

Temporal sampling frequency in training data affects end-to-end driving trajectory prediction performance depending on model capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper questions the assumption that training end-to-end autonomous driving models on the highest available frame rate always yields better results. It creates controlled variants of standard datasets by temporally subsampling camera frames along each trajectory and retrains the same models at different frequencies under fixed protocols. From a capacity-aware viewpoint, dense sampling can add redundant visual content that burdens finite-capacity models, while sparse sampling may omit useful cues. Experiments across three datasets reveal that smaller models often reach peak accuracy at lower or intermediate frequencies, whereas a larger VLA-style model improves with the densest sampling. The findings indicate that sampling frequency should be treated as a tunable variable rather than fixed at its maximum.

Core claim

Treating temporal sampling frequency as an explicit training-set design variable, the authors construct frequency-sweep datasets from high-frequency sources and show model-dependent responses: smaller E2E models exhibit non-monotonic or plateau trends with best 3-second ADE at lower or intermediate rates, while AutoVLA achieves its lowest ADE and FDE at the highest frequency on Waymo, nuScenes, and PAVE; iteration-matched controls rule out simple differences in update count as the sole explanation.

What carries the argument

Frequency-sweep training sets generated by temporal subsampling of camera frames along trajectories, used to isolate how sampling density trades off missing cues against redundant visual noise in models of varying capacity.

If this is right

  • Sampling frequency must be reported and tuned for each model-dataset combination instead of defaulting to the maximum available rate.
  • Smaller models can achieve better accuracy and efficiency by training on sparser frame sequences.
  • Iteration-matched controls confirm that frequency effects persist beyond differences in total training updates.
  • Dataset creators should consider releasing multiple temporal resolutions to support capacity-aware training.
  • Reproducibility standards in end-to-end driving research should include explicit specification of the sampling frequency used during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The capacity-aware trade-off may extend to other video-based prediction tasks such as action recognition or robotic motion forecasting.
  • Future models might include internal mechanisms to dynamically ignore redundant frames and reduce effective capacity burden.
  • Studying which specific visual cues are lost or gained across frequencies could guide targeted data augmentation or sensor design.
  • Similar frequency-tuning considerations could apply to other high-volume temporal data domains like surveillance or medical imaging sequences.

Load-bearing premise

That temporal subsampling of frames isolates the pure effect of sampling frequency without altering the underlying data distribution or introducing artifacts that affect model training.

What would settle it

Retraining the same models on high-frequency data that has been deliberately augmented with redundant or noisy frames to match the visual burden of dense sampling, then checking whether the performance gap to low-frequency training disappears.

Figures

Figures reproduced from arXiv: 2605.10388 by Jiaxiang Li, Ke Ma, Tao Liu, Xiangyu Li, Yumao Liu.

Figure 1
Figure 1. Figure 1: Capacity-aware view of temporal sampling frequency. Sparse temporal sampling may miss driving-relevant information. Dense temporal sampling can improve coverage of driving-relevant information, but it can also introduce redundant visual content and driving-irrelevant off-manifold noise. With finite model capacity, the best temporal sampling frequency may therefore vary across datasets and models. Our contr… view at source ↗
Figure 2
Figure 2. Figure 2: Toy illustration of capacity-aware frequency responses. (a) Noise-corrupted temporal BEV images are used as inputs to the toy trajectory-prediction models. (b) Frequency responses are shown across model capacities. Model width W controls capacity: W = 16, 48, and 64 correspond to approximately 0.3M, 2.6M, and 4.7M parameters, respectively. The best frequency changes from 7 Hz for W = 16 to 9 Hz for W = 48 … view at source ↗
Figure 3
Figure 3. Figure 3: Experimental pipeline. Starting from E2E datasets with high native camera-frame sampling frequency, we construct frequency-sweep training sets by temporal subsampling. This changes the density of selected camera frames. The training-sample format is kept fixed. The future ego-trajectory targets are also kept fixed. Each model is trained under a fixed protocol. Each model is evaluated on the same validation… view at source ↗
Figure 4
Figure 4. Figure 4: Architecture overview. E2EDriver, BEV-E2EDriver, and Tiny-SSR are smaller E2E trajectory-prediction models with different scene representations and trajectory decoders. AutoVLA is a larger VLA-style model that predicts future ego-trajectory targets through an autoregressive action-token interface. Within each frequency sweep, the model architecture, training-sample format, command representation, and predi… view at source ↗
Figure 5
Figure 5. Figure 5: 3-second ADE frequency-response across all evaluated models. The curves summarize E2EDriver, BEV-E2EDriver, Tiny-SSR, and AutoVLA across the evaluated temporal sampling frequencies. The smaller E2E models show non-monotonic or near-plateau frequency-response, whereas AutoVLA tends to improve toward the highest evaluated frequency. These results show that the best temporal sampling frequency depends on both… view at source ↗
read the original abstract

End to end (E2E) autonomous driving trajectory prediction is often trained with camera frames sampled at the highest available temporal frequency, assuming that denser sampling improves performance. We question this assumption by treating temporal sampling frequency as an explicit training set design variable. Starting from high frequency E2E driving datasets, we construct frequency sweep training sets by temporally subsampling camera frames along each trajectory. For each model dataset pair, we train and evaluate the same model under a fixed protocol, so the frequency response reflects how prediction performance changes with sampling frequency. We analyze this response from a capacity aware perspective. Sparse sampling may miss driving relevant cues, while dense sampling may add redundant visual content and off manifold noise. For finite capacity models, this can create a driving irrelevant capacity burden. We evaluate three smaller E2E models and a larger VLA style AutoVLA model on Waymo, nuScenes, and PAVE. Results show model and dataset dependent frequency responses. Smaller E2E models often show non monotonic or near plateau trends and achieve their best 3 second ADE at lower or intermediate frequencies. In contrast, AutoVLA achieves its best 3 second ADE and FDE at the highest evaluated frequency on all three datasets. Iteration matched controls suggest that the advantage of lower or intermediate frequencies for smaller models is not explained only by unequal training update counts. These findings show that temporal sampling frequency should be reported and tuned, rather than fixed to the highest available value.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper treats temporal sampling frequency as an explicit design variable for end-to-end driving trajectory prediction. Starting from high-frequency datasets (Waymo, nuScenes, PAVE), it constructs frequency-swept training sets by temporally subsampling camera frames along trajectories, then trains and evaluates the same models (three smaller E2E models and larger AutoVLA) under a fixed protocol. Results show model- and dataset-dependent responses: smaller models frequently exhibit non-monotonic or plateaued trends with best 3-second ADE at lower or intermediate frequencies, while AutoVLA achieves its best ADE and FDE at the highest frequency on all datasets. Iteration-matched controls are invoked to argue that the advantage of sparser sampling for smaller models is not solely due to unequal update counts.

Significance. If the frequency-response patterns survive controls that also equalize total frame count and temporal diversity, the work would usefully demonstrate that sampling rate is a tunable capacity-aware hyperparameter rather than a default-to-maximum choice. This could affect dataset construction practices and training efficiency for resource-constrained E2E driving models.

major comments (2)
  1. [Experimental controls and iteration matching] The iteration-matched controls (abstract and experimental description) equalize training update counts but leave total data volume and number of distinct temporal contexts strictly smaller at lower frequencies. Because each trajectory contributes fewer frames, the effective dataset size and visual diversity decrease; this alternative explanation for intermediate-frequency gains is not ruled out by the reported controls and directly affects the capacity-burden interpretation.
  2. [Evaluation protocol] Evaluation uses 3-second ADE/FDE computed on the original dense trajectory points (abstract). When training data are subsampled, the supervision signal density changes while the test metric remains dense; this mismatch could itself produce apparent optima at intermediate frequencies and should be quantified or controlled.
minor comments (2)
  1. [Results] The manuscript would benefit from explicit reporting of the exact frequency values tested, the number of trajectories per frequency, and any statistical tests (e.g., confidence intervals or significance) on the observed non-monotonic trends.
  2. [Dataset construction] Clarify whether subsampling is performed uniformly or with any anti-aliasing / interpolation step, and whether the same random seed and data augmentation are used across frequency variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful review of our manuscript. Their comments have helped us identify areas where additional controls can strengthen our claims regarding the role of temporal sampling frequency in end-to-end driving models. We provide detailed responses to each major comment below.

read point-by-point responses
  1. Referee: [Experimental controls and iteration matching] The iteration-matched controls (abstract and experimental description) equalize training update counts but leave total data volume and number of distinct temporal contexts strictly smaller at lower frequencies. Because each trajectory contributes fewer frames, the effective dataset size and visual diversity decrease; this alternative explanation for intermediate-frequency gains is not ruled out by the reported controls and directly affects the capacity-burden interpretation.

    Authors: We thank the referee for highlighting this important aspect of our experimental design. The iteration-matched controls were intended to isolate the effect of sampling frequency from the number of gradient updates. However, we agree that they do not fully control for the total volume of training data or the number of unique temporal contexts, as lower-frequency datasets contain fewer frames per trajectory. This could indeed provide an alternative explanation for the observed performance gains at intermediate frequencies for smaller models. To address this, we will revise the manuscript to include additional control experiments that equalize the total frame count across frequencies. For instance, we plan to subsample the number of trajectories used at higher frequencies or augment lower-frequency sets by repeating frames to match the data volume of higher-frequency sets. We will report these results and discuss how they impact the capacity-burden interpretation. We believe these additions will strengthen the paper's conclusions without altering the core findings. revision: yes

  2. Referee: [Evaluation protocol] Evaluation uses 3-second ADE/FDE computed on the original dense trajectory points (abstract). When training data are subsampled, the supervision signal density changes while the test metric remains dense; this mismatch could itself produce apparent optima at intermediate frequencies and should be quantified or controlled.

    Authors: The evaluation protocol computes 3-second ADE and FDE on the dense ground-truth trajectories to reflect the practical goal of accurate full-trajectory prediction in autonomous driving. While the training supervision is indeed sparser at lower frequencies, this setup allows us to study how input sampling rate affects the model's ability to predict dense outputs. We acknowledge that this train-test mismatch in supervision density could influence the location of performance optima. To quantify this effect, we will add in the revised manuscript an analysis where we also compute the metrics restricted to the temporally sampled points used during training. This will help determine whether the intermediate-frequency advantages persist under a matched-density evaluation. We maintain that the dense evaluation is the most relevant for the application, but the additional control will address the referee's concern. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical frequency-sweep study

full rationale

The paper performs direct empirical comparisons: it constructs frequency-swept training sets by temporal subsampling of existing trajectories, trains the same models under a fixed protocol, and reports ADE/FDE on held-out data. No equations, derivations, or predictions are presented that reduce by construction to fitted parameters, self-citations, or ansatzes. Iteration-matched controls are introduced precisely to isolate frequency from update-count effects, and results are reported as model- and dataset-dependent observations rather than universal claims. The work is therefore self-contained against external benchmarks and contains none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on experimental results from training and evaluating models on subsampled datasets. No free parameters are introduced in the sense of fitted constants for a derivation, and no new entities or ad-hoc axioms are postulated beyond standard assumptions in machine learning training.

pith-pipeline@v0.9.0 · 5574 in / 1199 out tokens · 63938 ms · 2026-05-12T04:15:58.257403+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 6 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin, et al. Qwen2.5-vl technical report...

  2. [2]

    Nearly-tight vc- dimension and pseudodimension bounds for piecewise linear neural networks.Journal of Machine Learning Research, 20(63):1–17, 2019

    Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc- dimension and pseudodimension bounds for piecewise linear neural networks.Journal of Machine Learning Research, 20(63):1–17, 2019

  3. [3]

    End to End Learning for Self-Driving Cars

    Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016

  4. [4]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

  5. [5]

    NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

  6. [6]

    End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

    Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024. doi: 10.1109/TPAMI.2024. 3435937

  7. [7]

    Recent advancements in end-to-end autonomous driving using deep learning: A survey.IEEE Transactions on Intelligent V ehicles, 9(1):103–118, 2023

    Pranav Singh Chib and Pravendra Singh. Recent advancements in end-to-end autonomous driving using deep learning: A survey.IEEE Transactions on Intelligent V ehicles, 9(1):103–118, 2023

  8. [8]

    End-to-end driving via conditional imitation learning

    Felipe Codevilla, Matthias Müller, Antonio López, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In2018 IEEE international conference on robotics and automation (ICRA), pages 4693–4700. IEEE, 2018

  9. [9]

    Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset

    Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF international conference on computer vision, pages 9710–9719, 2021

  10. [10]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InICCV, 2019

  11. [11]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  12. [12]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InCVPR, 2023

  13. [13]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

  14. [14]

    V AD: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene representation for efficient autonomous driving. InICCV, 2023

  15. [15]

    Navigation-guided sparse scene representation for end-to-end autonomous driving

    Peidong Li and Dixiao Cui. Navigation-guided sparse scene representation for end-to-end autonomous driving. InICLR, 2025. 10

  16. [16]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

  17. [17]

    Pave: An end-to- end dataset for production autonomous vehicle evaluation.arXiv preprint arXiv:2511.14185, 2025

    Xiangyu Li, Chen Wang, Yumao Liu, Dengbo He, Jiahao Zhang, and Ke Ma. Pave: An end-to- end dataset for production autonomous vehicle evaluation.arXiv preprint arXiv:2511.14185, 2025

  18. [18]

    BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InECCV, 2022

  19. [19]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

  20. [20]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017

  21. [21]

    e2edriver

    Adeel Mufti. e2edriver. https://github.com/AdeelMufti/e2edriver, 2026. GitHub repository. Accessed: 2026-03-04

  22. [22]

    Efficient 3d video engine using frame redundancy

    Gao Peng, Bo Pang, and Cewu Lu. Efficient 3d video engine using frame redundancy. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3792–3802, 2021

  23. [23]

    On the spectral bias of neural networks

    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. InICML, 2019

  24. [24]

    Rajendran, Murugan Sankaradas, Anand Raghunathan, and Srimat T

    Surya Selvam, Ravi K. Rajendran, Murugan Sankaradas, Anand Raghunathan, and Srimat T. Chakradhar. Simcache: Similarity caching for efficient VLM-based scene understanding. In CVPRW, 2025

  25. [25]

    Sparsedrive: End-to-end autonomous driving via sparse scene representation.arXiv preprint arXiv:2405.19620, 2024

    Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation.arXiv preprint arXiv:2405.19620, 2024

  26. [26]

    Alvarez, Lei Zhang, and Zhiding Yu

    Shihao Wang, Guo Chen, De an Huang, Zhiqi Li, Minghan Li, Guilin Liu, Jose M. Alvarez, Lei Zhang, and Zhiding Yu. Videoitg: Multimodal video understanding with instructed temporal grounding, 2025. URLhttps://arxiv.org/abs/2507.13353

  27. [27]

    Blaschko, Tinne Tuytelaars, and Minye Wu

    Zehao Wang, Han Zhou, Matthew B. Blaschko, Tinne Tuytelaars, and Minye Wu. Diversity- driven view subset selection for indoor novel view synthesis.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id= F42CRfcp3D

  28. [28]

    2510.26125 , archivePrefix =

    Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

  29. [29]

    Frequency principle: Fourier analysis sheds light on deep neural networks.Communications in Computational Physics, 28(5):1746–1767, 2020

    Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks.Communications in Computational Physics, 28(5):1746–1767, 2020

  30. [30]

    Sparsead: Sparse query- centric paradigm for efficient end-to-end autonomous driving.arXiv preprint arXiv:2404.06328, 2024

    Diankun Zhang, Guoan Wang, Runwen Zhu, Jianbo Zhao, Xiwu Chen, Siyu Zhang, Jiahao Gong, Qibin Zhou, Wenyuan Zhang, Ningzi Wang, Feiyang Tan, Hangning Zhou, Ziyao Xu, Haotian Yao, Chi Zhang, Xiaojun Liu, Xiaoguang Di, and Bin Li. Sparsead: Sparse query- centric paradigm for efficient end-to-end autonomous driving.arXiv preprint arXiv:2404.06328, 2024

  31. [31]

    Genad: Gen- erative end-to-end autonomous driving

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Gen- erative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024. 11

  32. [32]

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Zewei Zhou, Tianhui Cai, Yun Zhao, Seth Z.and Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

  33. [33]

    added driving-relevant information

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. A Extended Objective and Frequency-Response Definitions Let g(m) θ denote a trajectory...