GSDrive: Reinforcing Driving Policies by Multi-mode Future Trajectory Probing with 3D Gaussian Splatting Environment

Chen Min; Dzmitry Tsetserukou; Shuo Wang; Sifa Zheng; Xuefeng Zhang; Yixiao Zhou; Ziang Guo; Zufeng Zhang

arxiv: 2604.28111 · v3 · pith:W4TMZISDnew · submitted 2026-04-30 · 💻 cs.RO

GSDrive: Reinforcing Driving Policies by Multi-mode Future Trajectory Probing with 3D Gaussian Splatting Environment

Ziang Guo , Chen Min , Xuefeng Zhang , Yixiao Zhou , Shuo Wang , Sifa Zheng , Dzmitry Tsetserukou , Zufeng Zhang This is my paper

Pith reviewed 2026-05-19 16:42 UTC · model grok-4.3

classification 💻 cs.RO

keywords autonomous driving3D Gaussian Splattingreinforcement learningimitation learningtrajectory probingreward shapingend-to-end drivingclosed-loop evaluation

0 comments

The pith

A 3D Gaussian Splatting environment probes multiple candidate futures to supply dense rewards that refine end-to-end driving policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix the problem that standard reinforcement learning for autonomous driving only receives feedback after crashes or other rare failures, which causes policies to settle on mediocre behaviors. GSDrive instead builds a differentiable simulation from 3D Gaussian Splatting that lets the policy test several possible future trajectories at each step. The simulated outcomes of those futures are converted into continuous shaping rewards that guide the policy toward better choices before any real mistake occurs. Imitation learning supplies initial structured trajectory ideas while reinforcement learning uses the interactive simulation feedback to refine them in a repeating cycle. The result is reported to be stronger closed-loop performance than other simulation-based RL methods when tested on a reconstructed nuScenes dataset.

Core claim

GSDrive first trains a multi-mode trajectory probe through imitation learning, then deploys reinforcement learning to evaluate multiple candidate futures inside the 3D Gaussian Splatting environment; the returns from these simulated futures are turned into dense shaping rewards that optimize the driving policy. This produces a cyclic hybrid IL-RL loop in which imitation supplies future priors and reinforcement supplies interactive corrections, yielding measurable gains over prior simulation RL baselines in closed-loop nuScenes experiments.

What carries the argument

The multi-mode trajectory probe operating inside the differentiable 3D Gaussian Splatting environment, which evaluates several candidate futures and converts their simulated returns into dense shaping rewards for policy updates.

If this is right

Policies receive feedback based on anticipated future states rather than waiting for actual collisions or rule violations.
The hybrid training alternates between imitation learning that provides structured trajectory priors and reinforcement learning that supplies interactive simulation feedback.
Dense rewards derived from multi-mode probing reduce the tendency of policies to converge prematurely to suboptimal driving behaviors.
Evaluation remains within a closed-loop setting on reconstructed real-world data, allowing direct comparison to other simulation RL approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same future-probing idea could be tested on longer prediction horizons or in scenarios with more dynamic agents such as pedestrians and cyclists.
Because the environment is differentiable, the framework might support direct gradient-based policy optimization in addition to the current RL updates.
One could check whether the performance advantage persists when the 3DGS reconstruction is built from fewer views or lower-quality sensor data.

Load-bearing premise

The 3D Gaussian Splatting reconstruction must be accurate enough at modeling vehicle dynamics and interactions that rewards computed inside the simulation actually improve the policy when it is deployed in the real world.

What would settle it

If a policy trained with GSDrive shows no improvement or performs worse than a standard delayed-reward RL baseline when both are evaluated in closed-loop driving on the same nuScenes validation scenes, the benefit of the future-probing and dense-reward mechanism would be falsified.

Figures

Figures reproduced from arXiv: 2604.28111 by Chen Min, Dzmitry Tsetserukou, Shuo Wang, Sifa Zheng, Xuefeng Zhang, Yixiao Zhou, Ziang Guo, Zufeng Zhang.

**Figure 1.** Figure 1: The IL stage pipeline. Observations from the 3DGS environment are processed through ResNet and BEV view at source ↗

**Figure 2.** Figure 2: The RL stage pipeline. D. RL Stage We formulate the RL training in a Markov Decision Process (MDP) defined as (S, A, T , R, γ). For the state space St, it combines camera images ˆIt, agent detection At, and camera intrinsic and extrinsic matrices Kt and Et for image projection into (ˆI, A, K, E). The action space At is defined by the grid space of trajectory points. The policy outputs logits for a categori… view at source ↗

**Figure 3.** Figure 3: The training performance comparison, where the view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons in the closed-loop test. view at source ↗

read the original abstract

End-to-end (E2E) autonomous driving aims to directly map sensory observations to driving actions, but its real-world deployment is hindered by evolving data distributions and the high cost of continual annotation. While combining imitation learning (IL) and reinforcement learning (RL) is a common strategy for policy improvement, conventional RL training relies on delayed, event-based rewards, where policies learn only from catastrophic outcomes such as collisions, leading to premature convergence to suboptimal behaviors. To address these limitations, we propose GSDrive, a framework that uses a differentiable 3D Gaussian Splatting (3DGS) environment for future-aware trajectory probing and reward shaping in E2E driving. GSDrive first learns a multi-mode trajectory probe via IL and then uses RL to evaluate multiple candidate futures in the 3DGS environment, converting their simulated returns into dense shaping rewards for policy optimization. This yields a cyclic hybrid IL-RL training loop, where IL supplies structured future priors and RL provides interactive feedback for iterative refinement. Evaluated on the reconstructed nuScenes dataset, our method outperforms other simulation-based RL approaches in closed-loop experiments. Code is available at https://github.com/ZionGo6/GSDrive.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GSDrive presents a practical hybrid IL-RL method using 3DGS for dense reward shaping in driving, though the simulation's ability to model dynamics accurately remains the main open question.

read the letter

The main point here is that GSDrive uses 3D Gaussian Splatting to simulate future vehicle trajectories in a differentiable way, turning those simulations into dense rewards that guide reinforcement learning for driving policies. This helps avoid the usual issue where RL only learns from rare collisions. What the paper does is combine imitation learning to generate multi-mode future probes with RL that evaluates them in the 3DGS environment. The result is a training loop that refines the policy iteratively. They evaluate it on reconstructed nuScenes scenes and report better closed-loop performance than other sim-based RL approaches. The public code link is helpful. The approach is new in how it integrates the probing directly with the 3DGS for reward shaping in this domain. A potential weak point is the fidelity of the simulation for dynamics. 3DGS is excellent for visual reconstruction and novel views, but to probe trajectories you need to model how the ego vehicle and others move over time, including physics like acceleration and collisions. The paper likely adds some models for this, but if they are basic, the rewards could be based on inaccurate futures that do not carry over to real driving. The stress-test concern about this seems on target. Without seeing detailed ablations or error analysis in the experiments, it's hard to tell how much the gains depend on the specific probing versus the environment setup itself. This paper would interest people working on simulation-aided learning for autonomous vehicles and hybrid IL-RL methods. It shows clear thinking on the problem and engages with the literature on driving policies, so it is worth a serious referee's time even if revisions are needed on the validation side. I would recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces GSDrive, a hybrid imitation learning (IL) and reinforcement learning (RL) framework for end-to-end autonomous driving. It first trains a multi-mode trajectory probe via IL, then uses RL to evaluate candidate futures inside a differentiable 3D Gaussian Splatting (3DGS) environment, converting the simulated returns into dense shaping rewards that refine the policy in a cyclic IL-RL loop. The central empirical claim is that the resulting policy outperforms other simulation-based RL methods in closed-loop evaluation on a reconstructed nuScenes dataset.

Significance. If the closed-loop gains are robust, the work would demonstrate a practical way to obtain dense, future-aware rewards for driving policies without relying solely on sparse collision events. The public code release and use of an existing reconstructed dataset are strengths that aid reproducibility. The approach extends 3DGS beyond static novel-view synthesis into interactive RL, which could influence future sim-to-real pipelines if the dynamics modeling proves reliable.

major comments (2)

[§4 (Experiments) and abstract] §4 (Experiments) and abstract: the claim of outperformance over other simulation-based RL approaches in closed-loop experiments is presented without quantitative metrics, specific baseline names, ablation tables, or error bars. Because this is the primary evidence for the central contribution, the results section must supply these details (e.g., success rate, collision rate, or route completion percentage) to allow verification.
[§3.2 (3DGS Environment)] §3.2 (3DGS Environment): the description of how the 3DGS scene supports ego-vehicle rollout and multi-agent interactions during trajectory probing is insufficient. Standard 3DGS pipelines render static geometry; without explicit kinematic models, tire dynamics, or validated re-rendering of dynamic objects, the simulated returns used for dense rewards may optimize against reconstruction artifacts rather than real driving constraints. This assumption is load-bearing for the reward-shaping loop and the transfer claim.

minor comments (2)

[§3] The notation for the multi-mode probe and the conversion of simulated returns into shaping rewards could be clarified with an explicit equation or pseudocode block.
[Figures] Figure captions should explicitly state whether the visualized trajectories are generated inside the 3DGS environment or on the real nuScenes data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the emphasis on strengthening the empirical evidence and clarifying the simulation assumptions. We address each major comment below and outline the planned revisions.

read point-by-point responses

Referee: [§4 (Experiments) and abstract] §4 (Experiments) and abstract: the claim of outperformance over other simulation-based RL approaches in closed-loop experiments is presented without quantitative metrics, specific baseline names, ablation tables, or error bars. Because this is the primary evidence for the central contribution, the results section must supply these details (e.g., success rate, collision rate, or route completion percentage) to allow verification.

Authors: We agree that the current presentation of results in §4 and the abstract lacks sufficient quantitative detail to fully substantiate the outperformance claim. In the revised manuscript we will expand the experiments section to report concrete metrics including success rate, collision rate, and route completion percentage. We will explicitly name the simulation-based RL baselines, include ablation tables isolating the contributions of the multi-mode probe and dense reward shaping, and provide error bars computed over multiple random seeds. These additions will directly address the request for verifiable evidence of closed-loop gains on the reconstructed nuScenes dataset. revision: yes
Referee: [§3.2 (3DGS Environment)] §3.2 (3DGS Environment): the description of how the 3DGS scene supports ego-vehicle rollout and multi-agent interactions during trajectory probing is insufficient. Standard 3DGS pipelines render static geometry; without explicit kinematic models, tire dynamics, or validated re-rendering of dynamic objects, the simulated returns used for dense rewards may optimize against reconstruction artifacts rather than real driving constraints. This assumption is load-bearing for the reward-shaping loop and the transfer claim.

Authors: We acknowledge that §3.2 currently provides only a high-level description of the 3DGS environment and does not fully specify the mechanisms for dynamic rollouts. In the revision we will add explicit details on the kinematic bicycle model and tire dynamics used to simulate ego-vehicle and surrounding agent trajectories during probing. We will also describe the procedure for re-rendering dynamic objects at each timestep and any validation steps taken to ensure simulated returns reflect driving constraints rather than reconstruction artifacts. These clarifications will strengthen the justification for using the simulated returns as dense shaping rewards. revision: yes

Circularity Check

0 steps flagged

No circularity in the proposed empirical training framework

full rationale

The paper presents GSDrive as a practical hybrid IL-RL procedure: IL learns multi-mode trajectory priors, RL evaluates candidate futures inside a 3DGS simulator to generate dense shaping rewards, and the loop iterates for policy refinement. This is described as an algorithmic training recipe evaluated empirically on reconstructed nuScenes data, with the central claim being closed-loop outperformance versus other simulation-based RL baselines. No equations, first-principles derivations, or predictions are offered that reduce by construction to fitted parameters or prior self-citations; the method does not invoke uniqueness theorems, smuggle ansatzes, or rename known results. The cyclic loop is an explicit design choice for iterative improvement rather than a self-referential tautology, leaving the reported gains as independent empirical outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that 3DGS reconstruction yields a sufficiently faithful simulator for reward computation; no explicit free parameters or invented physical entities are mentioned in the abstract.

axioms (1)

domain assumption 3D Gaussian Splatting environment accurately models real-world vehicle dynamics and scene interactions for future trajectory evaluation
This premise is required for the simulated returns to produce useful shaping rewards that improve real policies.

pith-pipeline@v0.9.0 · 5778 in / 1109 out tokens · 36769 ms · 2026-05-19T16:42:42.310490+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GSDrive first learns a multi-mode trajectory probe via IL and then uses RL to evaluate multiple candidate futures in the 3DGS environment, converting their simulated returns into dense shaping rewards
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The reward function r_total_t = w_env · r_env_t + w_probe · max_i (r_probe(τ_i)) with probe horizon H

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 6 internal anchors

[1]

End- to-end autonomous driving: Challenges and frontiers,

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End- to-end autonomous driving: Challenges and frontiers,”IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 164–10 183, 2024

work page 2024
[2]

The era of end-to-end autonomy: Transitioning from rule-based driving to large driving models,

E. Nebot and J. S. B. Perez, “The era of end-to-end autonomy: Transitioning from rule-based driving to large driving models,”arXiv preprint arXiv:2603.16050, 2026

work page arXiv 2026
[3]

Iterative label refinement matters more than preference optimization under weak supervision,

Y . Ye, C. Laidlaw, and J. Steinhardt, “Iterative label refinement matters more than preference optimization under weak supervision,”arXiv preprint arXiv:2501.07886, 2025

work page arXiv 2025
[4]

End-to-end driving with online trajectory evaluation via bev world model,

Y . Li, Y . Wang, Y . Liu, J. He, L. Fan, and Z. Zhang, “End-to-end driving with online trajectory evaluation via bev world model,” in Proceedings of the IEEE/CVF Int. Conf. on Computer Vision, 2025, pp. 27 137–27 146

work page 2025
[5]

Centaur: Robust end-to-end autonomous driving with test-time training

C. Sima, K. Chitta, Z. Yu, S. Lan, P. Luo, A. Geiger, H. Li, and J. M. Alvarez, “Centaur: Robust end-to-end autonomous driving with test-time training,”arXiv preprint arXiv:2503.11650, 2025

work page arXiv 2025
[6]

Data scaling laws for end-to-end autonomous driving,

A. Naumann, X. Gu, T. Dimlioglu, M. Bojarski, A. Degirmenci, A. Popov, D. Bisla, M. Pavone, U. Muller, and B. Ivanovic, “Data scaling laws for end-to-end autonomous driving,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2571–2582

work page 2025
[7]

Synad: Enhancing real-world end-to-end autonomous driving models through synthetic data integration,

J. Kim, J. Lee, G. Han, D.-J. Lee, M. Jeong, and J. Kim, “Synad: Enhancing real-world end-to-end autonomous driving models through synthetic data integration,” inProceedings of the IEEE/CVF Int. Conf. on Computer Vision, 2025, pp. 25 197–25 206. Fig. 4: Qualitative comparisons in the closed-loop test

work page 2025
[8]

Diffe2e: Rethinking end-to-end driving with a hybrid diffusion-regression-classification policy,

R. Zhao, Y . Fan, Z. Chen, F. Gao, and Z. Gao, “Diffe2e: Rethinking end-to-end driving with a hybrid diffusion-regression-classification policy,” inThe Thirty-ninth Annual Conf. on Neural Information Processing Systems, 2025

work page 2025
[9]

Distilldrive: End-to- end multi-mode autonomous driving distillation by isomorphic hetero- source planning model,

R. Yu, X. Zhang, R. Zhao, H. Yan, and M. Wang, “Distilldrive: End-to- end multi-mode autonomous driving distillation by isomorphic hetero- source planning model,” inProceedings of the IEEE/CVF Int. Conf. on Computer Vision, 2025, pp. 26 188–26 197

work page 2025
[10]

Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving,

R. Feng, N. Xi, D. Chu, R. Wang, Z. Deng, A. Wang, L. Lu, J. Wang, and Y . Huang, “Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving,”IEEE Robotics and Automation Letters, vol. 11, no. 1, pp. 226–233, 2025

work page 2025
[11]

Ztrs: Zero-imitation end-to-end autonomous driving with trajectory scoring,

Z. Li, W. Yao, Z. Wang, X. Sun, J. Chen, N. Chang, M. Shen, J. Song, Z. Wu, S. Lan,et al., “Ztrs: Zero-imitation end-to-end autonomous driving with trajectory scoring,”arXiv preprint arXiv:2510.24108, 2025

work page arXiv 2025
[12]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang,et al., “Recogdrive: A reinforced cog- nitive framework for end-to-end autonomous driving,”arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

S. Shang, Y . Chen, Y . Wang, Y . Li, and Z. Zhang, “Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving,”arXiv preprint arXiv:2509.17940, 2025

work page arXiv 2025
[14]

Takead: Preference-based post-optimization for end-to-end autonomous driving with expert takeover data,

D. Liu, Y . Gao, D. Qian, Q. Zhang, X. Ye, J. Han, Y . Zheng, X. Liu, Z. Xia, D. Ding,et al., “Takead: Preference-based post-optimization for end-to-end autonomous driving with expert takeover data,”IEEE Robotics and Automation Letters, vol. 11, no. 2, pp. 1738–1745, 2025

work page 2025
[15]

Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,

H. Gao, S. Chen, B. Jiang, B. Liao, Y . Shi, X. Guo, Y . Pu, H. Yin, X. Li, X. Zhang,et al., “Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,”arXiv preprint arXiv:2502.13144, 2025

work page arXiv 2025
[16]

Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruction,

C. Ni, G. Zhao, X. Wang, Z. Zhu, W. Qin, X. Chen, G. Jia, G. Huang, and W. Mei, “Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruction,”arXiv preprint arXiv:2508.08170, 2025

work page arXiv 2025
[17]

Drive&gen: Co-evaluating end-to-end driving and video generation models,

J. Wang, Z. Yang, Y . Bai, Y . Li, Y . Zou, B. Sun, A. Kundu, J. Lezama, L. Y . Huang, Z. Zhu,et al., “Drive&gen: Co-evaluating end-to-end driving and video generation models,” in2025 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 8934– 8941

work page 2025
[18]

Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model

H. Lin, Y . Yang, Y . Zhang, C. Zheng, J. Feng, S. Wang, Z. Wang, S. Chen, B. Wang, Y . Zhang,et al., “Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model,”arXiv preprint arXiv:2512.11226, 2025

work page arXiv 2025
[19]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294–5306

work page 2025
[20]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,

J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” inEuropean conference on computer vision. Springer, 2020, pp. 194–210

work page 2020
[21]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, June 2016

work page 2016
[22]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Optimal flow matching: Learning straight trajectories in just one step,

N. Kornilov, P. Mokrov, A. Gasnikov, and A. Korotin, “Optimal flow matching: Learning straight trajectories in just one step,”Advances in Neural Information Processing Systems, vol. 37, pp. 104 180–104 204, 2024

work page 2024
[24]

On unbalanced optimal transport: An analysis of sinkhorn algorithm,

K. Pham, K. Le, N. Ho, T. Pham, and H. Bui, “On unbalanced optimal transport: An analysis of sinkhorn algorithm,” inInt. Conf. on Machine Learning. PMLR, 2020, pp. 7673–7682

work page 2020
[25]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Skill-critic: Refining learned skills for hierarchical rein- forcement learning,

C. Hao, C. Weaver, C. Tang, K. Kawamoto, M. Tomizuka, and W. Zhan, “Skill-critic: Refining learned skills for hierarchical rein- forcement learning,”IEEE Robotics and Automation Letters, vol. 9, no. 4, pp. 3625–3632, 2024

work page 2024
[27]

Reinforcement learning with inverse rewards for world model post-training.arXiv preprint arXiv:2509.23958, 2025a

Y . Ye, T. He, S. Yang, and J. Bian, “Reinforcement learning with inverse rewards for world model post-training,”arXiv preprint arXiv:2509.23958, 2025

work page arXiv 2025
[28]

Reinforcement Learning with Action Chunking

Q. Li, Z. Zhou, and S. Levine, “Reinforcement learning with action chunking,”arXiv preprint arXiv:2507.07969, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[30]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[31]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

End- to-end autonomous driving: Challenges and frontiers,

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End- to-end autonomous driving: Challenges and frontiers,”IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 164–10 183, 2024

work page 2024

[2] [2]

The era of end-to-end autonomy: Transitioning from rule-based driving to large driving models,

E. Nebot and J. S. B. Perez, “The era of end-to-end autonomy: Transitioning from rule-based driving to large driving models,”arXiv preprint arXiv:2603.16050, 2026

work page arXiv 2026

[3] [3]

Iterative label refinement matters more than preference optimization under weak supervision,

Y . Ye, C. Laidlaw, and J. Steinhardt, “Iterative label refinement matters more than preference optimization under weak supervision,”arXiv preprint arXiv:2501.07886, 2025

work page arXiv 2025

[4] [4]

End-to-end driving with online trajectory evaluation via bev world model,

Y . Li, Y . Wang, Y . Liu, J. He, L. Fan, and Z. Zhang, “End-to-end driving with online trajectory evaluation via bev world model,” in Proceedings of the IEEE/CVF Int. Conf. on Computer Vision, 2025, pp. 27 137–27 146

work page 2025

[5] [5]

Centaur: Robust end-to-end autonomous driving with test-time training

C. Sima, K. Chitta, Z. Yu, S. Lan, P. Luo, A. Geiger, H. Li, and J. M. Alvarez, “Centaur: Robust end-to-end autonomous driving with test-time training,”arXiv preprint arXiv:2503.11650, 2025

work page arXiv 2025

[6] [6]

Data scaling laws for end-to-end autonomous driving,

A. Naumann, X. Gu, T. Dimlioglu, M. Bojarski, A. Degirmenci, A. Popov, D. Bisla, M. Pavone, U. Muller, and B. Ivanovic, “Data scaling laws for end-to-end autonomous driving,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2571–2582

work page 2025

[7] [7]

Synad: Enhancing real-world end-to-end autonomous driving models through synthetic data integration,

J. Kim, J. Lee, G. Han, D.-J. Lee, M. Jeong, and J. Kim, “Synad: Enhancing real-world end-to-end autonomous driving models through synthetic data integration,” inProceedings of the IEEE/CVF Int. Conf. on Computer Vision, 2025, pp. 25 197–25 206. Fig. 4: Qualitative comparisons in the closed-loop test

work page 2025

[8] [8]

Diffe2e: Rethinking end-to-end driving with a hybrid diffusion-regression-classification policy,

R. Zhao, Y . Fan, Z. Chen, F. Gao, and Z. Gao, “Diffe2e: Rethinking end-to-end driving with a hybrid diffusion-regression-classification policy,” inThe Thirty-ninth Annual Conf. on Neural Information Processing Systems, 2025

work page 2025

[9] [9]

Distilldrive: End-to- end multi-mode autonomous driving distillation by isomorphic hetero- source planning model,

R. Yu, X. Zhang, R. Zhao, H. Yan, and M. Wang, “Distilldrive: End-to- end multi-mode autonomous driving distillation by isomorphic hetero- source planning model,” inProceedings of the IEEE/CVF Int. Conf. on Computer Vision, 2025, pp. 26 188–26 197

work page 2025

[10] [10]

Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving,

R. Feng, N. Xi, D. Chu, R. Wang, Z. Deng, A. Wang, L. Lu, J. Wang, and Y . Huang, “Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving,”IEEE Robotics and Automation Letters, vol. 11, no. 1, pp. 226–233, 2025

work page 2025

[11] [11]

Ztrs: Zero-imitation end-to-end autonomous driving with trajectory scoring,

Z. Li, W. Yao, Z. Wang, X. Sun, J. Chen, N. Chang, M. Shen, J. Song, Z. Wu, S. Lan,et al., “Ztrs: Zero-imitation end-to-end autonomous driving with trajectory scoring,”arXiv preprint arXiv:2510.24108, 2025

work page arXiv 2025

[12] [12]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang,et al., “Recogdrive: A reinforced cog- nitive framework for end-to-end autonomous driving,”arXiv preprint arXiv:2506.08052, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

S. Shang, Y . Chen, Y . Wang, Y . Li, and Z. Zhang, “Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving,”arXiv preprint arXiv:2509.17940, 2025

work page arXiv 2025

[14] [14]

Takead: Preference-based post-optimization for end-to-end autonomous driving with expert takeover data,

D. Liu, Y . Gao, D. Qian, Q. Zhang, X. Ye, J. Han, Y . Zheng, X. Liu, Z. Xia, D. Ding,et al., “Takead: Preference-based post-optimization for end-to-end autonomous driving with expert takeover data,”IEEE Robotics and Automation Letters, vol. 11, no. 2, pp. 1738–1745, 2025

work page 2025

[15] [15]

Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,

H. Gao, S. Chen, B. Jiang, B. Liao, Y . Shi, X. Guo, Y . Pu, H. Yin, X. Li, X. Zhang,et al., “Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,”arXiv preprint arXiv:2502.13144, 2025

work page arXiv 2025

[16] [16]

Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruction,

C. Ni, G. Zhao, X. Wang, Z. Zhu, W. Qin, X. Chen, G. Jia, G. Huang, and W. Mei, “Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruction,”arXiv preprint arXiv:2508.08170, 2025

work page arXiv 2025

[17] [17]

Drive&gen: Co-evaluating end-to-end driving and video generation models,

J. Wang, Z. Yang, Y . Bai, Y . Li, Y . Zou, B. Sun, A. Kundu, J. Lezama, L. Y . Huang, Z. Zhu,et al., “Drive&gen: Co-evaluating end-to-end driving and video generation models,” in2025 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 8934– 8941

work page 2025

[18] [18]

Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model

H. Lin, Y . Yang, Y . Zhang, C. Zheng, J. Feng, S. Wang, Z. Wang, S. Chen, B. Wang, Y . Zhang,et al., “Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model,”arXiv preprint arXiv:2512.11226, 2025

work page arXiv 2025

[19] [19]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294–5306

work page 2025

[20] [20]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,

J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” inEuropean conference on computer vision. Springer, 2020, pp. 194–210

work page 2020

[21] [21]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, June 2016

work page 2016

[22] [22]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Optimal flow matching: Learning straight trajectories in just one step,

N. Kornilov, P. Mokrov, A. Gasnikov, and A. Korotin, “Optimal flow matching: Learning straight trajectories in just one step,”Advances in Neural Information Processing Systems, vol. 37, pp. 104 180–104 204, 2024

work page 2024

[24] [24]

On unbalanced optimal transport: An analysis of sinkhorn algorithm,

K. Pham, K. Le, N. Ho, T. Pham, and H. Bui, “On unbalanced optimal transport: An analysis of sinkhorn algorithm,” inInt. Conf. on Machine Learning. PMLR, 2020, pp. 7673–7682

work page 2020

[25] [25]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Skill-critic: Refining learned skills for hierarchical rein- forcement learning,

C. Hao, C. Weaver, C. Tang, K. Kawamoto, M. Tomizuka, and W. Zhan, “Skill-critic: Refining learned skills for hierarchical rein- forcement learning,”IEEE Robotics and Automation Letters, vol. 9, no. 4, pp. 3625–3632, 2024

work page 2024

[27] [27]

Reinforcement learning with inverse rewards for world model post-training.arXiv preprint arXiv:2509.23958, 2025a

Y . Ye, T. He, S. Yang, and J. Bian, “Reinforcement learning with inverse rewards for world model post-training,”arXiv preprint arXiv:2509.23958, 2025

work page arXiv 2025

[28] [28]

Reinforcement Learning with Action Chunking

Q. Li, Z. Zhou, and S. Levine, “Reinforcement learning with action chunking,”arXiv preprint arXiv:2507.07969, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020

[30] [30]

Denoising Diffusion Implicit Models

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,”arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[31] [31]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022