arxiv: 2604.18918 · v1 · submitted 2026-04-20 · 💻 cs.SE · cs.LG

Recognition: unknown

From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing

Linfeng Liang , Xiao Cheng , Tsong Yueh Chen , Xi Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:39 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords autonomous driving systemsscenario generationStein variational gradient descentsafety testingsimulation-based testinghazardous scenariosCARLA simulator

0 comments

The pith

PtoP applies Stein Variational Gradient Descent to generate diverse failure-inducing seeds for autonomous driving system tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PtoP as a plug-and-play framework that pairs adaptive random seeding with Stein Variational Gradient Descent to create initial conditions for simulation-based testing of autonomous driving systems. Standard search methods such as genetic algorithms tend to collapse onto a few modes in high-dimensional traffic spaces and therefore miss many possible failures. SVGD moves particles toward high-risk regions while repelling them from one another so that the resulting seeds remain distributed across multiple failure modes. When these seeds feed existing testers such as reinforcement-learning agents, the combined system uncovers more safety violations, covers more of the map, and produces a wider variety of scenarios. A reader would care because thorough discovery of realistic edge cases is required before autonomous vehicles can be trusted in dense traffic.

Core claim

PtoP combines adaptive random seed generation with Stein Variational Gradient Descent to produce diverse, failure-inducing initial conditions for autonomous driving system testing. SVGD balances attraction toward high-risk regions and repulsion among particles, yielding risk-seeking yet well-distributed seeds across multiple failure modes. Evaluation in CARLA on Apollo, Autoware, and a native end-to-end system shows that PtoP improves safety violation rate up to 27.68 percent, scenario diversity 9.6 percent, and map coverage 16.78 percent over baselines.

What carries the argument

Stein Variational Gradient Descent applied to particle positions, performing gradient updates that attract particles to high-risk areas while repelling them to preserve diversity across failure modes.

Load-bearing premise

SVGD can balance attraction to high-risk regions against repulsion among particles in high-dimensional spaces to produce realistic yet diverse failure scenarios without mode collapse or unrealistic artifacts.

What would settle it

Running identical testing budgets with and without PtoP seeds in repeated CARLA trials on Apollo or Autoware and counting whether the number of distinct safety violations or failure modes differs by a statistically significant margin.

Figures

Figures reproduced from arXiv: 2604.18918 by Linfeng Liang, Tsong Yueh Chen, Xiao Cheng, Xi Zheng.

**Figure 2.** Figure 2: Case study. likely due to its significantly larger area, which reduces the frequency of interactions that lead to safety violations. We conducted a case study to manually analyze the causes of selected safety-violation cases identified by PtoP in Apollo. We selected cases that involve heterogeneous dynamic objects and exhibit diverse causes of safety violations. The vehicle annotated with a red dot in the … view at source ↗

**Figure 3.** Figure 3: Scatter plot of the initial relative position between ego vehicle and dynamic objects across different [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Scatter plot of the absolute initial position of dynamic objects across different maps generated by [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Boxplots for user ratings on sampled video with outliers. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Outlier analysis. MOSAT and GARL. For PtoP without ART, the method still maintains high map coverage, but the parameter distance drops to the level of GA and Random. This aligns with our hypothesis: ART tends to identify additional failure modes, while SVGD explores within each discovered mode. Consequently, removing ART reduces PtoP’s ability to uncover more failure modes, yet its exploration capability r… view at source ↗

read the original abstract

Simulation-based testing of autonomous driving systems (ADS) must uncover realistic and diverse failures in dense, heterogeneous traffic. However, existing search-based seeding methods (e.g., genetic algorithms) struggle in high-dimensional spaces, often collapsing to limited modes and missing many failure scenarios. We present PtoP, a framework that combines adaptive random seed generation with Stein Variational Gradient Descent (SVGD) to produce diverse, failure-inducing initial conditions. SVGD balances attraction toward high-risk regions and repulsion among particles, yielding risk-seeking yet well-distributed seeds across multiple failure modes. PtoP is plug-and-play and enhances existing online testing methods (e.g., reinforcement learning--based testers) by providing principled seeds. Evaluation in CARLA on two industry-grade ADS (Apollo, Autoware) and a native end-to-end system shows that PtoP improves safety violation rate (up to 27.68%), scenario diversity (9.6%), and map coverage (16.78%) over baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PtoP applies SVGD to ADS scenario seeding and reports clear gains on CARLA with Apollo and Autoware, but the diversity lift is not isolated from the adaptive seeding step.

read the letter

The paper's core move is to treat initial-condition search as a particle optimization problem and bring in SVGD so the particles both chase high-risk regions and repel one another. That combination with adaptive random seeding is the new piece; prior genetic or random methods either collapse or fail to scale in the high-dimensional traffic state space. They then plug the resulting seeds into existing online testers and run on CARLA against Apollo, Autoware, and one end-to-end system. The headline numbers are a 27.68 % rise in safety violations found, 9.6 % better scenario diversity, and 16.78 % more map coverage. Those are the kind of practical deltas that matter for testing pipelines that already exist in industry labs. The method is also described as plug-and-play, which lowers the barrier for adoption. The main weakness is that the evaluation stays at aggregate metrics. The abstract and results attribute the spread across failure modes to SVGD's kernel repulsion, yet there is no ablation that removes or weakens the repulsion term while holding the risk gradient and adaptive seeding fixed. Without per-mode histograms or a controlled comparison, it is hard to know how much of the diversity gain is truly coming from the SVGD mechanism rather than the seeding heuristic or the risk function itself. The experimental description also gives limited detail on statistical significance or variance across runs. This work is aimed at researchers and engineers who already run simulation-based ADS testing and want a drop-in way to improve seed quality. A reader who cares about measurable coverage improvements in CARLA will find usable numbers here. The idea is grounded enough and the empirical results are sharp enough that it should go to peer review rather than a desk reject; the referees will almost certainly ask for the missing ablations, but the current evidence is already worth their time.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PtoP, a framework combining adaptive random seed generation with Stein Variational Gradient Descent (SVGD) to produce diverse, failure-inducing initial conditions for simulation-based testing of autonomous driving systems. SVGD is used to balance attraction toward high-risk regions with repulsion among particles to avoid mode collapse in high-dimensional spaces. The approach is presented as plug-and-play for enhancing existing testers (e.g., RL-based). Evaluation in CARLA on Apollo, Autoware, and a native end-to-end ADS reports improvements over baselines: safety violation rate up to 27.68%, scenario diversity 9.6%, and map coverage 16.78%.

Significance. If the empirical claims are substantiated, PtoP would offer a useful mechanism for seeding ADS testers to uncover more realistic and diverse failures. The plug-and-play design and multi-system evaluation (industry-grade plus end-to-end) are strengths that could aid adoption in testing pipelines. The application of SVGD to scenario generation is a novel angle worth exploring if the mechanism is properly validated.

major comments (2)

Evaluation section: the headline diversity (9.6%) and coverage (16.78%) gains are reported only as aggregate metrics against baselines. No ablation is described that removes or varies the SVGD repulsion term (while holding the attraction-to-risk function and adaptive seeding fixed), nor are per-mode histograms or failure-type coverage tables provided. Without this isolation, the observed lifts cannot be confidently attributed to the SVGD repulsion mechanism rather than the risk function or random-seed component, undermining the central claim that SVGD successfully distributes particles across multiple high-risk modes.
Method section: the description of the SVGD kernel and bandwidth selection in high-dimensional initial-condition space lacks sufficient detail or sensitivity analysis. The claim that the repulsion term prevents mode collapse therefore rests on an untested assumption; a concrete test (e.g., bandwidth sweep or repulsion-ablated runs) is needed to support the balance asserted in the abstract.

minor comments (2)

The abstract would benefit from a one-sentence summary of the number of independent runs, statistical tests, or confidence intervals supporting the reported percentage improvements.
Notation for the risk function and kernel in the method could be made more explicit (e.g., explicit definition of the kernel bandwidth parameter) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional validation would strengthen the attribution of results to the SVGD repulsion mechanism. We address each major comment below and will incorporate the requested analyses and details in the revised version.

read point-by-point responses

Referee: Evaluation section: the headline diversity (9.6%) and coverage (16.78%) gains are reported only as aggregate metrics against baselines. No ablation is described that removes or varies the SVGD repulsion term (while holding the attraction-to-risk function and adaptive seeding fixed), nor are per-mode histograms or failure-type coverage tables provided. Without this isolation, the observed lifts cannot be confidently attributed to the SVGD repulsion mechanism rather than the risk function or random-seed component, undermining the central claim that SVGD successfully distributes particles across multiple high-risk modes.

Authors: We acknowledge that the current evaluation reports aggregate metrics without an explicit ablation isolating the SVGD repulsion term. In the revised manuscript, we will add an ablation study comparing full PtoP against a variant with the repulsion term removed (while fixing the attraction-to-risk function and adaptive seeding). We will also include per-mode histograms of particle distributions and failure-type coverage tables to demonstrate spread across high-risk modes. These additions will enable clearer attribution of the reported gains to the repulsion mechanism. revision: yes
Referee: Method section: the description of the SVGD kernel and bandwidth selection in high-dimensional initial-condition space lacks sufficient detail or sensitivity analysis. The claim that the repulsion term prevents mode collapse therefore rests on an untested assumption; a concrete test (e.g., bandwidth sweep or repulsion-ablated runs) is needed to support the balance asserted in the abstract.

Authors: We agree that the method section would benefit from expanded detail and empirical validation on kernel and bandwidth choices. In the revision, we will provide additional specifics on the kernel function and bandwidth selection procedure. We will also report a bandwidth sensitivity sweep and include the repulsion-ablated runs (as part of the ablation study noted above) to directly test the claim that repulsion prevents mode collapse in the high-dimensional space. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper's core contribution is an empirical framework (PtoP) that applies standard SVGD to generate initial conditions for ADS testing, with reported gains measured against external baselines in CARLA simulations on Apollo, Autoware, and an end-to-end system. No load-bearing step reduces a 'prediction' to a fitted parameter by construction, invokes a self-citation uniqueness theorem, or renames a known result as novel unification. The balance of attraction/repulsion is presented as a direct application of existing SVGD properties rather than a derived theorem internal to the paper. Evaluation metrics (violation rate, diversity, coverage) are computed from simulation outcomes independent of the method's internal definitions, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only, no specific free parameters, axioms, or invented entities are identifiable.

pith-pipeline@v0.9.0 · 5479 in / 1056 out tokens · 46280 ms · 2026-05-10T03:39:07.503843+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 6 canonical work pages · 1 internal anchor

[1]

[n. d.]. Baidu Apollo team (2017), Apollo: Open Source Autonomous Driving, howpublished = https://github.com/ ApolloAuto/apollo, note = Accessed: 2019-02-11

2017
[2]

Raja Ben Abdessalem, Annibale Panichella, Shiva Nejati, Lionel C Briand, and Thomas Stifter. 2018. Testing autonomous cars for feature interaction failures using many-objective search. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 143–154

2018
[3]

ApolloAuto. 2024. Apollo. https://github.com/ApolloAuto/apollo

2024
[4]

1994.Crashes resulting in car occupant fatalities: Frontal impacts

Robyn G Attewell and Stephen Ginpil. 1994.Crashes resulting in car occupant fatalities: Frontal impacts. Number CR

1994
[5]

Australian Government Pub. Service
[6]

Australian Government Department of Infrastructure, Transport, Regional Development, Communications and the Arts
[7]

https://datahub.roadsafety.gov.au/progress-reporting/monthly- road-deaths Accessed: 2025-01-27

Monthly Road Deaths - Road Safety Data Hub. https://datahub.roadsafety.gov.au/progress-reporting/monthly- road-deaths Accessed: 2025-01-27

2025
[8]

Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2014. The oracle problem in software testing: A survey.IEEE transactions on software engineering41, 5 (2014), 507–525

2014
[9]

Raja Ben Abdessalem, Shiva Nejati, Lionel C Briand, and Thomas Stifter. 2016. Testing advanced driver assistance systems using multi-objective search and neural networks. InProceedings of the 31st IEEE/ACM international conference on automated software engineering. 63–74

2016
[10]

Michele Bertoncello and Dominik Wee. 2015. Ten ways autonomous driving could redefine the automotive world. McKinsey & Company6 (2015)

2015
[11]

Lukas Birkemeyer, Tobias Pett, Andreas Vogelsang, Christoph Seidl, and Ina Schaefer. 2022. Feature-Interaction Sampling for Scenario-based Testing of Advanced Driver Assistance Systems. InProceedings of the 16th International Working Conference on Variability Modelling of Software-Intensive Systems. 1–10

2022
[12]

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. 2024. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence(2024)

2024
[13]

Tsong Yueh Chen, Hing Leung, and Ieng Kei Mak. 2004. Adaptive random testing. InAnnual Asian Computing Science Conference. Springer, 320–329

2004
[14]

Yuntianyi Chen, Yuqi Huai, Shilong Li, Changnam Hong, and Joshua Garcia. 2024. Misconfiguration Software Testing for Failure Emergence in Autonomous Driving Systems.Proceedings of the ACM on Software Engineering1, FSE (2024), 1913–1936

2024
[15]

Mingfei Cheng, Yuan Zhou, Xiaofei Xie, Junjie Wang, Guozhu Meng, and Kairui Yang. 2024. Decictor: Towards Evaluating the Robustness of Decision-Making in Autonomous Driving Systems.arXiv preprint arXiv:2402.18393 (2024)

work page arXiv 2024
[16]

Erwin De Gelder and Jan-Pieter Paardekooper. 2017. Assessment of automated driving systems using real-life scenarios. In2017 ieee intelligent vehicles symposium (iv). IEEE, 589–594

2017
[17]

Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and Tanaka Meyarivan. 2000. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. InParallel Problem Solving from Nature PPSN VI: 6th International Conference Paris, France, September 18–20, 2000 Proceedings 6. Springer, 849–858

2000
[18]

Yao Deng, Jiaohong Yao, Zhi Tu, Xi Zheng, Mengshi Zhang, and Tianyi Zhang. 2023. Target: Traffic rule-based test generation for autonomous driving systems.arXiv preprint arXiv:2305.06018(2023)

work page arXiv 2023
[19]

Yao Deng, Xi Zheng, Mengshi Zhang, Guannan Lou, and Tianyi Zhang. 2022. Scenario-based test reduction and prioritization for multi-module autonomous driving systems. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 82–93

2022
[20]

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. 2017. CARLA: An open urban driving simulator. InConference on robot learning. PMLR, 1–16

2017
[21]

Hamid Ebadi, Mahshid Helali Moghadam, et al. 2021. Efficient and effective generation of test cases for pedestrian detection-search-based software testing of Baidu Apollo in SVL. In2021 IEEE International Conference on Artificial Intelligence Testing (AITest). IEEE, 103–110

2021
[22]

Shuo Feng, Haowei Sun, Xintao Yan, et al. 2023. Dense reinforcement learning for safety validation of autonomous vehicles.Nature615, 7953 (2023)

2023
[23]

Autoware Foundation. 2025. Autoware: Open-Source Software for Autonomous Driving. https://github.com/ autowarefoundation/autoware. Accessed: 2025-02-18

2025
[24]

Alessio Gambi, Tri Huynh, and Gordon Fraser. 2019. Generating effective test cases for self-driving cars from police reports. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 257–267

2019
[25]

Niklas Hanselmann, Katrin Renz, Kashyap Chitta, Apratim Bhattacharyya, and Andreas Geiger. 2022. King: Generating safety-critical driving scenarios for robust imitation via kinematics gradients. InEuropean Conference on Computer Vision. Springer, 335–352. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE146. Publication date: July 2026. FSE146 Linfeng L...

2022
[26]

Florian Hauer, Ilias Gerostathopoulos, Tabea Schmidt, and Alexander Pretschner. 2020. Clustering traffic scenarios using mental models as little as possible. In2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, 1007–1012

2020
[27]

Yuqi Huai, Sumaya Almanee, Yuntianyi Chen, Xiafa Wu, Qi Alfred Chen, and Joshua Garcia. 2023. sceno RITA: Generating Diverse, Fully-Mutable, Test Scenarios for Autonomous Vehicle Planning.IEEE Transactions on Software Engineering(2023)

2023
[28]

Yuqi Huai, Yuntianyi Chen, Sumaya Almanee, Tuan Ngo, Xiang Liao, Ziwen Wan, Qi Alfred Chen, and Joshua Garcia
[29]

In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)

Doppelgänger test generation for revealing bugs in autonomous driving software. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2591–2603
[30]

Florian Klück, Yihao Li, Mihai Nica, Jianbo Tao, and Franz Wotawa. 2018. Using ontologies for test suites generation for automated and autonomous driving functions. In2018 IEEE International symposium on software reliability engineering workshops (ISSREW). IEEE, 118–123

2018
[31]

Mark Koren, Saud Alsaif, Ritchie Lee, and Mykel J Kochenderfer. 2018. Adaptive stress testing for autonomous vehicles. In2018 IEEE Intelligent Vehicles Symposium. IEEE

2018
[32]

Fred Lambert. 2016. Understanding the fatal tesla accident on autopilot and the nhtsa probe.Electrek, July1 (2016), 1

2016
[33]

Joel Lehman and Kenneth O Stanley. 2011. Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation19, 2 (2011), 189–223

2011
[34]

Joel Lehman, Kenneth O Stanley, et al. 2008. Exploiting open-endedness to solve problems through the search for novelty.. InALIFE. 329–336

2008
[35]

Guanpeng Li, Yiran Li, Saurabh Jha, et al. [n. d.]. Av-fuzzer: Finding safety violations in autonomous driving systems. In2020 IEEE 31st international symposium on software reliability engineering (ISSRE)
[36]

Pingfei Li, Xinyu Zhu, Yao Ren, Zhengping Tan, Wenhao Hu, You Zhang, and Chang Xu. 2024. Generalization of cut-in pre-crash scenarios for autonomous vehicles based on accident data.Scientific reports14, 1 (2024), 17664

2024
[37]

Linfeng Liang, Xiao Cheng, Tsong Yueh Chen, and Xi Zheng. 2025. Artifact for: From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing. https://doi.org/10.5281/zenodo.19625701

work page doi:10.5281/zenodo.19625701 2025
[38]

Linfeng Liang, Yao Deng, Kye Morton, Valtteri Kallinen, Alice James, Avishkar Seth, Endrowednes Kuantama, Subhas Mukhopadhyay, Richard Han, and Xi Zheng. 2023. RLaGA: A Reinforcement Learning Augmented Genetic Algorithm For Searching Real and Diverse Marker-Based Landing Violations.arXiv preprint arXiv:2310.07378(2023)

work page arXiv 2023
[39]

Qiang Liu and Dilin Wang. 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems29 (2016)

2016
[40]

Chengjie Lu, Yize Shi, et al . 2022. Learning configurations of operating environment of autonomous vehicles to maximize their collisions.IEEE Transactions on Software Engineering49, 1 (2022), 384–402

2022
[41]

Yuteng Lu, Kaicheng Shao, Weidi Sun, and Meng Sun. 2022. RGChaser: A RL-guided Fuzz and Mutation Testing Frame- work for Deep Learning Systems. In2022 9th International Conference on Dependable Systems and Their Applications (DSA). IEEE, 12–23

2022
[42]

Yixing Luo, Xiao-Yi Zhang, et al. 2021. Targeting requirements violations of autonomous driving systems by dynamic evolutionary search. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 279–291

2021
[43]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602(2013)

work page internal anchor Pith review arXiv 2013
[44]

Annibale Panichella, Fitsum Meshesha Kifetew, and Paolo Tonella. 2015. Reformulating branch coverage as a many- objective optimization problem. In2015 IEEE 8th international conference on software testing, verification and validation (ICST). IEEE, 1–10

2015
[45]

Prolific. 2024. General citation guidelines. Available at https://www.prolific.com. First released in 2014. Copyright

2024
[46]

Version: Current month(s) and year(s) of use

Located in London, UK. Version: Current month(s) and year(s) of use
[47]

Luke Rowe, Roger Girgis, Anthony Gosselin, Liam Paull, Christopher Pal, and Felix Heide. 2025. Scenario dreamer: Vectorized latent diffusion for generating driving simulation environments. InProceedings of the Computer Vision and Pattern Recognition Conference. 17207–17218

2025
[48]

Shuo Sun, Zekai Gu, Tianchen Sun, Jiawei Sun, Chengran Yuan, Yuhang Han, Dongen Li, and Marcelo H Ang Jr. 2023. Drivescenegen: Generating diverse and realistic driving scenarios from scratch.arXiv preprint arXiv:2309.14685(2023)

work page arXiv 2023
[49]

Inc. Tesla. 2024. Autopilot. https://www.tesla.com/en_AU/autopilot Accessed: 2024-11-13

2024
[50]

Haoxiang Tian, Yan Jiang, et al . 2022. MOSAT: finding safety violations of autonomous driving systems using multi-objective genetic algorithm. InESEC/FSE 2022. 94–106

2022
[51]

Ziyuan Zhong, Gail Kaiser, and Baishakhi Ray. 2022. Neural network guided evolutionary fuzzing for finding traffic violations of autonomous vehicles.IEEE Transactions on Software Engineering(2022). Received 2025-09-12; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE146. Publication date: July 2026

2022