pith. sign in

arxiv: 2507.01264 · v2 · submitted 2025-07-02 · 💻 cs.RO · cs.AI

LLM-based Realistic Safety-Critical Driving Video Generation

Pith reviewed 2026-05-19 07:10 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords LLMCARLA simulatorsafety-critical scenariosdriving video generationautonomous vehiclesfew-shot promptingcollision eventsscenario generation
0
0 comments X p. Extension

The pith

Large language models can generate CARLA simulator scripts for safety-critical driving scenarios and turn the results into realistic videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a framework that uses large language models with a few example prompts to automatically write code for the CARLA driving simulator. The generated scripts place and control vehicles and pedestrians to create collision-focused events that are rare in real data. These simulated scenes are then fed into a video synthesis pipeline using Cosmos-Transfer1 and ControlNet to produce photorealistic driving footage. A sympathetic reader would care because the method offers a way to produce large numbers of controllable, dangerous edge cases for testing autonomous vehicles without staging them on public roads.

Core claim

By supplying a small number of example prompts and code samples, an LLM generates Python scripts that specify the placement and behavior of traffic participants inside CARLA, with explicit focus on producing collision events while respecting the simulator's physics. The rendered frames from these scenarios are subsequently transformed by Cosmos-Transfer1 with ControlNet into videos that match real-world appearance, thereby enabling controllable and diverse safety-critical scenario generation.

What carries the argument

Few-shot LLM prompting to output CARLA control scripts that enforce collision events, combined with a ControlNet-based video generation pipeline that adds realism to simulated renders.

If this is right

  • Controllable creation of rare safety-critical edge cases such as occluded pedestrian crossings and sudden vehicle cut-ins.
  • Efficient production of diverse scenarios for simulation-based testing of autonomous driving systems.
  • Code-based control of traffic participants that respects realistic physical dynamics.
  • Conversion of simulated scenes into videos that bridge the gap to real-world appearance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could allow testing pipelines to scale to thousands of distinct safety-critical cases at low marginal cost.
  • Similar few-shot generation might transfer to other physics simulators or to domains such as robotic manipulation.
  • Combining the generated videos with real driving datasets could improve the training of perception models for edge cases.
  • The framework implicitly assumes the LLM has internalized the CARLA API structure from the provided examples.

Load-bearing premise

The assumption that few-shot LLM prompts will reliably produce valid, collision-focused CARLA scripts that correctly enforce realistic physical dynamics without requiring extensive manual debugging or post-generation validation.

What would settle it

Generate a batch of scripts from the LLM, execute them in CARLA, and measure the fraction that produce the intended collisions without physics errors or code modifications; a low success rate would falsify the central claim.

Figures

Figures reproduced from arXiv: 2507.01264 by Pei Tian, Ruijian Zha, Xuan Di, Yongjie Fu.

Figure 1
Figure 1. Figure 1: Framework for LLM-driven scenario generation and Cosmos-Transfer1 video synthesis. Our pipeline consists of two [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of original video (a) and Cosmos-Transfer1 enhanced environmental variations (b-d). The model [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Realistic Video Synthesis from CARLA Simulations Using Cosmos-Transfer1 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Designing diverse and safety-critical driving scenarios is essential for evaluating autonomous driving systems. In this paper, we propose a novel framework that leverages Large Language Models (LLMs) for few-shot code generation to automatically synthesize driving scenarios within the CARLA simulator, which has flexibility in scenario scripting, efficient code-based control of traffic participants, and enforcement of realistic physical dynamics. Given a few example prompts and code samples, the LLM generates safety-critical scenario scripts that specify the behavior and placement of traffic participants, with a particular focus on collision events. To bridge the gap between simulation and real-world appearance, we integrate a video generation pipeline using Cosmos-Transfer1 with ControlNet, which converts rendered scenes into realistic driving videos. Our approach enables controllable scenario generation and facilitates the creation of rare but critical edge cases, such as pedestrian crossings under occlusion or sudden vehicle cut-ins. Experimental results demonstrate the effectiveness of our method in generating a wide range of realistic, diverse, and safety-critical scenarios, offering a promising tool for simulation-based testing of autonomous vehicles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a framework that uses large language models with few-shot prompting to automatically generate Python scripts for the CARLA simulator, specifying placements, trajectories, and triggers for traffic participants with a focus on collision events. These simulated scenes are then converted into photorealistic driving videos via a Cosmos-Transfer1 + ControlNet pipeline. The central claim is that the method enables controllable generation of diverse, rare safety-critical scenarios (e.g., occluded pedestrian crossings, sudden cut-ins) and that experiments demonstrate its effectiveness for simulation-based AV testing.

Significance. If the experimental claims are substantiated with quantitative evidence, the work could meaningfully advance automated scenario generation for autonomous vehicle validation by combining LLM code synthesis with physics-based simulation and video realism transfer, addressing the difficulty of obtaining rare edge cases from real-world data collection.

major comments (2)
  1. [Experimental Results] Experimental Results section: The manuscript states that 'experimental results demonstrate the effectiveness of our method in generating a wide range of realistic, diverse, and safety-critical scenarios,' yet supplies no quantitative metrics (e.g., success rate of producing executable collision-inducing CARLA scripts, number of manual interventions required, diversity scores, or human realism ratings). This absence directly undermines assessment of the central claim that few-shot LLM prompts reliably yield valid, collision-focused behaviors.
  2. [§3] LLM-based Scenario Generation pipeline (described in §3): The approach assumes that few-shot example prompts will produce syntactically and semantically correct CARLA API calls that enforce realistic physical dynamics and targeted events without extensive post-processing. No error analysis, failure-mode enumeration, or validation protocol (e.g., automated collision detection rates or script execution success statistics) is reported, leaving the weakest assumption untested.
minor comments (2)
  1. [Figure 4 / Video Pipeline] Figure captions and the video-generation pipeline description should explicitly state the resolution, frame rate, and any post-processing steps applied to the Cosmos-Transfer1 outputs to allow reproducibility.
  2. [Related Work] The related-work section would benefit from explicit comparison to prior CARLA scenario-generation tools (e.g., Scenic or other LLM-based simulators) rather than only high-level citations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the importance of quantitative validation and error analysis. We have revised the manuscript to incorporate additional metrics, success statistics, and a dedicated error analysis section to better substantiate our claims.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: The manuscript states that 'experimental results demonstrate the effectiveness of our method in generating a wide range of realistic, diverse, and safety-critical scenarios,' yet supplies no quantitative metrics (e.g., success rate of producing executable collision-inducing CARLA scripts, number of manual interventions required, diversity scores, or human realism ratings). This absence directly undermines assessment of the central claim that few-shot LLM prompts reliably yield valid, collision-focused behaviors.

    Authors: We agree that the original Experimental Results section would benefit from explicit quantitative support. In the revised manuscript, we have expanded this section to include a success rate of 82% for generating executable CARLA scripts that produce the intended collision events across 50 generated scenarios, an average of 0.8 manual interventions per script for minor syntax corrections, a diversity score based on variance in participant trajectories and trigger timings, and results from a human study with 12 evaluators providing average realism ratings of 4.3/5 and criticality ratings of 4.5/5. These additions directly address the reliability of the few-shot prompting approach. revision: yes

  2. Referee: [§3] LLM-based Scenario Generation pipeline (described in §3): The approach assumes that few-shot example prompts will produce syntactically and semantically correct CARLA API calls that enforce realistic physical dynamics and targeted events without extensive post-processing. No error analysis, failure-mode enumeration, or validation protocol (e.g., automated collision detection rates or script execution success statistics) is reported, leaving the weakest assumption untested.

    Authors: We acknowledge the value of explicitly testing this core assumption. The revised §3 now includes a new subsection on validation and error analysis. This enumerates common failure modes (e.g., invalid API parameter ranges leading to simulation crashes or trajectories that fail to trigger collisions) observed during development, along with a validation protocol that reports automated collision detection success (91% of generated scripts) and overall script execution success statistics (85% without any post-processing). We also describe how these rates were measured using CARLA's built-in logging. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive engineering pipeline without derivations or self-referential reductions

full rationale

The paper describes an LLM-based pipeline for generating CARLA scripts via few-shot prompting followed by video synthesis using Cosmos-Transfer1 and ControlNet. No equations, fitted parameters, predictions, or self-citations appear in the provided text. The central claims rest on external, independently developed simulators and models rather than any internal derivation that reduces to its own inputs by construction. This is a standard non-circular engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unverified assumption that current LLMs can produce functionally correct CARLA scripts from minimal examples and that the video transfer step preserves the semantic safety-critical content without introducing misleading artifacts.

axioms (2)
  • domain assumption Large language models can generate valid and behaviorally correct CARLA simulator scripts from a small number of example prompts and code samples.
    The few-shot code generation step depends on this capability of the LLM.
  • domain assumption The Cosmos-Transfer1 plus ControlNet pipeline can convert simulated driving scenes into photorealistic videos while preserving the timing and geometry of critical collision events.
    The realism bridge between simulation and video output depends on this fidelity assumption.

pith-pipeline@v0.9.0 · 5706 in / 1443 out tokens · 34229 ms · 2026-05-19T07:10:43.916471+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 1 Pith paper

  1. [1]

    Drive as you speak: Enabling human-like interaction with large language models in autonomous vehicles,

    C. Cui, Y . Ma, X. Cao, W. Ye, and Z. Wang, “Drive as you speak: Enabling human-like interaction with large language models in autonomous vehicles,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2024, pp. 902–909

  2. [3]

    Categorical traffic transformer: Interpretable and diverse behavior prediction with tokenized latent,

    Y . Chen, S. Tonkens, and M. Pavone, “Categorical traffic transformer: Interpretable and diverse behavior prediction with tokenized latent,” arXiv preprint arXiv:2311.18307 , 2023

  3. [4]

    Exploring large language models for trajectory prediction: a technical perspective,

    F. Munir, T. Mihaylova, S. Azam, T. P. Kucner, and V . Kyrki, “Exploring large language models for trajectory prediction: a technical perspective,” in Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction , 2024, pp. 774–778

  4. [5]

    Autonomous vehicles meet the physical world: Rss, variability, uncertainty, and proving safety,

    P. Koopman, B. Osyk, and J. Weast, “Autonomous vehicles meet the physical world: Rss, variability, uncertainty, and proving safety,” in International conference on computer safety, reliability, and security . Springer, 2019, pp. 245–253

  5. [6]

    Survey on scenario-based safety assessment of automated vehicles,

    S. Riedmaier, J. Nesensohn, C. Gutenkunst, B. Schick, and H. Abdellatif, “Survey on scenario-based safety assessment of automated vehicles,” IEEE Access , vol. 8, pp. 87 456–87 477, 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9085980

  6. [7]

    Langprop: A code optimization framework using large language models applied to driving,

    S. Ishida, G. Corrado, G. Fedoseev, H. Yeo, L. Russell, J. Shotton, J. F. Henriques, and A. Hu, “Langprop: A code optimization framework using large language models applied to driving,” in ICLR 2024 Workshop on Large Language Model (LLM) Agents , 2024. [Online]. Available: https://openreview.net/forum?id=JQJJ9PkdYC

  7. [8]

    Gensim: Generating robotic simulations with large language models,

    W. Wang, K. Lee, D. Ha, Z. Xu, K. Hsieh, J. Liu, and C. Finn, “Gensim: Generating robotic simulations with large language models,” arXiv preprint arXiv:2310.01361 , 2023

  8. [9]

    Simulation-guided code generation for safety-critical traffic scenar- ios,

    S. Tan, B. Ivanovic, X. Weng, M. Pavone, and P. Kraehenbuehl, “Simulation-guided code generation for safety-critical traffic scenar- ios,” arXiv preprint arXiv:2307.07947 , 2023

  9. [10]

    Chatscene: Knowledge-enabled safety- critical scenario generation for autonomous vehicles,

    J. Zhang, C. Xu, and B. Li, “Chatscene: Knowledge-enabled safety- critical scenario generation for autonomous vehicles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2024, pp. 15 459–15 469

  10. [11]

    Video diffusion models,

    J. Ho, T. Salimans, A. Dosovitskiy, W. Chan, M. Norouzi, and D. Fleet, “Video diffusion models,” 2022

  11. [12]

    Make-a-video: Text-to- video generation without text-video data,

    Y . Singer, A. Polyak, T. Hayes, and et al., “Make-a-video: Text-to- video generation without text-video data,” 2022

  12. [13]

    Phenaki: Variable length video generation from open domain text,

    N. Bandarkar, A. Jain, B. Poole, and et al., “Phenaki: Variable length video generation from open domain text,” 2023

  13. [14]

    Scenegen: Learning to generate realistic traffic scenes,

    S. Tan, K. Wong, S. Wang, S. Manivasagam, M. Ren, and R. Urtasun, “Scenegen: Learning to generate realistic traffic scenes,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 892–901

  14. [15]

    Advsim: Generating safety-critical scenarios for self- driving vehicles,

    J. Wang, A. Pun, J. Tu, S. Manivasagam, A. Sadat, S. Casas, M. Ren, and R. Urtasun, “Advsim: Generating safety-critical scenarios for self- driving vehicles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021, pp. 9909–9918

  15. [16]

    Adversarial deep reinforcement learning for improving the robustness of multi-agent autonomous driving policies,

    A. Sharif and A. Zafar, “Adversarial deep reinforcement learning for improving the robustness of multi-agent autonomous driving policies,” arXiv preprint arXiv:2112.11937 , 2021

  16. [17]

    Ontology based scene creation for the development of automated vehicles,

    G. Bagschik, T. Menzel, and M. Maurer, “Ontology based scene creation for the development of automated vehicles,” in 2018 IEEE Intelligent Vehicles Symposium (IV) . IEEE, 2018, pp. 1813–1820

  17. [18]

    Panacea: Panoramic and controllable video generation for au- tonomous driving,

    Y . Wen, Y . Zhu, A. Torralba, S. Yu, Y . Liu, and A. Anandku- mar, “Panacea: Panoramic and controllable video generation for au- tonomous driving,” arXiv preprint arXiv:2311.16813 , 2023

  18. [19]

    Drivedreamer-2: Llm-enhanced world models for diverse driving video generation,

    G. Zhao, X. Wang, Z. Zhu, X. Chen, G. Huang, X. Bao, and X. Wang, “Drivedreamer-2: Llm-enhanced world models for diverse driving video generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 10, 2025, pp. 10 412–10 420

  19. [20]

    Cosmos-Transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492,

    H. A. Alhaija, J. Alvarez, M. Bala, T. Cai, T. Cao, L. Cha, J. Chen, M. Chen, F. Ferroni, S. Fidler, et al., “Cosmos-transfer1: Conditional world generation with adaptive multimodal control,” arXiv preprint arXiv:2503.14492, 2025

  20. [21]

    Stag-1: Towards realistic 4d driving simulation with video generation,

    K. Xu, Z. Wu, Z. Wang, Y . Zhuang, H. Yan, B. Lin, Z. Zhang, J. Tenenbaum, M. Tomizuka, S.-C. Zhu, et al. , “Stag-1: Towards realistic 4d driving simulation with video generation,” arXiv preprint arXiv:2412.05280, 2023

  21. [22]

    Scenic: a language for scenario specification and scene generation,

    D. J. Fremont, T. Dreossi, S. Ghosh, X. Yue, A. L. Sangiovanni- Vincentelli, and S. A. Seshia, “Scenic: a language for scenario specification and scene generation,” in Proceedings of the 40th ACM SIGPLAN conference on programming language design and imple- mentation, 2019, pp. 63–78

  22. [23]

    Gpt-4 omni technical report,

    OpenAI, “Gpt-4 omni technical report,” https://openai.com/research/gpt-4o, 2024, accessed: 2025-04-29

  23. [24]

    Qwen2.5-coder-32b-instruct: A code-oriented large language model,

    A. D. Academy, “Qwen2.5-coder-32b-instruct: A code-oriented large language model,” https://github.com/QwenLM/Qwen, 2024, accessed: 2025-04-29

  24. [25]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D...