pith. sign in

arxiv: 2604.09609 · v2 · pith:B52XMRC2new · submitted 2026-03-11 · 💻 cs.AI · cs.RO

General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging

Pith reviewed 2026-05-21 11:48 UTC · model grok-4.3

classification 💻 cs.AI cs.RO
keywords large language modelshuman driver behaviormerging scenarioautonomous vehicle safetyprompt ablationintermittent controlbehavior simulation
0
0 comments X

The pith

General-purpose LLMs reproduce human-like intermittent control and spatial cue responses in a simplified merge but fail to match dynamic velocity reactions consistently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper places two general-purpose LLMs into a one-dimensional merging task as closed-loop driver agents and measures how closely their actions match recorded human drivers. It finds that both models produce intermittent operational control and base tactical decisions on the positions of nearby vehicles in ways that align with humans. Neither model, however, reacts to changing speeds in the same reliable pattern as people do, and the two models produce markedly different levels of safety. The work also shows that individual parts of the prompt act as strong but non-transferable biases for each LLM. If these patterns hold, LLMs could be dropped into vehicle safety simulations without any task-specific tuning.

Core claim

Embedding OpenAI o3 and Google Gemini 2.5 Pro as standalone driver agents in a simplified one-dimensional merging scenario shows that both reproduce human-like intermittent operational control and tactical dependencies on spatial cues, yet neither consistently captures the human response to dynamic velocity cues, while safety performance diverges sharply between the two models. A prompt ablation study further indicates that prompt components function as model-specific inductive biases that do not transfer across LLMs.

What carries the argument

Embedding general-purpose LLMs as closed-loop driver agents in a one-dimensional merging scenario guided by structured natural-language prompts

If this is right

  • General-purpose LLMs could serve as standalone, ready-to-use human behavior models inside automated-vehicle evaluation pipelines without parameter fitting.
  • Prompt components function as model-specific inductive biases that do not transfer from one LLM to another.
  • Neither model reliably reproduces the human response to dynamic velocity cues, limiting their immediate use for certain safety-critical behaviors.
  • Safety performance differs sharply between the tested models, implying that model selection itself affects simulation outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the velocity-response gap persists across scenarios, targeted prompting or hybrid architectures that add explicit velocity modules might be needed before LLMs can be trusted in full AV safety pipelines.
  • The finding that prompt elements create non-transferable biases suggests that each new LLM will require its own prompt-engineering study rather than a single universal template.
  • Extending the same closed-loop comparison to multi-lane or multi-agent merges could expose whether the spatial-cue success generalizes or breaks when more dimensions are added.

Load-bearing premise

The simplified one-dimensional merging scenario together with the chosen prompt structure is sufficient to expose the general capabilities and limits of LLMs as human driver models that would carry over to richer driving environments.

What would settle it

Placing the same two LLMs into a merging scenario that adds explicit velocity changes or a second spatial dimension and measuring whether their velocity-response patterns and safety divergence remain the same as in the original one-dimensional case.

Figures

Figures reproduced from arXiv: 2604.09609 by Arkady Zgonnikov, Samir H.A. Mohammad, Wouter Mooi.

Figure 1
Figure 1. Figure 1: Closed-loop zero-shot LLM-based driver-agent framework. The simulator provided the current interaction state (top; 1-D merging scene), which [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Baseline behavioral comparison between humans and LLM agents (collisions excluded). [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Human behavior models are essential as behavior references and for simulating human agents in virtual safety assessment of automated vehicles (AVs), yet current models face a trade-off between interpretability and flexibility. General-purpose large language models (LLMs) offer a promising alternative: a single model potentially deployable without parameter fitting across diverse scenarios. However, what LLMs can and cannot capture about human driving behavior remains poorly understood. We address this gap by embedding two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as standalone, closed-loop driver agents in a simplified one-dimensional merging scenario and comparing their behavior against human data using quantitative and qualitative analyses. Both models reproduce human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captures the human response to dynamic velocity cues, and safety performance diverges sharply between models. A systematic prompt ablation study reveals that prompt components act as model-specific inductive biases that do not transfer across LLMs. These findings suggest that general-purpose LLMs could potentially serve as standalone, ready-to-use human behavior models in AV evaluation pipelines, but future research is needed to better understand their failure modes and ensure their validity as models of human driving behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper embeds two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as closed-loop agents in a simplified one-dimensional merging scenario and compares their operational and tactical behavior against human data via quantitative and qualitative analyses plus a prompt-ablation study. It reports that both models reproduce intermittent control and spatial-cue dependence but fail to capture dynamic velocity-cue responses consistently, with sharply divergent safety outcomes; prompt components act as model-specific inductive biases that do not transfer across LLMs.

Significance. If the empirical patterns hold, the work supplies concrete evidence that off-the-shelf LLMs can serve as ready-to-use, parameter-free human-behavior references for AV virtual safety assessment, directly addressing the interpretability-flexibility trade-off noted in the introduction. The prompt-ablation results further identify a practical mechanism (model-specific inductive biases) that future users must control. The 1-D simplification, however, leaves open whether the observed successes and failures generalize beyond the narrow setting.

major comments (2)
  1. [§4] §4 (Results): the quantitative comparisons are described only at the level of 'reproduce human-like intermittent operational control' and 'diverges sharply' without reporting concrete metrics, sample sizes, statistical tests, or controls for prompt sensitivity; this absence prevents assessment of whether the reported differences are robust or merely qualitative impressions.
  2. [§5] §5 (Discussion) and §3 (Scenario definition): the central claim that LLMs capture selected human-like behaviors while exposing specific failure modes rests on the assumption that the chosen one-dimensional merging task is representative; no variation in dimensionality, lateral dynamics, or multi-agent interactions is performed, so it remains possible that the velocity-cue failures and safety gaps are artifacts of the extreme simplification rather than stable properties of the LLMs.
minor comments (2)
  1. [Abstract] Abstract and §2: the phrase 'parameter-free' is used without clarifying that the LLMs still require prompt engineering; a brief sentence acknowledging this would avoid reader confusion.
  2. [§4] Figure captions and §4: several qualitative trajectory plots lack axis labels or legend entries for the human reference data; adding these would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments have prompted us to strengthen the quantitative presentation of results and to more explicitly address the scope and limitations of the one-dimensional scenario. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Results): the quantitative comparisons are described only at the level of 'reproduce human-like intermittent operational control' and 'diverges sharply' without reporting concrete metrics, sample sizes, statistical tests, or controls for prompt sensitivity; this absence prevents assessment of whether the reported differences are robust or merely qualitative impressions.

    Authors: We agree that the Results section would benefit from more explicit quantitative detail. Although the manuscript already performs quantitative comparisons against human data, we have revised §4 to report concrete metrics, including mean control-update intervals and their standard deviations for intermittency, Pearson correlations between model actions and spatial cues, and safety indicators such as minimum time-to-collision distributions and collision rates. Human data sample sizes (number of trials and participants) are now stated explicitly, and we have added a prompt-sensitivity analysis that varies key prompt components while holding the scenario fixed. These additions allow readers to evaluate the robustness of the reported reproduction of intermittent control, spatial-cue dependence, and the divergence in velocity responses and safety outcomes. revision: yes

  2. Referee: [§5] §5 (Discussion) and §3 (Scenario definition): the central claim that LLMs capture selected human-like behaviors while exposing specific failure modes rests on the assumption that the chosen one-dimensional merging task is representative; no variation in dimensionality, lateral dynamics, or multi-agent interactions is performed, so it remains possible that the velocity-cue failures and safety gaps are artifacts of the extreme simplification rather than stable properties of the LLMs.

    Authors: We acknowledge that the one-dimensional simplification is a deliberate design choice that restricts direct generalization. The study was framed as a controlled case study precisely to isolate operational intermittency and cue dependencies without confounding lateral or multi-agent effects. In the revised manuscript we have expanded both §3 and §5 to discuss this limitation explicitly, noting that velocity-cue inconsistencies and safety divergences could be modulated by added degrees of freedom or interactive agents. At the same time, the prompt-ablation findings on model-specific inductive biases are less likely to be artifacts of dimensionality and remain a practical takeaway. We have also added a forward-looking paragraph outlining extensions to higher-fidelity settings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons to external human data

full rationale

The paper conducts an empirical study embedding LLMs as closed-loop agents in a 1D merging simulation and compares outputs quantitatively and qualitatively to independent human driver data. No mathematical derivations, parameter fits, or predictions are defined within the paper that reduce to its own inputs by construction. Claims rest on external benchmarks and prompt ablations rather than self-definitional loops, fitted-input renamings, or load-bearing self-citations. The work is self-contained against external human data and simulation results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper assumes that the chosen 1D merging task and prompt engineering choices are representative enough to draw conclusions about LLMs as general human behavior models. No new physical entities or mathematical axioms are introduced.

axioms (1)
  • domain assumption The simplified one-dimensional merging scenario captures the key operational and tactical aspects of human driving behavior relevant to AV safety assessment.
    Invoked when the authors treat results from this scenario as informative for broader use of LLMs as human driver models.

pith-pipeline@v0.9.0 · 5754 in / 1186 out tokens · 22530 ms · 2026-05-21T11:48:04.397896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Economy-wide impacts from different speeds of deployment of automated vehicles on European Union roads,

    A. Norman-L ´opez, M. Weitzel, M. Tamba, L. Duboz, J. Krause, and B. Ciuffo, “Economy-wide impacts from different speeds of deployment of automated vehicles on European Union roads,”Technology in Society, vol. 84, p. 103066, Mar. 2026

  2. [2]

    On human behavior models for virtual safety impact assessment: A scoping review,

    J. B ¨argman, S. Jokhio, M. Sv ¨ard, A. Fries, F. Denk, A. Paliotto, L. F. A. de Oliveira, J. N. B. Beckmann, K. Adjenughwure, R. J. Davidse, R. de Zwart, F. Fahrenkrog, M. Hammouda, and P. Olleja, “On human behavior models for virtual safety impact assessment: A scoping review,” Oct. 2025

  3. [3]

    The Application of Driver Models in the Safety Assessment of Autonomous Vehicles: Perspectives, Insights, Prospects,

    C. Wang, F. Guo, R. Yu, L. Wang, and Y . Zhang, “The Application of Driver Models in the Safety Assessment of Autonomous Vehicles: Perspectives, Insights, Prospects,”IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 2364–2381, Jan. 2024

  4. [4]

    Using Models Based on Cognitive Theory to Predict Human Behavior in Traffic: A Case Study,

    J. F. Schumann, A. R. Srinivasan, J. Kober, G. Markkula, and A. Zgonnikov, “Using Models Based on Cognitive Theory to Predict Human Behavior in Traffic: A Case Study,” in2023 IEEE 26th Interna- tional Conference on Intelligent Transportation Systems (ITSC). Bilbao, Spain: IEEE, Sep. 2023, pp. 5870–5875

  5. [5]

    Modeling Pedestrian Crossing Behavior: A Reinforcement Learning Approach With Sensory Motor Constraints,

    Y . Wang, A. R. Srinivasan, Y . M. Lee, and G. Markkula, “Modeling Pedestrian Crossing Behavior: A Reinforcement Learning Approach With Sensory Motor Constraints,”IEEE Transactions on Intelligent Transportation Systems, vol. 26, no. 10, pp. 16 454–16 465, Oct. 2025

  6. [6]

    Explaining human interactions on the road by large-scale integration of computational psychological theory,

    G. Markkula, Y .-S. Lin, A. R. Srinivasan, J. Billington, M. Leonetti, A. H. Kalantari, Y . Yang, Y . M. Lee, R. Madigan, and N. Merat, “Explaining human interactions on the road by large-scale integration of computational psychological theory,”PNAS Nexus, vol. 2, no. 6, p. pgad163, Jun. 2023

  7. [7]

    A model of dyadic merging interactions explains human drivers’ behavior from control inputs to decisions,

    O. Siebinga, A. Zgonnikov, and D. A. Abbink, “A model of dyadic merging interactions explains human drivers’ behavior from control inputs to decisions,”PNAS nexus, vol. 3, no. 10, p. pgae420, 2024

  8. [8]

    Active inference as a unified model of collision avoidance behavior in human drivers,

    J. F. Schumann, J. Engstr ¨om, L. Johnson, M. O’Kelly, J. Messias, J. Kober, and A. Zgonnikov, “Active inference as a unified model of collision avoidance behavior in human drivers,” Jun. 2025

  9. [9]

    Building reliable sim driving agents by scaling self-play,

    D. Cornelisse, A. Pandya, K. Joseph, J. Su ´arez, and E. Vinitsky, “Building reliable sim driving agents by scaling self-play,” May 2025

  10. [10]

    DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning,

    K. Guo, H. Liu, X. Wu, and C. Lv, “DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning,” Jan. 2026

  11. [11]

    Trajeglish: Traffic Modeling as Next-Token Prediction,

    J. Philion, X. B. Peng, and S. Fidler, “Trajeglish: Traffic Modeling as Next-Token Prediction,” Apr. 2024

  12. [12]

    TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge

    Z. Zhang, X. Jia, G. Chen, Q. Li, and J. Yan, “TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge.”

  13. [13]

    Trajectory Prediction Meets Large Language Models: A Survey,

    Y . Xu, R. Yang, Y . Zhang, J. Lu, M. Zhang, Y . Wang, L. Su, and Y . Fu, “Trajectory Prediction Meets Large Language Models: A Survey,” Oct. 2025

  14. [14]

    The Waymo Open Sim Agents Challenge,

    N. Montali, J. Lambert, P. Mougin, A. Kuefler, N. Rhinehart, M. Li, C. Gulino, T. Emrich, Z. Yang, S. Whiteson, B. White, and D. Anguelov, “The Waymo Open Sim Agents Challenge,” Jul. 2023

  15. [15]

    SceneDiffuser++: City-Scale Traffic Simula- tion via a Generative World Model,

    S. Tan, J. Lambert, H. Jeon, S. Kulshrestha, Y . Bai, J. Luo, D. Anguelov, M. Tan, and C. M. Jiang, “SceneDiffuser++: City-Scale Traffic Simula- tion via a Generative World Model,” Jun. 2025

  16. [16]

    Language-Guided Traffic Simulation via Scene-Level Diffusion,

    Z. Zhong, D. Rempe, Y . Chen, B. Ivanovic, Y . Cao, D. Xu, M. Pavone, and B. Ray, “Language-Guided Traffic Simulation via Scene-Level Diffusion,” inProceedings of The 7th Conference on Robot Learning. PMLR, Dec. 2023, pp. 144–177

  17. [17]

    Promptable Closed-loop Traffic Simulation,

    S. Tan, B. Ivanovic, Y . Chen, B. Li, X. Weng, Y . Cao, P. Kr¨ahenb¨uhl, and M. Pavone, “Promptable Closed-loop Traffic Simulation,” Sep. 2024

  18. [18]

    WOMD-Reasoning: A Large- Scale Dataset for Interaction Reasoning in Driving,

    Y . Li, C. Fan, C. Ge, S. Z. Zhao, C. Li, C. Xu, H. Yao, M. Tomizuka, B. Zhou, C. Tang, M. Ding, and W. Zhan, “WOMD-Reasoning: A Large- Scale Dataset for Interaction Reasoning in Driving,” inProceedings of the 42nd International Conference on Machine Learning. PMLR, Oct. 2025, pp. 34 288–34 311

  19. [19]

    GenFollower: Enhancing Car-Following Prediction With Large Lan- guage Models,

    X. Chen, M. Peng, P. Tiu, Y . Wu, J. Chen, M. Zhu, and X. Zheng, “GenFollower: Enhancing Car-Following Prediction With Large Lan- guage Models,”IEEE Transactions on Intelligent Vehicles, pp. 1–11, 2024

  20. [20]

    Prompt-guided Large Language Models with chain-of-thought reasoning for mixed traffic car-following simulation,

    D. Gao, H. Zhang, Y . Liu, and Z. Qi, “Prompt-guided Large Language Models with chain-of-thought reasoning for mixed traffic car-following simulation,”Simulation Modelling Practice and Theory, vol. 148, p. 103262, Apr. 2026

  21. [21]

    ChatGPT is bullshit,

    M. T. Hicks, J. Humphries, and J. Slater, “ChatGPT is bullshit,”Ethics and Information Technology, vol. 26, no. 2, 2024

  22. [22]

    Human merging behavior in a coupled driving simulator: how do we resolve conflicts?

    O. Siebinga, A. Zgonnikov, and D. A. Abbink, “Human merging behavior in a coupled driving simulator: how do we resolve conflicts?” IEEE Open Journal of Intelligent Transportation Systems, vol. 5, pp. 103–114, 2024

  23. [23]

    Receive, reason, and react: Drive as you say, with large language models in autonomous vehicles,

    C. Cui, Y . Ma, X. Cao, W. Ye, and Z. Wang, “Receive, reason, and react: Drive as you say, with large language models in autonomous vehicles,” IEEE Intelligent Transportation Systems Magazine, vol. 16, no. 4, pp. 81–94, 2024

  24. [24]

    Epistemological fault lines between human and artificial intelligence,

    W. Quattrociocchi, V . Capraro, and M. Perc, “Epistemological fault lines between human and artificial intelligence,”arXiv preprint arXiv:2512.19466, 2025

  25. [25]

    Modeling Human Driver Behavior During Highway Merging Using the Communication- Enabled Interaction Framework,

    O. Siebinga, S. H. Mohammad, and A. Zgonnikov, “Modeling Human Driver Behavior During Highway Merging Using the Communication- Enabled Interaction Framework,” in2025 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2025, pp. 1258–1265

  26. [26]

    A foundation model to predict and capture human cognition,

    M. Binz, E. Akata, M. Bethge, F. Br ¨andle, F. Callaway, J. Coda- Forno, P. Dayan, C. Demircan, M. K. Eckstein, N. ´Eltet˝o, T. L. Griffiths, S. Haridi, A. K. Jagadish, L. Ji-An, A. Kipnis, S. Kumar, T. Ludwig, M. Mathony, M. Mattar, A. Modirshanechi, S. S. Nath, J. C. Peterson, M. Rmus, E. M. Russek, T. Saanum, J. A. Schubert, L. M. Schulze Buschoff, N. ...

  27. [27]

    Naturalistic Computational Cognitive Science: Towards generalizable models and theories that capture the full range of natural behavior,

    W. Carvalho and A. Lampinen, “Naturalistic Computational Cognitive Science: Towards generalizable models and theories that capture the full range of natural behavior,” Feb. 2025