General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging
Pith reviewed 2026-05-21 11:48 UTC · model grok-4.3
The pith
General-purpose LLMs reproduce human-like intermittent control and spatial cue responses in a simplified merge but fail to match dynamic velocity reactions consistently.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Embedding OpenAI o3 and Google Gemini 2.5 Pro as standalone driver agents in a simplified one-dimensional merging scenario shows that both reproduce human-like intermittent operational control and tactical dependencies on spatial cues, yet neither consistently captures the human response to dynamic velocity cues, while safety performance diverges sharply between the two models. A prompt ablation study further indicates that prompt components function as model-specific inductive biases that do not transfer across LLMs.
What carries the argument
Embedding general-purpose LLMs as closed-loop driver agents in a one-dimensional merging scenario guided by structured natural-language prompts
If this is right
- General-purpose LLMs could serve as standalone, ready-to-use human behavior models inside automated-vehicle evaluation pipelines without parameter fitting.
- Prompt components function as model-specific inductive biases that do not transfer from one LLM to another.
- Neither model reliably reproduces the human response to dynamic velocity cues, limiting their immediate use for certain safety-critical behaviors.
- Safety performance differs sharply between the tested models, implying that model selection itself affects simulation outcomes.
Where Pith is reading between the lines
- If the velocity-response gap persists across scenarios, targeted prompting or hybrid architectures that add explicit velocity modules might be needed before LLMs can be trusted in full AV safety pipelines.
- The finding that prompt elements create non-transferable biases suggests that each new LLM will require its own prompt-engineering study rather than a single universal template.
- Extending the same closed-loop comparison to multi-lane or multi-agent merges could expose whether the spatial-cue success generalizes or breaks when more dimensions are added.
Load-bearing premise
The simplified one-dimensional merging scenario together with the chosen prompt structure is sufficient to expose the general capabilities and limits of LLMs as human driver models that would carry over to richer driving environments.
What would settle it
Placing the same two LLMs into a merging scenario that adds explicit velocity changes or a second spatial dimension and measuring whether their velocity-response patterns and safety divergence remain the same as in the original one-dimensional case.
Figures
read the original abstract
Human behavior models are essential as behavior references and for simulating human agents in virtual safety assessment of automated vehicles (AVs), yet current models face a trade-off between interpretability and flexibility. General-purpose large language models (LLMs) offer a promising alternative: a single model potentially deployable without parameter fitting across diverse scenarios. However, what LLMs can and cannot capture about human driving behavior remains poorly understood. We address this gap by embedding two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as standalone, closed-loop driver agents in a simplified one-dimensional merging scenario and comparing their behavior against human data using quantitative and qualitative analyses. Both models reproduce human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captures the human response to dynamic velocity cues, and safety performance diverges sharply between models. A systematic prompt ablation study reveals that prompt components act as model-specific inductive biases that do not transfer across LLMs. These findings suggest that general-purpose LLMs could potentially serve as standalone, ready-to-use human behavior models in AV evaluation pipelines, but future research is needed to better understand their failure modes and ensure their validity as models of human driving behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper embeds two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as closed-loop agents in a simplified one-dimensional merging scenario and compares their operational and tactical behavior against human data via quantitative and qualitative analyses plus a prompt-ablation study. It reports that both models reproduce intermittent control and spatial-cue dependence but fail to capture dynamic velocity-cue responses consistently, with sharply divergent safety outcomes; prompt components act as model-specific inductive biases that do not transfer across LLMs.
Significance. If the empirical patterns hold, the work supplies concrete evidence that off-the-shelf LLMs can serve as ready-to-use, parameter-free human-behavior references for AV virtual safety assessment, directly addressing the interpretability-flexibility trade-off noted in the introduction. The prompt-ablation results further identify a practical mechanism (model-specific inductive biases) that future users must control. The 1-D simplification, however, leaves open whether the observed successes and failures generalize beyond the narrow setting.
major comments (2)
- [§4] §4 (Results): the quantitative comparisons are described only at the level of 'reproduce human-like intermittent operational control' and 'diverges sharply' without reporting concrete metrics, sample sizes, statistical tests, or controls for prompt sensitivity; this absence prevents assessment of whether the reported differences are robust or merely qualitative impressions.
- [§5] §5 (Discussion) and §3 (Scenario definition): the central claim that LLMs capture selected human-like behaviors while exposing specific failure modes rests on the assumption that the chosen one-dimensional merging task is representative; no variation in dimensionality, lateral dynamics, or multi-agent interactions is performed, so it remains possible that the velocity-cue failures and safety gaps are artifacts of the extreme simplification rather than stable properties of the LLMs.
minor comments (2)
- [Abstract] Abstract and §2: the phrase 'parameter-free' is used without clarifying that the LLMs still require prompt engineering; a brief sentence acknowledging this would avoid reader confusion.
- [§4] Figure captions and §4: several qualitative trajectory plots lack axis labels or legend entries for the human reference data; adding these would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments have prompted us to strengthen the quantitative presentation of results and to more explicitly address the scope and limitations of the one-dimensional scenario. We respond to each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Results): the quantitative comparisons are described only at the level of 'reproduce human-like intermittent operational control' and 'diverges sharply' without reporting concrete metrics, sample sizes, statistical tests, or controls for prompt sensitivity; this absence prevents assessment of whether the reported differences are robust or merely qualitative impressions.
Authors: We agree that the Results section would benefit from more explicit quantitative detail. Although the manuscript already performs quantitative comparisons against human data, we have revised §4 to report concrete metrics, including mean control-update intervals and their standard deviations for intermittency, Pearson correlations between model actions and spatial cues, and safety indicators such as minimum time-to-collision distributions and collision rates. Human data sample sizes (number of trials and participants) are now stated explicitly, and we have added a prompt-sensitivity analysis that varies key prompt components while holding the scenario fixed. These additions allow readers to evaluate the robustness of the reported reproduction of intermittent control, spatial-cue dependence, and the divergence in velocity responses and safety outcomes. revision: yes
-
Referee: [§5] §5 (Discussion) and §3 (Scenario definition): the central claim that LLMs capture selected human-like behaviors while exposing specific failure modes rests on the assumption that the chosen one-dimensional merging task is representative; no variation in dimensionality, lateral dynamics, or multi-agent interactions is performed, so it remains possible that the velocity-cue failures and safety gaps are artifacts of the extreme simplification rather than stable properties of the LLMs.
Authors: We acknowledge that the one-dimensional simplification is a deliberate design choice that restricts direct generalization. The study was framed as a controlled case study precisely to isolate operational intermittency and cue dependencies without confounding lateral or multi-agent effects. In the revised manuscript we have expanded both §3 and §5 to discuss this limitation explicitly, noting that velocity-cue inconsistencies and safety divergences could be modulated by added degrees of freedom or interactive agents. At the same time, the prompt-ablation findings on model-specific inductive biases are less likely to be artifacts of dimensionality and remain a practical takeaway. We have also added a forward-looking paragraph outlining extensions to higher-fidelity settings. revision: yes
Circularity Check
No circularity: empirical comparisons to external human data
full rationale
The paper conducts an empirical study embedding LLMs as closed-loop agents in a 1D merging simulation and compares outputs quantitatively and qualitatively to independent human driver data. No mathematical derivations, parameter fits, or predictions are defined within the paper that reduce to its own inputs by construction. Claims rest on external benchmarks and prompt ablations rather than self-definitional loops, fitted-input renamings, or load-bearing self-citations. The work is self-contained against external human data and simulation results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The simplified one-dimensional merging scenario captures the key operational and tactical aspects of human driving behavior relevant to AV safety assessment.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Both models reproduce human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captures the human response to dynamic velocity cues...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Norman-L ´opez, M. Weitzel, M. Tamba, L. Duboz, J. Krause, and B. Ciuffo, “Economy-wide impacts from different speeds of deployment of automated vehicles on European Union roads,”Technology in Society, vol. 84, p. 103066, Mar. 2026
work page 2026
-
[2]
On human behavior models for virtual safety impact assessment: A scoping review,
J. B ¨argman, S. Jokhio, M. Sv ¨ard, A. Fries, F. Denk, A. Paliotto, L. F. A. de Oliveira, J. N. B. Beckmann, K. Adjenughwure, R. J. Davidse, R. de Zwart, F. Fahrenkrog, M. Hammouda, and P. Olleja, “On human behavior models for virtual safety impact assessment: A scoping review,” Oct. 2025
work page 2025
-
[3]
C. Wang, F. Guo, R. Yu, L. Wang, and Y . Zhang, “The Application of Driver Models in the Safety Assessment of Autonomous Vehicles: Perspectives, Insights, Prospects,”IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 2364–2381, Jan. 2024
work page 2024
-
[4]
Using Models Based on Cognitive Theory to Predict Human Behavior in Traffic: A Case Study,
J. F. Schumann, A. R. Srinivasan, J. Kober, G. Markkula, and A. Zgonnikov, “Using Models Based on Cognitive Theory to Predict Human Behavior in Traffic: A Case Study,” in2023 IEEE 26th Interna- tional Conference on Intelligent Transportation Systems (ITSC). Bilbao, Spain: IEEE, Sep. 2023, pp. 5870–5875
work page 2023
-
[5]
Y . Wang, A. R. Srinivasan, Y . M. Lee, and G. Markkula, “Modeling Pedestrian Crossing Behavior: A Reinforcement Learning Approach With Sensory Motor Constraints,”IEEE Transactions on Intelligent Transportation Systems, vol. 26, no. 10, pp. 16 454–16 465, Oct. 2025
work page 2025
-
[6]
G. Markkula, Y .-S. Lin, A. R. Srinivasan, J. Billington, M. Leonetti, A. H. Kalantari, Y . Yang, Y . M. Lee, R. Madigan, and N. Merat, “Explaining human interactions on the road by large-scale integration of computational psychological theory,”PNAS Nexus, vol. 2, no. 6, p. pgad163, Jun. 2023
work page 2023
-
[7]
O. Siebinga, A. Zgonnikov, and D. A. Abbink, “A model of dyadic merging interactions explains human drivers’ behavior from control inputs to decisions,”PNAS nexus, vol. 3, no. 10, p. pgae420, 2024
work page 2024
-
[8]
Active inference as a unified model of collision avoidance behavior in human drivers,
J. F. Schumann, J. Engstr ¨om, L. Johnson, M. O’Kelly, J. Messias, J. Kober, and A. Zgonnikov, “Active inference as a unified model of collision avoidance behavior in human drivers,” Jun. 2025
work page 2025
-
[9]
Building reliable sim driving agents by scaling self-play,
D. Cornelisse, A. Pandya, K. Joseph, J. Su ´arez, and E. Vinitsky, “Building reliable sim driving agents by scaling self-play,” May 2025
work page 2025
-
[10]
K. Guo, H. Liu, X. Wu, and C. Lv, “DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning,” Jan. 2026
work page 2026
-
[11]
Trajeglish: Traffic Modeling as Next-Token Prediction,
J. Philion, X. B. Peng, and S. Fidler, “Trajeglish: Traffic Modeling as Next-Token Prediction,” Apr. 2024
work page 2024
-
[12]
TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge
Z. Zhang, X. Jia, G. Chen, Q. Li, and J. Yan, “TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge.”
work page 2025
-
[13]
Trajectory Prediction Meets Large Language Models: A Survey,
Y . Xu, R. Yang, Y . Zhang, J. Lu, M. Zhang, Y . Wang, L. Su, and Y . Fu, “Trajectory Prediction Meets Large Language Models: A Survey,” Oct. 2025
work page 2025
-
[14]
The Waymo Open Sim Agents Challenge,
N. Montali, J. Lambert, P. Mougin, A. Kuefler, N. Rhinehart, M. Li, C. Gulino, T. Emrich, Z. Yang, S. Whiteson, B. White, and D. Anguelov, “The Waymo Open Sim Agents Challenge,” Jul. 2023
work page 2023
-
[15]
SceneDiffuser++: City-Scale Traffic Simula- tion via a Generative World Model,
S. Tan, J. Lambert, H. Jeon, S. Kulshrestha, Y . Bai, J. Luo, D. Anguelov, M. Tan, and C. M. Jiang, “SceneDiffuser++: City-Scale Traffic Simula- tion via a Generative World Model,” Jun. 2025
work page 2025
-
[16]
Language-Guided Traffic Simulation via Scene-Level Diffusion,
Z. Zhong, D. Rempe, Y . Chen, B. Ivanovic, Y . Cao, D. Xu, M. Pavone, and B. Ray, “Language-Guided Traffic Simulation via Scene-Level Diffusion,” inProceedings of The 7th Conference on Robot Learning. PMLR, Dec. 2023, pp. 144–177
work page 2023
-
[17]
Promptable Closed-loop Traffic Simulation,
S. Tan, B. Ivanovic, Y . Chen, B. Li, X. Weng, Y . Cao, P. Kr¨ahenb¨uhl, and M. Pavone, “Promptable Closed-loop Traffic Simulation,” Sep. 2024
work page 2024
-
[18]
WOMD-Reasoning: A Large- Scale Dataset for Interaction Reasoning in Driving,
Y . Li, C. Fan, C. Ge, S. Z. Zhao, C. Li, C. Xu, H. Yao, M. Tomizuka, B. Zhou, C. Tang, M. Ding, and W. Zhan, “WOMD-Reasoning: A Large- Scale Dataset for Interaction Reasoning in Driving,” inProceedings of the 42nd International Conference on Machine Learning. PMLR, Oct. 2025, pp. 34 288–34 311
work page 2025
-
[19]
GenFollower: Enhancing Car-Following Prediction With Large Lan- guage Models,
X. Chen, M. Peng, P. Tiu, Y . Wu, J. Chen, M. Zhu, and X. Zheng, “GenFollower: Enhancing Car-Following Prediction With Large Lan- guage Models,”IEEE Transactions on Intelligent Vehicles, pp. 1–11, 2024
work page 2024
-
[20]
D. Gao, H. Zhang, Y . Liu, and Z. Qi, “Prompt-guided Large Language Models with chain-of-thought reasoning for mixed traffic car-following simulation,”Simulation Modelling Practice and Theory, vol. 148, p. 103262, Apr. 2026
work page 2026
-
[21]
M. T. Hicks, J. Humphries, and J. Slater, “ChatGPT is bullshit,”Ethics and Information Technology, vol. 26, no. 2, 2024
work page 2024
-
[22]
Human merging behavior in a coupled driving simulator: how do we resolve conflicts?
O. Siebinga, A. Zgonnikov, and D. A. Abbink, “Human merging behavior in a coupled driving simulator: how do we resolve conflicts?” IEEE Open Journal of Intelligent Transportation Systems, vol. 5, pp. 103–114, 2024
work page 2024
-
[23]
Receive, reason, and react: Drive as you say, with large language models in autonomous vehicles,
C. Cui, Y . Ma, X. Cao, W. Ye, and Z. Wang, “Receive, reason, and react: Drive as you say, with large language models in autonomous vehicles,” IEEE Intelligent Transportation Systems Magazine, vol. 16, no. 4, pp. 81–94, 2024
work page 2024
-
[24]
Epistemological fault lines between human and artificial intelligence,
W. Quattrociocchi, V . Capraro, and M. Perc, “Epistemological fault lines between human and artificial intelligence,”arXiv preprint arXiv:2512.19466, 2025
-
[25]
O. Siebinga, S. H. Mohammad, and A. Zgonnikov, “Modeling Human Driver Behavior During Highway Merging Using the Communication- Enabled Interaction Framework,” in2025 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2025, pp. 1258–1265
work page 2025
-
[26]
A foundation model to predict and capture human cognition,
M. Binz, E. Akata, M. Bethge, F. Br ¨andle, F. Callaway, J. Coda- Forno, P. Dayan, C. Demircan, M. K. Eckstein, N. ´Eltet˝o, T. L. Griffiths, S. Haridi, A. K. Jagadish, L. Ji-An, A. Kipnis, S. Kumar, T. Ludwig, M. Mathony, M. Mattar, A. Modirshanechi, S. S. Nath, J. C. Peterson, M. Rmus, E. M. Russek, T. Saanum, J. A. Schubert, L. M. Schulze Buschoff, N. ...
work page 2025
-
[27]
W. Carvalho and A. Lampinen, “Naturalistic Computational Cognitive Science: Towards generalizable models and theories that capture the full range of natural behavior,” Feb. 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.