Recognition: unknown
VLM-VPI: A Vision-Language Reasoning Framework for Improving Automated Vehicle-Pedestrian Interactions
Pith reviewed 2026-05-08 02:13 UTC · model grok-4.3
The pith
Vision-language reasoning enables autonomous vehicles to classify pedestrian intent with 92% accuracy and cut conflicts by over 70% while shortening traversal times.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLM-VPI is a multimodal framework with a perception layer for visual and kinematic data, a reasoning layer using Qwen3-VL 8B for scene understanding and GPT-OSS 20B for few-shot intent reasoning, and a tiered controller applying age-specific braking margins. Tested in 112 CARLA scenarios, it reaches 92.3% intent classification accuracy compared to 78.4% for rule-based baselines, reduces false-alarm rate to 2.8%, mean traversal time to 11.8 seconds, conflicts to 33 from 124, and improves minimum time-to-collision to 4.47 seconds from 1.92 seconds. On 24 real-world PIE scenarios it achieves 87.5% accuracy, and demographic-adaptive control further reduces conflicts by 60% for children and 54.5%
What carries the argument
The VLM-VPI reasoning layer that uses vision-language models to infer pedestrian intent and age category from visual input, feeding into an age-specific tiered safety controller for vehicle control decisions.
Load-bearing premise
The off-the-shelf vision-language models can reliably infer both pedestrian intent and age from visual inputs across different lighting, poses, and real-world conditions, with correctly calibrated age-specific braking margins.
What would settle it
Running the system on a new set of real-world videos where the models frequently misclassify pedestrian intent or age, resulting in either more conflicts or higher false alarms than the baseline methods.
read the original abstract
Autonomous driving systems often infer pedestrian yielding behavior from geometric and kinematic cues alone, limiting their ability to reason about visual scene context and age-dependent behavioral variability. This limitation can produce delayed interventions in safety-critical encounters and unnecessary braking in benign interactions. This work introduces Vision-Language Model-based Vehicle-Pedestrian Interaction (VLM-VPI), a multimodal reasoning framework for pedestrian intent understanding and yielding-aware control in autonomous driving. The system combines three components: a multimodal perception layer that captures visual and kinematic observations, a reasoning layer that uses Qwen3-VL 8B for visual scene understanding and GPT-OSS 20B for few-shot intent reasoning, and a tiered safety controller that applies age-specific braking margins for children, adults, and seniors. In 112 CARLA scenarios, VLM-VPI achieves 92.3% intent classification accuracy, outperforming a rule-based baseline (78.4%), supervised trajectory models (73.5-82.4%), and a zero-shot LLM configuration (88.4%). Validation on 24 real-world PIE scenarios yields 87.5% accuracy, indicating functional sim-to-real transferability. Across 200 simulation cases, VLM-VPI reduces the false-alarm rate from 7.4% to 2.8% and mean intersection traversal time from 13.5 s to 11.8 s. Conflict occurrences decrease from 124 to 33, while mean minimum time-to-collision improves from 1.92 s to 4.47 s. Demographic-adaptive control further reduces conflicts by 60% for children and 54.5% for seniors compared with uniform control. These results show that an explicit vision-language reasoning layer can improve both safety and efficiency by linking pedestrian intent, demographic context, and vehicle control decisions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VLM-VPI, a multimodal framework that uses Qwen3-VL 8B for visual scene understanding and GPT-OSS 20B for few-shot intent and age reasoning, feeding into a tiered safety controller with age-specific braking margins. It reports 92.3% intent classification accuracy on 112 CARLA scenarios (outperforming rule-based, supervised trajectory, and zero-shot LLM baselines), 87.5% on 24 PIE real-world scenarios, plus aggregate gains in false-alarm rate, traversal time, conflicts (124 to 33), and minimum TTC (1.92 s to 4.47 s), with further conflict reductions of 60% for children and 54.5% for seniors under demographic-adaptive control.
Significance. If the results hold, the work provides concrete evidence that explicit vision-language reasoning can link pedestrian intent, demographic context, and control decisions to improve both safety and efficiency over purely geometric/kinematic baselines. The evaluation against independent baselines (rule-based, supervised models, zero-shot LLM) and external datasets (CARLA, PIE) strengthens the claims and supports reproducibility. The demographic-adaptive component is a distinctive contribution, though its impact requires tighter validation.
major comments (3)
- [Abstract and Results] Abstract and Results: The headline safety gains from demographic-adaptive control (60% and 54.5% further conflict reductions for children and seniors) rest on age categorization by GPT-OSS 20B, yet no age-classification accuracy, confusion matrix, or ablation of adaptive vs. uniform control under noisy age labels is reported. This is load-bearing for attributing the tiered-controller improvements to the claimed mechanism.
- [Real-world validation] Real-world validation (24 PIE scenarios): The small sample size, absence of error bars, statistical significance tests, and details on ground-truth label provenance for intent and age categories limit the strength of the sim-to-real transfer claim and the robustness of the numeric improvements.
- [Methods] Methods (tiered safety controller): The age-specific braking margins applied for children, adults, and seniors are not explicitly defined, calibrated, or sensitivity-tested against plausible age-inference error rates, leaving open whether the reported TTC and conflict gains would persist under realistic VLM misclassifications.
minor comments (2)
- The manuscript would benefit from a dedicated reproducibility section listing exact prompts, few-shot examples, and controller threshold values.
- Figure captions and axis labels in the results plots could be expanded to include baseline names and metric definitions for standalone readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, indicating revisions where appropriate to strengthen the claims on age-adaptive control, real-world validation, and controller details.
read point-by-point responses
-
Referee: [Abstract and Results] The headline safety gains from demographic-adaptive control (60% and 54.5% further conflict reductions for children and seniors) rest on age categorization by GPT-OSS 20B, yet no age-classification accuracy, confusion matrix, or ablation of adaptive vs. uniform control under noisy age labels is reported. This is load-bearing for attributing the tiered-controller improvements to the claimed mechanism.
Authors: We agree this information is necessary to substantiate the demographic-adaptive gains. In the revised manuscript we will add the age classification accuracy achieved by GPT-OSS 20B (reported separately from intent accuracy), a three-class confusion matrix on both CARLA and PIE data, and an ablation comparing the tiered controller under VLM-derived age labels versus oracle age labels. This will directly quantify how inference noise affects the reported conflict reductions and TTC improvements. revision: yes
-
Referee: [Real-world validation] Real-world validation (24 PIE scenarios): The small sample size, absence of error bars, statistical significance tests, and details on ground-truth label provenance for intent and age categories limit the strength of the sim-to-real transfer claim and the robustness of the numeric improvements.
Authors: The 24 scenarios are the subset of PIE with complete intent and age annotations matching our evaluation protocol; ground-truth labels are taken directly from the original PIE dataset annotations, which we will state explicitly. We will add bootstrap-derived error bars and, where sample sizes permit, report p-values for the accuracy and safety metric differences. The small sample size will be acknowledged as a limitation in the revised text. revision: partial
-
Referee: [Methods] Methods (tiered safety controller): The age-specific braking margins applied for children, adults, and seniors are not explicitly defined, calibrated, or sensitivity-tested against plausible age-inference error rates, leaving open whether the reported TTC and conflict gains would persist under realistic VLM misclassifications.
Authors: The braking margins (2.5 m/s² for children, 1.8 m/s² for adults, 2.2 m/s² for seniors) are defined in Section 3.3 and were selected from age-dependent reaction-time literature. We will expand this section with the exact numerical values, the calibration rationale, and a new sensitivity analysis that injects age misclassification noise at the observed VLM error rate and recomputes TTC and conflict statistics. This will demonstrate robustness under realistic inference errors. revision: yes
Circularity Check
No circularity; empirical framework uses external VLMs and reports metrics against independent baselines
full rationale
The paper describes a modular system that applies pre-trained Qwen3-VL 8B and GPT-OSS 20B models (zero-shot or few-shot) for scene understanding and intent/age inference, then feeds outputs into a tiered rule-based controller with fixed age-specific margins. All quantitative claims (92.3% accuracy, conflict reduction 124→33, TTC improvement 1.92→4.47 s, etc.) are obtained by direct comparison to external baselines (rule-based, supervised trajectory models, zero-shot LLM) on CARLA and PIE datasets. No equations, parameter fits, or self-citations are load-bearing; the derivation chain consists of independent perception-reasoning-control stages whose outputs are measured against held-out or external references rather than being redefined in terms of themselves.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained Qwen3-VL 8B and GPT-OSS 20B models can accurately classify pedestrian intent and age category from visual scene input in both simulation and real-world conditions
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925. Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei -Fei, L., Savarese, S., Year. Social lstm: Human trajectory prediction in crowded spaces. In: Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 961-971. Almeida, R., Pereira, F....
work page internal anchor Pith review arXiv
-
[2]
2025 IEEE Intelligent Vehicles Symposium (IV), 1899-1904
Pedestrian intention prediction via vision- language foundation models. 2025 IEEE Intelligent Vehicles Symposium (IV), 1899-1904. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C.,
2025
-
[3]
Qwen3-vl technical report. arXiv preprint arXiv:2511.21631. Barbierato, E., Gatti, A.,
work page internal anchor Pith review arXiv
-
[4]
The nvidia pilotnet experiments. ArXiv abs/2010.08776. Bouhsain, S.A., Saadatnejad, S., Alahi, A.,
-
[5]
Pedestrian intention prediction: A multi-task perspective. ArXiv abs/2010.10270. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O., Year. Nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the Proceedings of the IEEE/CVF conference on computer visio n and pattern recog...
-
[6]
Traffic Safety Facts
Pedestrians: 2023 data. Traffic Safety Facts . National Highway Traffic Safety Administration, Washington, DC. Chen, L., Wu, P., Chitta, K., Jaeger, B., Geiger, A., Li, H.,
2023
-
[7]
arXiv preprint arXiv:2507.20850
Free energy- inspired cognitive risk integration for av navigation in pedestrian-rich environments. arXiv preprint arXiv:2507.20850. 37 37 Duim, E., Lebrão, M.L., Antunes, J.L.F.,
-
[8]
IEEE Transactions on Intelligent Vehicles, 1-33
Summary and reflections on pedestrian trajectory prediction in the field of autonomous driving. IEEE Transactions on Intelligent Vehicles, 1-33. Gadzicki, K., Khamsehashari, R., Zetzsche, C., Year. Early vs late fusion in multimodal convolutional neural networks. In: Proceedings of the 2020 IEEE 23rd international conference on information fusion (FUSION)...
2020
-
[9]
IEEE Access 12, 150395-150418
A comprehensive review of parallel autonomy systems within vehicles: Applications, architectures, safety considerations, and standards. IEEE Access 12, 150395-150418. Giuliari, F., Hasan, I., Cristani, M., Galasso, F., Year. Transformer networks for trajectory forecasting. In: Proceedings of the 2020 25th international conference on pattern recognition (I...
2020
-
[10]
arXiv preprint arXiv:2506.17590
Drama -x: A fine -grained intent prediction and risk reasoning benchmark for driving. arXiv preprint arXiv:2506.17590. Haghzare, S., Stasiulis, E., Delfi, G., Mohamud, H., Rapoport, M.J., Naglie, G., Mihailidis, A., Campos, J.L.,
-
[11]
Research on characteristics of hesitant driving behavior in urban expressway diversion areas
Li, X., Zhou, R., Zhang, L., Zou, J., Li, J., Year. Research on characteristics of hesitant driving behavior in urban expressway diversion areas. In: Proceedings of the 2024 6th International Conference on Internet of Things, Automation and Artificial Intelligence (IoTAAI), pp. 168-175. Lindsey, J.,
2024
-
[12]
Empowering safer socially sensitive autonomous vehicles using human- plausible cognitive encoding
Lu, H., Zhu, M., Lu, C., Feng, S., Wang, X., Wang, Y., Yang, H., 2025a. Empowering safer socially sensitive autonomous vehicles using human- plausible cognitive encoding. Proceedings of the National Academy of Sciences 122 (21), e2401626122. Lu, S., He, L., Li, S.E., Luo, Y., Wang, J., Li, K., Year. Hierarchical end- to-end autonomous driving: Integrating...
2025
-
[13]
Not all explicit cues help communicate: Pedestrians' perceptions, fixations, and decisions toward automated vehicles with varied appearance. ArXiv abs/2407.06505. Mirzabagheri, A., Ahmadi, M., Zhang, N., Alirezaee, R., Mozaffari, S., Alirezaee, S.,
-
[14]
Fairness in autonomous driving: Towards understanding confounding factors in object detection under challenging weather. ArXiv abs/2406.00219. Pitcairn, T., Edlmann, T.,
-
[15]
IEEE Transactions on Intelligent Vehicles 8 (1), 438-457
Pedestrian behavior in shared spaces with autonomous vehicles: An integrated framework and review. IEEE Transactions on Intelligent Vehicles 8 (1), 438-457. Pu, Q., Xie, K., Guo, H., Zhu, Y., 2025a. Modeling crash avoidance behaviors in vehicle -pedestrian near- miss scenarios: Curvilinear time- to-collision and mamba -driven deep reinforcement learning. ...
-
[16]
arXiv preprint arXiv:1802.02522
Joint attention in driver -pedestrian interaction: From theory to practice. arXiv preprint arXiv:1802.02522. Rasouli, A., Tsotsos, J.K.,
-
[17]
arXiv preprint arXiv:2511.07155
Dynamics -decoupled trajectory alignment for sim- to-real transfer in reinforcement learning for autonomous driving. arXiv preprint arXiv:2511.07155. Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., Year. Scalability in perception for autonomous driving: Waymo open dataset. In: Proceeding...
-
[18]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Drivevlm: The convergence of autonomous driving and large vision-language models. ArXiv abs/2402.12289. Tzigieras, A.,
work page internal anchor Pith review arXiv
-
[19]
University of Leeds
Cognitive mechanisms of behaviour estimation: Modelling pedestrian interpretation of approaching vehicle behaviour. University of Leeds. Vemula, A., Muelling, K., Oh, J., Year. Social attention: Modeling attention in human crowds. In: Proceedings of the 2018 IEEE international Conference on Robotics and Automation (ICRA), pp. 4601-4607. Wang, Y., Srinivas...
2018
-
[20]
Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting
Argoverse 2: Next generation datasets for self -driving perception and forecasting. arXiv preprint arXiv:2301.00493. Wu, J., Ruenz, J., Berkemeyer, H., Dixon, L., Althoff, M.,
work page internal anchor Pith review arXiv
-
[21]
Proceedings of the National Academy of Sciences 120 (49), e2303869120
Childhood unpredictability and the development of exploration. Proceedings of the National Academy of Sciences 120 (49), e2303869120. 40 40 Yang, Y., Li, J., Yang, Y., Year. The research of the fast svm classifier method. In: Proceedings of the 2015 12th international computer conference on wavelet active media technology and information processing (ICCWA...
2015
-
[22]
arXiv preprint arXiv:2601.03225
Wait or cross? Understanding the influence of behavioral tendency, trust, and risk perception on pedestrian gap- acceptance of automated truck platoons. arXiv preprint arXiv:2601.03225. Yi, D., Jang, S., Yim, J., Year. Relationship between associated neuropsychological factors and fall risk factors in community-dwelling elderly. In: Proceedings of the Hea...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.