arxiv: 2604.23934 · v1 · submitted 2026-04-27 · 📡 eess.SY · cs.SY

Recognition: unknown

VLM-VPI: A Vision-Language Reasoning Framework for Improving Automated Vehicle-Pedestrian Interactions

Qingwen Pu , Kun Xie , Yuxiang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:13 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords vision-language modelsautonomous drivingpedestrian intentvehicle-pedestrian interactiondemographic adaptationsafety controllerCARLA simulationmultimodal reasoning

0 comments

The pith

Vision-language reasoning enables autonomous vehicles to classify pedestrian intent with 92% accuracy and cut conflicts by over 70% while shortening traversal times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VLM-VPI to address limitations in current autonomous driving systems that rely only on geometric and kinematic cues for pedestrian interactions. It establishes that combining vision-language models for intent and age inference with a tiered safety controller leads to higher classification accuracy, fewer false alarms, and better safety and efficiency metrics in simulations and real-world tests. A sympathetic reader would care because this approach reduces the risk of delayed interventions in safety-critical situations and avoids unnecessary braking in normal interactions. The results demonstrate functional transfer from simulation to real scenarios with demographic adaptations benefiting vulnerable groups like children and seniors.

Core claim

VLM-VPI is a multimodal framework with a perception layer for visual and kinematic data, a reasoning layer using Qwen3-VL 8B for scene understanding and GPT-OSS 20B for few-shot intent reasoning, and a tiered controller applying age-specific braking margins. Tested in 112 CARLA scenarios, it reaches 92.3% intent classification accuracy compared to 78.4% for rule-based baselines, reduces false-alarm rate to 2.8%, mean traversal time to 11.8 seconds, conflicts to 33 from 124, and improves minimum time-to-collision to 4.47 seconds from 1.92 seconds. On 24 real-world PIE scenarios it achieves 87.5% accuracy, and demographic-adaptive control further reduces conflicts by 60% for children and 54.5%

What carries the argument

The VLM-VPI reasoning layer that uses vision-language models to infer pedestrian intent and age category from visual input, feeding into an age-specific tiered safety controller for vehicle control decisions.

Load-bearing premise

The off-the-shelf vision-language models can reliably infer both pedestrian intent and age from visual inputs across different lighting, poses, and real-world conditions, with correctly calibrated age-specific braking margins.

What would settle it

Running the system on a new set of real-world videos where the models frequently misclassify pedestrian intent or age, resulting in either more conflicts or higher false alarms than the baseline methods.

read the original abstract

Autonomous driving systems often infer pedestrian yielding behavior from geometric and kinematic cues alone, limiting their ability to reason about visual scene context and age-dependent behavioral variability. This limitation can produce delayed interventions in safety-critical encounters and unnecessary braking in benign interactions. This work introduces Vision-Language Model-based Vehicle-Pedestrian Interaction (VLM-VPI), a multimodal reasoning framework for pedestrian intent understanding and yielding-aware control in autonomous driving. The system combines three components: a multimodal perception layer that captures visual and kinematic observations, a reasoning layer that uses Qwen3-VL 8B for visual scene understanding and GPT-OSS 20B for few-shot intent reasoning, and a tiered safety controller that applies age-specific braking margins for children, adults, and seniors. In 112 CARLA scenarios, VLM-VPI achieves 92.3% intent classification accuracy, outperforming a rule-based baseline (78.4%), supervised trajectory models (73.5-82.4%), and a zero-shot LLM configuration (88.4%). Validation on 24 real-world PIE scenarios yields 87.5% accuracy, indicating functional sim-to-real transferability. Across 200 simulation cases, VLM-VPI reduces the false-alarm rate from 7.4% to 2.8% and mean intersection traversal time from 13.5 s to 11.8 s. Conflict occurrences decrease from 124 to 33, while mean minimum time-to-collision improves from 1.92 s to 4.47 s. Demographic-adaptive control further reduces conflicts by 60% for children and 54.5% for seniors compared with uniform control. These results show that an explicit vision-language reasoning layer can improve both safety and efficiency by linking pedestrian intent, demographic context, and vehicle control decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLM-VPI shows a workable integration of off-the-shelf models for joint intent and age reasoning in AV-pedestrian control with clear sim gains, but the age-adaptive claims rest on unvalidated classification accuracy.

read the letter

The main point is that this paper wires Qwen3-VL and GPT-OSS into a pipeline that infers both pedestrian intent and age category from images, then hands the result to a tiered braking controller. In 112 CARLA scenarios it beats the listed baselines on intent accuracy, false-alarm rate, traversal time, conflict count, and minimum time-to-collision. The 24-scenario PIE check gives a basic indication that the accuracy holds up outside simulation. That combination of multimodal reasoning plus explicit demographic margins is the concrete new piece here, and the reported deltas are specific enough to be usable for comparison. The architecture itself is straightforward and addresses a real limitation in most current AV systems that treat every pedestrian the same. The end-to-end flow from perception to control decision is easy to follow and the numbers on reduced conflicts and improved TTC are the strongest part of the evidence. The soft spots sit mainly around the age component. The abstract reports only aggregate intent accuracy and gives no separate age-classification accuracy, confusion matrix, or ablation that isolates the effect of the adaptive margins versus uniform control. If the few-shot age predictions are off by more than a modest amount under varied lighting, poses, or actual ages, the extra 60 % and 54.5 % conflict reductions claimed for children and seniors cannot be attributed to the mechanism. The real-world set is small, no error bars or statistical tests appear, and the exact controller thresholds are not detailed. These gaps are fixable but keep the demographic claims provisional. This is for researchers working on multimodal safety layers for autonomous vehicles. Someone who needs a working example of plugging VLMs into a control loop will find the structure and the simulation experiments useful. It is not yet ready for strong deployment conclusions. I would send it to peer review. The idea is worth testing properly and referees can require the missing ablations and larger real-world checks.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces VLM-VPI, a multimodal framework that uses Qwen3-VL 8B for visual scene understanding and GPT-OSS 20B for few-shot intent and age reasoning, feeding into a tiered safety controller with age-specific braking margins. It reports 92.3% intent classification accuracy on 112 CARLA scenarios (outperforming rule-based, supervised trajectory, and zero-shot LLM baselines), 87.5% on 24 PIE real-world scenarios, plus aggregate gains in false-alarm rate, traversal time, conflicts (124 to 33), and minimum TTC (1.92 s to 4.47 s), with further conflict reductions of 60% for children and 54.5% for seniors under demographic-adaptive control.

Significance. If the results hold, the work provides concrete evidence that explicit vision-language reasoning can link pedestrian intent, demographic context, and control decisions to improve both safety and efficiency over purely geometric/kinematic baselines. The evaluation against independent baselines (rule-based, supervised models, zero-shot LLM) and external datasets (CARLA, PIE) strengthens the claims and supports reproducibility. The demographic-adaptive component is a distinctive contribution, though its impact requires tighter validation.

major comments (3)

[Abstract and Results] Abstract and Results: The headline safety gains from demographic-adaptive control (60% and 54.5% further conflict reductions for children and seniors) rest on age categorization by GPT-OSS 20B, yet no age-classification accuracy, confusion matrix, or ablation of adaptive vs. uniform control under noisy age labels is reported. This is load-bearing for attributing the tiered-controller improvements to the claimed mechanism.
[Real-world validation] Real-world validation (24 PIE scenarios): The small sample size, absence of error bars, statistical significance tests, and details on ground-truth label provenance for intent and age categories limit the strength of the sim-to-real transfer claim and the robustness of the numeric improvements.
[Methods] Methods (tiered safety controller): The age-specific braking margins applied for children, adults, and seniors are not explicitly defined, calibrated, or sensitivity-tested against plausible age-inference error rates, leaving open whether the reported TTC and conflict gains would persist under realistic VLM misclassifications.

minor comments (2)

The manuscript would benefit from a dedicated reproducibility section listing exact prompts, few-shot examples, and controller threshold values.
Figure captions and axis labels in the results plots could be expanded to include baseline names and metric definitions for standalone readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating revisions where appropriate to strengthen the claims on age-adaptive control, real-world validation, and controller details.

read point-by-point responses

Referee: [Abstract and Results] The headline safety gains from demographic-adaptive control (60% and 54.5% further conflict reductions for children and seniors) rest on age categorization by GPT-OSS 20B, yet no age-classification accuracy, confusion matrix, or ablation of adaptive vs. uniform control under noisy age labels is reported. This is load-bearing for attributing the tiered-controller improvements to the claimed mechanism.

Authors: We agree this information is necessary to substantiate the demographic-adaptive gains. In the revised manuscript we will add the age classification accuracy achieved by GPT-OSS 20B (reported separately from intent accuracy), a three-class confusion matrix on both CARLA and PIE data, and an ablation comparing the tiered controller under VLM-derived age labels versus oracle age labels. This will directly quantify how inference noise affects the reported conflict reductions and TTC improvements. revision: yes
Referee: [Real-world validation] Real-world validation (24 PIE scenarios): The small sample size, absence of error bars, statistical significance tests, and details on ground-truth label provenance for intent and age categories limit the strength of the sim-to-real transfer claim and the robustness of the numeric improvements.

Authors: The 24 scenarios are the subset of PIE with complete intent and age annotations matching our evaluation protocol; ground-truth labels are taken directly from the original PIE dataset annotations, which we will state explicitly. We will add bootstrap-derived error bars and, where sample sizes permit, report p-values for the accuracy and safety metric differences. The small sample size will be acknowledged as a limitation in the revised text. revision: partial
Referee: [Methods] Methods (tiered safety controller): The age-specific braking margins applied for children, adults, and seniors are not explicitly defined, calibrated, or sensitivity-tested against plausible age-inference error rates, leaving open whether the reported TTC and conflict gains would persist under realistic VLM misclassifications.

Authors: The braking margins (2.5 m/s² for children, 1.8 m/s² for adults, 2.2 m/s² for seniors) are defined in Section 3.3 and were selected from age-dependent reaction-time literature. We will expand this section with the exact numerical values, the calibration rationale, and a new sensitivity analysis that injects age misclassification noise at the observed VLM error rate and recomputes TTC and conflict statistics. This will demonstrate robustness under realistic inference errors. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework uses external VLMs and reports metrics against independent baselines

full rationale

The paper describes a modular system that applies pre-trained Qwen3-VL 8B and GPT-OSS 20B models (zero-shot or few-shot) for scene understanding and intent/age inference, then feeds outputs into a tiered rule-based controller with fixed age-specific margins. All quantitative claims (92.3% accuracy, conflict reduction 124→33, TTC improvement 1.92→4.47 s, etc.) are obtained by direct comparison to external baselines (rule-based, supervised trajectory models, zero-shot LLM) on CARLA and PIE datasets. No equations, parameter fits, or self-citations are load-bearing; the derivation chain consists of independent perception-reasoning-control stages whose outputs are measured against held-out or external references rather than being redefined in terms of themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current large vision-language models can perform reliable zero/few-shot pedestrian intent and age inference from images without domain-specific fine-tuning or additional training data.

axioms (1)

domain assumption Pre-trained Qwen3-VL 8B and GPT-OSS 20B models can accurately classify pedestrian intent and age category from visual scene input in both simulation and real-world conditions
The framework description and performance numbers depend directly on this capability; no fine-tuning or calibration details are provided.

pith-pipeline@v0.9.0 · 5640 in / 1509 out tokens · 55143 ms · 2026-05-08T02:13:49.865613+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 14 canonical work pages · 4 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925. Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei -Fei, L., Savarese, S., Year. Social lstm: Human trajectory prediction in crowded spaces. In: Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 961-971. Almeida, R., Pereira, F....

work page internal anchor Pith review arXiv
[2]

2025 IEEE Intelligent Vehicles Symposium (IV), 1899-1904

Pedestrian intention prediction via vision- language foundation models. 2025 IEEE Intelligent Vehicles Symposium (IV), 1899-1904. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C.,

2025
[3]

Qwen3-VL Technical Report

Qwen3-vl technical report. arXiv preprint arXiv:2511.21631. Barbierato, E., Gatti, A.,

work page internal anchor Pith review arXiv
[4]

ArXiv abs/2010.08776

The nvidia pilotnet experiments. ArXiv abs/2010.08776. Bouhsain, S.A., Saadatnejad, S., Alahi, A.,

work page arXiv 2010
[5]

ArXiv abs/2010.10270

Pedestrian intention prediction: A multi-task perspective. ArXiv abs/2010.10270. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O., Year. Nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the Proceedings of the IEEE/CVF conference on computer visio n and pattern recog...

work page arXiv 2010
[6]

Traffic Safety Facts

Pedestrians: 2023 data. Traffic Safety Facts . National Highway Traffic Safety Administration, Washington, DC. Chen, L., Wu, P., Chitta, K., Jaeger, B., Geiger, A., Li, H.,

2023
[7]

arXiv preprint arXiv:2507.20850

Free energy- inspired cognitive risk integration for av navigation in pedestrian-rich environments. arXiv preprint arXiv:2507.20850. 37 37 Duim, E., Lebrão, M.L., Antunes, J.L.F.,

work page arXiv
[8]

IEEE Transactions on Intelligent Vehicles, 1-33

Summary and reflections on pedestrian trajectory prediction in the field of autonomous driving. IEEE Transactions on Intelligent Vehicles, 1-33. Gadzicki, K., Khamsehashari, R., Zetzsche, C., Year. Early vs late fusion in multimodal convolutional neural networks. In: Proceedings of the 2020 IEEE 23rd international conference on information fusion (FUSION)...

2020
[9]

IEEE Access 12, 150395-150418

A comprehensive review of parallel autonomy systems within vehicles: Applications, architectures, safety considerations, and standards. IEEE Access 12, 150395-150418. Giuliari, F., Hasan, I., Cristani, M., Galasso, F., Year. Transformer networks for trajectory forecasting. In: Proceedings of the 2020 25th international conference on pattern recognition (I...

2020
[10]

arXiv preprint arXiv:2506.17590

Drama -x: A fine -grained intent prediction and risk reasoning benchmark for driving. arXiv preprint arXiv:2506.17590. Haghzare, S., Stasiulis, E., Delfi, G., Mohamud, H., Rapoport, M.J., Naglie, G., Mihailidis, A., Campos, J.L.,

work page arXiv
[11]

Research on characteristics of hesitant driving behavior in urban expressway diversion areas

Li, X., Zhou, R., Zhang, L., Zou, J., Li, J., Year. Research on characteristics of hesitant driving behavior in urban expressway diversion areas. In: Proceedings of the 2024 6th International Conference on Internet of Things, Automation and Artificial Intelligence (IoTAAI), pp. 168-175. Lindsey, J.,

2024
[12]

Empowering safer socially sensitive autonomous vehicles using human- plausible cognitive encoding

Lu, H., Zhu, M., Lu, C., Feng, S., Wang, X., Wang, Y., Yang, H., 2025a. Empowering safer socially sensitive autonomous vehicles using human- plausible cognitive encoding. Proceedings of the National Academy of Sciences 122 (21), e2401626122. Lu, S., He, L., Li, S.E., Luo, Y., Wang, J., Li, K., Year. Hierarchical end- to-end autonomous driving: Integrating...

2025
[13]

ArXiv abs/2407.06505

Not all explicit cues help communicate: Pedestrians' perceptions, fixations, and decisions toward automated vehicles with varied appearance. ArXiv abs/2407.06505. Mirzabagheri, A., Ahmadi, M., Zhang, N., Alirezaee, R., Mozaffari, S., Alirezaee, S.,

work page arXiv
[14]

ArXiv abs/2406.00219

Fairness in autonomous driving: Towards understanding confounding factors in object detection under challenging weather. ArXiv abs/2406.00219. Pitcairn, T., Edlmann, T.,

work page arXiv
[15]

IEEE Transactions on Intelligent Vehicles 8 (1), 438-457

Pedestrian behavior in shared spaces with autonomous vehicles: An integrated framework and review. IEEE Transactions on Intelligent Vehicles 8 (1), 438-457. Pu, Q., Xie, K., Guo, H., Zhu, Y., 2025a. Modeling crash avoidance behaviors in vehicle -pedestrian near- miss scenarios: Curvilinear time- to-collision and mamba -driven deep reinforcement learning. ...

work page arXiv
[16]

arXiv preprint arXiv:1802.02522

Joint attention in driver -pedestrian interaction: From theory to practice. arXiv preprint arXiv:1802.02522. Rasouli, A., Tsotsos, J.K.,

work page arXiv
[17]

arXiv preprint arXiv:2511.07155

Dynamics -decoupled trajectory alignment for sim- to-real transfer in reinforcement learning for autonomous driving. arXiv preprint arXiv:2511.07155. Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., Year. Scalability in perception for autonomous driving: Waymo open dataset. In: Proceeding...

work page arXiv
[18]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Drivevlm: The convergence of autonomous driving and large vision-language models. ArXiv abs/2402.12289. Tzigieras, A.,

work page internal anchor Pith review arXiv
[19]

University of Leeds

Cognitive mechanisms of behaviour estimation: Modelling pedestrian interpretation of approaching vehicle behaviour. University of Leeds. Vemula, A., Muelling, K., Oh, J., Year. Social attention: Modeling attention in human crowds. In: Proceedings of the 2018 IEEE international Conference on Robotics and Automation (ICRA), pp. 4601-4607. Wang, Y., Srinivas...

2018
[20]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

Argoverse 2: Next generation datasets for self -driving perception and forecasting. arXiv preprint arXiv:2301.00493. Wu, J., Ruenz, J., Berkemeyer, H., Dixon, L., Althoff, M.,

work page internal anchor Pith review arXiv
[21]

Proceedings of the National Academy of Sciences 120 (49), e2303869120

Childhood unpredictability and the development of exploration. Proceedings of the National Academy of Sciences 120 (49), e2303869120. 40 40 Yang, Y., Li, J., Yang, Y., Year. The research of the fast svm classifier method. In: Proceedings of the 2015 12th international computer conference on wavelet active media technology and information processing (ICCWA...

2015
[22]

arXiv preprint arXiv:2601.03225

Wait or cross? Understanding the influence of behavioral tendency, trust, and risk perception on pedestrian gap- acceptance of automated truck platoons. arXiv preprint arXiv:2601.03225. Yi, D., Jang, S., Yim, J., Year. Relationship between associated neuropsychological factors and fall risk factors in community-dwelling elderly. In: Proceedings of the Hea...

work page arXiv