pith. sign in

arxiv: 2605.24562 · v1 · pith:GCBFQW7Fnew · submitted 2026-05-23 · 💻 cs.CV · cs.AI

PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction

Pith reviewed 2026-06-30 13:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords pedestrian intention predictiontrajectory forecastingvision-language modelsbenchmark datasetautonomous drivingquestion answeringexplainable prediction
0
0 comments X

The pith

Finetuning vision-language models on PedestrianQA improves intention classification, trajectory forecasting accuracy, and rationale quality without task-specific architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PedestrianQA as a large-scale video dataset that reformulates pedestrian intention and trajectory prediction as question-answering tasks with structured natural-language rationales. It demonstrates that finetuning state-of-the-art VLMs on this data yields measurable gains in classification accuracy, forecasting precision, and explanation quality when tested on the PIE, JAAD, TITAN, and IDD-PeD datasets. A sympathetic reader would care because the approach suggests VLMs can serve as a single, explainable model for safety-critical pedestrian behavior instead of requiring separate specialized predictors for each sub-task.

Core claim

The central claim is that richly annotated pedestrian sequences expressed in natural language from existing video datasets supply sufficient visual dynamics and contextual cues, so that finetuning VLMs on the PedestrianQA QA formulation significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, establishing VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling.

What carries the argument

PedestrianQA, a video-based dataset that formulates pedestrian intention and trajectory prediction as question-answering tasks augmented with structured rationales drawn from annotated sequences in PIE, JAAD, TITAN, and IDD-PeD.

If this is right

  • VLMs can handle both intention classification and trajectory forecasting within one model instead of separate specialized architectures.
  • Generated rationales become more reliable and informative, supporting interpretability in autonomous driving decisions.
  • Performance improvements transfer across the four evaluated datasets without additional per-dataset engineering.
  • The QA framing enables direct use of existing VLM training pipelines for pedestrian behavior tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar QA reformulations could be applied to other multi-agent prediction problems such as vehicle trajectory forecasting.
  • High-quality rationales might support downstream uses like regulatory auditing or human-in-the-loop verification of autonomous systems.
  • If the dataset annotations contain systematic biases from the source videos, the learned rationales could propagate those biases into deployed models.

Load-bearing premise

Richly annotated pedestrian sequences from existing video datasets expressed in natural language supply sufficient unbiased visual dynamics and contextual cues for VLMs to learn accurate predictions and rationales without specialized per-task architectures.

What would settle it

A controlled experiment showing no statistically significant gains in intention classification accuracy or reduction in trajectory forecasting error after finetuning multiple VLMs on PedestrianQA versus strong baselines without the dataset would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.24562 by C. V. Jawahar, Naman Mishra, Shankar Gangisetty.

Figure 1
Figure 1. Figure 1: An illustration of an unstructured-traffic scenario where a pedestrian stands in the middle of the road in front of the ego￾vehicle, attempting to cross the road. Unlike prior approaches that provide only predictions, our method predicts the intention and trajectory and generates supporting rationales. which predicts whether a pedestrian intends to cross, and Pedestrian Trajectory Prediction (PTP) [6], [7]… view at source ↗
Figure 2
Figure 2. Figure 2: PedestrianQA Dataset PIP Sample. The top row shows the observation frames. The “Question” and “Answer” are followed by 5 types of rationales and a conclusion. spatial–temporal data, behavioral categories, environmental context, and ego-vehicle signals that can be reformulated in natural language. Integrating these textual representations with visual inputs can enable VLMs to model pedestrian behavior compr… view at source ↗
Figure 3
Figure 3. Figure 3: Data generation pipeline: We first aggregate all ground-truth, human-annotated annotations from the constituent datasets into a unified metadata schema. We use generated VLM captions to enrich motion semantics using carefully designed pedestrian-motion prompts that target fine-grained cues. These captions are validated for format and appended to the metadata. We then construct a single instruction package … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative analysis between Qwen2.5-VL-7b and Ours (All Datasets) on the compositional datasets (clockwise: IDD-PeD [1], JAAD [2], TITAN [6], PIE [3]). We identify samples having correct predicate predictions for a fair comparison among rationales. F. Effect of Training on Unstructured Traffic Data Interestingly, our model trained only on IDD-PeD [1] maintains a comparable performance with our model train… view at source ↗
read the original abstract

Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision-language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural language reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question-answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces PedestrianQA, a large-scale video-based dataset reformulating pedestrian intention and trajectory prediction as question-answering tasks augmented with structured rationales, derived from PIE, JAAD, TITAN, and IDD-PeD. It claims that fine-tuning state-of-the-art vision-language models on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, positioning VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling without specialized per-task architectures.

Significance. If the empirical claims hold with rigorous quantitative support, this could represent a meaningful contribution by providing a new benchmark and showing that VLMs can deliver a flexible, unified approach to pedestrian prediction tasks with built-in explainability, potentially benefiting autonomous driving safety research.

major comments (1)
  1. [Abstract] Abstract: The central claim that finetuning VLMs on PedestrianQA 'significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales' is asserted without any quantitative results, baselines, metrics, or experimental details. This prevents assessment of the improvements and is load-bearing for the paper's main contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the comment on the abstract. We agree the central claims require quantitative support in the abstract itself to allow proper assessment.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that finetuning VLMs on PedestrianQA 'significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales' is asserted without any quantitative results, baselines, metrics, or experimental details. This prevents assessment of the improvements and is load-bearing for the paper's main contribution.

    Authors: We agree this is a valid observation. The abstract currently states the improvements without numbers or metrics. In the revised version we will add concise quantitative results (e.g., intention classification accuracy gains on PIE/JAAD, ADE/FDE reductions for trajectory prediction, and rationale quality scores) drawn from the experimental section so readers can evaluate the claims directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark on public datasets

full rationale

The paper introduces PedestrianQA as a new QA-formulated dataset derived from existing public video datasets (PIE, JAAD, TITAN, IDD-PeD) and reports empirical gains from finetuning VLMs. No derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations are present in the provided text. Claims rest on standard train/eval splits and external benchmarks rather than self-referential constructions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central contribution rests on creation of a new dataset and empirical claims about VLM fine-tuning; no free parameters are described, and the work depends on domain assumptions about annotation quality and VLM capabilities.

axioms (2)
  • domain assumption Existing video datasets provide sufficient visual dynamics and interactions when augmented with natural language annotations
    Invoked when constructing PedestrianQA from PIE, JAAD, TITAN, and IDD-PeD.
  • domain assumption VLMs can learn pedestrian intention and trajectory from QA format without task-specific architectures
    Central to the unified framework claim in the abstract.
invented entities (1)
  • PedestrianQA dataset no independent evidence
    purpose: Large-scale video benchmark expressing pedestrian sequences as QA tasks with rationales
    Newly introduced contribution; no independent evidence outside this work.

pith-pipeline@v0.9.1-grok · 5719 in / 1431 out tokens · 38525 ms · 2026-06-30T13:22:06.440224+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 10 canonical work pages · 6 internal anchors

  1. [1]

    Pedestrian intention and trajectory prediction in unstructured traffic using idd-ped,

    R. Bokkasam, S. Gangisetty, A. H. A. Hafez, and C. V . Jawahar, “Pedestrian intention and trajectory prediction in unstructured traffic using idd-ped,” inICRA, 2025

  2. [2]

    Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior,

    A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior,” inICCVW, 2017

  3. [3]

    Pie: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction,

    A. Rasouli, I. Kotseruba, T. Kunic, and J. K. Tsotsos, “Pie: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction,” inICCV, 2019

  4. [4]

    Can reasons help improve pedestrian intent estimation? a cross-modal approach,

    V . Khindkar, V . Balasubramanian, C. Arora, A. Subramanian, and C. Jawahar, “Can reasons help improve pedestrian intent estimation? a cross-modal approach,” inIROS, 2024

  5. [5]

    Pedestrian vision language model for intentions prediction,

    F. Munir, S. Azam, T. Mihaylova, V . Kyrki, and T. P. Kucner, “Pedestrian vision language model for intentions prediction,”IEEE Open Journal of Intelligent Transportation Systems, 2025

  6. [6]

    Titan: Future forecast using action priors,

    S. Malla, B. Dariush, and C. Choi, “Titan: Future forecast using action priors,” inCVPR, 2020

  7. [7]

    Lg-traj: Llm guided pedestrian trajectory prediction,

    P. S. Chib and P. Singh, “Lg-traj: Llm guided pedestrian trajectory prediction,”arXiv preprint arXiv:2403.08032, 2024

  8. [8]

    Instructblip: Towards general-purpose vision-language models with instruction tuning,

    W. Daiet al., “Instructblip: Towards general-purpose vision-language models with instruction tuning,” inNeurIPS, 2023

  9. [9]

    Qwen2.5-VL Technical Report

    S. Baiet al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  10. [10]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    J. Zhuet al., “Internvl3: Exploring advanced training and test- time recipes for open-source multimodal models,”arXiv preprint arXiv:2504.10479, 2025

  11. [11]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” 2023

  12. [12]

    Gemma 3 Technical Report

    G. Team, “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, 2025

  13. [13]

    GPT3.int8(): 8-bit matrix multiplication for transformers at scale,

    T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “GPT3.int8(): 8-bit matrix multiplication for transformers at scale,” inNeurIPS, 2022

  14. [14]

    Flashattention: Fast and memory-efficient exact attention with io-awareness,

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” inNeurIPS, 2022

  15. [15]

    Vision-language models for edge networks: A comprehensive survey,

    A. Sharshar, L. U. Khan, W. Ullah, and M. Guizani, “Vision-language models for edge networks: A comprehensive survey,”IEEE Internet of Things Journal, 2025

  16. [16]

    Vision language models in autonomous driving: A survey and outlook,

    X. Zhouet al., “Vision language models in autonomous driving: A survey and outlook,”IEEE Transactions on Intelligent Vehicles, 2024

  17. [17]

    AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

    B. Jiang, S. Chen, Q. Zhang, W. Liu, and X. Wang, “Alphadrive: Un- leashing the power of vlms in autonomous driving via reinforcement learning and reasoning,”arXiv preprint arXiv:2503.07608, 2025

  18. [18]

    Pip-net: Pedestrian intention prediction in the wild,

    M. Azarmi, M. Rezaei, and H. Wang, “Pip-net: Pedestrian intention prediction in the wild,”IEEE Transactions on Intelligent Transporta- tion Systems, 2025

  19. [19]

    Spatiotemporal relationship reasoning for pedestrian intent prediction,

    B. Liu, E. Adeli, Z. Cao, K.-H. Lee, A. Shenoi, A. Gaidon, and J. C. Niebles, “Spatiotemporal relationship reasoning for pedestrian intent prediction,”IEEE RA-L, 2020

  20. [20]

    Pedx: Benchmark dataset for metric 3-d pose esti- mation of pedestrians in complex urban intersections,

    W. Kimet al., “Pedx: Benchmark dataset for metric 3-d pose esti- mation of pedestrians in complex urban intersections,”IEEE RA-L, 2019

  21. [21]

    Pedestrian stop and go forecasting with hybrid feature fusion,

    D. Guo, T. Mordan, and A. Alahi, “Pedestrian stop and go forecasting with hybrid feature fusion,” inICRA, 2022

  22. [22]

    Drivelm: Driving with graph visual question answer- ing,

    C. Simaet al., “Drivelm: Driving with graph visual question answer- ing,” inECCV, 2023

  23. [23]

    DriveVLM: The convergence of autonomous driving and large vision-language models,

    X. Tianet al., “DriveVLM: The convergence of autonomous driving and large vision-language models,” inCoRL, 2024

  24. [24]

    Abductive ego-view accident video understanding for safe driving perception,

    J. Fanget al., “Abductive ego-view accident video understanding for safe driving perception,” inCVPR, 2024

  25. [25]

    Roadsocial: A diverse videoqa dataset and benchmark for road event understanding from social video narratives,

    C. Parikh, D. Rawat, R. R. T., T. Ghosh, and R. K. Sarvadevabhatla, “Roadsocial: A diverse videoqa dataset and benchmark for road event understanding from social video narratives,” inCVPR, 2025

  26. [26]

    Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning,

    E. Sachdevaet al., “Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning,” inWACV, 2024

  27. [27]

    Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,

    T. Qian, J. Chen, L. Zhuo, Y . Jiao, and Y .-G. Jiang, “Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario,”AAAI, 2024

  28. [28]

    Lingoqa: Visual question answering for autonomous driving,

    A. Marcuet al., “Lingoqa: Visual question answering for autonomous driving,” inECCV, 2024

  29. [29]

    Drama: Joint risk localization and captioning in driving,

    S. Malla, C. Choi, I. Dwivedi, J. H. Choi, and J. Li, “Drama: Joint risk localization and captioning in driving,” inWACV, 2023

  30. [30]

    Language prompt for autonomous driving,

    D. Wu, W. Han, Y . Liu, T. Wang, C. zhong Xu, X. Zhang, and J. Shen, “Language prompt for autonomous driving,” inAAAI, 2025

  31. [31]

    System card: Claude opus 4 & claude sonnet 4,

    Anthropic, “System card: Claude opus 4 & claude sonnet 4,” 2025. [Online]. Available: www.anthropic.com/claude/sonnet

  32. [32]

    Benchmark for evaluating pedestrian action prediction,

    I. Kotseruba, A. Rasouli, and J. K. Tsotsos, “Benchmark for evaluating pedestrian action prediction,” inWACV, 2021

  33. [33]

    Pre- dicting pedestrian crossing intention with feature fusion and spatio- temporal attention,

    D. Yang, H. Zhang, E. Yurtsever, K. Redmill, and U. Ozguner, “Pre- dicting pedestrian crossing intention with feature fusion and spatio- temporal attention,”IEEE Transactions on Intelligent Vehicles, 2022

  34. [34]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radfordet al., “Learning transferable visual models from natural language supervision,”arXiv preprint arXiv:2103.00020, 2021

  35. [35]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffelet al., “Exploring the limits of transfer learning with a unified text-to-text transformer,”JMLR, 2020

  36. [36]

    Predicting pedestrian inten- tions with multimodal intentformer: A co-learning approach,

    N. Sharma, C. Dhiman, and S. Indu, “Predicting pedestrian inten- tions with multimodal intentformer: A co-learning approach,”Pattern Recogn., 2025

  37. [37]

    Optimizing vision-language model for road crossing intention estimation,

    R. Uziel and O. Bialer, “Optimizing vision-language model for road crossing intention estimation,” inWACV, 2025

  38. [38]

    Gpt-4v takes the wheel: Promises and challenges for pedestrian behavior prediction,

    J. Huang, P. Jiang, A. Gautam, and S. Saripalli, “Gpt-4v takes the wheel: Promises and challenges for pedestrian behavior prediction,” AAAI, 2024

  39. [39]

    Pedestrian intention pre- diction via vision-language foundation models,

    M. Azarmi, M. Rezaei, and H. Wang, “Pedestrian intention pre- diction via vision-language foundation models,”arXiv preprint arXiv:2507.04141, 2025

  40. [40]

    BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inICML, 2023

  41. [41]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inICLR, 2021

  42. [42]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

    W. Chianget al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/

  43. [43]

    Llava-next: A strong zero-shot video understanding model,

    Y . Zhanget al., “Llava-next: A strong zero-shot video understanding model,” 2024. [Online]. Available: https://llava-vl.github.io/blog/ 2024-04-30-llava-next-video/

  44. [44]

    Kwai keye-vl technical report,

    K. K. Team, “Kwai keye-vl technical report,”arXiv preprint arXiv:2507.01949, 2025

  45. [45]

    Roformer: Enhanced transformer with rotary position embedding,

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neurocomput., 2024

  46. [46]

    Perception, reason, think, and plan: A survey on large multimodal reasoning models,

    Y . Liet al., “Perception, reason, think, and plan: A survey on large multimodal reasoning models,”arXiv preprint arXiv:2505.04921, 2025

  47. [47]

    Dolphins: Multimodal language model for driving,

    Y . Ma, Y . Cao, J. Sun, M. Pavone, and C. Xiao, “Dolphins: Multimodal language model for driving,” inECCV, 2024

  48. [48]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    A. Awadallaet al., “Openflamingo: An open-source framework for training large autoregressive vision-language models,”arXiv preprint arXiv:2308.01390, 2023

  49. [49]

    OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,

    S. Wanget al., “OmniDrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,” inCVPR, 2025

  50. [50]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9

  51. [51]

    CLAIR: Evaluating image captions with large language models,

    D. Chan, S. Petryk, J. Gonzalez, T. Darrell, and J. Canny, “CLAIR: Evaluating image captions with large language models,” in”EMNLP”, 2023