pith. machine review for the scientific record. sign in

arxiv: 2604.23728 · v1 · submitted 2026-04-26 · 💻 cs.CV · cs.AI

ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction

Pith reviewed 2026-05-08 06:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords pedestrian intention predictionconditional random fieldenergy-based modelspatiotemporal graphsimulated annealingstructured predictionautonomous driving
0
0 comments X

The pith

ESIA casts pedestrian intention prediction as energy minimization over a spatiotemporal graph to enforce scene-level consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes modeling the scene as a graph with pedestrians and environmental elements as nodes. Unary potentials score individual crossing intentions while pairwise potentials capture social and spatial interactions between nodes. These are combined into one global energy function whose minimum gives the most consistent set of predictions across all agents at once. Structural consistency terms further penalize illogical combinations such as contradictory crossing decisions within a group. The resulting framework is solved by a seeded annealing procedure that starts from strong unary estimates.

Core claim

By treating intention prediction as structured prediction in a CRF over a unified graph of spatiotemporal nodes, assigning unary potentials to capture individual intentions and pairwise potentials to encode social and environmental interactions, and augmenting the energy function with structural consistency terms, the approach produces predictions that maintain global logical coherence and can be optimized efficiently by the Unary-Seeded Simulated Annealing algorithm.

What carries the argument

The unified global energy function combining unary node potentials for individual intentions, pairwise edge potentials for interactions, and structural consistency penalties, minimized via Unary-Seeded Simulated Annealing that seeds the search with high-confidence unary values.

If this is right

  • Behavioral predictions remain logically consistent across an entire scene rather than being generated independently for each pedestrian.
  • The explicit energy terms make the contribution of individual intentions versus group interactions directly inspectable.
  • The seeded annealing procedure reaches high-quality solutions faster than standard optimization because it begins from reliable unary estimates.
  • The same graph and energy formulation can incorporate new environmental factors by simply adding or reweighting the corresponding potentials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same energy-based graph structure could be reused for other multi-agent forecasting problems such as vehicle trajectory prediction by redefining the node and edge potentials accordingly.
  • If structural consistency penalties successfully suppress contradictory outputs, the method may reduce safety-critical errors in dense crowd scenarios where pedestrians move in coordinated groups.
  • Because the model separates unary priors from interaction terms, it could support incremental updates when new sensor data arrives without recomputing the entire scene energy.

Load-bearing premise

That the combination of unary potentials, pairwise interaction terms, and structural consistency penalties, when minimized by U-SSA, will produce both higher accuracy and clearer reasoning than earlier methods on real traffic data.

What would settle it

If, on the standard pedestrian intention benchmarks using identical data splits and evaluation protocols, ESIA fails to exceed the accuracy of prior methods or its energy terms do not yield human-readable explanations for the chosen predictions, the central performance and interpretability claims would be falsified.

Figures

Figures reproduced from arXiv: 2604.23728 by Chongfeng Wei, Edmond S. L. Ho, Lin Wu, Meiting Dang, Yanping Wu, Zhenghua Chen.

Figure 1
Figure 1. Figure 1: Illustration of paradigms for pedestrian intention prediction. (a) view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our ESIA. Unlike previous methods ( view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the Feature Extraction Modules. (a) The view at source ↗
Figure 4
Figure 4. Figure 4: Correct qualitative results on JAAD and PIE. Scenarios span: (a) single pedestrian (PIE); (b,d) two pedestrians on the same side (JAAD/PIE); (c) five view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative failure cases on JAAD and PIE. Typical error sources include (a) heavy occlusion (JAAD), (b) adverse lighting conditions (JAAD), and view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the U-SSA optimization process across scenarios with varying crowd densities. The top panels display the original scenes with GT. view at source ↗
Figure 7
Figure 7. Figure 7: Parameter sensitivity analysis of ESIA with respect to (a) node coefficient view at source ↗
read the original abstract

Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi-agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy-based Spatiotemporal Interaction-Aware framework), a novel Conditional Random Field (CRF)-based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph-based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene-level consistency across behavioral predictions. To further constrain inference without ground-truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary-Seeded Simulated Annealing (U-SSA) algorithm, which leverages high-confidence unary priors to rapidly converge to a high-quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state-of-the-art performance with improved interpretability over existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ESIA, a CRF-based energy minimization framework for pedestrian intention prediction. Pedestrians and environmental elements are modeled as nodes in a spatiotemporal graph; unary potentials capture individual intentions, pairwise potentials encode social/environmental interactions, and structural consistency terms penalize logical contradictions in the global energy function. Inference is performed via the proposed Unary-Seeded Simulated Annealing (U-SSA) algorithm, which seeds from unary priors. The central claim is that this yields state-of-the-art accuracy together with improved interpretability on standard benchmarks.

Significance. If the empirical claims hold and the optimization reliably enforces scene-level consistency, the work would offer a principled, interpretable alternative to opaque neural predictors in autonomous driving. Explicit energy-based modeling with structural penalties could improve robustness in multi-agent settings and provide diagnostic value through the learned potentials.

major comments (3)
  1. [§3.3] §3.3 (U-SSA description): The claim that seeding with unary priors plus annealing produces high-quality, globally consistent solutions lacks any convergence analysis, schedule details, or empirical verification that the procedure escapes local minima on graphs with dense, conflicting pairwise terms. This directly bears on the central assertion that structural consistency is enforced and that the method outperforms prior approaches.
  2. [Experiments / Abstract] Experiments section and abstract: The SOTA performance claim rests on an unelaborated statement of 'extensive experiments' with no reported metrics, baseline tables, ablation results on the contribution of pairwise or consistency terms, or failure-case analysis. Without these data the performance and interpretability advantages cannot be assessed.
  3. [§3.2] §3.2 (energy function and structural terms): The structural consistency penalties are introduced as independently motivated constraints that require no per-interaction ground truth, yet no explicit equations show how they are added to E or how their weights are chosen; this leaves open whether the terms are load-bearing or merely decorative.
minor comments (2)
  1. [§3] Notation for the global energy E and its components should be introduced once with a single equation block rather than piecemeal across subsections.
  2. [Abstract] The abstract would be strengthened by including one or two concrete performance numbers (e.g., accuracy or F1 deltas versus the strongest baseline).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We will revise the paper to strengthen the description of the U-SSA inference procedure, expand the experimental reporting, and clarify the formulation of the structural consistency terms.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (U-SSA description): The claim that seeding with unary priors plus annealing produces high-quality, globally consistent solutions lacks any convergence analysis, schedule details, or empirical verification that the procedure escapes local minima on graphs with dense, conflicting pairwise terms. This directly bears on the central assertion that structural consistency is enforced and that the method outperforms prior approaches.

    Authors: We appreciate the referee pointing out the need for stronger justification of U-SSA. The algorithm uses unary potentials to seed the initial state and then applies simulated annealing to minimize the global energy; this design is intended to bias the search toward high-quality regions before exploring for consistency. We acknowledge that the current §3.3 provides only a high-level description without a formal convergence argument or explicit schedule. In the revision we will add the precise annealing schedule (initial temperature, cooling rate, and iteration counts), a short discussion of how unary seeding reduces the effective search space for our graph densities, and new ablation tables comparing final energy values and prediction accuracy of U-SSA against standard SA and greedy baselines on the same graphs. revision: yes

  2. Referee: [Experiments / Abstract] Experiments section and abstract: The SOTA performance claim rests on an unelaborated statement of 'extensive experiments' with no reported metrics, baseline tables, ablation results on the contribution of pairwise or consistency terms, or failure-case analysis. Without these data the performance and interpretability advantages cannot be assessed.

    Authors: We apologize that the experimental presentation was insufficiently detailed. The manuscript does contain quantitative results on JAAD and PIE, but we agree they are not presented with the clarity required to evaluate the claims. In the revised version we will (i) insert full comparison tables with accuracy, F1, and AUC for ESIA and all cited baselines, (ii) add ablation studies that isolate the contribution of the pairwise interaction potentials and the structural consistency penalties, and (iii) include a short failure-case analysis. The abstract will be updated with a concise statement of the observed gains. revision: yes

  3. Referee: [§3.2] §3.2 (energy function and structural terms): The structural consistency penalties are introduced as independently motivated constraints that require no per-interaction ground truth, yet no explicit equations show how they are added to E or how their weights are chosen; this leaves open whether the terms are load-bearing or merely decorative.

    Authors: We agree that the structural consistency terms require a more explicit treatment. These penalties are defined on logical contradictions (e.g., inconsistent crossing intentions among spatially interacting pedestrians) and are added directly to the global energy E as additional weighted summands. In the revision we will supply the exact mathematical expressions for each penalty, show how they are summed into E, and describe the weight-selection procedure (grid search on a held-out validation split that balances consistency enforcement against unary and pairwise fidelity). This will make clear that the terms measurably affect both energy minimization and final accuracy. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the ESIA derivation chain

full rationale

The paper constructs ESIA as a CRF over a spatiotemporal graph with unary potentials for individual pedestrian intentions, pairwise potentials for social/environmental interactions, a global energy function for scene consistency, and added structural consistency penalties, all optimized by the proposed U-SSA algorithm. These components are presented as independently motivated modeling choices that extend standard CRF structured prediction to the pedestrian intention task; no equation or claim reduces a derived quantity (such as a prediction or consistency guarantee) to a fitted parameter or self-citation by construction. The central performance claims rest on empirical benchmark results rather than tautological equivalence to the input design decisions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Only the abstract is available; full model equations, parameter counts, and training procedures are inaccessible. The ledger therefore records only the high-level modeling assumptions stated in the abstract.

free parameters (1)
  • Unary and pairwise potential parameters
    Parameters that define individual intention scores and interaction strengths; standard in CRF models and must be learned or tuned from data.
axioms (2)
  • domain assumption Pedestrian and environmental elements can be represented as nodes whose behaviors are captured by unary potentials and whose interactions are captured by pairwise potentials.
    Core modeling choice of the graph-based CRF formulation.
  • ad hoc to paper Structural consistency terms can penalize logical contradictions in behavioral predictions without requiring ground-truth supervision for every interaction.
    Introduced specifically to constrain inference in the absence of full labels.
invented entities (1)
  • Unary-Seeded Simulated Annealing (U-SSA) algorithm no independent evidence
    purpose: Rapidly converge to high-quality solutions by seeding the annealing process with high-confidence unary priors.
    Novel optimization procedure proposed to solve the global energy minimization efficiently.

pith-pipeline@v0.9.0 · 5548 in / 1591 out tokens · 33832 ms · 2026-05-08T06:31:22.924156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 1 canonical work pages

  1. [1]

    Predicting pedestrian crossing intention with feature fusion and spatio- temporal attention,

    D. Yang, H. Zhang, E. Yurtsever, K. A. Redmill, and U. Ozguner, “Predicting pedestrian crossing intention with feature fusion and spatio- temporal attention,”IEEE Transactions on Intelligent Vehicles, vol. 7, no. 2, pp. 221–230, 2022. 1, 6, 7

  2. [2]

    Visual reasoning using graph con- volutional networks for predicting pedestrian crossing intention,

    T. Chen, R. Tian, and Z. Ding, “Visual reasoning using graph con- volutional networks for predicting pedestrian crossing intention,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3103–3109. 1

  3. [3]

    Early warning pedestrian crossing intention from its head gesture using head pose estimation,

    M. I. Perdana, W. Anggraeni, H. A. Sidharta, E. M. Yuniarno, and M. H. Purnomo, “Early warning pedestrian crossing intention from its head gesture using head pose estimation,” in2021 International seminar on intelligent technology and its applications (ISITIA). IEEE, 2021, pp. 402–407. 1

  4. [4]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631. 1

  5. [5]

    Waymo open dataset: Panoramic video panoptic segmenta- tion,

    J. Mei, A. Z. Zhu, X. Yan, H. Yan, S. Qiao, L.-C. Chen, and H. Kret- zschmar, “Waymo open dataset: Panoramic video panoptic segmenta- tion,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 53–72. 1

  6. [6]

    Trafficpredict: Trajectory prediction for heterogeneous traffic-agents,

    Y . Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha, “Trafficpredict: Trajectory prediction for heterogeneous traffic-agents,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 6120–6127. 1

  7. [7]

    Social force model for pedestrian dynamics,

    D. Helbing and P. Molnar, “Social force model for pedestrian dynamics,” Physical review E, vol. 51, no. 5, p. 4282, 1995. 1, 2

  8. [8]

    Social lstm: Human trajectory prediction in crowded spaces,

    A. Alahi, K. Goel, V . Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 961–971. 1

  9. [9]

    Autonomous driving system: A comprehensive survey,

    J. Zhao, W. Zhao, B. Deng, Z. Wang, F. Zhang, W. Zheng, W. Cao, J. Nan, Y . Lian, and A. F. Burke, “Autonomous driving system: A comprehensive survey,”Expert Systems with Applications, vol. 242, p. 122836, 2024. 1

  10. [10]

    Planning-oriented autonomous driving,

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862. 1

  11. [11]

    Coupling intention and actions of vehicle–pedestrian interaction: A virtual reality experiment study,

    M. Dang, Y . Jin, P. Hang, L. Crosato, Y . Sun, and C. Wei, “Coupling intention and actions of vehicle–pedestrian interaction: A virtual reality experiment study,”Accident Analysis & Prevention, vol. 203, p. 107639,

  12. [12]

    Pit: Progressive interaction transformer for pedestrian crossing intention prediction,

    Y . Zhou, G. Tan, R. Zhong, Y . Li, and C. Gou, “Pit: Progressive interaction transformer for pedestrian crossing intention prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 12, pp. 14 213–14 225, 2023. 1, 6, 7

  13. [13]

    Benchmark for evaluating pedestrian action prediction,

    I. Kotseruba, A. Rasouli, and J. K. Tsotsos, “Benchmark for evaluating pedestrian action prediction,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 1258–1268. 1, 6, 7

  14. [14]

    Trouspi- net: Spatio-temporal attention on parallel atrous convolutions and u- grus for skeletal pedestrian crossing prediction,

    J. Gesnouin, S. Pechberti, B. Stanciulcscu, and F. Moutarde, “Trouspi- net: Spatio-temporal attention on parallel atrous convolutions and u- grus for skeletal pedestrian crossing prediction,” in2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). IEEE, 2021, pp. 01–07. 1, 6, 7

  15. [15]

    Pedestrian crossing intention prediction based on cross-modal transformer and uncertainty-aware multi-task learning for autonomous driving,

    X. Chen, S. Zhang, J. Li, and J. Yang, “Pedestrian crossing intention prediction based on cross-modal transformer and uncertainty-aware multi-task learning for autonomous driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 25, no. 9, pp. 12 538–12 549,

  16. [16]

    Context- based pedestrian path prediction,

    J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila, “Context- based pedestrian path prediction,” inEuropean Conference on Computer Vision. Springer, 2014, pp. 618–633. 1, 2, 3

  17. [17]

    Pedestrian path prediction with recursive bayesian filters: A comparative study,

    N. Schneider and D. M. Gavrila, “Pedestrian path prediction with recursive bayesian filters: A comparative study,” inGerman conference on pattern recognition. Springer, 2013, pp. 174–183. 1

  18. [18]

    Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction,

    A. Mohamed, K. Qian, M. Elhoseiny, and C. Claudel, “Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 14 424–14 432. 1, 2

  19. [19]

    Pedestrian graph+: A fast pedestrian crossing prediction model based on graph convolutional networks,

    P. R. G. Cadena, Y . Qian, C. Wang, and M. Yang, “Pedestrian graph+: A fast pedestrian crossing prediction model based on graph convolutional networks,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 11, pp. 21 050–21 061, 2022. 1, 6, 7, 8

  20. [20]

    Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting,

    Y . Yuan, X. Weng, Y . Ou, and K. M. Kitani, “Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9813–9823. 1, 2

  21. [21]

    Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior,

    A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 206–213. 1, 6

  22. [22]

    Pie: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction,

    A. Rasouli, I. Kotseruba, T. Kunic, and J. K. Tsotsos, “Pie: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6262–6271. 1, 6

  23. [23]

    Pedestrian trajectory prediction based on social interactions learning with random weights,

    J. Xie, S. Zhang, B. Xia, Z. Xiao, H. Jiang, S. Zhou, Z. Qin, and H. Chen, “Pedestrian trajectory prediction based on social interactions learning with random weights,”IEEE Transactions on Multimedia, vol. 26, pp. 7503–7515, 2024. 1

  24. [24]

    Space-time- separable graph convolutional network for pose forecasting,

    T. Sofianos, A. Sampieri, L. Franco, and F. Galasso, “Space-time- separable graph convolutional network for pose forecasting,” inPro- ceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11 209–11 218. 1

  25. [25]

    Gtranspdm: a graph-embedded transformer with positional decoupling for pedestrian crossing intention prediction,

    C. Xie, C. Lin, X. Zheng, B. Gong, and A. M. L ´opez, “Gtranspdm: a graph-embedded transformer with positional decoupling for pedestrian crossing intention prediction,”IEEE Signal Processing Letters, 2025. 1

  26. [26]

    Trep: Transformer-based evidential prediction for pedestrian intention with uncertainty,

    Z. Zhang, R. Tian, and Z. Ding, “Trep: Transformer-based evidential prediction for pedestrian intention with uncertainty,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 3534–3542. 1

  27. [27]

    Pedestrian intention prediction via vision-language foundation models,

    M. Azarmi, M. Rezaei, and H. Wang, “Pedestrian intention prediction via vision-language foundation models,” in2025 IEEE Intelligent Vehi- cles Symposium (IV). IEEE, 2025, pp. 1899–1904. 2

  28. [28]

    Context-based detection of pedes- trian crossing intention for autonomous driving in urban environments,

    F. Schneemann and P. Heinemann, “Context-based detection of pedes- trian crossing intention for autonomous driving in urban environments,” in2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016, pp. 2243–2248. 2

  29. [29]

    A probabilistic model for the estimation of pedestrian crossing behavior at signalized intersections,

    Y . Hashimoto, G. Yanlei, L.-T. Hsu, and K. Shunsuke, “A probabilistic model for the estimation of pedestrian crossing behavior at signalized intersections,” in2015 IEEE 18th International Conference on Intelligent Transportation Systems, 2015, pp. 1520–1526. 2

  30. [30]

    What the constant velocity model can teach us about pedestrian motion prediction,

    C. Sch ¨oller, V . Aravantinos, F. Lay, and A. Knoll, “What the constant velocity model can teach us about pedestrian motion prediction,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1696–1703, 2020. 2

  31. [31]

    Graph-sim: A graph-based spatiotemporal interaction modelling for pedestrian action prediction,

    T. Yau, S. Malekmohammadi, A. Rasouli, P. Lakner, M. Rohani, and J. Luo, “Graph-sim: A graph-based spatiotemporal interaction modelling for pedestrian action prediction,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 8580–8586. 2

  32. [32]

    Reliable few-shot learning under dual noises,

    J. Zhang, J. Song, L. Gao, N. Sebe, and H. T. Shen, “Reliable few-shot learning under dual noises,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

  33. [33]

    Sparse pedestrian character learning for trajectory prediction,

    Y . Dong, L. Wang, S. Zhou, G. Hua, and C. Sun, “Sparse pedestrian character learning for trajectory prediction,”IEEE Transactions on Multimedia, 2024. 2

  34. [34]

    A social force based pedestrian motion model considering multi-pedestrian interaction with a vehicle,

    D. Yang, ¨U. ¨Ozg¨uner, and K. Redmill, “A social force based pedestrian motion model considering multi-pedestrian interaction with a vehicle,” ACM Transactions on Spatial Algorithms and Systems (TSAS), vol. 6, no. 2, pp. 1–27, 2020. 2

  35. [35]

    From channel bias to feature redundancy: Uncovering the

    J. Zhang, X. Luo, L. Gao, D. Zou, H. Shen, and J. Song, “From channel bias to feature redundancy: Uncovering the ”less is more” principle in 12 few-shot learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. 2

  36. [36]

    Completed interaction networks for pedestrian trajectory prediction,

    Z. Zhang, J. Zhou, S. Liu, and B. Xiao, “Completed interaction networks for pedestrian trajectory prediction,”IEEE Transactions on Multimedia, vol. 27, pp. 5119–5129, 2025. 2

  37. [37]

    Sgcn: Sparse graph convolution network for pedestrian trajectory prediction,

    L. Shi, L. Wang, C. Long, S. Zhou, M. Zhou, Z. Niu, and G. Hua, “Sgcn: Sparse graph convolution network for pedestrian trajectory prediction,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8994–9003. 2

  38. [38]

    Spatio-temporal graph transformer networks for pedestrian trajectory prediction,

    C. Yu, X. Ma, J. Ren, H. Zhao, and S. Yi, “Spatio-temporal graph transformer networks for pedestrian trajectory prediction,” inEuropean conference on computer vision. Springer, 2020, pp. 507–523. 2

  39. [39]

    Query-centric trajectory prediction,

    Z. Zhou, J. Wang, Y .-H. Li, and Y .-K. Huang, “Query-centric trajectory prediction,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 863–17 873. 2

  40. [40]

    Conditional random fields as recurrent neural networks,

    S. Zheng, S. Jayasumana, B. Romera-Paredes, V . Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 1529–1537. 2

  41. [41]

    Detecting phishing scams on ethereum using graph convolutional networks with conditional random field,

    W. Hou, B. Cui, and R. Li, “Detecting phishing scams on ethereum using graph convolutional networks with conditional random field,” in2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (H...

  42. [42]

    Msd-crfs: Multi-scale dual aggregation conditional random fields for monocular depth estimation,

    X. Zhang, J. Wei, A. Moteki, Y . Kobayashi, G. Suzuki, and Z. Tan, “Msd-crfs: Multi-scale dual aggregation conditional random fields for monocular depth estimation,” in2024 IEEE International Conference on Image Processing (ICIP). IEEE, 2024, pp. 2001–2007. 2

  43. [43]

    A novel constructive unceasement conditional random field and dynamic bayesian network model for attack prediction on internet of vehicle,

    R. K. Mahendran, S. Rajendran, P. Pandian, R. S. Rathore, F. Benedetto, and R. H. Jhaveri, “A novel constructive unceasement conditional random field and dynamic bayesian network model for attack prediction on internet of vehicle,”IEEE Access, vol. 12, pp. 24 644–24 658, 2024. 2

  44. [44]

    Multimodal parallel attention network for medical image segmentation,

    Z. Wang, W. Wang, N. Li, S. Zhang, Q. Chen, and Z. Jiang, “Multimodal parallel attention network for medical image segmentation,”Image and Vision Computing, vol. 147, p. 105069, 2024. 2

  45. [45]

    Tyche: Stochastic in-context learning for medical image segmentation,

    M. Rakic, H. E. Wong, J. J. G. Ortiz, B. A. Cimini, J. V . Guttag, and A. V . Dalca, “Tyche: Stochastic in-context learning for medical image segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 159–11 173. 2

  46. [46]

    Efficient aerial images algorithms over multi- objects labeling and semantic segmentation,

    A. Naseer and A. Jalal, “Efficient aerial images algorithms over multi- objects labeling and semantic segmentation,” in2024 5th International Conference on Advancements in Computational Sciences (ICACS). IEEE, 2024, pp. 1–9. 2

  47. [47]

    Adaptive attention and feature embedding for enhanced entity extraction using an improved bert model,

    L. Wu, J. Gao, X. Liao, H. Zheng, J. Hu, and R. Bao, “Adaptive attention and feature embedding for enhanced entity extraction using an improved bert model,” in2024 4th International Conference on Communication Technology and Information Technology (ICCTIT). IEEE, 2024, pp. 702–705. 2

  48. [48]

    Application of bilstm-crf model with different embeddings for product name extraction in unstructured turkish text,

    S. Arslan, “Application of bilstm-crf model with different embeddings for product name extraction in unstructured turkish text,”Neural Com- puting and Applications, vol. 36, no. 15, pp. 8371–8382, 2024. 2

  49. [49]

    A comprehensive review of markov random field and conditional random field approaches in pathology image analysis,

    Y . Li, C. Li, X. Li, K. Wang, M. M. Rahaman, C. Sun, H. Chen, X. Wu, H. Zhang, and Q. Wang, “A comprehensive review of markov random field and conditional random field approaches in pathology image analysis,”Archives of Computational Methods in Engineering, vol. 29, no. 1, pp. 609–639, 2022. 3

  50. [50]

    Vanmarcke,Random fields: analysis and synthesis

    E. Vanmarcke,Random fields: analysis and synthesis. World Scientific,

  51. [51]

    Pedestrian intention recognition using latent-dynamic conditional random fields,

    A. T. Schulz and R. Stiefelhagen, “Pedestrian intention recognition using latent-dynamic conditional random fields,” in2015 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2015, pp. 622–627. 3

  52. [52]

    Prospects of structural similarity index for medical image analysis,

    V . Mudeng, M. Kim, and S.-w. Choe, “Prospects of structural similarity index for medical image analysis,”Applied Sciences, vol. 12, no. 8, p. 3754, 2022. 4

  53. [53]

    Mobilevit: light-weight, general- purpose, and mobile-friendly vision transformer,

    S. Mehta and M. Rastegari, “Mobilevit: light-weight, general- purpose, and mobile-friendly vision transformer,”arXiv preprint arXiv:2110.02178, 2021. 4

  54. [54]

    Attention is all you need,

    A. Vaswani, “Attention is all you need,”Advances in Neural Information Processing Systems, 2017. 5

  55. [55]

    Simulated annealing: A review and a new scheme,

    T. Guilmeau, E. Chouzenoux, and V . Elvira, “Simulated annealing: A review and a new scheme,” in2021 IEEE statistical signal processing workshop (SSP). IEEE, 2021, pp. 101–105. 6

  56. [56]

    Do they want to cross? understanding pedestrian intention for behavior prediction,

    I. Kotseruba, A. Rasouli, and J. K. Tsotsos, “Do they want to cross? understanding pedestrian intention for behavior prediction,” in2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2020, pp. 1688–1693. 6, 7

  57. [57]

    Pedestrian safety by intent prediction: A lightweight lstm-attention architecture and experimental evaluations with real-world datasets,

    A. Alofi, R. Greer, A. Gopalkrishnan, and M. Trivedi, “Pedestrian safety by intent prediction: A lightweight lstm-attention architecture and experimental evaluations with real-world datasets,” in2024 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2024, pp. 77–84. 6, 7

  58. [58]

    Predicting pedestrian crossing intentions in adverse weather with self-attention models,

    A. Elgazwy, K. Elgazzar, and A. Khamis, “Predicting pedestrian crossing intentions in adverse weather with self-attention models,”IEEE Trans- actions on Intelligent Transportation Systems, vol. 26, no. 3, pp. 3250– 3261, 2025. 6, 7

  59. [59]

    Long–short observation- driven prediction network for pedestrian crossing intention prediction with momentary observation,

    H. Liu, C. Liu, F. Chang, Y . Lu, and M. Liu, “Long–short observation- driven prediction network for pedestrian crossing intention prediction with momentary observation,”Neurocomputing, vol. 614, p. 128824,

  60. [60]

    Recurrent neural networks,

    L. R. Medsker, L. Jainet al., “Recurrent neural networks,”Design and applications, vol. 5, no. 64-67, p. 2, 2001. 7

  61. [61]

    Multi-scale hierarchical recurrent neural net- works for hyperspectral image classification,

    C. Shi and C.-M. Pun, “Multi-scale hierarchical recurrent neural net- works for hyperspectral image classification,”Neurocomputing, vol. 294, pp. 82–93, 2018. 7

  62. [62]

    Learning spatiotemporal features with 3d convolutional networks,

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497. 7, 8

  63. [63]

    Quo vadis, action recognition? a new model and the kinetics dataset,

    J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. 7, 8

  64. [64]

    Multi- modality sensing and data fusion for multi-vehicle detection,

    D. Roy, Y . Li, T. Jian, P. Tian, K. Chowdhury, and S. Ioannidis, “Multi- modality sensing and data fusion for multi-vehicle detection,”IEEE Transactions on Multimedia, vol. 25, pp. 2280–2295, 2022. 6 Yanping Wureceived her B.S. degree from South- west Jiaotong University, Chengdu, China, in 2017, and her M.S. degree in 2020. From 2020 to 2023, she worked...

  65. [65]

    degree with the James Watt School of Engineering, University of Glasgow, U.K

    He is currently pursuing the Ph.D. degree with the James Watt School of Engineering, University of Glasgow, U.K. His research interests include multi- modal understanding and generation, and interactive intelligent systems. 13 Edmond S. L. Hois currently an Associate Pro- fessor in the School of Computing Science at the University of Glasgow, UK. He was a...

  66. [66]

    He serves as an Associate Editor-in-Chief of Neurocomputing and an Associate Editor for IEEE TII, IEEE TIM,IEEE T-ICPS

    Currently, he is an Associate Professor at Uni- versity of Glasgow, UK. He serves as an Associate Editor-in-Chief of Neurocomputing and an Associate Editor for IEEE TII, IEEE TIM,IEEE T-ICPS. His research interests include efficient AI and smart city. Chongfeng Wei (Senior Member, IEEE)received his Ph.D. degree in mechanical engineering from the Universit...