Learning Robot Visual Navigation in Crowds via Intention-Aware Scene Representations

Baozhi Jia; Bingyi Xia; Han Bao; Hanjing Ye; Hao Cheng; Jiankun Wang; Wenjun Xu; Yu Zhan

arxiv: 2606.26047 · v1 · pith:4Z3YPQUZnew · submitted 2026-06-24 · 💻 cs.RO

Learning Robot Visual Navigation in Crowds via Intention-Aware Scene Representations

Han Bao , Bingyi Xia , Hanjing Ye , Yu Zhan , Hao Cheng , Baozhi Jia , Wenjun Xu , Jiankun Wang This is my paper

Pith reviewed 2026-06-25 19:04 UTC · model grok-4.3

classification 💻 cs.RO

keywords crowd navigationvisual navigationintention inferencedeep reinforcement learningscene representationattention mechanismrobot navigationegocentric vision

0 comments

The pith

Robot crowd navigation improves when visual observations are used to infer pedestrian intentions via attention encoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to show that a new approach can learn better navigation policies for robots in crowds by using egocentric visual observations to build intention-aware scene representations. It does this with a spatio-temporal encoder that pulls out occupancy features and an attention-based module that uses human poses to guess motion intentions. These are combined into a state for training a DRL policy. A reader would care if this leads to robots that can navigate dense crowds using only cameras, avoiding the limitations of treating people as simple points. The work matters because it bridges visual perception with reinforcement learning for more practical robot behavior in human spaces.

Core claim

The paper claims that encoding behavioral and structural context from egocentric visual observations with a spatio-temporal encoder for occupancy features and the Intent-Interact Former for inferring motion intentions from human poses creates a compact state embedding that enables superior DRL policy training for crowd navigation compared to baselines using limited representations.

What carries the argument

The Intent-Interact Former (I² Former), an attention-based module that encodes human poses to infer pedestrians' motion intentions, together with a spatio-temporal encoder for scene occupancy features.

Load-bearing premise

Egocentric visual observations provide rich enough behavioral and structural context to reliably infer motion intentions and support effective DRL policy learning.

What would settle it

A scenario or test case where the visual input leads to incorrect intention inference, such as deceptive human poses, resulting in navigation failure despite the method's training.

Figures

Figures reproduced from arXiv: 2606.26047 by Baozhi Jia, Bingyi Xia, Han Bao, Hanjing Ye, Hao Cheng, Jiankun Wang, Wenjun Xu, Yu Zhan.

**Figure 2.** Figure 2: Our method consists of three primary components: a feature extraction module, a feature fusion module, and a DRL network. It takes multi-timestep [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Our method includes two key components: a spatio-temporal encoder [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Policy training environment in SocNav-Gym, featuring common social [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Experimental environments for navigation policy testing. (a) Office lobby with a width of 7.0 m. (b) Hospital corridor with a width of 4.0 m. (c) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Example trajectories for the compared policies with nine static obstacles and four SFM agents. The robot trajectory is color-coded with the viridis [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Two long-distance Navigation Scenarios. Each scenario uses a generalized voronoi graph to generate a topological map. In each map, [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Robot crowd navigation requires the ability to infer human intentions while accounting for the structural constraints of the environment. Currently, deep reinforcement learning (DRL) provides a promising method for learning navigation policies that understand human intentions. However, most of them rely on limited scene representations, treating pedestrians as simple 2D points and ignoring rich visual cues from both humans and the environment. To address this issue, we introduce iCrowdNav, a novel visual crowd navigation method with intention-aware scene representations, to encode behavioral and structural context from egocentric visual observations. Our method employs two key components: a spatio-temporal encoder for extracting occupancy features of the scene, and Intent-Interact Former (I$^2$ Former), an attention-based module that encodes human poses to infer pedestrians' motion intentions. These features are integrated into a compact state embedding that supports effective DRL policy training. Extensive experiments show that our method achieves superior performance over baselines, and real-world deployment demonstrates vision-based crowd navigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a pose-based attention module for intention in visual DRL crowd nav, but the gains are not isolated from the rest of the visual encoder.

read the letter

This paper's main new element is the Intent-Interact Former, an attention module that takes human poses from egocentric views to infer motion intentions, combined with a spatio-temporal occupancy encoder. The result feeds a compact state into DRL for robot navigation that avoids treating people as simple points.

The setup is reasonable. Using visual observations for both behavioral context and scene structure is a direct response to a known limit in prior point-based methods, and the real-world robot test is a practical check.

The soft spot is exactly the one in the stress-test note. There is no separate metric or ablation showing that the I² Former outputs track actual pedestrian trajectories or intentions better than a generic encoder would. Performance edges could come from richer occupancy features or model size alone. The abstract claims clear wins over baselines, but without the numbers, protocol details, or error bars in view, the size and reliability of the improvement stay hard to judge.

The work is coherent on its own terms with no obvious circularity or unfalsifiable claims. It is aimed at people building service robots that move among people. A reader working on visual navigation or intention modeling would find the components and integration useful to examine. The paper shows clear thinking about the problem and standard engagement with the literature, so it deserves a serious referee even if the experiments need tighter controls on the intention component.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes iCrowdNav, a visual crowd navigation method that encodes behavioral and structural context from egocentric visual observations via a spatio-temporal encoder for occupancy features and the attention-based Intent-Interact Former (I² Former) module to infer pedestrians' motion intentions from human poses. These are fused into a compact state embedding for DRL policy training. The abstract claims that extensive experiments demonstrate superior performance over baselines and that real-world deployment validates vision-based crowd navigation.

Significance. If the empirical results hold with proper validation, the work could meaningfully advance DRL-based visual navigation by showing the value of intention-aware representations over simpler point-based or occupancy encodings. The focus on egocentric visuals for both humans and environment is a reasonable direction. No machine-checked proofs, reproducible code releases, or parameter-free derivations are described.

major comments (2)

[Abstract] Abstract: the central claim that 'extensive experiments show that our method achieves superior performance over baselines' supplies no quantitative metrics, baseline descriptions, experiment protocols, success rates, or error analysis, rendering the claim unverifiable from the provided text.
[Abstract] Abstract / I² Former description: no isolated metric (e.g., intention prediction error vs. future pedestrian positions/velocities or ablation against a generic spatio-temporal encoder) is reported to show that the attention mechanism extracts motion-intention signals rather than merely increasing feature dimensionality; this is load-bearing for attributing gains to the intention-aware component.

minor comments (1)

[Abstract] Abstract: the I² Former acronym is used before its expansion is given.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the claims would be strengthened by including quantitative details and will revise the abstract accordingly. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'extensive experiments show that our method achieves superior performance over baselines' supplies no quantitative metrics, baseline descriptions, experiment protocols, success rates, or error analysis, rendering the claim unverifiable from the provided text.

Authors: We agree that the abstract as written does not include specific quantitative metrics or protocol details, which limits verifiability. In the revised manuscript we will update the abstract to incorporate key results from the experiments section, including representative success rates, collision rates, and a concise description of the main baselines and evaluation protocol. revision: yes
Referee: [Abstract] Abstract / I² Former description: no isolated metric (e.g., intention prediction error vs. future pedestrian positions/velocities or ablation against a generic spatio-temporal encoder) is reported to show that the attention mechanism extracts motion-intention signals rather than merely increasing feature dimensionality; this is load-bearing for attributing gains to the intention-aware component.

Authors: The full manuscript contains ablation studies that isolate the contribution of the I² Former module to overall navigation performance. However, we acknowledge that a direct metric of intention prediction accuracy (e.g., error against future positions or velocities) is not reported. To strengthen attribution of gains to the intention-aware component, we will add such an isolated evaluation in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity; method description contains no derivations or fitted predictions

full rationale

The paper describes an empirical DRL-based navigation method using a spatio-temporal encoder and I² Former attention module on visual observations, with performance evaluated via experiments against baselines. No equations, parameter-fitting steps, or first-principles derivations appear in the provided text that could reduce by construction to inputs, self-citations, or renamed patterns. Claims rest on external experimental outcomes rather than any internal derivation chain, making the work self-contained against the listed circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no details on parameters, axioms, or new entities are provided in the text.

pith-pipeline@v0.9.1-grok · 5715 in / 1045 out tokens · 28233 ms · 2026-06-25T19:04:07.980792+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 1 linked inside Pith

[1]

Fapp: Fast and adaptive perception and planning for uavs in dynamic cluttered environments,

M. Luet al., “Fapp: Fast and adaptive perception and planning for uavs in dynamic cluttered environments,”IEEE Transactions on Robotics, vol. 41, pp. 871–886, 2025

2025
[2]

Rpf-search: Field-based search for robot person follow- ing in unknown dynamic environments,

H. Yeet al., “Rpf-search: Field-based search for robot person follow- ing in unknown dynamic environments,”IEEE/ASME Transactions on Mechatronics, pp. 1–12, 2025

2025
[3]

Namr-rrt: Neural adaptive motion planning for mobile robots in dynamic environments,

Z. Sunet al., “Namr-rrt: Neural adaptive motion planning for mobile robots in dynamic environments,”IEEE Transactions on Automation Science and Engineering, vol. 22, pp. 13 087–13 100, 2025

2025
[4]

A dual closed-loop control strategy for human-following robots respecting social space,

J. Penget al., “A dual closed-loop control strategy for human-following robots respecting social space,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 11 252–11 258

2024
[5]

Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,

C. Chenet al., “Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,” in2019 international conference on robotics and automation (ICRA). IEEE, 2019, pp. 6015– 6022

2019
[6]

Motion planning among dynamic, decision-making agents with deep reinforcement learning,

M. Everett, Y . F. Chen, and J. P. How, “Motion planning among dynamic, decision-making agents with deep reinforcement learning,” in2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3052–3059

2018
[7]

Intention aware robot crowd navigation with attention- based interaction graph,

S. Liuet al., “Intention aware robot crowd navigation with attention- based interaction graph,” in2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 12 015–12 021

2023
[8]

Socially compliant robot navi- gation in crowded environment by human behavior resemblance using deep reinforcement learning,

S. S. Samsani and M. S. Muhammad, “Socially compliant robot navi- gation in crowded environment by human behavior resemblance using deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5223–5230, 2021

2021
[9]

Robot navigation in crowded environments using deep reinforcement learning,

L. Liuet al., “Robot navigation in crowded environments using deep reinforcement learning,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5671–5677

2020
[10]

Drl-vo: Learning to navigate through crowded dy- namic scenes using velocity obstacles,

Z. Xie and P. Dames, “Drl-vo: Learning to navigate through crowded dy- namic scenes using velocity obstacles,”IEEE Transactions on Robotics, vol. 39, no. 4, pp. 2700–2719, 2023

2023
[11]

Navdreams: Towards camera-only rl navigation among humans,

D. Dugaset al., “Navdreams: Towards camera-only rl navigation among humans,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 2504–2511

2022
[12]

Vision-centric bev perception: A survey,

Y . Maet al., “Vision-centric bev perception: A survey,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 978–10 997, 2024

2024
[13]

Robots that can see: Leveraging human pose for trajectory prediction,

T. Salzmannet al., “Robots that can see: Leveraging human pose for trajectory prediction,”IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 7090–7097, 2023

2023
[14]

Social-pose: Enhancing trajectory prediction with human body pose,

Y . Gao, S. Saadatnejad, and A. Alahi, “Social-pose: Enhancing trajectory prediction with human body pose,”IEEE Transactions on Intelligent Transportation Systems, 2025

2025
[15]

A survey on socially aware robot navigation: Taxonomy and future challenges,

P. T. Singamaneniet al., “A survey on socially aware robot navigation: Taxonomy and future challenges,”The International Journal of Robotics Research, vol. 43, no. 10, pp. 1533–1572, 2024

2024
[16]

Vlm-social-nav: Socially aware robot navigation through scoring using vision-language models,

D. Songet al., “Vlm-social-nav: Socially aware robot navigation through scoring using vision-language models,”IEEE Robotics and Automation Letters, 2024

2024
[17]

Social-llava: Enhancing robot navigation through human-language reasoning in social spaces,

A. Payandehet al., “Social-llava: Enhancing robot navigation through human-language reasoning in social spaces,”arXiv preprint arXiv:2501.09024, 2024

arXiv 2024
[18]

Gson: A group-based social navigation framework with large multimodal model,

S. Luoet al., “Gson: A group-based social navigation framework with large multimodal model,”IEEE Robotics and Automation Letters, 2025

2025
[19]

Social force model for pedestrian dynamics,

D. Helbing and P. Molnar, “Social force model for pedestrian dynamics,” Physical review E, vol. 51, no. 5, p. 4282, 1995

1995
[20]

The dynamic window approach to collision avoidance,

D. Fox, W. Burgard, and S. Thrun, “The dynamic window approach to collision avoidance,”IEEE robotics & automation magazine, vol. 4, no. 1, pp. 23–33, 2002

2002
[21]

Human-behaviour-based social locomotion model im- proves the humanization of social robots,

C. Zhouet al., “Human-behaviour-based social locomotion model im- proves the humanization of social robots,”Nature Machine Intelligence, vol. 4, no. 11, pp. 1040–1052, 2022

2022
[22]

Crowd-aware robot navigation for pedestrians with mul- tiple collision avoidance strategies via map-based deep reinforcement learning,

S. Yaoet al., “Crowd-aware robot navigation for pedestrians with mul- tiple collision avoidance strategies via map-based deep reinforcement learning,” in2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021, pp. 8144–8150

2021
[23]

Learning world transition model for socially aware robot navigation,

Y . Cuiet al., “Learning world transition model for socially aware robot navigation,” in2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 9262–9268

2021
[24]

Rmrl: Robot navigation in crowd environments with risk map-based deep reinforcement learning,

H. Yanget al., “Rmrl: Robot navigation in crowd environments with risk map-based deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 8, no. 12, pp. 7930–7937, 2023

2023
[25]

Collision avoidance among dense heterogeneous agents using deep reinforcement learning,

K. Zhuet al., “Collision avoidance among dense heterogeneous agents using deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 8, no. 1, pp. 57–64, 2023

2023
[26]

Learning local planners for human-aware navi- gation in indoor environments,

R. Guldenringet al., “Learning local planners for human-aware navi- gation in indoor environments,” in2020 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 6053–6060

2020
[27]

Bevnav: Robot autonomous navigation via spatial- temporal contrastive learning in bird’s-eye view,

J. Jianget al., “Bevnav: Robot autonomous navigation via spatial- temporal contrastive learning in bird’s-eye view,”IEEE Robotics and Automation Letters, 2024

2024
[28]

Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras,

A. Huet al., “Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 273–15 282

2021
[29]

Deep residual learning for image recognition,

K. Heet al., “Deep residual learning for image recognition,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016
[30]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesaret al., “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

2020
[31]

Ultralytics yolov8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

2023
[32]

Proximal policy optimization algorithms,

J. Schulmanet al., “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[33]

Navrl: Learning safe flight in dynamic environments,

Z. Xuet al., “Navrl: Learning safe flight in dynamic environments,” IEEE Robotics and Automation Letters, 2025

2025
[34]

Characterizing the complexity of social robot navigation scenarios,

A. Stratton, K. Hauser, and C. Mavrogiannis, “Characterizing the complexity of social robot navigation scenarios,”IEEE Robotics and Automation Letters, 2024

2024
[35]

ViNT: A foundation model for visual navigation,

D. Shahet al., “ViNT: A foundation model for visual navigation,” in 7th Annual Conference on Robot Learning, 2023. [Online]. Available: https://arxiv.org/abs/2306.14846

arXiv 2023
[36]

Optimal path planning using generalized voronoi graph and multiple potential functions,

J. Wang and M. Q.-H. Meng, “Optimal path planning using generalized voronoi graph and multiple potential functions,”IEEE transactions on industrial electronics, vol. 67, no. 12, pp. 10 621–10 630, 2020

2020

[1] [1]

Fapp: Fast and adaptive perception and planning for uavs in dynamic cluttered environments,

M. Luet al., “Fapp: Fast and adaptive perception and planning for uavs in dynamic cluttered environments,”IEEE Transactions on Robotics, vol. 41, pp. 871–886, 2025

2025

[2] [2]

Rpf-search: Field-based search for robot person follow- ing in unknown dynamic environments,

H. Yeet al., “Rpf-search: Field-based search for robot person follow- ing in unknown dynamic environments,”IEEE/ASME Transactions on Mechatronics, pp. 1–12, 2025

2025

[3] [3]

Namr-rrt: Neural adaptive motion planning for mobile robots in dynamic environments,

Z. Sunet al., “Namr-rrt: Neural adaptive motion planning for mobile robots in dynamic environments,”IEEE Transactions on Automation Science and Engineering, vol. 22, pp. 13 087–13 100, 2025

2025

[4] [4]

A dual closed-loop control strategy for human-following robots respecting social space,

J. Penget al., “A dual closed-loop control strategy for human-following robots respecting social space,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 11 252–11 258

2024

[5] [5]

Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,

C. Chenet al., “Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,” in2019 international conference on robotics and automation (ICRA). IEEE, 2019, pp. 6015– 6022

2019

[6] [6]

Motion planning among dynamic, decision-making agents with deep reinforcement learning,

M. Everett, Y . F. Chen, and J. P. How, “Motion planning among dynamic, decision-making agents with deep reinforcement learning,” in2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3052–3059

2018

[7] [7]

Intention aware robot crowd navigation with attention- based interaction graph,

S. Liuet al., “Intention aware robot crowd navigation with attention- based interaction graph,” in2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 12 015–12 021

2023

[8] [8]

Socially compliant robot navi- gation in crowded environment by human behavior resemblance using deep reinforcement learning,

S. S. Samsani and M. S. Muhammad, “Socially compliant robot navi- gation in crowded environment by human behavior resemblance using deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5223–5230, 2021

2021

[9] [9]

Robot navigation in crowded environments using deep reinforcement learning,

L. Liuet al., “Robot navigation in crowded environments using deep reinforcement learning,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5671–5677

2020

[10] [10]

Drl-vo: Learning to navigate through crowded dy- namic scenes using velocity obstacles,

Z. Xie and P. Dames, “Drl-vo: Learning to navigate through crowded dy- namic scenes using velocity obstacles,”IEEE Transactions on Robotics, vol. 39, no. 4, pp. 2700–2719, 2023

2023

[11] [11]

Navdreams: Towards camera-only rl navigation among humans,

D. Dugaset al., “Navdreams: Towards camera-only rl navigation among humans,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 2504–2511

2022

[12] [12]

Vision-centric bev perception: A survey,

Y . Maet al., “Vision-centric bev perception: A survey,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 978–10 997, 2024

2024

[13] [13]

Robots that can see: Leveraging human pose for trajectory prediction,

T. Salzmannet al., “Robots that can see: Leveraging human pose for trajectory prediction,”IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 7090–7097, 2023

2023

[14] [14]

Social-pose: Enhancing trajectory prediction with human body pose,

Y . Gao, S. Saadatnejad, and A. Alahi, “Social-pose: Enhancing trajectory prediction with human body pose,”IEEE Transactions on Intelligent Transportation Systems, 2025

2025

[15] [15]

A survey on socially aware robot navigation: Taxonomy and future challenges,

P. T. Singamaneniet al., “A survey on socially aware robot navigation: Taxonomy and future challenges,”The International Journal of Robotics Research, vol. 43, no. 10, pp. 1533–1572, 2024

2024

[16] [16]

Vlm-social-nav: Socially aware robot navigation through scoring using vision-language models,

D. Songet al., “Vlm-social-nav: Socially aware robot navigation through scoring using vision-language models,”IEEE Robotics and Automation Letters, 2024

2024

[17] [17]

Social-llava: Enhancing robot navigation through human-language reasoning in social spaces,

A. Payandehet al., “Social-llava: Enhancing robot navigation through human-language reasoning in social spaces,”arXiv preprint arXiv:2501.09024, 2024

arXiv 2024

[18] [18]

Gson: A group-based social navigation framework with large multimodal model,

S. Luoet al., “Gson: A group-based social navigation framework with large multimodal model,”IEEE Robotics and Automation Letters, 2025

2025

[19] [19]

Social force model for pedestrian dynamics,

D. Helbing and P. Molnar, “Social force model for pedestrian dynamics,” Physical review E, vol. 51, no. 5, p. 4282, 1995

1995

[20] [20]

The dynamic window approach to collision avoidance,

D. Fox, W. Burgard, and S. Thrun, “The dynamic window approach to collision avoidance,”IEEE robotics & automation magazine, vol. 4, no. 1, pp. 23–33, 2002

2002

[21] [21]

Human-behaviour-based social locomotion model im- proves the humanization of social robots,

C. Zhouet al., “Human-behaviour-based social locomotion model im- proves the humanization of social robots,”Nature Machine Intelligence, vol. 4, no. 11, pp. 1040–1052, 2022

2022

[22] [22]

Crowd-aware robot navigation for pedestrians with mul- tiple collision avoidance strategies via map-based deep reinforcement learning,

S. Yaoet al., “Crowd-aware robot navigation for pedestrians with mul- tiple collision avoidance strategies via map-based deep reinforcement learning,” in2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021, pp. 8144–8150

2021

[23] [23]

Learning world transition model for socially aware robot navigation,

Y . Cuiet al., “Learning world transition model for socially aware robot navigation,” in2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 9262–9268

2021

[24] [24]

Rmrl: Robot navigation in crowd environments with risk map-based deep reinforcement learning,

H. Yanget al., “Rmrl: Robot navigation in crowd environments with risk map-based deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 8, no. 12, pp. 7930–7937, 2023

2023

[25] [25]

Collision avoidance among dense heterogeneous agents using deep reinforcement learning,

K. Zhuet al., “Collision avoidance among dense heterogeneous agents using deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 8, no. 1, pp. 57–64, 2023

2023

[26] [26]

Learning local planners for human-aware navi- gation in indoor environments,

R. Guldenringet al., “Learning local planners for human-aware navi- gation in indoor environments,” in2020 IEEE/RSJ International Con- ference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 6053–6060

2020

[27] [27]

Bevnav: Robot autonomous navigation via spatial- temporal contrastive learning in bird’s-eye view,

J. Jianget al., “Bevnav: Robot autonomous navigation via spatial- temporal contrastive learning in bird’s-eye view,”IEEE Robotics and Automation Letters, 2024

2024

[28] [28]

Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras,

A. Huet al., “Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 273–15 282

2021

[29] [29]

Deep residual learning for image recognition,

K. Heet al., “Deep residual learning for image recognition,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016

[30] [30]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesaret al., “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

2020

[31] [31]

Ultralytics yolov8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics yolov8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

2023

[32] [32]

Proximal policy optimization algorithms,

J. Schulmanet al., “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[33] [33]

Navrl: Learning safe flight in dynamic environments,

Z. Xuet al., “Navrl: Learning safe flight in dynamic environments,” IEEE Robotics and Automation Letters, 2025

2025

[34] [34]

Characterizing the complexity of social robot navigation scenarios,

A. Stratton, K. Hauser, and C. Mavrogiannis, “Characterizing the complexity of social robot navigation scenarios,”IEEE Robotics and Automation Letters, 2024

2024

[35] [35]

ViNT: A foundation model for visual navigation,

D. Shahet al., “ViNT: A foundation model for visual navigation,” in 7th Annual Conference on Robot Learning, 2023. [Online]. Available: https://arxiv.org/abs/2306.14846

arXiv 2023

[36] [36]

Optimal path planning using generalized voronoi graph and multiple potential functions,

J. Wang and M. Q.-H. Meng, “Optimal path planning using generalized voronoi graph and multiple potential functions,”IEEE transactions on industrial electronics, vol. 67, no. 12, pp. 10 621–10 630, 2020

2020