pith. machine review for the scientific record. sign in

arxiv: 2604.02829 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:07 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords visual navigationspatio-temporal fusiongraph reasoningtemporal shiftgoal-conditioned controlfeature encodingrobotic vision
0
0 comments X

The pith

A new spatio-temporal fusion module uses graph reasoning per frame and hybrid temporal shifts to better preserve visual details for robot goal navigation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard feature encoders and temporal pooling in visual navigation agents discard fine-grained spatial and temporal structure from first-person image sequences, which limits accurate action prediction and progress estimation toward a specified goal image. It introduces a unified framework that extracts features from observation sequences and goal views, then fuses them through a module performing spatial graph reasoning inside each frame along with hybrid temporal modeling via shift operations and multi-resolution difference-aware convolutions. This richer representation supports goal-conditioned control policies. A sympathetic reader would care because retaining that structure could yield navigation agents that succeed more reliably without depending on elaborate policy heads or extra data. The reported experiments show consistent performance gains over baselines, framing the encoder as a reusable backbone.

Core claim

The authors establish that their spatio-temporal fusion module, which performs spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution, extracts a richer representation from visual sequences and goal observations, leading to improved navigation performance and serving as a generalizable visual backbone for goal-conditioned control.

What carries the argument

Spatio-temporal fusion module that integrates spatial graph reasoning for intra-frame relations with hybrid temporal shift operations and multi-resolution difference-aware convolutions to capture dynamics across frames while fusing sequence and goal features.

If this is right

  • Navigation agents achieve higher success rates when reaching specified visual goals from first-person views.
  • The encoder functions as a reusable visual backbone for multiple goal-conditioned control problems.
  • Action prediction and progress estimation become more accurate due to retained spatial and temporal details.
  • The approach reduces reliance on complex policy heads by improving the input representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The graph-based spatial reasoning could help navigation in scenes with many distinct objects by explicitly modeling their relations.
  • This representation might transfer to other sequential vision tasks such as video-based prediction or manipulation planning.
  • Replacing standard pooling with the hybrid temporal component could improve sample efficiency during policy training.

Load-bearing premise

The proposed fusion module actually preserves fine-grained spatial and temporal structure better than standard encoders and temporal pooling, rather than performance gains coming from unrelated training details or architecture choices.

What would settle it

An ablation experiment that replaces the fusion module with a standard CNN encoder plus average temporal pooling and measures equivalent or higher navigation success rates in the same goal-reaching tasks.

Figures

Figures reproduced from arXiv: 2604.02829 by Hao Ren, Hui Cheng, Lu Qi, Yiming Zeng, Zetong Bi, Zhaoliang Wan.

Figure 1
Figure 1. Figure 1: t-SNE projections of feature embeddings colored [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of the Proposed Model for Action Prediction: The model processes input observations and goal images through feature [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) A Grid structure representing a partitioned image, [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative navigation trajectories (blue) produced by STRNet in 2D-3D-S and Citysim Environments. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Schematic diagram of front-view projection visualiza [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Visual navigation requires the robot to reach a specified goal such as an image, based on a sequence of first-person visual observations. While recent learning-based approaches have made significant progress, they often focus on improving policy heads or decision strategies while relying on simplistic feature encoders and temporal pooling to represent visual input. This leads to the loss of fine-grained spatial and temporal structure, ultimately limiting accurate action prediction and progress estimation. In this paper, we propose a unified spatio-temporal representation framework that enhances visual encoding for robotic navigation. Our approach extracts features from both image sequences and goal observations, and fuses them using the designed spatio-temporal fusion module. This module performs spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution. Experimental results demonstrate that our approach consistently improves navigation performance and offers a generalizable visual backbone for goal-conditioned control. Code is available at \href{https://github.com/hren20/STRNet}{https://github.com/hren20/STRNet}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes STRNet, a visual navigation framework that extracts features from first-person image sequences and goal observations then fuses them via a spatio-temporal fusion module. The module applies spatial graph reasoning per frame and models temporal dynamics with a hybrid temporal shift module plus multi-resolution difference-aware convolution, claiming this preserves fine-grained structure better than standard encoders and temporal pooling, yielding consistent performance gains and a generalizable backbone for goal-conditioned control. Code is released.

Significance. If the central claim holds under controlled evaluation, the work would supply a reusable spatio-temporal visual encoder for navigation that addresses a documented weakness in prior learning-based methods. The public code release is a concrete strength that supports reproducibility and follow-on use.

major comments (1)
  1. [Experiments] Experiments section: the claim that the spatio-temporal fusion module (spatial graph reasoning + hybrid temporal shift + multi-resolution difference-aware convolution) is responsible for the reported gains requires a controlled ablation that replaces only this module with a standard encoder (e.g., ResNet + temporal average pooling) while freezing all other hyperparameters, seeds, training details, and policy head. Without such an isolation experiment, attribution remains unproven and performance differences could arise from unrelated implementation choices.
minor comments (2)
  1. [Abstract] Abstract: quantitative metrics, baseline names, and ablation summaries are absent, which is atypical for a paper whose central claim rests on experimental improvement.
  2. [Method] Method: the hybrid temporal shift and multi-resolution difference-aware convolution would benefit from explicit equations or a compact algorithm box to clarify how they differ from standard temporal pooling and shift operations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the claim that the spatio-temporal fusion module (spatial graph reasoning + hybrid temporal shift + multi-resolution difference-aware convolution) is responsible for the reported gains requires a controlled ablation that replaces only this module with a standard encoder (e.g., ResNet + temporal average pooling) while freezing all other hyperparameters, seeds, training details, and policy head. Without such an isolation experiment, attribution remains unproven and performance differences could arise from unrelated implementation choices.

    Authors: We agree that a controlled ablation isolating only the spatio-temporal fusion module is required to rigorously attribute the reported gains. In the revised manuscript we will add this experiment: the proposed module will be replaced by a standard ResNet encoder followed by temporal average pooling while keeping every other element (hyperparameters, seeds, training details, policy head, and data pipeline) identical to the original configuration. The results will be reported alongside the existing ablations to demonstrate that the performance differences stem from the fusion module itself. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents STRNet as an independent neural architecture for visual navigation, consisting of feature extractors fused via a spatio-temporal module (spatial graph reasoning, hybrid temporal shift, multi-resolution difference-aware convolution). No equations, predictions, or claims reduce by construction to fitted parameters or self-referential definitions; the method is described as a new assembly of standard components and validated through external experiments against baselines. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the provided text. The central contribution remains an architectural proposal whose performance claims are tested separately rather than forced by the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that richer spatio-temporal visual features directly improve navigation policy performance; no new physical entities or ad-hoc constants are introduced beyond standard neural-network hyperparameters.

axioms (1)
  • domain assumption Better preservation of fine-grained spatial and temporal structure in visual features leads to improved action prediction and progress estimation in goal-conditioned navigation.
    Invoked in the motivation and in the description of the fusion module's purpose.

pith-pipeline@v0.9.0 · 5488 in / 1135 out tokens · 48008 ms · 2026-05-13T20:07:50.441916+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    Zero experience required: Plug & play modular transfer learning for semantic visual navigation

    Ziad Al-Halah, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Zero experience required: Plug & play modular transfer learning for semantic visual navigation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17031–17041, 2022. 2

  2. [2]

    Vision-and- language navigation: Interpreting visually-grounded naviga- tion instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, et al. Vision-and- language navigation: Interpreting visually-grounded naviga- tion instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3674–3683, 2018. 1

  3. [3]

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding

    Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017. 2, 6, 7

  4. [4]

    Vivit: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, et al. Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 2

  5. [5]

    Is space-time attention all you need for video understanding? InICML, page 4, 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InICML, page 4, 2021. 2

  6. [6]

    Past, present, and future of simultaneous localiza- tion and mapping: Toward the robust-perception age.IEEE Transactions on robotics, 32(6):1309–1332, 2016

    Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jos ´e Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localiza- tion and mapping: Toward the robust-perception age.IEEE Transactions on robotics, 32(6):1309–1332, 2016. 2

  7. [7]

    Topological planning with transform- ers for vision-and-language navigation

    Kevin Chen, Junshen K Chen, Jo Chuang, Marynel V´azquez, and Silvio Savarese. Topological planning with transform- ers for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11276–11286, 2021. 2

  8. [8]

    Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, page 02783649241273668, 2023

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, page 02783649241273668, 2023. 5

  9. [9]

    Curious repre- sentation learning for embodied intelligence

    Yilun Du, Chuang Gan, and Phillip Isola. Curious repre- sentation learning for embodied intelligence. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10408–10417, 2021. 2

  10. [10]

    Room-object entity prompting and reasoning for embodied referring expression

    Chen Gao, Si Liu, Jinyu Chen, et al. Room-object entity prompting and reasoning for embodied referring expression. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(2):994–1010, 2023. 1

  11. [11]

    Vision gnn: An image is worth graph of nodes.Advances in neural informa- tion processing systems, 35:8291–8303, 2022

    Kai Han, Yunhe Wang, Jianyuan Guo, et al. Vision gnn: An image is worth graph of nodes.Advances in neural informa- tion processing systems, 35:8291–8303, 2022. 3

  12. [12]

    Deep visual mpc-policy learning for navigation.IEEE Robotics and Automation Let- ters, 4(4):3184–3191, 2019

    Noriaki Hirose, Fei Xia, Roberto Mart ´ın-Mart´ın, Amir Sadeghian, and Silvio Savarese. Deep visual mpc-policy learning for navigation.IEEE Robotics and Automation Let- ters, 4(4):3184–3191, 2019. 5

  13. [13]

    Sacson: Scalable autonomous control for social nav- igation.IEEE Robotics and Automation Letters, 2023

    Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social nav- igation.IEEE Robotics and Automation Letters, 2023. 5

  14. [14]

    Visual evaluation for autonomous driving

    Yijie Hou, Chengshun Wang, Junhong Wang, Xiangyang Xue, Xiaolong Luke Zhang, Jun Zhu, Dongliang Wang, and Siming Chen. Visual evaluation for autonomous driving. IEEE Transactions on Visualization and Computer Graph- ics, 28(1):1030–1039, 2021. 1

  15. [15]

    A new representation of universal successor features for enhancing the generaliza- tion of target-driven visual navigation.IEEE Robotics and Automation Letters, 2024

    Jiaocheng Hu, Yuexin Ma, Haiyun Jiang, Shaofeng He, Gelu Liu, Qizhen Weng, and Xiangwei Zhu. A new representation of universal successor features for enhancing the generaliza- tion of target-driven visual navigation.IEEE Robotics and Automation Letters, 2024. 2

  16. [16]

    Building category graphs representation with spatial and temporal attention for visual navigation

    Xiaobo Hu, Youfang Lin, Hehe Fan, Shuo Wang, Zhihao Wu, and Kai Lv. Building category graphs representation with spatial and temporal attention for visual navigation. ACM Transactions on Multimedia Computing, Communica- tions and Applications, 20(7):1–22, 2024. 2

  17. [17]

    Mail: Improving imitation learning with selective state space models

    Xiaogang Jia, Qian Wang, Atalay Donat, Bowen Xing, Ge Li, Hongyi Zhou, Onur Celik, Denis Blessing, Rudolf Li- outikov, and Gerhard Neumann. Mail: Improving imitation learning with selective state space models. In8th Annual Conference on Robot Learning, 2024. 2

  18. [18]

    Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters,

    Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett War- nell, S ¨oren Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters,

  19. [19]

    Design and use paradigms for gazebo, an open-source multi-robot simula- tor

    Nathan Koenig and Andrew Howard. Design and use paradigms for gazebo, an open-source multi-robot simula- tor. In2004 IEEE/RSJ international conference on intelli- gent robots and systems (IROS)(IEEE Cat. No. 04CH37566), pages 2149–2154. Ieee, 2004. 2, 6

  20. [20]

    Memonav: Working memory model for visual navigation

    Hongxin Li, Zeyu Wang, Xu Yang, Yuran Yang, Shuqi Mei, and Zhaoxiang Zhang. Memonav: Working memory model for visual navigation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 17913–17922, 2024. 2

  21. [21]

    Tsm: Temporal shift module for efficient video understanding

    Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019. 4

  22. [22]

    Citywalker: Learning embodied urban navigation from web-scale videos

    Xinhao Liu, Jintong Li, Yicheng Jiang, Niranjan Sujay, Zhicheng Yang, Juexiao Zhang, John Abanes, Jing Zhang, and Chen Feng. Citywalker: Learning embodied urban navigation from web-scale videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6875–6885, 2025. 1

  23. [23]

    Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,

    Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024. 1

  24. [24]

    Zson: Zero-shot object-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352,

    Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352,

  25. [25]

    Sandra Malpica, Daniel Martin, Ana Serrano, Diego Gutier- rez, and Belen Masia. Task-dependent visual behavior in immersive environments: A comparative study of free ex- ploration, memory and visual search.IEEE transactions on visualization and computer graphics, 29(11):4417–4425,

  26. [26]

    Visual navigation with spatial attention

    Bar Mayo, Tamir Hazan, and Ayellet Tal. Visual navigation with spatial attention. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16898–16907, 2021. 2

  27. [27]

    Recurrent models of visual attention.Ad- vances in neural information processing systems, 27, 2014

    V olodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention.Ad- vances in neural information processing systems, 27, 2014. 2

  28. [28]

    Greedyvig: Dynamic axial graph construction for efficient vision gnns

    Mustafa Munir, William Avery, Md Mostafijur Rahman, and Radu Marculescu. Greedyvig: Dynamic axial graph construction for efficient vision gnns. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6118–6127, 2024. 3

  29. [29]

    Degcn: Deformable graph convolutional networks for skeleton-based action recognition.IEEE Transactions on Image Processing, 33:2477–2490, 2024

    Woomin Myung, Nan Su, Jing-Hao Xue, and Guijin Wang. Degcn: Deformable graph convolutional networks for skeleton-based action recognition.IEEE Transactions on Image Processing, 33:2477–2490, 2024. 2

  30. [30]

    An exploration of embodied visual explo- ration.International Journal of Computer Vision, 129(5): 1616–1649, 2021

    Santhosh K Ramakrishnan, Dinesh Jayaraman, and Kris- ten Grauman. An exploration of embodied visual explo- ration.International Journal of Computer Vision, 129(5): 1616–1649, 2021. 1

  31. [31]

    Hao Ren, Mingwei Wang, Wenpeng Li, Chen Liu, and Mengli Zhang. Adaptive patchwork: Real-time ground seg- mentation for 3d point cloud with adaptive partitioning and spatial-temporal context.IEEE Robotics and Automation Letters, 8(11):7162–7169, 2023. 2

  32. [32]

    Prior does matter: Visual naviga- tion via denoising diffusion bridge models.arXiv preprint arXiv:2504.10041, 2025

    Hao Ren, Yiming Zeng, Zetong Bi, Zhaoliang Wan, Junlong Huang, and Hui Cheng. Prior does matter: Visual naviga- tion via denoising diffusion bridge models.arXiv preprint arXiv:2504.10041, 2025. 2, 5

  33. [33]

    Viplanner: Visual semantic imperative learn- ing for local navigation

    Pascal Roth, Julian Nubert, Fan Yang, Mayank Mittal, and Marco Hutter. Viplanner: Visual semantic imperative learn- ing for local navigation. In2024 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 5243–5249. IEEE, 2024. 2

  34. [34]

    Maast: Map attention with semantic transformers for effi- cient visual navigation

    Zachary Seymour, Kowshik Thopalli, Niluthpol Mithun, Han-Pang Chiu, Supun Samarasekera, and Rakesh Kumar. Maast: Map attention with semantic transformers for effi- cient visual navigation. In2021 IEEE international con- ference on robotics and automation (ICRA), pages 13223– 13230. IEEE, 2021. 2

  35. [35]

    arXiv preprint arXiv:2104.05859 , year=

    Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Rapid exploration for open- world navigation with latent goal models.arXiv preprint arXiv:2104.05859, 2021. 5

  36. [36]

    Ving: Learning open- world navigation with visual goals

    Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Ving: Learning open- world navigation with visual goals. In2021 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 13215–13222. IEEE, 2021. 1

  37. [37]

    Gnm: A general navigation model to drive any robot

    Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE,

  38. [38]

    Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

    Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachow- icz, Kevin Black, Noriaki Hirose, and Sergey Levine. Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023. 2, 5

  39. [39]

    Nomad: Goal masked diffusion policies for nav- igation and exploration

    Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for nav- igation and exploration. In2024 IEEE International Con- ference on Robotics and Automation (ICRA), pages 63–70. IEEE, 2024. 1, 2, 5

  40. [40]

    Learning spatiotemporal features with 3d convolutional networks

    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. InProceedings of the IEEE inter- national conference on computer vision, pages 4489–4497,

  41. [41]

    Rapid hand: A robust, affordable, perception- integrated, dexterous manipulation platform for generalist robot auton- omy,

    Zhaoliang Wan, Zetong Bi, Zida Zhou, Hao Ren, Yiming Zeng, Yihan Li, Lu Qi, Xu Yang, Ming-Hsuan Yang, and Hui Cheng. Rapid hand: A robust, affordable, perception- integrated, dexterous manipulation platform for generalist robot autonomy.arXiv preprint arXiv:2506.07490, 2025. 2

  42. [42]

    Grutopia: Dream general robots in a city at scale, 2024

    Hanqing Wang, Jiahe Chen, Wensi Huang, Qingwei Ben, Tai Wang, Boyu Mi, Tao Huang, Siheng Zhao, Yilun Chen, Sizhe Yang, et al. Grutopia: Dream general robots in a city at scale.arXiv preprint arXiv:2407.10943, 2024. 2, 6

  43. [43]

    Coal: Robust contrastive learning-based visual navigation frame- work.Journal of Field Robotics, 2025

    Zengmao Wang, Jianhua Hu, Qifei Tang, and Wei Gao. Coal: Robust contrastive learning-based visual navigation frame- work.Journal of Field Robotics, 2025. 2

  44. [44]

    Offline visual repre- sentation learning for embodied navigation

    Karmesh Yadav, Ram Ramrakhya, Arjun Majumdar, Vincent-Pierre Berges, Sachit Kuhar, Dhruv Batra, Alexei Baevski, and Oleksandr Maksymets. Offline visual repre- sentation learning for embodied navigation. InWorkshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023. 2

  45. [45]

    Spatial tempo- ral graph convolutional networks for skeleton-based action recognition

    Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo- ral graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI conference on arti- ficial intelligence, 2018. 2

  46. [46]

    Graph r-cnn for scene graph generation

    Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. InProceed- ings of the European conference on computer vision (ECCV), pages 670–685, 2018. 2

  47. [47]

    Survey of robot 3d path planning algo- rithms.Journal of Control Science and Engineering, 2016 (1):7426913, 2016

    Liang Yang, Juntong Qi, Dalei Song, Jizhong Xiao, Jianda Han, and Yong Xia. Survey of robot 3d path planning algo- rithms.Journal of Control Science and Engineering, 2016 (1):7426913, 2016. 2

  48. [48]

    Autonomous visual navigation for mobile robots: A systematic literature review.ACM Computing Sur- veys (CSUR), 53(1):1–34, 2020

    Yuri DV Yasuda, Luiz Eduardo G Martins, and Fabio AM Cappabianco. Autonomous visual navigation for mobile robots: A systematic literature review.ACM Computing Sur- veys (CSUR), 53(1):1–34, 2020. 1, 2

  49. [49]

    Navidiffusor: Cost-guided diffusion model for visual navigation.arXiv preprint arXiv:2504.10003, 2025

    Yiming Zeng, Hao Ren, Shuhang Wang, Junlong Huang, and Hui Cheng. Navidiffusor: Cost-guided diffusion model for visual navigation.arXiv preprint arXiv:2504.10003, 2025. 2

  50. [50]

    Get: Goal-directed exploration and target- ing for large-scale unknown environments.arXiv preprint arXiv:2505.20828, 2025

    Lanxiang Zheng, Ruidong Mei, Mingxin Wei, Hao Ren, and Hui Cheng. Get: Goal-directed exploration and target- ing for large-scale unknown environments.arXiv preprint arXiv:2505.20828, 2025. 2

  51. [51]

    Target-driven vi- sual navigation in indoor scenes using deep reinforcement learning

    Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Ab- hinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven vi- sual navigation in indoor scenes using deep reinforcement learning. In2017 IEEE international conference on robotics and automation (ICRA), pages 3357–3364. IEEE, 2017. 1, 2 10