arxiv: 2604.02829 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation

Hao Ren , Zetong Bi , Yiming Zeng , Zhaoliang Wan , Lu Qi , Hui Cheng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:07 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords visual navigationspatio-temporal fusiongraph reasoningtemporal shiftgoal-conditioned controlfeature encodingrobotic vision

0 comments

The pith

A new spatio-temporal fusion module uses graph reasoning per frame and hybrid temporal shifts to better preserve visual details for robot goal navigation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard feature encoders and temporal pooling in visual navigation agents discard fine-grained spatial and temporal structure from first-person image sequences, which limits accurate action prediction and progress estimation toward a specified goal image. It introduces a unified framework that extracts features from observation sequences and goal views, then fuses them through a module performing spatial graph reasoning inside each frame along with hybrid temporal modeling via shift operations and multi-resolution difference-aware convolutions. This richer representation supports goal-conditioned control policies. A sympathetic reader would care because retaining that structure could yield navigation agents that succeed more reliably without depending on elaborate policy heads or extra data. The reported experiments show consistent performance gains over baselines, framing the encoder as a reusable backbone.

Core claim

The authors establish that their spatio-temporal fusion module, which performs spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution, extracts a richer representation from visual sequences and goal observations, leading to improved navigation performance and serving as a generalizable visual backbone for goal-conditioned control.

What carries the argument

Spatio-temporal fusion module that integrates spatial graph reasoning for intra-frame relations with hybrid temporal shift operations and multi-resolution difference-aware convolutions to capture dynamics across frames while fusing sequence and goal features.

If this is right

Navigation agents achieve higher success rates when reaching specified visual goals from first-person views.
The encoder functions as a reusable visual backbone for multiple goal-conditioned control problems.
Action prediction and progress estimation become more accurate due to retained spatial and temporal details.
The approach reduces reliance on complex policy heads by improving the input representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The graph-based spatial reasoning could help navigation in scenes with many distinct objects by explicitly modeling their relations.
This representation might transfer to other sequential vision tasks such as video-based prediction or manipulation planning.
Replacing standard pooling with the hybrid temporal component could improve sample efficiency during policy training.

Load-bearing premise

The proposed fusion module actually preserves fine-grained spatial and temporal structure better than standard encoders and temporal pooling, rather than performance gains coming from unrelated training details or architecture choices.

What would settle it

An ablation experiment that replaces the fusion module with a standard CNN encoder plus average temporal pooling and measures equivalent or higher navigation success rates in the same goal-reaching tasks.

Figures

Figures reproduced from arXiv: 2604.02829 by Hao Ren, Hui Cheng, Lu Qi, Yiming Zeng, Zetong Bi, Zhaoliang Wan.

**Figure 2.** Figure 2: Pipeline of the Proposed Model for Action Prediction: The model processes input observations and goal images through feature [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) A Grid structure representing a partitioned image, [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative navigation trajectories (blue) produced by STRNet in 2D-3D-S and Citysim Environments. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Schematic diagram of front-view projection visualiza [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Visual navigation requires the robot to reach a specified goal such as an image, based on a sequence of first-person visual observations. While recent learning-based approaches have made significant progress, they often focus on improving policy heads or decision strategies while relying on simplistic feature encoders and temporal pooling to represent visual input. This leads to the loss of fine-grained spatial and temporal structure, ultimately limiting accurate action prediction and progress estimation. In this paper, we propose a unified spatio-temporal representation framework that enhances visual encoding for robotic navigation. Our approach extracts features from both image sequences and goal observations, and fuses them using the designed spatio-temporal fusion module. This module performs spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution. Experimental results demonstrate that our approach consistently improves navigation performance and offers a generalizable visual backbone for goal-conditioned control. Code is available at \href{https://github.com/hren20/STRNet}{https://github.com/hren20/STRNet}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STRNet adds a graph-plus-hybrid-shift fusion module for navigation encoders that hasn't been tried in exactly this combo, but the experiments need tighter ablations to pin the gains on the new parts rather than training details.

read the letter

STRNet focuses on the visual encoder for goal-conditioned navigation instead of just the policy head. It runs spatial graph reasoning inside each frame, then fuses with a hybrid temporal shift module and multi-resolution difference-aware convolution to keep more structure from the image sequence and goal image. That specific mix is the concrete new piece; it does not collapse to standard ResNet pooling or prior temporal modules cited in the abstract. Releasing the code is useful for anyone who wants to plug it in and test. The motivation about losing fine-grained spatial-temporal detail in simple encoders is clear and reasonable. The design choices look like they could help preserve detail better than average pooling. The main soft spot is the evidence. The abstract states consistent improvements without metrics, baseline tables, or ablation breakdowns. If the full experiments only compare the full STRNet against prior navigation methods and skip a controlled swap that replaces just the fusion module while freezing capacity, seeds, and training, then we cannot tell whether the lift comes from the graph-temporal design or from other implementation choices. That attribution gap is the one to check first. This paper is for people building or improving visual navigation stacks in robotics who need a stronger backbone to try. A reader already working on goal-conditioned control would get practical value from the code and the module description. It deserves a serious referee to examine the full results and ablations.

Referee Report

1 major / 2 minor

Summary. The paper proposes STRNet, a visual navigation framework that extracts features from first-person image sequences and goal observations then fuses them via a spatio-temporal fusion module. The module applies spatial graph reasoning per frame and models temporal dynamics with a hybrid temporal shift module plus multi-resolution difference-aware convolution, claiming this preserves fine-grained structure better than standard encoders and temporal pooling, yielding consistent performance gains and a generalizable backbone for goal-conditioned control. Code is released.

Significance. If the central claim holds under controlled evaluation, the work would supply a reusable spatio-temporal visual encoder for navigation that addresses a documented weakness in prior learning-based methods. The public code release is a concrete strength that supports reproducibility and follow-on use.

major comments (1)

[Experiments] Experiments section: the claim that the spatio-temporal fusion module (spatial graph reasoning + hybrid temporal shift + multi-resolution difference-aware convolution) is responsible for the reported gains requires a controlled ablation that replaces only this module with a standard encoder (e.g., ResNet + temporal average pooling) while freezing all other hyperparameters, seeds, training details, and policy head. Without such an isolation experiment, attribution remains unproven and performance differences could arise from unrelated implementation choices.

minor comments (2)

[Abstract] Abstract: quantitative metrics, baseline names, and ablation summaries are absent, which is atypical for a paper whose central claim rests on experimental improvement.
[Method] Method: the hybrid temporal shift and multi-resolution difference-aware convolution would benefit from explicit equations or a compact algorithm box to clarify how they differ from standard temporal pooling and shift operations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below.

read point-by-point responses

Referee: [Experiments] Experiments section: the claim that the spatio-temporal fusion module (spatial graph reasoning + hybrid temporal shift + multi-resolution difference-aware convolution) is responsible for the reported gains requires a controlled ablation that replaces only this module with a standard encoder (e.g., ResNet + temporal average pooling) while freezing all other hyperparameters, seeds, training details, and policy head. Without such an isolation experiment, attribution remains unproven and performance differences could arise from unrelated implementation choices.

Authors: We agree that a controlled ablation isolating only the spatio-temporal fusion module is required to rigorously attribute the reported gains. In the revised manuscript we will add this experiment: the proposed module will be replaced by a standard ResNet encoder followed by temporal average pooling while keeping every other element (hyperparameters, seeds, training details, policy head, and data pipeline) identical to the original configuration. The results will be reported alongside the existing ablations to demonstrate that the performance differences stem from the fusion module itself. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents STRNet as an independent neural architecture for visual navigation, consisting of feature extractors fused via a spatio-temporal module (spatial graph reasoning, hybrid temporal shift, multi-resolution difference-aware convolution). No equations, predictions, or claims reduce by construction to fitted parameters or self-referential definitions; the method is described as a new assembly of standard components and validated through external experiments against baselines. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the provided text. The central contribution remains an architectural proposal whose performance claims are tested separately rather than forced by the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that richer spatio-temporal visual features directly improve navigation policy performance; no new physical entities or ad-hoc constants are introduced beyond standard neural-network hyperparameters.

axioms (1)

domain assumption Better preservation of fine-grained spatial and temporal structure in visual features leads to improved action prediction and progress estimation in goal-conditioned navigation.
Invoked in the motivation and in the description of the fusion module's purpose.

pith-pipeline@v0.9.0 · 5488 in / 1135 out tokens · 48008 ms · 2026-05-13T20:07:50.441916+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

graph-based spatial aggregation module to enhance spatial understanding, and a lightweight temporal fusion module

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

[1]

Zero experience required: Plug & play modular transfer learning for semantic visual navigation

Ziad Al-Halah, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Zero experience required: Plug & play modular transfer learning for semantic visual navigation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17031–17041, 2022. 2

work page 2022
[2]

Vision-and- language navigation: Interpreting visually-grounded naviga- tion instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, et al. Vision-and- language navigation: Interpreting visually-grounded naviga- tion instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3674–3683, 2018. 1

work page 2018
[3]

Joint 2D-3D-Semantic Data for Indoor Scene Understanding

Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017. 2, 6, 7

work page Pith review arXiv 2017
[4]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, et al. Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021. 2

work page 2021
[5]

Is space-time attention all you need for video understanding? InICML, page 4, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InICML, page 4, 2021. 2

work page 2021
[6]

Past, present, and future of simultaneous localiza- tion and mapping: Toward the robust-perception age.IEEE Transactions on robotics, 32(6):1309–1332, 2016

Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jos ´e Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localiza- tion and mapping: Toward the robust-perception age.IEEE Transactions on robotics, 32(6):1309–1332, 2016. 2

work page 2016
[7]

Topological planning with transform- ers for vision-and-language navigation

Kevin Chen, Junshen K Chen, Jo Chuang, Marynel V´azquez, and Silvio Savarese. Topological planning with transform- ers for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11276–11286, 2021. 2

work page 2021
[8]

Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, page 02783649241273668, 2023

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, page 02783649241273668, 2023. 5

work page 2023
[9]

Curious repre- sentation learning for embodied intelligence

Yilun Du, Chuang Gan, and Phillip Isola. Curious repre- sentation learning for embodied intelligence. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10408–10417, 2021. 2

work page 2021
[10]

Room-object entity prompting and reasoning for embodied referring expression

Chen Gao, Si Liu, Jinyu Chen, et al. Room-object entity prompting and reasoning for embodied referring expression. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(2):994–1010, 2023. 1

work page 2023
[11]

Vision gnn: An image is worth graph of nodes.Advances in neural informa- tion processing systems, 35:8291–8303, 2022

Kai Han, Yunhe Wang, Jianyuan Guo, et al. Vision gnn: An image is worth graph of nodes.Advances in neural informa- tion processing systems, 35:8291–8303, 2022. 3

work page 2022
[12]

Deep visual mpc-policy learning for navigation.IEEE Robotics and Automation Let- ters, 4(4):3184–3191, 2019

Noriaki Hirose, Fei Xia, Roberto Mart ´ın-Mart´ın, Amir Sadeghian, and Silvio Savarese. Deep visual mpc-policy learning for navigation.IEEE Robotics and Automation Let- ters, 4(4):3184–3191, 2019. 5

work page 2019
[13]

Sacson: Scalable autonomous control for social nav- igation.IEEE Robotics and Automation Letters, 2023

Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social nav- igation.IEEE Robotics and Automation Letters, 2023. 5

work page 2023
[14]

Visual evaluation for autonomous driving

Yijie Hou, Chengshun Wang, Junhong Wang, Xiangyang Xue, Xiaolong Luke Zhang, Jun Zhu, Dongliang Wang, and Siming Chen. Visual evaluation for autonomous driving. IEEE Transactions on Visualization and Computer Graph- ics, 28(1):1030–1039, 2021. 1

work page 2021
[15]

A new representation of universal successor features for enhancing the generaliza- tion of target-driven visual navigation.IEEE Robotics and Automation Letters, 2024

Jiaocheng Hu, Yuexin Ma, Haiyun Jiang, Shaofeng He, Gelu Liu, Qizhen Weng, and Xiangwei Zhu. A new representation of universal successor features for enhancing the generaliza- tion of target-driven visual navigation.IEEE Robotics and Automation Letters, 2024. 2

work page 2024
[16]

Building category graphs representation with spatial and temporal attention for visual navigation

Xiaobo Hu, Youfang Lin, Hehe Fan, Shuo Wang, Zhihao Wu, and Kai Lv. Building category graphs representation with spatial and temporal attention for visual navigation. ACM Transactions on Multimedia Computing, Communica- tions and Applications, 20(7):1–22, 2024. 2

work page 2024
[17]

Mail: Improving imitation learning with selective state space models

Xiaogang Jia, Qian Wang, Atalay Donat, Bowen Xing, Ge Li, Hongyi Zhou, Onur Celik, Denis Blessing, Rudolf Li- outikov, and Gerhard Neumann. Mail: Improving imitation learning with selective state space models. In8th Annual Conference on Robot Learning, 2024. 2

work page 2024
[18]

Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters,

Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett War- nell, S ¨oren Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters,

work page
[19]

Design and use paradigms for gazebo, an open-source multi-robot simula- tor

Nathan Koenig and Andrew Howard. Design and use paradigms for gazebo, an open-source multi-robot simula- tor. In2004 IEEE/RSJ international conference on intelli- gent robots and systems (IROS)(IEEE Cat. No. 04CH37566), pages 2149–2154. Ieee, 2004. 2, 6

work page 2004
[20]

Memonav: Working memory model for visual navigation

Hongxin Li, Zeyu Wang, Xu Yang, Yuran Yang, Shuqi Mei, and Zhaoxiang Zhang. Memonav: Working memory model for visual navigation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 17913–17922, 2024. 2

work page 2024
[21]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019. 4

work page 2019
[22]

Citywalker: Learning embodied urban navigation from web-scale videos

Xinhao Liu, Jintong Li, Yicheng Jiang, Niranjan Sujay, Zhicheng Yang, Juexiao Zhang, John Abanes, Jing Zhang, and Chen Feng. Citywalker: Learning embodied urban navigation from web-scale videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6875–6885, 2025. 1

work page 2025
[23]

Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024. 1

work page arXiv 2024
[24]

Zson: Zero-shot object-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352,

Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, and Dhruv Batra. Zson: Zero-shot object-goal navigation using multimodal goal embeddings.Advances in Neural Information Processing Systems, 35:32340–32352,

work page
[25]

Sandra Malpica, Daniel Martin, Ana Serrano, Diego Gutier- rez, and Belen Masia. Task-dependent visual behavior in immersive environments: A comparative study of free ex- ploration, memory and visual search.IEEE transactions on visualization and computer graphics, 29(11):4417–4425,

work page
[26]

Visual navigation with spatial attention

Bar Mayo, Tamir Hazan, and Ayellet Tal. Visual navigation with spatial attention. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16898–16907, 2021. 2

work page 2021
[27]

Recurrent models of visual attention.Ad- vances in neural information processing systems, 27, 2014

V olodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention.Ad- vances in neural information processing systems, 27, 2014. 2

work page 2014
[28]

Greedyvig: Dynamic axial graph construction for efficient vision gnns

Mustafa Munir, William Avery, Md Mostafijur Rahman, and Radu Marculescu. Greedyvig: Dynamic axial graph construction for efficient vision gnns. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6118–6127, 2024. 3

work page 2024
[29]

Degcn: Deformable graph convolutional networks for skeleton-based action recognition.IEEE Transactions on Image Processing, 33:2477–2490, 2024

Woomin Myung, Nan Su, Jing-Hao Xue, and Guijin Wang. Degcn: Deformable graph convolutional networks for skeleton-based action recognition.IEEE Transactions on Image Processing, 33:2477–2490, 2024. 2

work page 2024
[30]

An exploration of embodied visual explo- ration.International Journal of Computer Vision, 129(5): 1616–1649, 2021

Santhosh K Ramakrishnan, Dinesh Jayaraman, and Kris- ten Grauman. An exploration of embodied visual explo- ration.International Journal of Computer Vision, 129(5): 1616–1649, 2021. 1

work page 2021
[31]

Hao Ren, Mingwei Wang, Wenpeng Li, Chen Liu, and Mengli Zhang. Adaptive patchwork: Real-time ground seg- mentation for 3d point cloud with adaptive partitioning and spatial-temporal context.IEEE Robotics and Automation Letters, 8(11):7162–7169, 2023. 2

work page 2023
[32]

Prior does matter: Visual naviga- tion via denoising diffusion bridge models.arXiv preprint arXiv:2504.10041, 2025

Hao Ren, Yiming Zeng, Zetong Bi, Zhaoliang Wan, Junlong Huang, and Hui Cheng. Prior does matter: Visual naviga- tion via denoising diffusion bridge models.arXiv preprint arXiv:2504.10041, 2025. 2, 5

work page arXiv 2025
[33]

Viplanner: Visual semantic imperative learn- ing for local navigation

Pascal Roth, Julian Nubert, Fan Yang, Mayank Mittal, and Marco Hutter. Viplanner: Visual semantic imperative learn- ing for local navigation. In2024 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 5243–5249. IEEE, 2024. 2

work page 2024
[34]

Maast: Map attention with semantic transformers for effi- cient visual navigation

Zachary Seymour, Kowshik Thopalli, Niluthpol Mithun, Han-Pang Chiu, Supun Samarasekera, and Rakesh Kumar. Maast: Map attention with semantic transformers for effi- cient visual navigation. In2021 IEEE international con- ference on robotics and automation (ICRA), pages 13223– 13230. IEEE, 2021. 2

work page 2021
[35]

arXiv preprint arXiv:2104.05859 , year=

Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Rapid exploration for open- world navigation with latent goal models.arXiv preprint arXiv:2104.05859, 2021. 5

work page arXiv 2021
[36]

Ving: Learning open- world navigation with visual goals

Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Ving: Learning open- world navigation with visual goals. In2021 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 13215–13222. IEEE, 2021. 1

work page 2021
[37]

Gnm: A general navigation model to drive any robot

Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE,

work page
[38]

Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachow- icz, Kevin Black, Noriaki Hirose, and Sergey Levine. Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023. 2, 5

work page arXiv 2023
[39]

Nomad: Goal masked diffusion policies for nav- igation and exploration

Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for nav- igation and exploration. In2024 IEEE International Con- ference on Robotics and Automation (ICRA), pages 63–70. IEEE, 2024. 1, 2, 5

work page 2024
[40]

Learning spatiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. InProceedings of the IEEE inter- national conference on computer vision, pages 4489–4497,

work page
[41]

Rapid hand: A robust, affordable, perception- integrated, dexterous manipulation platform for generalist robot auton- omy,

Zhaoliang Wan, Zetong Bi, Zida Zhou, Hao Ren, Yiming Zeng, Yihan Li, Lu Qi, Xu Yang, Ming-Hsuan Yang, and Hui Cheng. Rapid hand: A robust, affordable, perception- integrated, dexterous manipulation platform for generalist robot autonomy.arXiv preprint arXiv:2506.07490, 2025. 2

work page arXiv 2025
[42]

Grutopia: Dream general robots in a city at scale, 2024

Hanqing Wang, Jiahe Chen, Wensi Huang, Qingwei Ben, Tai Wang, Boyu Mi, Tao Huang, Siheng Zhao, Yilun Chen, Sizhe Yang, et al. Grutopia: Dream general robots in a city at scale.arXiv preprint arXiv:2407.10943, 2024. 2, 6

work page arXiv 2024
[43]

Coal: Robust contrastive learning-based visual navigation frame- work.Journal of Field Robotics, 2025

Zengmao Wang, Jianhua Hu, Qifei Tang, and Wei Gao. Coal: Robust contrastive learning-based visual navigation frame- work.Journal of Field Robotics, 2025. 2

work page 2025
[44]

Offline visual repre- sentation learning for embodied navigation

Karmesh Yadav, Ram Ramrakhya, Arjun Majumdar, Vincent-Pierre Berges, Sachit Kuhar, Dhruv Batra, Alexei Baevski, and Oleksandr Maksymets. Offline visual repre- sentation learning for embodied navigation. InWorkshop on Reincarnating Reinforcement Learning at ICLR 2023, 2023. 2

work page 2023
[45]

Spatial tempo- ral graph convolutional networks for skeleton-based action recognition

Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo- ral graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI conference on arti- ficial intelligence, 2018. 2

work page 2018
[46]

Graph r-cnn for scene graph generation

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph r-cnn for scene graph generation. InProceed- ings of the European conference on computer vision (ECCV), pages 670–685, 2018. 2

work page 2018
[47]

Survey of robot 3d path planning algo- rithms.Journal of Control Science and Engineering, 2016 (1):7426913, 2016

Liang Yang, Juntong Qi, Dalei Song, Jizhong Xiao, Jianda Han, and Yong Xia. Survey of robot 3d path planning algo- rithms.Journal of Control Science and Engineering, 2016 (1):7426913, 2016. 2

work page 2016
[48]

Autonomous visual navigation for mobile robots: A systematic literature review.ACM Computing Sur- veys (CSUR), 53(1):1–34, 2020

Yuri DV Yasuda, Luiz Eduardo G Martins, and Fabio AM Cappabianco. Autonomous visual navigation for mobile robots: A systematic literature review.ACM Computing Sur- veys (CSUR), 53(1):1–34, 2020. 1, 2

work page 2020
[49]

Navidiffusor: Cost-guided diffusion model for visual navigation.arXiv preprint arXiv:2504.10003, 2025

Yiming Zeng, Hao Ren, Shuhang Wang, Junlong Huang, and Hui Cheng. Navidiffusor: Cost-guided diffusion model for visual navigation.arXiv preprint arXiv:2504.10003, 2025. 2

work page arXiv 2025
[50]

Get: Goal-directed exploration and target- ing for large-scale unknown environments.arXiv preprint arXiv:2505.20828, 2025

Lanxiang Zheng, Ruidong Mei, Mingxin Wei, Hao Ren, and Hui Cheng. Get: Goal-directed exploration and target- ing for large-scale unknown environments.arXiv preprint arXiv:2505.20828, 2025. 2

work page arXiv 2025
[51]

Target-driven vi- sual navigation in indoor scenes using deep reinforcement learning

Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Ab- hinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven vi- sual navigation in indoor scenes using deep reinforcement learning. In2017 IEEE international conference on robotics and automation (ICRA), pages 3357–3364. IEEE, 2017. 1, 2 10

work page 2017