pith. machine review for the scientific record. sign in

arxiv: 2604.24086 · v1 · submitted 2026-04-27 · 💻 cs.RO · cs.AI

Recognition: unknown

AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:14 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords asynchronous navigationvision-language-actioncloud roboticsedge adaptationreinforcement learningsafety constraintskinematic mappingplug-and-play
0
0 comments X

The pith

AsyncShield uses a pose buffer and kinematic mapping to correct latency in cloud VLA robot navigation, then balances restored intent against real-time safety constraints without any model changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a way to deploy large vision-language-action models from the cloud for continuous robot movement even when network delays would otherwise make the commands outdated and dangerous. It replaces predictive guessing with a direct physical calculation that shifts past commands into the robot's current position using stored poses and motion rules. An additional layer treats safety as hard limits and uses reinforcement learning to decide how closely to follow the corrected commands while avoiding obstacles detected by sensors. The whole system attaches without retraining the original models and uses standard interfaces so it works across different setups. If the approach holds, it opens the door to keeping heavy AI computation off the robot while still achieving reliable navigation.

Core claim

AsyncShield maintains a temporal pose buffer and applies kinematic transformations to convert the time lag of cloud inferences into spatial offsets, thereby restoring the original geometric intent of the VLA model. This intent is then tracked by a reinforcement learning adapter that solves a constrained Markov decision process to balance fidelity to the VLA output against hard LiDAR-based collision avoidance constraints, using the PPO-Lagrangian algorithm. The framework uses a universal sub-goal interface, domain randomization, and collision radius inflation for perception adaptation, enabling plug-and-play operation without fine-tuning the cloud models. Experiments confirm improved success率

What carries the argument

Temporal pose buffer with kinematic transformations that convert time lags into spatial pose offsets, combined with a CMDP solved by PPO-Lagrangian to trade off intent tracking against obstacle avoidance.

Load-bearing premise

The pose buffer and kinematic transformations will recover the exact original VLA intent without adding new errors from sensor noise, odometry drift, or unmodeled robot dynamics.

What would settle it

A controlled trial in which the robot follows the buffer-corrected commands but still collides or deviates from the intended path at the same rate as the uncorrected baseline under realistic sensor noise and drift.

Figures

Figures reproduced from arXiv: 2604.24086 by Kai Yang, Mu Xu, Shichao Xie, Xiaolong Wu, Xing Li, Yanfen Shen, Yingnan Guo, Zedong Chu, Zhengbo Wang.

Figure 1
Figure 1. Figure 1: Overview of the AsyncShield framework. Top: Behavioral comparison under network degradation. Naive Execution blindly follows stale intents, leading to collisions, whereas AsyncShield safely bypasses obstacles. Bottom: The edge adaptation pipeline. The system first utilizes a temporal pose buffer to perform spatio-temporal intent realignment on delayed cloud waypoints. Subsequently, a policy optimized via P… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of executed trajectories under the Mixed Degradation network condition. We visualize the actual executed robot trajectories (blue solid lines) and the realigned/stale VLA intents (red thin lines) across four highly challenging dynamic scenarios. RTC generates extremely smooth curves but blindly guides the robot to crash into dynamic obstacles. A2C2 exhibits severe Intent Deviation, c… view at source ↗
read the original abstract

While Vision-Language-Action (VLA) models have been demonstrated possessing strong zero-shot generalization for robot control, their massive parameter sizes typically necessitate cloud-based deployment. However, cloud deployment introduces network jitter and inference latency, which can induce severe spatiotemporal misalignment in mobile navigation under continuous displacement, so that the stale intents expressed in past ego frames may become spatially incorrect in the current frame and lead to collisions. To address this issue, we propose AsyncShield, a plug-and-play asynchronous control framework. AsyncShield discards traditional black-box time-series prediction in favor of a deterministic physical white-box spatial mapping. By maintaining a temporal pose buffer and utilizing kinematic transformations, the system accurately converts temporal lag into spatial pose offsets to restore the VLA's original geometric intent. To balance intent restoration fidelity and physical safety, the edge adaptation is formulated as a constrained Markov decision process (CMDP). Solved via the PPO-Lagrangian algorithm, a reinforcement learning adapter dynamically trades off between tracking the VLA intent and responding to high-frequency LiDAR obstacle avoidance hard constraints. Furthermore, benefiting from a standardized universal sub-goal interface, domain randomization, and perception-level adaptation via Collision Radius Inflation, AsyncShield operates as a lightweight, plug-and-play module. Simulation and real-world experiments demonstrate that, without fine-tuning any cloud-based foundation models, the framework exhibits zero-shot and robust generalization capabilities, effectively improving the success rate and physical safety of asynchronous navigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes AsyncShield, a plug-and-play edge adapter for asynchronous cloud-based Vision-Language-Action (VLA) navigation. It maintains a temporal pose buffer and applies deterministic kinematic transformations to convert network latency into spatial pose offsets, thereby restoring the VLA's original geometric intent without fine-tuning the foundation model. The edge adaptation is cast as a Constrained Markov Decision Process (CMDP) solved via PPO-Lagrangian to trade off VLA sub-goal tracking against high-frequency LiDAR hard constraints, augmented by Collision Radius Inflation and domain randomization for plug-and-play operation. Simulation and real-world experiments are claimed to demonstrate zero-shot generalization with improved success rates and physical safety.

Significance. If the robustness claims hold, the work offers a practical, lightweight solution to a common deployment barrier for large VLA models in mobile robotics. The white-box physical mapping and standardized sub-goal interface are strengths that avoid black-box time-series predictors and enable zero-shot use across domains. The CMDP formulation for safety-aware adaptation is a reasonable engineering choice and could influence future edge-cloud robotics systems.

major comments (2)
  1. [Method section (temporal pose buffer and kinematic transformations)] The core restoration mechanism (temporal pose buffer plus kinematic transformations) is presented as accurately recovering VLA intent. However, this mapping assumes noise-free poses and perfect kinematics; the subsequent CMDP/PPO-Lagrangian adapter and Collision Radius Inflation address only downstream safety and do not correct upstream errors from odometry drift or sensor noise. This assumption is load-bearing for the zero-shot success-rate and safety claims yet receives no sensitivity analysis or noisy-condition experiments.
  2. [Abstract and Experiments section] The abstract and results claim quantitative improvements in success rate and physical safety, but the provided description supplies no numerical values, baselines, statistical significance, or error analysis. Without these, the magnitude and reliability of the reported gains cannot be assessed, weakening evaluation of the central experimental claim.
minor comments (1)
  1. [Abstract] The abstract states that the framework 'operates as a lightweight, plug-and-play module' but does not specify the exact interface contract or computational overhead on the edge device.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's practical value. We address each major comment below with clarifications and indicate the revisions made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method section (temporal pose buffer and kinematic transformations)] The core restoration mechanism (temporal pose buffer plus kinematic transformations) is presented as accurately recovering VLA intent. However, this mapping assumes noise-free poses and perfect kinematics; the subsequent CMDP/PPO-Lagrangian adapter and Collision Radius Inflation address only downstream safety and do not correct upstream errors from odometry drift or sensor noise. This assumption is load-bearing for the zero-shot success-rate and safety claims yet receives no sensitivity analysis or noisy-condition experiments.

    Authors: We agree that the deterministic kinematic mapping assumes accurate pose inputs for exact geometric restoration. In practice, the CMDP formulation prioritizes LiDAR-based hard safety constraints over strict intent tracking, and Collision Radius Inflation adds a conservative buffer against small upstream errors. Real-world experiments already incorporate physical odometry drift, sensor noise, and imperfect kinematics, providing evidence of robustness under realistic conditions. We have added a dedicated sensitivity analysis subsection to the Experiments section, including results with injected pose noise and drift, to explicitly quantify the impact on success rate and safety. revision: yes

  2. Referee: [Abstract and Experiments section] The abstract and results claim quantitative improvements in success rate and physical safety, but the provided description supplies no numerical values, baselines, statistical significance, or error analysis. Without these, the magnitude and reliability of the reported gains cannot be assessed, weakening evaluation of the central experimental claim.

    Authors: The abstract is intentionally high-level for brevity. The Experiments section provides the requested details, including specific success-rate improvements, baseline comparisons (e.g., against direct VLA deployment and time-series predictors), statistical significance testing across repeated trials, and error analysis. We have revised the abstract to include representative numerical values and key metrics for immediate clarity while preserving conciseness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses standard kinematics and off-the-shelf RL

full rationale

The claimed chain converts temporal lag to spatial offsets via a temporal pose buffer and deterministic kinematic transformations (standard rigid-body math), then applies a CMDP solved by PPO-Lagrangian to trade off intent tracking against LiDAR constraints. Neither step is defined in terms of the target success/safety metrics, nor does any prediction reduce to a fitted parameter that encodes the outcome. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the load-bearing steps. The zero-shot generalization claim rests on the plug-and-play interface and domain randomization, which are independent of the core mapping.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The CMDP formulation and PPO-Lagrangian solver are standard but may involve implicit tuning parameters not specified here.

pith-pipeline@v0.9.0 · 5580 in / 1061 out tokens · 33980 ms · 2026-05-08T03:14:12.969610+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 36 canonical work pages · 9 internal anchors

  1. [1]

    Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,

    J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,”arXiv preprint arXiv:2412.06224, 2024

  2. [2]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

  3. [3]

    Octo: An Open-Source Generalist Robot Policy

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y . L. Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine, “Octo: An open-source generalist robot policy,” 2024. [Online]. Available: https://arxiv.org/abs/2405.12213

  4. [4]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,” 2024. [Online]. Available: https://arxiv.org/abs/2406.09246

  5. [5]

    Abot-n0: Technical report on the vla foundation model for versatile embodied navigation,

    Z. Chu, S. Xie, X. Wu, Y . Shen, M. Luo, Z. Wang, F. Liu, X. Leng, J. Hu, M. Yin, J. Lu, Y . Guo, K. Yang, J. Han, X. Chen, Y . Zhu, Y . Zhao, X. Liu, Y . Yang, Y . He, J. Wang, Y . Cai, T. Zhang, L. Gao, L. Liu, M. Sun, F. Jiang, C. Wang, Z. Liu, H. Pan, H. Han, Z. Gu, K. Yang, J. Zhang, D. Jing, Z. Guan, W. Guo, G. Liu, D. Yang, X. Yang, M. Yang, H. Xin...

  6. [6]
  7. [7]

    AstraNav-World: World Model for Foresight Control and Consistency

    J. Hu, J. Chen, H. Bai, M. Luo, S. Xie, Z. Chen, F. Liu, Z. Chu, X. Xue, B. Ren, X. Wu, M. Xu, and S. Zhang, “Astranav-world: World model for foresight control and consistency,” 2025. [Online]. Available: https://arxiv.org/abs/2512.21714

  8. [8]

    Nav-r 2 dual-relation reasoning for generalizable open-vocabulary object-goal navigation,

    W. Xiang, H. Zhang, T. Yang, Z. Chu, R. Chu, S. Xie, Y . Yuan, J. Sun, Z. Gu, J. Wang, X. Wu, M. Xu, and Y . Yang, “Nav-r 2 dual-relation reasoning for generalizable open-vocabulary object-goal navigation,”

  9. [9]

    Available: https://arxiv.org/abs/2512.02400

    [Online]. Available: https://arxiv.org/abs/2512.02400

  10. [10]

    Navforesee: A unified vision-language world model for hierarchical planning and dual-horizon navigation prediction.arXiv preprint arXiv:2512.01550, 2026

    F. Liu, S. Xie, M. Luo, Z. Chu, J. Hu, X. Wu, and M. Xu, “Navforesee: A unified vision-language world model for hierarchical planning and dual-horizon navigation prediction,” 2025. [Online]. Available: https://arxiv.org/abs/2512.01550

  11. [11]

    Socialnav: Training human-inspired foundation model for socially-aware embodied navigation,

    Z. Chen, Y . Guo, Z. Chu, M. Luo, Y . Shen, M. Sun, J. Hu, S. Xie, K. Yang, P. Shi, Z. Gu, L. Liu, H. Han, X. Wu, M. Xu, and Y . Zhang, “Socialnav: Training human-inspired foundation model for socially-aware embodied navigation,” 2025. [Online]. Available: https://arxiv.org/abs/2511.21135

  12. [12]

    arXiv preprint arXiv:2509.25687 , year=

    X. Xue, J. Hu, M. Luo, S. Xie, J. Chen, Z. Xie, K. Quan, W. Guo, M. Xu, and Z. Chu, “Omninav: A unified framework for prospective exploration and visual-language navigation,” 2026. [Online]. Available: https://arxiv.org/abs/2509.25687

  13. [13]

    Tic-vla: A think-in-control vision-language-action model for robot navigation in dynamic environments

    Z. Huang, Y . Zhang, J. Liu, R. Song, C. Tang, and J. Ma, “Tic-vla: A think-in-control vision-language-action model for robot navigation in dynamic environments,” 2026. [Online]. Available: https://arxiv.org/abs/2602.02459

  14. [14]

    Y ., and Levine, S

    K. Black, M. Y . Galliker, and S. Levine, “Real-time execution of action chunking flow policies,” 2025. [Online]. Available: https://arxiv.org/abs/2506.07339

  15. [15]

    Leave no observation behind: Real-time correction for vla action chunks.ArXiv, abs/2509.23224, 2025

    K. Sendai, M. Alvarez, T. Matsushima, Y . Matsuo, and Y . Iwasawa, “Leave no observation behind: Real-time correction for vla action chunks,” 2025. [Online]. Available: https://arxiv.org/abs/2509.23224

  16. [16]

    Constrained policy optimization,

    J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” inInternational conference on machine learning. Pmlr, 2017, pp. 22–31

  17. [17]

    Omnivla: An omni-modal vision-language-action model for robot navigation,

    N. Hirose, C. Glossop, D. Shah, and S. Levine, “Omnivla: An omni-modal vision-language-action model for robot navigation,”

  18. [18]
  19. [19]

    Green-vla: Staged vision-language-action model for generalist robots,

    I. Apanasevich, M. Artemyev, R. Babakyan, P. Fedotova, D. Grankin, E. Kupryashin, A. Misailidi, D. Nerus, A. Nutalapati, G. Sidorov, I. Efremov, M. Gerasyov, D. Pikurov, Y . Senchenko, S. Davidenko, D. Kulikov, M. Sultankin, K. Askarbek, O. Shamanin, D. Statovoy, E. Zalyaev, I. Zorin, A. Letkin, E. Rusakov, A. Silchenko, V . V orobyov, S. Sobolnikov, and ...

  20. [20]

    Autofly: Vision-language-action model for uav autonomous navigation in the wild,

    X. Sun, W. Si, W. Ni, Y . Li, D. Wu, F. Xie, R. Guan, H.-Y . Xu, H. Ding, Y . Wu, Y . Yue, Y . Huang, and H. Xiong, “Autofly: Vision-language-action model for uav autonomous navigation in the wild,” 2026. [Online]. Available: https://arxiv.org/abs/2602.09657

  21. [21]

    Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

    H. Chi, H.-a. Gao, Z. Liu, J. Liu, C. Liu, J. Li, K. Yang, Y . Yu, Z. Wang, W. Li,et al., “Impromptu vla: Open weights and open data for driving vision-language-action models,”arXiv preprint arXiv:2505.23757, 2025

  22. [22]

    arXiv preprint arXiv:2601.22153 (2026)

    H. Xie, B. Wen, J. Zheng, Z. Chen, F. Hong, H. Diao, and Z. Liu, “Dynamicvla: A vision-language-action model for dynamic object manipulation,”arXiv preprint arXiv:2601.22153, 2026

  23. [23]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti,et al., “Smolvla: A vision-language-action model for affordable and efficient robotics,”arXiv preprint arXiv:2506.01844, 2025

  24. [24]

    Contextvla: Vision-language- action model with amortized multi-frame context.arXiv preprint arXiv:2510.04246, 2025a

    H. Jang, S. Yu, H. Kwon, H. Jeon, Y . Seo, and J. Shin, “Contextvla: Vision-language-action model with amortized multi-frame context,” arXiv preprint arXiv:2510.04246, 2025

  25. [25]

    SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    H. Li, Y . Zuo, J. Yu, Y . Zhang, Z. Yang, K. Zhang, X. Zhu, Y . Zhang, T. Chen, G. Cui,et al., “Simplevla-rl: Scaling vla training via reinforcement learning,”arXiv preprint arXiv:2509.09674, 2025

  26. [26]

    Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs,

    H.-T. L. Chiang, Z. Xu, Z. Fu, M. G. Jacob, T. Zhang, T.-W. E. Lee, W. Yu, C. Schenck, D. Rendleman, D. Shah, F. Xia, J. Hsu, J. Hoech, P. Florence, S. Kirmani, S. Singh, V . Sindhwani, C. Parada, C. Finn, P. Xu, S. Levine, and J. Tan, “Mobility vla: Multimodal instruction navigation with long-context vlms and topological graphs,” 2024. [Online]. Availabl...

  27. [27]

    Iros: A dual-process architecture for real-time vlm-based indoor navigation,

    J. Lee, H. Shin, and J. Ko, “Iros: A dual-process architecture for real-time vlm-based indoor navigation,” 2026. [Online]. Available: https://arxiv.org/abs/2601.21506

  28. [28]

    X-mobility: End-to-end generalizable navigation via world modeling,

    W. Liu, H. Zhao, C. Li, J. Biswas, B. Okal, P. Goyal, Y . Chang, and S. Pouya, “X-mobility: End-to-end generalizable navigation via world modeling,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 7569–7576

  29. [29]

    X-nav: Learning end-to- end cross-embodiment navigation for mobile robots,

    H. Wang, A. H. Tan, A. Fung, and G. Nejat, “X-nav: Learning end-to- end cross-embodiment navigation for mobile robots,”IEEE Robotics and Automation Letters, vol. 11, no. 1, pp. 698–705, 2025

  30. [30]

    Ce-nav: Flow-guided reinforcement refinement for cross-embodiment local navigation,

    K. Yang, T. Zhang, Z. Wang, Z. Chu, X. Wu, Y . Cai, and M. Xu, “Ce-nav: Flow-guided reinforcement refinement for cross-embodiment local navigation,”arXiv preprint arXiv:2509.23203, 2025

  31. [31]

    Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents

    X. Chen, S. Xie, Z. Gu, L. Jia, M. Luo, F. Liu, Z. Chu, Y . Shen, X. Wu, and M. Xu, “Explore like humans: Autonomous exploration with online sg-memo construction for embodied agents,” 2026. [Online]. Available: https://arxiv.org/abs/2604.19034

  32. [32]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13705

  33. [33]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” 2024. [Online]. Available: https://arxiv.org/abs/2303.04137

  34. [34]

    Mixture of horizons in action chunking.arXiv preprint arXiv:2511.19433, 2025

    D. Jing, G. Wang, J. Liu, W. Tang, Z. Sun, Y . Yao, Z. Wei, Y . Liu, Z. Lu, and M. Ding, “Mixture of horizons in action chunking,”arXiv preprint arXiv:2511.19433, 2025

  35. [35]

    arXiv preprint arXiv:2512.01031 (2025)

    J. Tang, Y . Sun, Y . Zhao, S. Yang, Y . Lin, Z. Zhang, J. Hou, Y . Lu, Z. Liu, and S. Han, “Vlash: Real-time vlas via future- state-aware asynchronous inference,” 2025. [Online]. Available: https://arxiv.org/abs/2512.01031

  36. [36]

    Asynchronous fast-slow vision-language-action policies for whole-body robotic manipulation,

    T. Zou, H. Zeng, Y . Nong, Y . Li, K. Liu, H. Yang, X. Ling, X. Li, and L. Ma, “Asynchronous fast-slow vision-language-action policies for whole-body robotic manipulation,”arXiv preprint arXiv:2512.20188, 2025

  37. [37]

    Asyncvla: An asynchronous vla for fast and robust navigation on the edge,

    N. Hirose, C. Glossop, D. Shah, and S. Levine, “Asyncvla: An asynchronous vla for fast and robust navigation on the edge,”arXiv preprint arXiv:2602.13476, 2026

  38. [38]

    Navformer: A transformer ar- chitecture for robot target-driven navigation in unknown and dynamic environments,

    H. Wang, A. H. Tan, and G. Nejat, “Navformer: A transformer ar- chitecture for robot target-driven navigation in unknown and dynamic environments,”IEEE Robotics and Automation Letters, vol. 9, no. 8, pp. 6808–6815, 2024

  39. [39]

    V-strong: Visual self-supervised traversability learning for off-road navigation,

    S. Jung, J. Lee, X. Meng, B. Boots, and A. Lambert, “V-strong: Visual self-supervised traversability learning for off-road navigation,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 1766–1773

  40. [40]

    Hierarchical reinforcement learning for safe mapless navigation with congestion estimation,

    J. Gao, X. Pang, Q. Liu, and Y . Li, “Hierarchical reinforcement learning for safe mapless navigation with congestion estimation,” in 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 8849–8855

  41. [41]

    Deep reinforcement learning with enhanced ppo for safe mobile robot navigation,

    H. Taheri, S. R. Hosseini, and M. A. Nekoui, “Deep reinforcement learning with enhanced ppo for safe mobile robot navigation,”arXiv preprint arXiv:2405.16266, 2024

  42. [42]

    Decentralized multi- robot navigation for autonomous surface vehicles with distributional reinforcement learning,

    X. Lin, Y . Huang, F. Chen, and B. Englot, “Decentralized multi- robot navigation for autonomous surface vehicles with distributional reinforcement learning,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 8327–8333

  43. [43]

    Adaptive primal-dual method for safe reinforcement learning,

    W. Chen, J. Onyejizu, L. Vu, L. Hoang, D. Subramanian, K. Kar, S. Mishra, and S. Paternain, “Adaptive primal-dual method for safe reinforcement learning,”arXiv preprint arXiv:2402.00355, 2024

  44. [44]

    Global uncertainty-aware planning for magnetic anomaly-based navigation,

    A. Penumarti and J. Shin, “Global uncertainty-aware planning for magnetic anomaly-based navigation,” 2024. [Online]. Available: https://arxiv.org/abs/2409.10366

  45. [45]

    Dr. nav: Semantic-geometric representations for proactive dead-end recovery and navigation,

    V . Rajagopal, K. W. K. Mudiyanselage, G. D. Seneviratne, P. A. Sankaralingam, M. Elnoor, J. Liang, R. Chandra, and D. Manocha, “Dr. nav: Semantic-geometric representations for proactive dead-end recovery and navigation,” 2025. [Online]. Available: https://arxiv.org/abs/2511.12778

  46. [46]

    Trackvla: Embodied visual tracking in the wild.arXiv preprint arXiv:2505.23189, 2025a

    S. Wang, J. Zhang, M. Li, J. Liu, A. Li, K. Wu, F. Zhong, J. Yu, Z. Zhang, and H. Wang, “Trackvla: Embodied visual tracking in the wild,”arXiv preprint arXiv:2505.23189, 2025