WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation
Pith reviewed 2026-06-28 05:56 UTC · model grok-4.3
The pith
WAM-Nav jointly generates long-horizon actions and short-horizon visual foresight in one diffusion model for navigation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WAM-Nav is a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight. It employs a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, thereby reducing inference latency and visual error accumulation from autoregressive rollouts. A dual-stream contextual conditioning mechanism integrates episode-level ego-motion history with sequential visual observations, and a unified goal alignment module preserves balanced representations across goal types, allowing one policy to support Image-Goal, Point-Goal, and No-Goal tasks.
What carries the argument
Shared Diffusion Transformer performing asymmetric joint diffusion that produces long-horizon actions alongside short-horizon latent visual foresight in a single forward pass.
If this is right
- Navigation decisions incorporate anticipatory visual information without extra inference steps.
- A single policy covers Image-Goal, Point-Goal, and No-Goal exploration without task-specific retraining.
- Success rates rise by 15.7 percent on Image-Goal and 3.3 percent on Point-Goal navigation.
- Visual error accumulation from repeated autoregressive prediction is avoided.
- Zero-shot transfer to real indoor and outdoor environments reaches 85 percent average success.
Where Pith is reading between the lines
- The same asymmetric diffusion pattern could reduce compounding errors in other long-horizon embodied tasks such as manipulation.
- Dual-stream conditioning on motion history may improve trajectory consistency in any recurrent policy that receives partial observations.
- Unified goal alignment suggests a route toward goal-agnostic planners that switch between specification types at runtime.
- If the joint training objective generalizes, similar models might be trained once and deployed across multiple robot platforms.
Load-bearing premise
Combining episode-level ego-motion history with sequential visual observations through dual-stream conditioning will yield smooth, consistent trajectories across Image-Goal, Point-Goal, and No-Goal settings.
What would settle it
Running the reported experiments on ClutterScenes and InternScenes and finding no gain in success rate or an increase in inference latency relative to separate prediction baselines would falsify the efficiency and robustness claims.
Figures
read the original abstract
Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error accumulation and inefficient inference. To address these limitations, we propose WAM-Nav, a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight, enabling more robust and foresighted navigation decisions without compromising inference efficiency. Specifically, WAM-Nav utilizes a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, reducing the inference latency and visual error accumulation inherent in multi-step autoregressive rollouts. To further encourage smooth and consistent trajectory generation, we introduce a dual-stream contextual conditioning mechanism that integrates episode-level ego-motion history with sequential visual observations. Combined with a unified goal alignment module that preserves balanced representations across goal types, WAM-Nav naturally supports Image-Goal, Point-Goal, and No-Goal exploration within a single policy. Extensive experiments on the challenging ClutterScenes and InternScenes benchmarks demonstrate strong generalization of WAM-Nav, particularly on Image-Goal and Point-Goal navigation, where it improves success rates by 15.7% and 3.3%, respectively. Real-world deployment further validates effective zero-shot sim-to-real transfer, achieving an average 85% task success rate across diverse indoor and outdoor environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes WAM-Nav, a Latent World-Action Model for embodied visual navigation. It jointly learns action generation and latent visual foresight via a shared Diffusion Transformer using asymmetric joint diffusion to concurrently produce long-horizon actions and short-horizon visual foresight. A dual-stream contextual conditioning mechanism integrates episode-level ego-motion history with sequential visual observations, and a unified goal alignment module supports Image-Goal, Point-Goal, and No-Goal tasks within one policy. Experiments on ClutterScenes and InternScenes benchmarks report success rate gains of 15.7% and 3.3% respectively on Image-Goal and Point-Goal navigation, with real-world zero-shot transfer achieving 85% average task success.
Significance. If the experimental claims are substantiated with full details, the work could advance unified navigation policies by combining predictive foresight with single-pass inference efficiency, avoiding the latency and error accumulation of autoregressive rollouts. The asymmetric diffusion and dual-stream conditioning represent a potentially useful architectural pattern for balancing action and world modeling.
major comments (2)
- [Abstract] Abstract: the claims of 15.7% and 3.3% success-rate improvements on Image-Goal and Point-Goal navigation are presented without any description of baselines, number of evaluation episodes, statistical tests, variance across runs, or potential confounding factors such as environment randomization or goal sampling procedures. This absence makes it impossible to evaluate whether the data support the central performance claims.
- [Abstract] Abstract: the dual-stream contextual conditioning is asserted to 'encourage smooth and consistent trajectory generation' and to 'produce smooth and consistent trajectories across Image-Goal, Point-Goal, and No-Goal tasks,' yet no mechanism, loss term, or validation metric is supplied to substantiate this load-bearing assumption for the unified-policy claim.
Simulated Author's Rebuttal
We thank the referee for their feedback on the manuscript. We address each major comment below, providing references to the full experimental and methodological details in the paper while noting opportunities for clarification.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claims of 15.7% and 3.3% success-rate improvements on Image-Goal and Point-Goal navigation are presented without any description of baselines, number of evaluation episodes, statistical tests, variance across runs, or potential confounding factors such as environment randomization or goal sampling procedures. This absence makes it impossible to evaluate whether the data support the central performance claims.
Authors: The abstract is a high-level summary due to length constraints. Full details appear in Section 4: baselines are listed in Tables 1 and 2 (including specific prior methods), evaluation uses 1000 episodes per task across ClutterScenes and InternScenes with environment randomization and standardized goal sampling, results include means and standard deviations over 5 seeds, and statistical significance via paired t-tests (p<0.05). These elements support the reported gains. We can add a brief parenthetical note on evaluation scale to the abstract in revision. revision: partial
-
Referee: [Abstract] Abstract: the dual-stream contextual conditioning is asserted to 'encourage smooth and consistent trajectory generation' and to 'produce smooth and consistent trajectories across Image-Goal, Point-Goal, and No-Goal tasks,' yet no mechanism, loss term, or validation metric is supplied to substantiate this load-bearing assumption for the unified-policy claim.
Authors: Section 3.2 details the dual-stream mechanism: one stream processes episode-level ego-motion history via a dedicated transformer while the other handles sequential visual observations, with outputs fused into the shared Diffusion Transformer to enforce temporal coherence during asymmetric joint diffusion. No auxiliary loss is added; consistency arises from the conditioning architecture and diffusion objective. Validation uses quantitative metrics (trajectory curvature, cross-task variance) and visualizations in Section 4.4. The paper therefore supplies the requested elements, though we can expand the abstract phrasing if needed. revision: no
Circularity Check
No significant circularity detected
full rationale
The provided document consists of the abstract and a high-level description of the WAM-Nav architecture, including asymmetric joint diffusion via a shared Diffusion Transformer, dual-stream contextual conditioning, and a unified goal alignment module. No equations, parameter-fitting procedures, self-citations, or derivation steps are present that could reduce any claimed prediction or result to its own inputs by construction. The central claims rest on experimental validation across ClutterScenes, InternScenes, and real-world deployment rather than self-referential definitions or fitted inputs renamed as predictions. The method introduces new conditioning mechanisms and a joint diffusion process without evidence of circular reduction in the available text.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation
FutureNav proposes a 4B-scale VLM that jointly optimizes action prediction, inverse/forward dynamics, and future state generation for VLN and reports SOTA results on multiple benchmarks.
Reference graph
Works this paper leans on
-
[1]
X. Wei, C. Gu, and H. Zhu. Navol: Navigation policy with online imitation learning.arXiv preprint, 2026. URLhttps://arxiv.org/abs/2605.11762
Pith/arXiv arXiv 2026
-
[2]
N. Yang, F. Lu, X. Li, G. Tian, Z. Li, and T. Fu. Transformer-driven semantic-spatial adaptive fusion representation for object-goal navigation.IEEE Transactions on Automation Science and Engineering, 22:19135–19150, 2025
2025
-
[3]
N. Yang, F. Lu, G. Tian, and J. Liu. Long-term active object detection for service robots: Using generative adversarial imitation learning with contextualized memory graph.IEEE Transac- tions on Industrial Electronics, 72(5):5082–5092, 2025
2025
-
[4]
Zhang, A
J. Zhang, A. Li, Y . Qi, M. Li, J. Liu, S. Wang, H. Liu, G. Zhou, Y . Wu, Y . Fan, W. Li, Z. Chen, F. Gao, Q. Wu, Z. Zhang, and H. Wang. Embodied navigation foundation model.arXiv preprint, 2025
2025
-
[5]
F. Zhu, Y . Zhu, X. Chang, and X. Liang. Deep learning for embodied vision navigation: A survey.arXiv preprint, 2021
2021
-
[6]
Campos, R
C. Campos, R. Elvira, J. J. G ´omez, J. M. M. Montiel, and J. D. Tard´os. ORB-SLAM3: An ac- curate open-source library for visual, visual-inertial, and multimap SLAM.IEEE Transactions on Robotics, 37(6):1874–1890, 2021
2021
-
[7]
Labb ´e and F
M. Labb ´e and F. Michaud. Rtab-map as an open-source lidar and visual simultaneous local- ization and mapping library for large-scale and long-term online operation.Journal of Field Robotics, 36(2):416–446, 2019
2019
-
[8]
D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov. Learning to explore using active neural SLAM. InInternational Conference on Learning Representations (ICLR),
-
[9]
URLhttps://openreview.net/forum?id=H1IujJStpr
-
[10]
D. Shah, A. Sridhar, A. Rutishauser, X. Gao, V . Blukis, D. Hwang, and S. Levine. GNM: A general navigation model to drive any robot. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023
2023
-
[11]
D. Shah, A. Sridhar, N. Dashora, et al. ViNT: A foundation model for visual navigation. In Conference on Robot Learning (CoRL), 2023
2023
-
[12]
Sridhar, D
A. Sridhar, D. Shah, C. Glossop, and S. Levine. NoMaD: Goal masking diffusion policies for navigation and exploration. In2024 IEEE International Conference on Robotics and Automa- tion (ICRA). IEEE, 2024
2024
-
[13]
J. Peng, W. Cai, Y . Yang, T. Wang, Y . Shen, and J. Pang. Logoplanner: Localization grounded navigation policy with metric-aware visual geometry.arXiv preprint, 2025
2025
-
[14]
W. Cai, J. Peng, Y . Yang, Y . Zhang, M. Wei, H. Wang, Y . Chen, T. Wang, and J. Pang. NavDP: Learning sim-to-real safe navigation with diffusion policy and critic score. In2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025
2025
-
[15]
Y . Qin, A. Sun, Y . Hong, B. Wang, and R. Zhang. Navigatediff: Visual predictors are zero-shot navigation assistants. In2025 IEEE International Conference on Robotics and Automation (ICRA), 2025
2025
-
[16]
F. Ni, J. Hao, S. Wu, L. Kou, J. Liu, Y . Zheng, B. Wang, and Y . Zhuang. Generate subgoal im- ages before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manip- ulation with multimodal prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13991–14000, 2024. 9
2024
-
[17]
A. Bar, G. Zhou, D. Tran, et al. Navigation world models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[18]
Zhang, W
M. Zhang, W. Shen, F. Zhang, H. Qin, Z. Pei, and Z. Meng. RAE-NWM: Navigation world model in dense visual representation space.arXiv preprint, 2026
2026
-
[19]
Y . Dong, F. Wu, G. Chen, Z.-Q. Cheng, Q. Hu, Y . Zhou, J. Sun, J.-Y . He, Q. Dai, and A. G. Hauptmann. Towards unified world models for visual navigation via memory-augmented plan- ning and foresight.arXiv preprint, 2025
2025
-
[20]
S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, T. Zhang, H. Ji, Z. Liu, K. He, S. Xie, S. Song, P. Abbeel, S. Levine, C. Finn, et al. World action models are zero-shot policies.arXiv preprint, 2026. URLhttps://arxiv.org/abs/ 2602.15922
Pith/arXiv arXiv 2026
-
[21]
H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, Z. Li, Y . Chen, J. Zhang, Y . Li, L. Ma, Y . Qiao, et al. Motus: A unified latent action world model. arXiv preprint, 2025. URLhttps://arxiv.org/abs/2512.13030
Pith/arXiv arXiv 2025
-
[22]
A. Group, H. K. U. of Science, and Technology. Causal world modeling for robot control. arXiv preprint, 2026. URLhttps://arxiv.org/abs/2601.21998
Pith/arXiv arXiv 2026
-
[23]
Y . Zhu, R. Mottaghi, E. Kolve, A. Torralba, A. Gupta, L. Fei-Fei, and A. Farhadi. Target- driven visual navigation in indoor scenes using deep reinforcement learning. In2017 IEEE International Conference on Robotics and Automation (ICRA), pages 3357–3364. IEEE, 2017
2017
-
[24]
Savva, A
M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9339–9347. IEEE, 2019
2019
-
[25]
Wijmans, A
E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra. DD- PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. InInternational Conference on Learning Representations (ICLR), 2020
2020
-
[26]
Y . Zeng, H. Ren, S. Wang, J. Huang, and H. Cheng. Navidiffusor: Cost-guided diffusion model for visual navigation. In2025 IEEE International Conference on Robotics and Automation (ICRA), 2025
2025
-
[27]
Krizhevsky, I
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu- tional neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 25, pages 1097–1105, 2012
2012
-
[28]
Janner, Y
M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine. Policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, July 2023
2023
-
[29]
W. Shen, Z. Meng, J. Ma, M. Zhou, and D. Xiang. An efficient and multi-modal navigation system with one-step world model.arXiv preprint, 2026. URLhttps://arxiv.org/abs/ 2601.12277
arXiv 2026
-
[30]
Zhang, S
H. Zhang, S. Liang, L. Chen, Y . Li, Y . Xu, Y . Zhong, F. Zhang, and H. Li. Sparse video generation propels real-world beyond-the-view vision-language navigation.arXiv preprint, 2026
2026
-
[31]
D. Nie, X. Guo, Y . Duan, R. Zhang, and L. Chen. Wmnav: Integrating vision-language models into world models for object goal navigation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025
2025
-
[32]
J. Y . Koh, H. Lee, Y . Yang, J. Baldridge, and P. Anderson. Pathdreamer: A world model for indoor navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14738–14748, 2021. 10
2021
-
[33]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020
2020
-
[34]
S. Wang, Y . Wang, Z. Fan, Y . Wang, M. Chen, K. Wang, Z. Su, W. Li, X. Cai, Y . Jin, and D. Li. Internvla-n1: An open dual-system vision-language navigation foundation model with learned latent plans.arXiv preprint, 2025. arXiv number placeholder
2025
-
[35]
Oquab, T
M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint, 2023
2023
-
[36]
Rombach, A
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. Stable diffusion v1.https: //github.com/Stability-AI/stablediffusion, 2022
2022
-
[37]
Jiang, S
S. Jiang, S. Ancha, N. Roy, T. Lozano-P ´erez, L. P. Kaelbling, et al. Streaming flow policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories. arXiv preprint, 2025
2025
-
[38]
van den Oord, Y
A. van den Oord, Y . Li, and O. Vinyals. Representation learning with contrastive predictive coding.arXiv preprint, 2018
2018
-
[39]
Contributors
I. Contributors. Interndata-n1 dataset.https://huggingface.co/datasets/ InternRobotics/InternData-N1, 2025. Accessed: 2025-09-15
2025
-
[40]
Straub, T
J. Straub, T. Whelan, L. Ma, Y . Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y . Yan, X. Pan, J. Youn, Y . Zou, N. Ratliff, D. Huang, S. Wang, F. Yang, J. J. Leonard, and J. Shen. The replica dataset: A digital replica of indoor spaces.arXiv preprint, 2019
2019
-
[41]
Chang, A
A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niebner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In2017 Interna- tional Conference on 3D Vision (3DV). IEEE, 2017
2017
-
[42]
F. Xia, A. R. Zamir, Z.-Y . He, A. Sax, J. Malik, and S. Savarese. Gibson env: Real-world perception for embodied agents. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9068–9079, 2018
2018
-
[43]
H. Fu, B. Cai, L. Gao, L.-X. Zhang, J. Wang, X. Li, X. Cao, S.-Q. Han, Y .-W. Liu, O. Wang, et al. 3d-front: 3d furnished rooms with layouts and semantics.arXiv preprint, 2020
2020
-
[44]
Khanna, Y
M. Khanna, Y . Wang, M. Z. Irshad, T. Gervet, Y . Xu, Y . Han, C. Gan, T.-W. Lee, D. Xu, K.-L. Gervet, et al. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation.arXiv preprint, 2023
2023
-
[45]
S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Mousavian, A. Clegg, B. Diorio, S. Song, D. Batra, J. Malik, and S. Lee. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. InThirty-fifth Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021
2021
-
[46]
H. Wang, J. Chen, W. Huang, Q. Ben, T. Wang, B. Mi, T. Huang, S. Zhao, Y . Chen, S. Yang, P. Cao, W. Yu, Z. Ye, J. Li, J. Long, Z. Wang, H. Wang, Y . Zhao, Z. Tu, Y . Qiao, D. Lin, and J. Pang. Grutopia: Dream general robots in a city at scale.arXiv preprint, 2024
2024
-
[47]
F. Yang, C. Wang, C. Cadena, and M. Hutter. iplanner: Imperative path planning. InProceed- ings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, July 2023
2023
-
[48]
P. Roth, J. Nubert, F. Yang, M. Mittal, and M. Hutter. Viplanner: Visual semantic impera- tive learning for local navigation. In2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024. 11 A Overview The supplementary material is organized as follows: • Section B provides the formal problem definition. • Section C provides addition...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.