pith. machine review for the scientific record. sign in

arxiv: 2605.06595 · v1 · submitted 2026-05-07 · 💻 cs.RO · cs.AI· cs.LG· cs.MA

Recognition: unknown

Cross-Modal Navigation with Multi-Agent Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:40 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LGcs.MA
keywords multi-agent reinforcement learningcross-modal navigationembodied navigationvisual-acoustic navigationcollaborative agentsmodality specialization
0
0 comments X

The pith

Multi-agent reinforcement learning lets modality-specialized agents collaborate on navigation tasks more effectively than single models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CRONA, a multi-agent reinforcement learning framework in which separate lightweight agents each focus on one sensory modality such as vision or acoustics. These agents share control-relevant auxiliary beliefs and are guided by a centralized critic that observes the full multi-modal state, allowing them to coordinate without any single agent needing to process every sensor type at once. Experiments on visual-acoustic navigation show that the multi-agent approach yields higher success rates and greater efficiency than monolithic single-agent baselines. The work also reports that uniform agents suffice for short-range trips with clear signals, while agents with complementary modalities perform well in broader settings, and that large complex spaces demand both richer sensory inputs and greater model size.

Core claim

Cross-modal navigation can be achieved more scalably by training multiple modality-specialized agents to collaborate through multi-agent reinforcement learning, using control-relevant auxiliary beliefs to align their actions and a centralized multi-modal critic that accesses global state information, rather than forcing a single model to handle all modalities simultaneously.

What carries the argument

CRONA, a multi-agent reinforcement learning framework that equips each agent with control-relevant auxiliary beliefs and trains them under a centralized multi-modal critic with access to global state.

If this is right

  • Homogeneous collaboration among agents with limited modalities is sufficient for short-range navigation when salient cues are available.
  • Heterogeneous collaboration among agents with complementary modalities is generally efficient and effective across tasks.
  • Navigation in large, complex environments requires richer multi-modal perception together with increased model capacity.
  • Lightweight specialized agents can be deployed flexibly and executed in parallel while preserving each modality's strengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may reduce the need for perfectly aligned multi-modal datasets during training since each agent can learn from its own sensor stream.
  • Separate agents could be placed on different hardware platforms, enabling truly parallel sensing and decision making on physical robots.
  • The same specialization pattern might extend to other embodied problems such as object manipulation or multi-robot coordination where sensor types differ.

Load-bearing premise

The proposed auxiliary beliefs and centralized critic will produce better collaboration among the agents without creating new coordination failures or demanding extensive tuning beyond what is shown in the experiments.

What would settle it

A direct replication of the visual-acoustic navigation experiments in which the multi-agent CRONA agents fail to outperform the single-agent baselines on success rate or efficiency measures.

Figures

Figures reproduced from arXiv: 2605.06595 by Christopher Amato, Shuo Liu, Xinzichen Li.

Figure 1
Figure 1. Figure 1: A collaborative navigation task in a Ranch scene from Matterport3D. An audio agent (blue) collaborates with a vision agent (green) to locate a table and pictures. Each agent receives only local observations during execution, while global information is captured by a global monitor (yellow) and used only during training. Gray curves denote agents’ trajectories. In embodied navigation, agents perceive the en… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of CRONA framework. 2 decentralized agents, one with audio inputs (blue) and another with vision inputs (green), cooperate to navigate toward a table with silverware-dropping sounds and pictures with camera-shutter sounds. (a) Observation-action history embeddings and auxiliary belief predictors of agents. (b) A multi-modal critic (red) estimates the value with joint history, the auxiliary bel… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation of CRONA and collaborative navigation baselines across 5 view at source ↗
Figure 4
Figure 4. Figure 4: Bird-eye’s-views of MatterPort3D scenes. 15 view at source ↗
Figure 5
Figure 5. Figure 5: shows the corresponding source spectrograms. (a) Dragging Chair (b) Table with Silverware (c) Picture Shutter (d) Sink Dripping (e) Coin Drop on Counter (f) Chest of Drawers (g) Creaking Bed view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of example episodes. C Experimental Settings C.1 Hyperparameters We use the same hyperparameters across all scenes view at source ↗
Figure 7
Figure 7. Figure 7: Additional evaluation of CRONA and collaborative navigation baselines across 5 view at source ↗
read the original abstract

Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substantially enlarge the policy space. Cross-modal collaboration among lightweight modality-specialized agents offers a scalable paradigm. It enables flexible deployment and parallel execution, while preserving the strength of each modality. In this paper, we propose \textbf{CRONA}, a Multi-Agent Reinforcement Learning (MARL) framework for \textbf{Cro}ss-Modal \textbf{Na}vigation. CRONA improves collaboration by leveraging control-relevant auxiliary beliefs and a centralized multi-modal critic with global state. Experiments on visual-acoustic navigation tasks show that multi-agent methods significantly improve performance and efficiency over single-agent baselines. We find that homogeneous collaboration with limited modalities is sufficient for short-range navigation under salient cues; heterogeneous collaboration among agents with complementary modalities is generally efficient and effective; and navigation in large, complex environments requires both richer multi-modal perception and increased model capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes CRONA, a multi-agent reinforcement learning (MARL) framework for cross-modal navigation. It uses control-relevant auxiliary beliefs and a centralized multi-modal critic with global state to enable collaboration among modality-specialized agents, avoiding monolithic models. Experiments on visual-acoustic navigation tasks are reported to show that multi-agent methods significantly outperform single-agent baselines in performance and efficiency. The authors conclude that homogeneous collaboration with limited modalities suffices for short-range navigation under salient cues, heterogeneous collaboration is generally efficient and effective, and large complex environments require richer multi-modal perception plus increased model capacity.

Significance. If the results hold under rigorous evaluation, the work offers a scalable paradigm for multi-modal embodied navigation by decomposing into lightweight specialized agents with flexible deployment. This could be significant for real-world robotics where aligned multi-modal data is scarce, and the empirical distinctions between collaboration modes provide actionable insights for MARL design in navigation.

major comments (2)
  1. [§3.2] §3.2 (centralized critic and auxiliary beliefs): The mechanism ensuring cross-modal collaboration at decentralized execution time is not specified. Standard CTDE uses the critic only during training; without described belief propagation, explicit communication, or policy input modifications, it is unclear how heterogeneous agents with complementary modalities realize the claimed collaboration in partially observable large environments rather than reverting to independent single-modality behavior.
  2. [§4] §4 (Experiments): The section provides insufficient detail on task definitions, baseline algorithms, exact metrics, number of independent runs, statistical tests, and ablations isolating the auxiliary beliefs and centralized critic. Without these, the claims of 'significant improvements' and the specific findings on homogeneous vs. heterogeneous collaboration cannot be evaluated as load-bearing evidence.
minor comments (2)
  1. The abstract would be strengthened by including at least one quantitative performance delta or metric to ground the 'significant improvements' claim.
  2. [§3] Notation for the auxiliary beliefs and how they interface with the policy network should be formalized with an equation or diagram for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have updated the manuscript to improve clarity and experimental rigor where appropriate.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (centralized critic and auxiliary beliefs): The mechanism ensuring cross-modal collaboration at decentralized execution time is not specified. Standard CTDE uses the critic only during training; without described belief propagation, explicit communication, or policy input modifications, it is unclear how heterogeneous agents with complementary modalities realize the claimed collaboration in partially observable large environments rather than reverting to independent single-modality behavior.

    Authors: We appreciate this clarification request. In CRONA, the centralized multi-modal critic operates exclusively during training, providing a global state and cross-modal view that shapes the learning of each agent's decentralized policy. This follows standard CTDE (e.g., MADDPG-style) where the critic enables implicit coordination: each modality-specialized policy is optimized to exploit complementary cues without needing explicit communication or belief sharing at execution. The auxiliary beliefs are local, control-relevant representations updated from each agent's own observations, which help maintain useful state estimates. We have added a dedicated paragraph and a training-vs-execution diagram in §3.2 to make this explicit. We agree that in very large or highly occluded environments, performance may still benefit from richer mechanisms, consistent with our conclusions on model capacity. revision: partial

  2. Referee: [§4] §4 (Experiments): The section provides insufficient detail on task definitions, baseline algorithms, exact metrics, number of independent runs, statistical tests, and ablations isolating the auxiliary beliefs and centralized critic. Without these, the claims of 'significant improvements' and the specific findings on homogeneous vs. heterogeneous collaboration cannot be evaluated as load-bearing evidence.

    Authors: We agree that additional experimental details are required for reproducibility and to support the claims. In the revised manuscript we have expanded §4 to include: full task specifications (environment dimensions, cue saliency levels, episode horizons, and success criteria); complete baseline descriptions (single-agent RL variants with modality concatenation or fusion, plus ablated multi-agent versions); precise metrics (success rate, SPL, navigation efficiency, and collision counts); number of independent runs (5 random seeds with mean and standard deviation reported); statistical tests (paired t-tests with p-values); and targeted ablations that remove auxiliary beliefs or replace the centralized critic with decentralized alternatives. These changes allow direct evaluation of the reported improvements and the homogeneous vs. heterogeneous collaboration findings. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical MARL framework with experimental validation

full rationale

The paper introduces CRONA as a practical MARL framework leveraging auxiliary beliefs and a centralized multi-modal critic for cross-modal navigation, with all central claims resting on reported experimental outcomes comparing multi-agent performance to single-agent baselines across visual-acoustic tasks. No derivation chain, first-principles prediction, or uniqueness theorem is presented that reduces by construction to fitted parameters, self-definitions, or self-citations. The work is self-contained as an empirical study against external benchmarks, with no load-bearing steps that equate outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an empirical MARL framework and does not introduce or rely on explicit mathematical derivations, free parameters, or new postulated entities.

pith-pipeline@v0.9.0 · 5486 in / 1235 out tokens · 30075 ms · 2026-05-08T08:40:04.186751+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017

  2. [2]

    Habitat: A Platform for Embodied AI Research

    Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

  3. [3]

    Habitat 2.0: Training home assistants to rearrange their habitat

    Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training home assist...

  4. [4]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

  5. [5]

    A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

    Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

  6. [6]

    Multi-view 3d object detection network for autonomous driving

    Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017

  7. [7]

    Visual dialog

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017

  8. [8]

    Joint 3d proposal generation and object detection from view aggregation

    Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 1–8. IEEE, 2018

  9. [9]

    Frustum pointnets for 3d object detection from rgb-d data

    Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 918–927, 2018

  10. [10]

    Cliport: What and where pathways for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. InConference on robot learning, pages 894–906. PMLR, 2022

  11. [11]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  12. [12]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

  13. [13]

    Multimodal transformer for unaligned multimodal language sequences

    Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 6558–6569, 2019

  14. [14]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021. 10

  15. [15]

    Soundspaces: Audio-visual navigation in 3d environments

    Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. Soundspaces: Audio-visual navigation in 3d environments. InEuropean conference on computer vision, pages 17–36. Springer, 2020

  16. [16]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  17. [17]

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020

  18. [18]

    What makes training multi-modal classification networks hard? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12695–12705, 2020

    Weiyao Wang, Du Tran, and Matt Feiszli. What makes training multi-modal classification networks hard? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12695–12705, 2020

  19. [19]

    Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks

    Nan Wu, Stanislaw Jastrzebski, Kyunghyun Cho, and Krzysztof J Geras. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. InInternational Conference on Machine Learning, pages 24043–24055. PMLR, 2022

  20. [20]

    Modality competi- tion: What makes joint training of multi-modal network fail in deep learning?(provably)

    Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang. Modality competi- tion: What makes joint training of multi-modal network fail in deep learning?(provably). In International conference on machine learning, pages 9226–9259. PMLR, 2022

  21. [21]

    Balanced multimodal learning via on-the-fly gradient modulation

    Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. Balanced multimodal learning via on-the-fly gradient modulation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8238–8247, 2022

  22. [22]

    Vilt: Vision-and-language transformer without convolution or region supervision

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021

  23. [23]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

  24. [24]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  25. [25]

    Coordination for multi-robot exploration and mapping

    Reid Simmons, David Apfelbaum, Wolfram Burgard, Dieter Fox, Mark Moors, Sebastian Thrun, and Håkan Younes. Coordination for multi-robot exploration and mapping. InAaai/Iaai, pages 852–858, 2000

  26. [26]

    Alliance: An architecture for fault tolerant multirobot cooperation.IEEE transactions on robotics and automation, 14(2):220–240, 2002

    Lynne E Parker. Alliance: An architecture for fault tolerant multirobot cooperation.IEEE transactions on robotics and automation, 14(2):220–240, 2002

  27. [27]

    Coordinated multi- robot exploration.IEEE Transactions on robotics, 21(3):376–386, 2005

    Wolfram Burgard, Mark Moors, Cyrill Stachniss, and Frank E Schneider. Coordinated multi- robot exploration.IEEE Transactions on robotics, 21(3):376–386, 2005

  28. [28]

    Attention-based fault-tolerant approach for multi-agent reinforcement learning systems.Entropy, 23(9):1133, 2021

    Shanzhi Gu, Mingyang Geng, and Long Lan. Attention-based fault-tolerant approach for multi-agent reinforcement learning systems.Entropy, 23(9):1133, 2021

  29. [29]

    Decentralized autonomous navigation of a uav network for road traffic monitoring.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2558–2564, 2021

    Hailong Huang, Andrey V Savkin, and Chao Huang. Decentralized autonomous navigation of a uav network for road traffic monitoring.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2558–2564, 2021

  30. [30]

    Fully decentralized cooperative navigation for spacecraft constellations.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2383– 2394, 2021

    Tong Qin, Malcolm Macdonald, and Dong Qiao. Fully decentralized cooperative navigation for spacecraft constellations.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2383– 2394, 2021. 11

  31. [31]

    Swarm cooperative navigation using centralized training and decentralized execution.Drones, 7(3):193, 2023

    Rana Azzam, Igor Boiko, and Yahya Zweiri. Swarm cooperative navigation using centralized training and decentralized execution.Drones, 7(3):193, 2023

  32. [32]

    Learning multi-robot decentralized macro-action-based policies via a centralized q-net

    Yuchen Xiao, Joshua Hoffman, Tian Xia, and Christopher Amato. Learning multi-robot decentralized macro-action-based policies via a centralized q-net. In2020 IEEE International conference on robotics and automation (ICRA), pages 10695–10701. IEEE, 2020

  33. [33]

    Asynchronous actor-critic for multi-agent reinforcement learning.Advances in Neural Information Processing Systems, 35:4385–4400, 2022

    Yuchen Xiao, Weihao Tan, and Christopher Amato. Asynchronous actor-critic for multi-agent reinforcement learning.Advances in Neural Information Processing Systems, 35:4385–4400, 2022

  34. [34]

    Multi-agent deep reinforcement learning for uavs navigation in unknown complex environment.IEEE Transactions on Intelligent Vehicles, 9(1):2290–2303, 2023

    Yuntao Xue and Weisheng Chen. Multi-agent deep reinforcement learning for uavs navigation in unknown complex environment.IEEE Transactions on Intelligent Vehicles, 9(1):2290–2303, 2023

  35. [35]

    Multi-robot cooperative socially-aware navigation using multi-agent reinforcement learning

    Weizheng Wang, Le Mao, Ruiqi Wang, and Byung-Cheol Min. Multi-robot cooperative socially-aware navigation using multi-agent reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12353–12360. IEEE, 2024

  36. [36]

    Collaborative visual navigation.arXiv preprint arXiv:2107.01151, 2021

    Haiyang Wang, Wenguan Wang, Xizhou Zhu, Jifeng Dai, and Liwei Wang. Collaborative visual navigation.arXiv preprint arXiv:2107.01151, 2021

  37. [37]

    Advancing audio-visual navigation through multi-agent collaboration in 3d environments

    Hailong Zhang, Yinfeng Yu, Liejun Wang, Fuchun Sun, and Wendong Zheng. Advancing audio-visual navigation through multi-agent collaboration in 3d environments. InInternational Conference on Neural Information Processing, pages 502–516. Springer, 2025

  38. [38]

    Conavbench: Collaborative long-horizon vision-language navi- gation benchmark

    Tianhang Wang, Xinhai Li, Fan Lu, Tianshi Gong, Jiankun Dong, Weiyi Xue, Sanqing Qu, Chenjia Bai, and Guang Chen. Conavbench: Collaborative long-horizon vision-language navi- gation benchmark. InThe Fourteenth International Conference on Learning Representations, 2026

  39. [39]

    Conav: Collaborative cross-modal reasoning for embodied navigation.arXiv preprint arXiv:2505.16663, 2025

    Haihong Hao, Mingfei Han, Changlin Li, Zhihui Li, and Xiaojun Chang. Conav: Collaborative cross-modal reasoning for embodied navigation.arXiv preprint arXiv:2505.16663, 2025

  40. [40]

    Caml: Collaborative auxiliary modality learning for multi- agent systems.arXiv preprint arXiv:2502.17821, 2025

    Rui Liu, Yu Shen, Peng Gao, Pratap Tokekar, and Ming Lin. Caml: Collaborative auxiliary modality learning for multi-agent systems.arXiv preprint arXiv:2502.17821, 2025

  41. [41]

    Semantic collaborative learning for cross-modal moment localization.ACM Transactions on Information Systems, 42(2):1–26, 2023

    Yupeng Hu, Kun Wang, Meng Liu, Haoyu Tang, and Liqiang Nie. Semantic collaborative learning for cross-modal moment localization.ACM Transactions on Information Systems, 42(2):1–26, 2023

  42. [42]

    Optimal and approximate q-value functions for decentralized pomdps.Journal of Artificial Intelligence Research, 32:289–353, 2008

    Frans A Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate q-value functions for decentralized pomdps.Journal of Artificial Intelligence Research, 32:289–353, 2008

  43. [43]

    Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs

    Frans A. Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs. Springer, 2016

  44. [44]

    Beyond the nav- graph: Vision-and-language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav- graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

  45. [45]

    Speaker- follower models for vision-and-language navigation

    Daniel Fried, Ronghang Hu, V olkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker- follower models for vision-and-language navigation. InAdvances in neural information pro- cessing systems, volume 31, 2018

  46. [46]

    Towards learning a generic agent for vision-and-language navigation via pre-training

    Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13137–13146, 2020

  47. [47]

    Improving vision-and-language navigation with image-text pairs from the web

    Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. Improving vision-and-language navigation with image-text pairs from the web. InEuropean Conference on Computer Vision, pages 259–274. Springer, 2020. 12

  48. [48]

    Vln bert: A recurrent vision-and-language bert for navigation

    Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, and Stephen Gould. Vln bert: A recurrent vision-and-language bert for navigation. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 1643–1653, 2021

  49. [49]

    History aware multimodal transformer for vision-and-language navigation.Advances in neural information processing systems, 34:5834–5847, 2021

    Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation.Advances in neural information processing systems, 34:5834–5847, 2021

  50. [50]

    Mobile robot navigation based on lidar

    Yi Cheng and Gong Ye Wang. Mobile robot navigation based on lidar. In2018 Chinese control and decision conference (CCDC), pages 1243–1246. IEEE, 2018

  51. [51]

    Semantic audio-visual navigation

    Changan Chen, Ziad Al-Halah, and Kristen Grauman. Semantic audio-visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15516–15525, 2021

  52. [52]

    Avlen: Audio-visual-language embodied navigation in 3d environments.Advances in Neural Information Processing Systems, 35:6236–6249, 2022

    Sudipta Paul, Amit Roy-Chowdhury, and Anoop Cherian. Avlen: Audio-visual-language embodied navigation in 3d environments.Advances in Neural Information Processing Systems, 35:6236–6249, 2022

  53. [53]

    Learning to set waypoints for audio-visual navigation.arXiv preprint arXiv:2008.09622, 2020

    Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Learning to set waypoints for audio-visual navigation.arXiv preprint arXiv:2008.09622, 2020

  54. [54]

    Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on Intelligent Vehicles, 9(1):2094–2128, 2023

    Shanliang Yao, Runwei Guan, Xiaoyu Huang, Zhuoxiao Li, Xiangyu Sha, Yong Yue, Eng Gee Lim, Hyungjoon Seo, Ka Lok Man, Xiaohui Zhu, et al. Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on Intelligent Vehicles, 9(1):2094–2128, 2023

  55. [55]

    Gram: Spatial general-purpose audio representations for real-world environments.arXiv preprint arXiv:2602.03307, 2026

    Goksenin Yuksel, Marcel van Gerven, and Kiki van der Heijden. Gram: Spatial general-purpose audio representations for real-world environments.arXiv preprint arXiv:2602.03307, 2026

  56. [56]

    The road to know-where: An object-and-room informed sequential bert for indoor vision-language navigation

    Yuankai Qi, Zizheng Pan, Yicong Hong, Ming-Hsuan Yang, Anton Van Den Hengel, and Qi Wu. The road to know-where: An object-and-room informed sequential bert for indoor vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1655–1664, 2021

  57. [57]

    Visual language maps for robot navigation

    Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023

  58. [58]

    L3mvn: Leveraging large language models for visual target navigation

    Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target navigation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3554–3560. IEEE, 2023

  59. [59]

    Safe multirobot navigation within dynamics constraints

    James R Bruce and Manuela M Veloso. Safe multirobot navigation within dynamics constraints. Proceedings of the IEEE, 94(7):1398–1411, 2006

  60. [60]

    Centralized path planning for multiple robots: Optimal decoupling into sequential plans

    Jur van Den Berg, Jack Snoeyink, Ming C Lin, and Dinesh Manocha. Centralized path planning for multiple robots: Optimal decoupling into sequential plans. InRobotics: Science and systems, volume 2, pages 2–3, 2009

  61. [61]

    Cloud based centralized task control for human domain multi-robot operations.Intelligent Service Robotics, 9(1):63–77, 2016

    Rob Janssen, René van de Molengraft, Herman Bruyninckx, and Maarten Steinbuch. Cloud based centralized task control for human domain multi-robot operations.Intelligent Service Robotics, 9(1):63–77, 2016

  62. [62]

    Decentralized prioritized planning in large multirobot teams

    Prasanna Velagapudi, Katia Sycara, and Paul Scerri. Decentralized prioritized planning in large multirobot teams. In2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4603–4609. IEEE, 2010

  63. [63]

    Efficient and complete centralized multi-robot path planning

    Ryan Luna and Kostas E Bekris. Efficient and complete centralized multi-robot path planning. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3268–3275. IEEE, 2011. 13

  64. [64]

    Actor-attention-critic for multi-agent reinforcement learning

    Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In International conference on machine learning, pages 2961–2970. PMLR, 2019

  65. [65]

    Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021

    Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸ sar. Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021

  66. [66]

    Albrecht, Filippos Christianos, and Lukas Schäfer.Multi-Agent Reinforcement Learning: Foundations and Modern Approaches

    Stefano V . Albrecht, Filippos Christianos, and Lukas Schäfer.Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024

  67. [67]

    A survey of progress on cooperative multi-agent reinforcement learning in open environment.arXiv preprint arXiv:2312.01058, 2023

    Lei Yuan, Ziqian Zhang, Lihe Li, Cong Guan, and Yang Yu. A survey of progress on cooperative multi-agent reinforcement learning in open environment.arXiv preprint arXiv:2312.01058, 2023

  68. [68]

    Multi-agent reinforcement learning: Independent vs

    Ming Tan et al. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pages 330–337, 1993

  69. [69]

    Learning to cooperate via policy search.arXiv preprint cs/0105032, 2001

    Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau, and Leslie Pack Kaelbling. Learning to cooperate via policy search.arXiv preprint cs/0105032, 2001

  70. [70]

    The dynamics of reinforcement learning in cooperative multiagent systems.AAAI/IAAI, 1998(746-752):2, 1998

    Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems.AAAI/IAAI, 1998(746-752):2, 1998

  71. [71]

    A selection-mutation model for q-learning in multi-agent systems

    Karl Tuyls, Katja Verbeeck, and Tom Lenaerts. A selection-mutation model for q-learning in multi-agent systems. InProceedings of the second international joint conference on Autonomous agents and multiagent systems, pages 693–700, 2003

  72. [72]

    Classes of multiagent q-learning dynamics with epsilon-greedy exploration

    Michael Wunder, Michael L Littman, and Monica Babes. Classes of multiagent q-learning dynamics with epsilon-greedy exploration. InProceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1167–1174, 2010

  73. [73]

    In: arXiv.org

    Christopher Amato. An introduction to centralized training for decentralized execution in cooperative multi-agent reinforcement learning.arXiv preprint arXiv:2409.03052, 2024

  74. [74]

    Multi-agent actor-critic for mixed cooperative-competitive environments.Advances in neural information processing systems, 30, 2017

    Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments.Advances in neural information processing systems, 30, 2017

  75. [75]

    The surprising effectiveness of ppo in cooperative multi-agent games

    Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games. InAdvances in Neural Information Processing Systems, volume 35, pages 24611–24624. Curran Associates, Inc., 2022

  76. [76]

    Counterfactual multi-agent policy gradients

    Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI Conference on Artificial Intelligence, April 2018

  77. [77]

    arXiv preprint arXiv:2102.04402 , year=

    Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. Contrasting centralized and decentralized critics in multi-agent reinforcement learning.arXiv preprint arXiv:2102.04402, 2021

  78. [78]

    On centralized critics in multi-agent reinforcement learning.Journal of Artificial Intelligence Research, 77:295–354, 2023

    Xueguang Lyu, Andrea Baisero, Yuchen Xiao, Brett Daley, and Christopher Amato. On centralized critics in multi-agent reinforcement learning.Journal of Artificial Intelligence Research, 77:295–354, 2023

  79. [79]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  80. [80]

    Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017

Showing first 80 references.