Recognition: unknown
Cross-Modal Navigation with Multi-Agent Reinforcement Learning
Pith reviewed 2026-05-08 08:40 UTC · model grok-4.3
The pith
Multi-agent reinforcement learning lets modality-specialized agents collaborate on navigation tasks more effectively than single models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Cross-modal navigation can be achieved more scalably by training multiple modality-specialized agents to collaborate through multi-agent reinforcement learning, using control-relevant auxiliary beliefs to align their actions and a centralized multi-modal critic that accesses global state information, rather than forcing a single model to handle all modalities simultaneously.
What carries the argument
CRONA, a multi-agent reinforcement learning framework that equips each agent with control-relevant auxiliary beliefs and trains them under a centralized multi-modal critic with access to global state.
If this is right
- Homogeneous collaboration among agents with limited modalities is sufficient for short-range navigation when salient cues are available.
- Heterogeneous collaboration among agents with complementary modalities is generally efficient and effective across tasks.
- Navigation in large, complex environments requires richer multi-modal perception together with increased model capacity.
- Lightweight specialized agents can be deployed flexibly and executed in parallel while preserving each modality's strengths.
Where Pith is reading between the lines
- The approach may reduce the need for perfectly aligned multi-modal datasets during training since each agent can learn from its own sensor stream.
- Separate agents could be placed on different hardware platforms, enabling truly parallel sensing and decision making on physical robots.
- The same specialization pattern might extend to other embodied problems such as object manipulation or multi-robot coordination where sensor types differ.
Load-bearing premise
The proposed auxiliary beliefs and centralized critic will produce better collaboration among the agents without creating new coordination failures or demanding extensive tuning beyond what is shown in the experiments.
What would settle it
A direct replication of the visual-acoustic navigation experiments in which the multi-agent CRONA agents fail to outperform the single-agent baselines on success rate or efficiency measures.
Figures
read the original abstract
Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substantially enlarge the policy space. Cross-modal collaboration among lightweight modality-specialized agents offers a scalable paradigm. It enables flexible deployment and parallel execution, while preserving the strength of each modality. In this paper, we propose \textbf{CRONA}, a Multi-Agent Reinforcement Learning (MARL) framework for \textbf{Cro}ss-Modal \textbf{Na}vigation. CRONA improves collaboration by leveraging control-relevant auxiliary beliefs and a centralized multi-modal critic with global state. Experiments on visual-acoustic navigation tasks show that multi-agent methods significantly improve performance and efficiency over single-agent baselines. We find that homogeneous collaboration with limited modalities is sufficient for short-range navigation under salient cues; heterogeneous collaboration among agents with complementary modalities is generally efficient and effective; and navigation in large, complex environments requires both richer multi-modal perception and increased model capacity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CRONA, a multi-agent reinforcement learning (MARL) framework for cross-modal navigation. It uses control-relevant auxiliary beliefs and a centralized multi-modal critic with global state to enable collaboration among modality-specialized agents, avoiding monolithic models. Experiments on visual-acoustic navigation tasks are reported to show that multi-agent methods significantly outperform single-agent baselines in performance and efficiency. The authors conclude that homogeneous collaboration with limited modalities suffices for short-range navigation under salient cues, heterogeneous collaboration is generally efficient and effective, and large complex environments require richer multi-modal perception plus increased model capacity.
Significance. If the results hold under rigorous evaluation, the work offers a scalable paradigm for multi-modal embodied navigation by decomposing into lightweight specialized agents with flexible deployment. This could be significant for real-world robotics where aligned multi-modal data is scarce, and the empirical distinctions between collaboration modes provide actionable insights for MARL design in navigation.
major comments (2)
- [§3.2] §3.2 (centralized critic and auxiliary beliefs): The mechanism ensuring cross-modal collaboration at decentralized execution time is not specified. Standard CTDE uses the critic only during training; without described belief propagation, explicit communication, or policy input modifications, it is unclear how heterogeneous agents with complementary modalities realize the claimed collaboration in partially observable large environments rather than reverting to independent single-modality behavior.
- [§4] §4 (Experiments): The section provides insufficient detail on task definitions, baseline algorithms, exact metrics, number of independent runs, statistical tests, and ablations isolating the auxiliary beliefs and centralized critic. Without these, the claims of 'significant improvements' and the specific findings on homogeneous vs. heterogeneous collaboration cannot be evaluated as load-bearing evidence.
minor comments (2)
- The abstract would be strengthened by including at least one quantitative performance delta or metric to ground the 'significant improvements' claim.
- [§3] Notation for the auxiliary beliefs and how they interface with the policy network should be formalized with an equation or diagram for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have updated the manuscript to improve clarity and experimental rigor where appropriate.
read point-by-point responses
-
Referee: [§3.2] §3.2 (centralized critic and auxiliary beliefs): The mechanism ensuring cross-modal collaboration at decentralized execution time is not specified. Standard CTDE uses the critic only during training; without described belief propagation, explicit communication, or policy input modifications, it is unclear how heterogeneous agents with complementary modalities realize the claimed collaboration in partially observable large environments rather than reverting to independent single-modality behavior.
Authors: We appreciate this clarification request. In CRONA, the centralized multi-modal critic operates exclusively during training, providing a global state and cross-modal view that shapes the learning of each agent's decentralized policy. This follows standard CTDE (e.g., MADDPG-style) where the critic enables implicit coordination: each modality-specialized policy is optimized to exploit complementary cues without needing explicit communication or belief sharing at execution. The auxiliary beliefs are local, control-relevant representations updated from each agent's own observations, which help maintain useful state estimates. We have added a dedicated paragraph and a training-vs-execution diagram in §3.2 to make this explicit. We agree that in very large or highly occluded environments, performance may still benefit from richer mechanisms, consistent with our conclusions on model capacity. revision: partial
-
Referee: [§4] §4 (Experiments): The section provides insufficient detail on task definitions, baseline algorithms, exact metrics, number of independent runs, statistical tests, and ablations isolating the auxiliary beliefs and centralized critic. Without these, the claims of 'significant improvements' and the specific findings on homogeneous vs. heterogeneous collaboration cannot be evaluated as load-bearing evidence.
Authors: We agree that additional experimental details are required for reproducibility and to support the claims. In the revised manuscript we have expanded §4 to include: full task specifications (environment dimensions, cue saliency levels, episode horizons, and success criteria); complete baseline descriptions (single-agent RL variants with modality concatenation or fusion, plus ablated multi-agent versions); precise metrics (success rate, SPL, navigation efficiency, and collision counts); number of independent runs (5 random seeds with mean and standard deviation reported); statistical tests (paired t-tests with p-values); and targeted ablations that remove auxiliary beliefs or replace the centralized critic with decentralized alternatives. These changes allow direct evaluation of the reported improvements and the homogeneous vs. heterogeneous collaboration findings. revision: yes
Circularity Check
No circularity; empirical MARL framework with experimental validation
full rationale
The paper introduces CRONA as a practical MARL framework leveraging auxiliary beliefs and a centralized multi-modal critic for cross-modal navigation, with all central claims resting on reported experimental outcomes comparing multi-agent performance to single-agent baselines across visual-acoustic tasks. No derivation chain, first-principles prediction, or uniqueness theorem is presented that reduces by construction to fitted parameters, self-definitions, or self-citations. The work is self-contained as an empirical study against external benchmarks, with no load-bearing steps that equate outputs to inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
AI2-THOR: An Interactive 3D Environment for Visual AI
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474, 2017
work page internal anchor Pith review arXiv 2017
-
[2]
Habitat: A Platform for Embodied AI Research
Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019
2019
-
[3]
Habitat 2.0: Training home assistants to rearrange their habitat
Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training home assist...
2021
-
[4]
Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018
2018
-
[5]
A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022
Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022
2022
-
[6]
Multi-view 3d object detection network for autonomous driving
Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017
1907
-
[7]
Visual dialog
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017
2017
-
[8]
Joint 3d proposal generation and object detection from view aggregation
Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 1–8. IEEE, 2018
2018
-
[9]
Frustum pointnets for 3d object detection from rgb-d data
Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 918–927, 2018
2018
-
[10]
Cliport: What and where pathways for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. InConference on robot learning, pages 894–906. PMLR, 2022
2022
-
[11]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review arXiv 2022
-
[12]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review arXiv 2023
-
[13]
Multimodal transformer for unaligned multimodal language sequences
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 6558–6569, 2019
2019
-
[14]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021. 10
2021
-
[15]
Soundspaces: Audio-visual navigation in 3d environments
Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. Soundspaces: Audio-visual navigation in 3d environments. InEuropean conference on computer vision, pages 17–36. Springer, 2020
2020
-
[16]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020
work page internal anchor Pith review arXiv 2010
-
[17]
Alfred: A benchmark for interpreting grounded instructions for everyday tasks
Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mot- taghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020
2020
-
[18]
What makes training multi-modal classification networks hard? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12695–12705, 2020
Weiyao Wang, Du Tran, and Matt Feiszli. What makes training multi-modal classification networks hard? InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12695–12705, 2020
2020
-
[19]
Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks
Nan Wu, Stanislaw Jastrzebski, Kyunghyun Cho, and Krzysztof J Geras. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. InInternational Conference on Machine Learning, pages 24043–24055. PMLR, 2022
2022
-
[20]
Modality competi- tion: What makes joint training of multi-modal network fail in deep learning?(provably)
Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang. Modality competi- tion: What makes joint training of multi-modal network fail in deep learning?(provably). In International conference on machine learning, pages 9226–9259. PMLR, 2022
2022
-
[21]
Balanced multimodal learning via on-the-fly gradient modulation
Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. Balanced multimodal learning via on-the-fly gradient modulation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8238–8247, 2022
2022
-
[22]
Vilt: Vision-and-language transformer without convolution or region supervision
Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021
2021
-
[23]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review arXiv 2024
-
[24]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
2023
-
[25]
Coordination for multi-robot exploration and mapping
Reid Simmons, David Apfelbaum, Wolfram Burgard, Dieter Fox, Mark Moors, Sebastian Thrun, and Håkan Younes. Coordination for multi-robot exploration and mapping. InAaai/Iaai, pages 852–858, 2000
2000
-
[26]
Alliance: An architecture for fault tolerant multirobot cooperation.IEEE transactions on robotics and automation, 14(2):220–240, 2002
Lynne E Parker. Alliance: An architecture for fault tolerant multirobot cooperation.IEEE transactions on robotics and automation, 14(2):220–240, 2002
2002
-
[27]
Coordinated multi- robot exploration.IEEE Transactions on robotics, 21(3):376–386, 2005
Wolfram Burgard, Mark Moors, Cyrill Stachniss, and Frank E Schneider. Coordinated multi- robot exploration.IEEE Transactions on robotics, 21(3):376–386, 2005
2005
-
[28]
Attention-based fault-tolerant approach for multi-agent reinforcement learning systems.Entropy, 23(9):1133, 2021
Shanzhi Gu, Mingyang Geng, and Long Lan. Attention-based fault-tolerant approach for multi-agent reinforcement learning systems.Entropy, 23(9):1133, 2021
2021
-
[29]
Decentralized autonomous navigation of a uav network for road traffic monitoring.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2558–2564, 2021
Hailong Huang, Andrey V Savkin, and Chao Huang. Decentralized autonomous navigation of a uav network for road traffic monitoring.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2558–2564, 2021
2021
-
[30]
Fully decentralized cooperative navigation for spacecraft constellations.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2383– 2394, 2021
Tong Qin, Malcolm Macdonald, and Dong Qiao. Fully decentralized cooperative navigation for spacecraft constellations.IEEE Transactions on Aerospace and Electronic Systems, 57(4):2383– 2394, 2021. 11
2021
-
[31]
Swarm cooperative navigation using centralized training and decentralized execution.Drones, 7(3):193, 2023
Rana Azzam, Igor Boiko, and Yahya Zweiri. Swarm cooperative navigation using centralized training and decentralized execution.Drones, 7(3):193, 2023
2023
-
[32]
Learning multi-robot decentralized macro-action-based policies via a centralized q-net
Yuchen Xiao, Joshua Hoffman, Tian Xia, and Christopher Amato. Learning multi-robot decentralized macro-action-based policies via a centralized q-net. In2020 IEEE International conference on robotics and automation (ICRA), pages 10695–10701. IEEE, 2020
2020
-
[33]
Asynchronous actor-critic for multi-agent reinforcement learning.Advances in Neural Information Processing Systems, 35:4385–4400, 2022
Yuchen Xiao, Weihao Tan, and Christopher Amato. Asynchronous actor-critic for multi-agent reinforcement learning.Advances in Neural Information Processing Systems, 35:4385–4400, 2022
2022
-
[34]
Multi-agent deep reinforcement learning for uavs navigation in unknown complex environment.IEEE Transactions on Intelligent Vehicles, 9(1):2290–2303, 2023
Yuntao Xue and Weisheng Chen. Multi-agent deep reinforcement learning for uavs navigation in unknown complex environment.IEEE Transactions on Intelligent Vehicles, 9(1):2290–2303, 2023
2023
-
[35]
Multi-robot cooperative socially-aware navigation using multi-agent reinforcement learning
Weizheng Wang, Le Mao, Ruiqi Wang, and Byung-Cheol Min. Multi-robot cooperative socially-aware navigation using multi-agent reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12353–12360. IEEE, 2024
2024
-
[36]
Collaborative visual navigation.arXiv preprint arXiv:2107.01151, 2021
Haiyang Wang, Wenguan Wang, Xizhou Zhu, Jifeng Dai, and Liwei Wang. Collaborative visual navigation.arXiv preprint arXiv:2107.01151, 2021
-
[37]
Advancing audio-visual navigation through multi-agent collaboration in 3d environments
Hailong Zhang, Yinfeng Yu, Liejun Wang, Fuchun Sun, and Wendong Zheng. Advancing audio-visual navigation through multi-agent collaboration in 3d environments. InInternational Conference on Neural Information Processing, pages 502–516. Springer, 2025
2025
-
[38]
Conavbench: Collaborative long-horizon vision-language navi- gation benchmark
Tianhang Wang, Xinhai Li, Fan Lu, Tianshi Gong, Jiankun Dong, Weiyi Xue, Sanqing Qu, Chenjia Bai, and Guang Chen. Conavbench: Collaborative long-horizon vision-language navi- gation benchmark. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[39]
Haihong Hao, Mingfei Han, Changlin Li, Zhihui Li, and Xiaojun Chang. Conav: Collaborative cross-modal reasoning for embodied navigation.arXiv preprint arXiv:2505.16663, 2025
-
[40]
Rui Liu, Yu Shen, Peng Gao, Pratap Tokekar, and Ming Lin. Caml: Collaborative auxiliary modality learning for multi-agent systems.arXiv preprint arXiv:2502.17821, 2025
-
[41]
Semantic collaborative learning for cross-modal moment localization.ACM Transactions on Information Systems, 42(2):1–26, 2023
Yupeng Hu, Kun Wang, Meng Liu, Haoyu Tang, and Liqiang Nie. Semantic collaborative learning for cross-modal moment localization.ACM Transactions on Information Systems, 42(2):1–26, 2023
2023
-
[42]
Optimal and approximate q-value functions for decentralized pomdps.Journal of Artificial Intelligence Research, 32:289–353, 2008
Frans A Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate q-value functions for decentralized pomdps.Journal of Artificial Intelligence Research, 32:289–353, 2008
2008
-
[43]
Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs
Frans A. Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs. Springer, 2016
2016
-
[44]
Beyond the nav- graph: Vision-and-language navigation in continuous environments
Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav- graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020
2020
-
[45]
Speaker- follower models for vision-and-language navigation
Daniel Fried, Ronghang Hu, V olkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker- follower models for vision-and-language navigation. InAdvances in neural information pro- cessing systems, volume 31, 2018
2018
-
[46]
Towards learning a generic agent for vision-and-language navigation via pre-training
Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13137–13146, 2020
2020
-
[47]
Improving vision-and-language navigation with image-text pairs from the web
Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. Improving vision-and-language navigation with image-text pairs from the web. InEuropean Conference on Computer Vision, pages 259–274. Springer, 2020. 12
2020
-
[48]
Vln bert: A recurrent vision-and-language bert for navigation
Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, and Stephen Gould. Vln bert: A recurrent vision-and-language bert for navigation. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 1643–1653, 2021
2021
-
[49]
History aware multimodal transformer for vision-and-language navigation.Advances in neural information processing systems, 34:5834–5847, 2021
Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation.Advances in neural information processing systems, 34:5834–5847, 2021
2021
-
[50]
Mobile robot navigation based on lidar
Yi Cheng and Gong Ye Wang. Mobile robot navigation based on lidar. In2018 Chinese control and decision conference (CCDC), pages 1243–1246. IEEE, 2018
2018
-
[51]
Semantic audio-visual navigation
Changan Chen, Ziad Al-Halah, and Kristen Grauman. Semantic audio-visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15516–15525, 2021
2021
-
[52]
Avlen: Audio-visual-language embodied navigation in 3d environments.Advances in Neural Information Processing Systems, 35:6236–6249, 2022
Sudipta Paul, Amit Roy-Chowdhury, and Anoop Cherian. Avlen: Audio-visual-language embodied navigation in 3d environments.Advances in Neural Information Processing Systems, 35:6236–6249, 2022
2022
-
[53]
Learning to set waypoints for audio-visual navigation.arXiv preprint arXiv:2008.09622, 2020
Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Learning to set waypoints for audio-visual navigation.arXiv preprint arXiv:2008.09622, 2020
-
[54]
Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on Intelligent Vehicles, 9(1):2094–2128, 2023
Shanliang Yao, Runwei Guan, Xiaoyu Huang, Zhuoxiao Li, Xiangyu Sha, Yong Yue, Eng Gee Lim, Hyungjoon Seo, Ka Lok Man, Xiaohui Zhu, et al. Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on Intelligent Vehicles, 9(1):2094–2128, 2023
2094
-
[55]
Goksenin Yuksel, Marcel van Gerven, and Kiki van der Heijden. Gram: Spatial general-purpose audio representations for real-world environments.arXiv preprint arXiv:2602.03307, 2026
-
[56]
The road to know-where: An object-and-room informed sequential bert for indoor vision-language navigation
Yuankai Qi, Zizheng Pan, Yicong Hong, Ming-Hsuan Yang, Anton Van Den Hengel, and Qi Wu. The road to know-where: An object-and-room informed sequential bert for indoor vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1655–1664, 2021
2021
-
[57]
Visual language maps for robot navigation
Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023
2023
-
[58]
L3mvn: Leveraging large language models for visual target navigation
Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target navigation. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3554–3560. IEEE, 2023
2023
-
[59]
Safe multirobot navigation within dynamics constraints
James R Bruce and Manuela M Veloso. Safe multirobot navigation within dynamics constraints. Proceedings of the IEEE, 94(7):1398–1411, 2006
2006
-
[60]
Centralized path planning for multiple robots: Optimal decoupling into sequential plans
Jur van Den Berg, Jack Snoeyink, Ming C Lin, and Dinesh Manocha. Centralized path planning for multiple robots: Optimal decoupling into sequential plans. InRobotics: Science and systems, volume 2, pages 2–3, 2009
2009
-
[61]
Cloud based centralized task control for human domain multi-robot operations.Intelligent Service Robotics, 9(1):63–77, 2016
Rob Janssen, René van de Molengraft, Herman Bruyninckx, and Maarten Steinbuch. Cloud based centralized task control for human domain multi-robot operations.Intelligent Service Robotics, 9(1):63–77, 2016
2016
-
[62]
Decentralized prioritized planning in large multirobot teams
Prasanna Velagapudi, Katia Sycara, and Paul Scerri. Decentralized prioritized planning in large multirobot teams. In2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4603–4609. IEEE, 2010
2010
-
[63]
Efficient and complete centralized multi-robot path planning
Ryan Luna and Kostas E Bekris. Efficient and complete centralized multi-robot path planning. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3268–3275. IEEE, 2011. 13
2011
-
[64]
Actor-attention-critic for multi-agent reinforcement learning
Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In International conference on machine learning, pages 2961–2970. PMLR, 2019
2019
-
[65]
Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021
Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸ sar. Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021
2021
-
[66]
Albrecht, Filippos Christianos, and Lukas Schäfer.Multi-Agent Reinforcement Learning: Foundations and Modern Approaches
Stefano V . Albrecht, Filippos Christianos, and Lukas Schäfer.Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024
2024
-
[67]
Lei Yuan, Ziqian Zhang, Lihe Li, Cong Guan, and Yang Yu. A survey of progress on cooperative multi-agent reinforcement learning in open environment.arXiv preprint arXiv:2312.01058, 2023
-
[68]
Multi-agent reinforcement learning: Independent vs
Ming Tan et al. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pages 330–337, 1993
1993
-
[69]
Learning to cooperate via policy search.arXiv preprint cs/0105032, 2001
Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau, and Leslie Pack Kaelbling. Learning to cooperate via policy search.arXiv preprint cs/0105032, 2001
-
[70]
The dynamics of reinforcement learning in cooperative multiagent systems.AAAI/IAAI, 1998(746-752):2, 1998
Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems.AAAI/IAAI, 1998(746-752):2, 1998
1998
-
[71]
A selection-mutation model for q-learning in multi-agent systems
Karl Tuyls, Katja Verbeeck, and Tom Lenaerts. A selection-mutation model for q-learning in multi-agent systems. InProceedings of the second international joint conference on Autonomous agents and multiagent systems, pages 693–700, 2003
2003
-
[72]
Classes of multiagent q-learning dynamics with epsilon-greedy exploration
Michael Wunder, Michael L Littman, and Monica Babes. Classes of multiagent q-learning dynamics with epsilon-greedy exploration. InProceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1167–1174, 2010
2010
-
[73]
Christopher Amato. An introduction to centralized training for decentralized execution in cooperative multi-agent reinforcement learning.arXiv preprint arXiv:2409.03052, 2024
-
[74]
Multi-agent actor-critic for mixed cooperative-competitive environments.Advances in neural information processing systems, 30, 2017
Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments.Advances in neural information processing systems, 30, 2017
2017
-
[75]
The surprising effectiveness of ppo in cooperative multi-agent games
Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games. InAdvances in Neural Information Processing Systems, volume 35, pages 24611–24624. Curran Associates, Inc., 2022
2022
-
[76]
Counterfactual multi-agent policy gradients
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI Conference on Artificial Intelligence, April 2018
2018
-
[77]
arXiv preprint arXiv:2102.04402 , year=
Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. Contrasting centralized and decentralized critics in multi-agent reinforcement learning.arXiv preprint arXiv:2102.04402, 2021
-
[78]
On centralized critics in multi-agent reinforcement learning.Journal of Artificial Intelligence Research, 77:295–354, 2023
Xueguang Lyu, Andrea Baisero, Yuchen Xiao, Brett Daley, and Christopher Amato. On centralized critics in multi-agent reinforcement learning.Journal of Artificial Intelligence Research, 77:295–354, 2023
2023
-
[79]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
2016
-
[80]
Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.