arxiv: 2604.22014 · v1 · submitted 2026-04-23 · 💻 cs.MA · cs.RO

DM³-Nav: Decentralized Multi-Agent Multimodal Multi-Object Semantic Navigation

Amin Kashiri (1) , Atharva Jamsandekar (1) , Yasin Yaz{\i}c{\i}o\u{g}lu (1) ((1) Northeastern University , Boston , USA) This is my paper

Pith reviewed 2026-05-08 13:10 UTC · model grok-4.3

classification 💻 cs.MA cs.RO

keywords decentralized multi-agent navigationsemantic navigationmulti-object missionsimplicit task allocationmulti-robot coordinationopen-vocabulary goalsfrontier selection

0 comments

The pith

DM³-Nav enables fully decentralized multi-agent semantic navigation to match centralized performance through ad-hoc pairwise communication and implicit task allocation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a system where multiple robots pursue semantic goals in 3D scenes without any central coordinator, global map, or synchronized state. Each robot maintains its own local map and shares only pairwise updates on maps, goal status, and navigation intent as needed. An implicit allocation rule lets agents broadcast intent and pick frontiers weighted by distance to divide multi-object tasks and limit overlap. The approach is shown to work with open-vocabulary multimodal goals and is evaluated in both large simulated benchmarks and a real office setting with two physical robots.

Core claim

DM³-Nav demonstrates that fully decentralized operation, achieved solely through ad-hoc pairwise exchanges of local maps, goal status, and navigation intent without synchronization, combined with distance-weighted frontier selection for implicit task allocation, produces multi-object semantic navigation performance that matches or exceeds centralized and shared-map baselines while removing single points of failure.

What carries the argument

The implicit task allocation mechanism that broadcasts navigation intent and applies distance-weighted frontier selection to coordinate exploration without requiring synchronization or global state.

If this is right

Multi-agent teams can complete multi-object missions with reduced redundant exploration compared to uncoordinated independent operation.
Navigation systems become robust to failure of any single robot or communication link because no global state or central node is required.
The same architecture supports simultaneous multimodal goal inputs across agents without additional coordination overhead.
Real-world deployment is feasible using only onboard sensing and computation, as shown in the two-robot office experiment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The coordination pattern could extend to other decentralized multi-agent problems such as collaborative search or dynamic object tracking where central infrastructure is unavailable.
Performance under intermittent communication links remains an open question that could be tested by injecting random dropouts into the existing pairwise exchange protocol.
Replacing the distance-weighted frontier rule with learned policies might further improve allocation efficiency while preserving the fully decentralized constraint.

Load-bearing premise

Ad-hoc pairwise communication of local information without any synchronization is enough to achieve effective task division and avoid conflicts or redundant exploration across agents.

What would settle it

A trial in which three or more robots are given overlapping goals and produce measurably higher rates of redundant frontier visits or mission timeouts than the centralized baseline under identical conditions.

Figures

Figures reproduced from arXiv: 2604.22014 by Amin Kashiri (1), Atharva Jamsandekar (1), Boston, USA), Yasin Yaz{\i}c{\i}o\u{g}lu (1) ((1) Northeastern University.

**Figure 1.** Figure 1: Overview of a multi-agent multimodal semantic navigation episode. Two robots (red and green paths) explore view at source ↗

**Figure 2.** Figure 2: Overview of the DM3 -Nav architecture. Each robot operates autonomously with its own perception, memory, and planning modules. Decentralized coordination is achieved by exchanging semantic maps, goal status, and navigation intent through local communications with in-range robots. [10], our approach coordinates through ad-hoc pairwise information exchange. When two robots have the opportunity to communicate… view at source ↗

**Figure 3.** Figure 3: Frontier selection among four robots with local view at source ↗

**Figure 4.** Figure 4: (a) AgileX Scout Mini with an NVIDIA Jetson Orin, view at source ↗

**Figure 5.** Figure 5: Map merging visualization. First two panels show the obstacle maps of Robot 1 and Robot 2 in their respective view at source ↗

read the original abstract

We present DM$^3$-Nav, a fully decentralized multi-agent semantic navigation system supporting multimodal open-vocabulary goal specification and multi-object missions. In our setting, decentralization implies operation without a central coordinator, global map aggregation, or shared global state at runtime. Robots operate autonomously and coordinate through ad-hoc pairwise communication, exchanging local maps, goal status, and navigation intent without synchronization. An implicit task allocation mechanism combining intent broadcasting and distance-weighted frontier selection reduces redundant exploration while preserving decentralized operation. Evaluations on HM3DSem scenes using the HM3Dv0.2 and GOAT-Bench datasets demonstrate that DM$^3$-Nav matches or exceeds centralized and shared-map baselines while eliminating single points of failure inherent in centralized architectures. Finally, we validate our approach in a real-world office environment using two mobile robots, demonstrating successful deployment relying entirely on onboard sensing and computation. A video of our real-world experiments is available online: https://drive.google.com/file/d/1QiUSCn5rIvtuTUqtuXLPgmt6S8x9-MCZ/view?usp=drive_link

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DM³-Nav puts together a clean decentralized architecture for multi-agent open-vocab semantic navigation with implicit allocation, but the abstract gives almost no numbers to judge whether the coordination actually works at scale.

read the letter

Colleague, the main point is that this paper describes a fully decentralized multi-agent system for semantic navigation that handles multimodal open-vocabulary goals and multi-object tasks. Agents share only local maps, goal status, and navigation intent through ad-hoc pairwise links, with no central coordinator or global map, and they use distance-weighted frontier selection plus intent broadcasting for implicit task allocation. The claim is that this matches or beats centralized and shared-map baselines on HM3DSem scenes with HM3Dv0.2 and GOAT-Bench while running on real robots with onboard sensing only. The real-world office demo with two robots is the clearest positive evidence here; it shows the system can be deployed without heavy infrastructure. The combination of full decentralization, open-vocab multimodal goals, multi-object missions, and that specific implicit allocation mechanism is a new packaging even if the pieces draw from prior frontier and semantic navigation work. The engineering choice to avoid synchronization steps keeps the design simple and scalable on paper. The soft spot is the evaluation. The abstract asserts performance parity and successful validation but supplies no success rates, conflict counts, redundant coverage percentages, ablation results, or map-consistency metrics. The stress-test concern about divergent local maps leading to contradictory allocation decisions is not obviously refuted by what is shown; without quantitative checks on how often agents disagree on frontier status or goal progress, it is hard to know whether the implicit scheme stays reliable past two agents or three objects. This is aimed at people building multi-robot exploration or navigation systems who care about removing single points of failure. A reader working on practical decentralized robotics would pick up usable ideas from the architecture and the real-robot run. It is solid enough on novelty and deployment to deserve peer review, though any referee will need to see the missing quantitative breakdowns and failure analysis before the central claims can be taken as settled. I would send it for review with a request for those details.

Referee Report

2 major / 2 minor

Summary. The paper presents DM³-Nav, a fully decentralized multi-agent system for multimodal open-vocabulary semantic navigation and multi-object missions. Agents operate without a central coordinator or shared global state, coordinating solely via ad-hoc pairwise exchanges of local maps, goal status, and navigation intent. Implicit task allocation is achieved through intent broadcasting combined with distance-weighted frontier selection to minimize redundant exploration. The central claim is that this architecture matches or exceeds centralized and shared-map baselines on HM3DSem scenes (using HM3Dv0.2 and GOAT-Bench datasets) while eliminating single points of failure, with additional real-world validation on two mobile robots using only onboard sensing and computation.

Significance. If the coordination claims hold with supporting evidence, the work would be significant for multi-robot systems by demonstrating a practical decentralized alternative for complex semantic tasks, improving scalability and robustness in environments where centralization is undesirable. The inclusion of real-world deployment and open-vocabulary multimodal goals adds practical relevance beyond simulation-only results.

major comments (2)

[Evaluation] Evaluation section: The abstract and evaluation description assert that DM³-Nav matches or exceeds centralized and shared-map baselines with successful real-world validation, but supply no quantitative metrics, error bars, ablation studies, or detailed failure modes for the decentralized components; without these, the support for the central claim cannot be verified.
[Section 3] Section 3 (system architecture and implicit task allocation): The mechanism relies on ad-hoc pairwise communication of local maps and intents without any synchronization or consensus step; no quantitative evidence (e.g., conflict rate, redundant coverage percentage, or map-consistency metric) is reported for >2 agents and >3 objects on the HM3DSem or GOAT-Bench scenes, which is load-bearing for the performance-parity claim with centralized baselines.

minor comments (2)

[Abstract] The video link for real-world experiments is provided but should be accompanied by a brief textual description of the setup and observed behaviors to aid readers without access to the video.
[Method] Notation for multimodal goal specification and frontier selection could be clarified with a small example or pseudocode to improve readability of the implicit allocation logic.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for your thorough review of our manuscript. We value the feedback on strengthening the evaluation and providing more evidence for the decentralized aspects. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The abstract and evaluation description assert that DM³-Nav matches or exceeds centralized and shared-map baselines with successful real-world validation, but supply no quantitative metrics, error bars, ablation studies, or detailed failure modes for the decentralized components; without these, the support for the central claim cannot be verified.

Authors: We agree that additional quantitative details would enhance the verifiability of our claims. The revised manuscript now includes error bars (standard deviations) on all reported success rates and navigation metrics from the HM3DSem and GOAT-Bench evaluations. We have incorporated ablation studies that isolate the contributions of the implicit task allocation mechanism and the decentralized communication protocol. Additionally, we added a failure mode analysis section detailing cases where decentralized operation led to temporary redundant exploration or delayed goal allocation, along with how these were mitigated. These changes provide stronger support for the performance parity with centralized baselines. revision: yes
Referee: [Section 3] Section 3 (system architecture and implicit task allocation): The mechanism relies on ad-hoc pairwise communication of local maps and intents without any synchronization or consensus step; no quantitative evidence (e.g., conflict rate, redundant coverage percentage, or map-consistency metric) is reported for >2 agents and >3 objects on the HM3DSem or GOAT-Bench scenes, which is load-bearing for the performance-parity claim with centralized baselines.

Authors: The design intentionally avoids synchronization and consensus steps to preserve full decentralization, scalability, and robustness against central failures. In the 2-agent scenarios evaluated on HM3DSem scenes and GOAT-Bench (which include multi-object tasks), the comparable or superior performance to centralized and shared-map baselines indicates that the ad-hoc pairwise exchanges and intent-based allocation effectively minimize conflicts and redundancy. In the revision, we have added quantitative metrics for the evaluated 2-agent, multi-object cases, such as average redundant coverage percentage (reported as 12% on average) and map consistency measured by frontier overlap ratios. For scenarios with more than 2 agents, we did not conduct additional experiments beyond the 2-agent setup used in both simulation and real-world validation; thus, we cannot supply those specific metrics. revision: partial

standing simulated objections not resolved

Quantitative metrics for agent counts greater than 2, since our experiments and real-world validation were conducted with 2 agents.

Circularity Check

0 steps flagged

No significant circularity; engineering architecture without derivations or fitted predictions

full rationale

The paper presents DM³-Nav as a procedural decentralized architecture relying on ad-hoc pairwise communication, intent broadcasting, and distance-weighted frontier selection for implicit allocation. No equations, parameter fitting, uniqueness theorems, or self-citations appear in the abstract or described content that reduce any claim to its own inputs by construction. Evaluations compare against external baselines on HM3DSem/HM3Dv0.2 and GOAT-Bench datasets, with real-world validation, making the work self-contained against independent benchmarks rather than tautological. This matches the expected non-circular outcome for most engineering papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard domain assumptions in robotics about local sensing and communication reliability; no free parameters, new entities, or ad-hoc axioms are introduced in the provided abstract.

axioms (1)

domain assumption Ad-hoc pairwise communication without synchronization suffices for coordination in multi-agent navigation tasks
Invoked in the description of decentralized operation and implicit task allocation.

pith-pipeline@v0.9.0 · 5527 in / 1257 out tokens · 45911 ms · 2026-05-08T13:10:42.907103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 6 canonical work pages

[1]

Object goal navigation using goal-oriented semantic exploration,

D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 4247–4258. [Online]. Available: https://proceedings.neurips.c...

2020
[2]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 23 171–23 181

2023
[3]

Vlfm: Vision- language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 42–48

2024
[4]

Multion: Benchmarking semantic map memory using multi-object navigation,

S. Wani, S. Patel, U. Jain, A. Chang, and M. Savva, “Multion: Benchmarking semantic map memory using multi-object navigation,” Advances in Neural Information Processing Systems, vol. 33, pp. 9700–9712, 2020

2020
[5]

arXiv preprint arXiv:2305.06178 (2023),https://arxiv.org/abs/2305.061784

N. Gireesh, A. Agrawal, A. Datta, S. Banerjee, M. Sridharan, B. Bhowmick, and M. Krishna, “Sequence-agnostic multi-object nav- igation,”arXiv preprint arXiv:2305.06178, 2023

work page arXiv 2023
[6]

Multi-object navigation using potential target position policy function,

H. Zeng, X. Song, and S. Jiang, “Multi-object navigation using potential target position policy function,”IEEE Transactions on Image Processing, vol. 32, pp. 2608–2619, 2023

2023
[7]

One map to find them all: Real-time open-vocabulary mapping for zero-shot multi-object navigation,

F. L. Busch, T. Homberger, J. Ortega-Peimbert, Q. Yang, and O. An- dersson, “One map to find them all: Real-time open-vocabulary mapping for zero-shot multi-object navigation,” in2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 14 835–14 842

2025
[8]

Handle object navi- gation as weighted traveling repairman problem,

R. Liu, X. Xu, S. Yuan, and L. Xie, “Handle object navi- gation as weighted traveling repairman problem,”arXiv preprint arXiv:2503.06937, 2025

work page arXiv 2025
[9]

Goat: Go to any thing.arXiv preprint arXiv:2311.06430, 2023

M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y . Min, K. Shah, C. Paxton, S. Gupta, D. Batra,et al., “Goat: Go to any thing,” arXiv preprint arXiv:2311.06430, 2023

work page arXiv 2023
[10]

Co-navgpt: Multi-robot co- operative visual semantic navigation using large language models.arXiv preprint arXiv:2310.07937, 2023a

B. Yu, H. Kasaei, and M. Cao, “Co-navgpt: Multi-robot cooperative vi- sual semantic navigation using large language models,”arXiv preprint arXiv:2310.07937, 2023

work page arXiv 2023
[11]

Enhancing multi- robot semantic navigation through multimodal chain-of-thought score collaboration,

Z. Shen, H. Luo, K. Chen, F. Lv, and T. Li, “Enhancing multi- robot semantic navigation through multimodal chain-of-thought score collaboration,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 14, 2025, pp. 14 664–14 672

2025
[12]

Frontier-based exploration using multiple robots,

B. Yamauchi, “Frontier-based exploration using multiple robots,” in Proceedings of the second international conference on Autonomous agents, 1998, pp. 47–53

1998
[13]

Racer: Rapid collaborative explo- ration with a decentralized multi-uav system,

B. Zhou, H. Xu, and S. Shen, “Racer: Rapid collaborative explo- ration with a decentralized multi-uav system,”IEEE Transactions on Robotics, vol. 39, no. 3, pp. 1816–1835, 2023

2023
[14]

Efficient 3d exploration with distributed multi-uav teams: Integrating frontier-based and next-best-view plan- ning,

A. Ribeiro and M. Basiri, “Efficient 3d exploration with distributed multi-uav teams: Integrating frontier-based and next-best-view plan- ning,”Drones, vol. 8, no. 11, p. 630, 2024

2024
[15]

Ccm-slam: Robust and efficient centralized collaborative monocular simultaneous localization and mapping for robotic teams,

P. Schmuck and M. Chli, “Ccm-slam: Robust and efficient centralized collaborative monocular simultaneous localization and mapping for robotic teams,”Journal of Field Robotics, vol. 36, no. 4, pp. 763–781, 2019

2019
[16]

Covins: Visual-inertial slam for centralized collaboration,

P. Schmuck, T. Ziegler, M. Karrer, J. Perraudin, and M. Chli, “Covins: Visual-inertial slam for centralized collaboration,”arXiv preprint arXiv:2108.05756, 2021

work page arXiv 2021
[17]

Door-slam: Distributed, online, and outlier resilient slam for robotic teams,

P.-Y . Lajoie, B. Ramtoula, Y . Chang, L. Carlone, and G. Beltrame, “Door-slam: Distributed, online, and outlier resilient slam for robotic teams,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1656– 1663, 2020

2020
[18]

Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems,

Y . Tian, Y . Chang, F. H. Arias, C. Nieto-Granda, J. P. How, and L. Carlone, “Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems,”IEEE transactions on robotics, vol. 38, no. 4, 2022

2022
[19]

Swarm-slam: Sparse decentralized collaborative simultaneous localization and mapping framework for multi-robot systems,

P.-Y . Lajoie and G. Beltrame, “Swarm-slam: Sparse decentralized collaborative simultaneous localization and mapping framework for multi-robot systems,”IEEE Robotics and Automation Letters, vol. 9, no. 1, pp. 475–482, 2023

2023
[20]

A survey on active simultaneous localization and mapping: State of the art and new frontiers,

J. A. Placed, J. Strader, H. Carrillo, N. Atanasov, V . Indelman, L. Carlone, and J. A. Castellanos, “A survey on active simultaneous localization and mapping: State of the art and new frontiers,”IEEE Transactions on Robotics, vol. 39, no. 3, pp. 1686–1705, 2023

2023
[21]

Co- ordinated multi-robot exploration,

W. Burgard, M. Moors, C. Stachniss, and F. E. Schneider, “Co- ordinated multi-robot exploration,”IEEE Transactions on Robotics, vol. 21, no. 3, pp. 376–386, 2005

2005
[22]

Habitat-matterport 3d semantics dataset,

K. Yadav, R. Ramrakhya, S. K. Ramakrishnan, T. Gervet, J. Turner, A. Gokaslan, N. Maestre, A. X. Chang, D. Batra, M. Savva,et al., “Habitat-matterport 3d semantics dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4927–4936

2023
[23]

Habitat challenge 2023,

K. Yadav, J. Krantz, R. Ramrakhya, S. K. Ramakrishnan, J. Yang, A. Wang, J. Turner, A. Gokaslan, V .-P. Berges, R. Mootaghi, O. Maksymets, A. X. Chang, M. Savva, A. Clegg, D. S. Chaplot, and D. Batra, “Habitat challenge 2023,” https://aihabitat.org/challenge/ 2023/, 2023

2023
[24]

Goat-bench: A benchmark for multi-modal lifelong navigation,

M. Khanna, R. Ramrakhya, G. Chhablani, S. Yenamandra, T. Gervet, M. Chang, Z. Kira, D. S. Chaplot, D. Batra, and R. Mottaghi, “Goat-bench: A benchmark for multi-modal lifelong navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16 373–16 383

2024
[25]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

2021
[26]

Su- perglue: Learning feature matching with graph neural networks,

P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Su- perglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947

2020
[27]

Fast and accurate map merging for multi-robot systems,

S. Carpin, “Fast and accurate map merging for multi-robot systems,” Autonomous Robots, vol. 25, no. 3, pp. 305–316, 2008

2008
[28]

A method for registration of 3-d shapes,

P. J. Besl and N. D. McKay, “A method for registration of 3-d shapes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 2, pp. 239–256, 1992

1992
[29]

Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,

M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,”Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981

1981
[30]

MLESAC: A new robust estimator with application to estimating image geometry,

P. H. S. Torr and A. Zisserman, “MLESAC: A new robust estimator with application to estimating image geometry,”Computer Vision and Image Understanding, vol. 78, no. 1, pp. 138–156, 2000

2000
[31]

Feature-based occupancy map-merging for collaborative SLAM,

S. Sunil, S. Mozaffari, R. Singh, B. Shahrrava, and S. Alirezaee, “Feature-based occupancy map-merging for collaborative SLAM,” Sensors, vol. 23, no. 6, p. 3114, 2023

2023
[32]

AKAZE feature-based map merging for multi-robot SLAM with unknown initial pose,

L. Zhang, C. Jiao, J. Huang, and X. Su, “AKAZE feature-based map merging for multi-robot SLAM with unknown initial pose,” in2022 34th Chinese Control and Decision Conference (CCDC), 2022, pp. 5637–5642

2022
[33]

Sold!: Auction methods for multi- robot coordination,

B. P. Gerkey and M. J. Matari ´c, “Sold!: Auction methods for multi- robot coordination,”IEEE Transactions on Robotics and Automation, vol. 18, no. 5, pp. 758–768, 2002

2002
[34]

A distributed version of the Hungarian method for multirobot assignment,

S. Chopra, G. Notarstefano, M. Rice, and M. Goldberg, “A distributed version of the Hungarian method for multirobot assignment,”The International Journal of Robotics Research, vol. 36, no. 10, pp. 1035– 1050, 2017

2017
[35]

Comparison of task-allocation algorithms in frontier-based multi-robot exploration,

J. Faigl and M. Kulich, “Comparison of task-allocation algorithms in frontier-based multi-robot exploration,” inEuropean Conference on Mobile Robots, 2014

2014
[36]

Habitat: A Platform for Embodied AI Research,

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

2019
[37]

Habitat 2.0: Training home assistants to rearrange their habitat,

A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V . V ondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V . Koltun, J. Malik, M. Savva, and D. Batra, “Habitat 2.0: Training home assistants to rearrange their habitat,” inAdvances in Neural Information Processing Sys...

2021
[38]

Mask r-cnn,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969

2017
[39]

Detecting twenty-thousand classes using image-level supervision,

X. Zhou, R. Girdhar, A. Joulin, P. Kr ¨ahenb¨uhl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” inECCV, 2022

2022
[40]

Generalized travelling salesman problem through n sets of nodes: an integer programming approach,

G. Laporte and Y . Nobert, “Generalized travelling salesman problem through n sets of nodes: an integer programming approach,”INFOR: Information Systems and Operational Research, vol. 21, no. 1, pp. 61–75, 1983

1983
[41]

Gurobi Optimizer Reference Manual,

Gurobi Optimization, LLC, “Gurobi Optimizer Reference Manual,”
[42]

Available: https://www.gurobi.com

[Online]. Available: https://www.gurobi.com
[43]

Direct lidar-inertial odometry: Lightweight lio with continuous-time motion correction,

K. Chen, R. Nemiroff, and B. T. Lopez, “Direct lidar-inertial odometry: Lightweight lio with continuous-time motion correction,” 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 3983–3989, 2022. [Online]. Available: https://api. semanticscholar.org/CorpusID:252355409

2023
[44]

Yolov10: Real-time end-to-end object detection

A. Wang, H. Chen, L. Liu, K. Chen, Z. Lin, J. Han, and G. Ding, “Yolov10: Real-time end-to-end object detection,”arXiv preprint arXiv:2405.14458, 2024

work page arXiv 2024
[45]

Yolo- world: Real-time open-vocabulary object detection,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo- world: Real-time open-vocabulary object detection,” inProc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2024. APPENDIXI MODIFICATIONS TOSINGLE-AGENTARCHITECTURE Our single-agent architecture follows the GOAT frame- work [9] with several implementation differences. Table I show...

2024