RoboAtlas: Contextual Active SLAM

Abraham P. Vinod; Alexander Schperberg; M. K. Jawed; Shivam K. Panda; Stefano Di Cairano

arxiv: 2606.26046 · v1 · pith:VXD2ZB3Dnew · submitted 2026-06-24 · 💻 cs.RO · cs.CV

RoboAtlas: Contextual Active SLAM

Alexander Schperberg , Shivam K. Panda , Abraham P. Vinod , M. K. Jawed , Stefano Di Cairano This is my paper

Pith reviewed 2026-06-25 19:10 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords active SLAMsemantic mappingcontextual banditvision-language modelsrobot explorationfrontier navigation3D scene understanding

0 comments

The pith

RoboAtlas uses a contextual bandit to balance exploration and semantic reasoning in active SLAM, reaching 90.6 percent success on unseen benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that integrates geometric exploration with semantic understanding for robot navigation in unknown environments. It employs a decision mechanism that starts with broad searching and shifts to targeted movement as it builds a map of object meanings. This is shown to work in large real spaces and to set new performance levels on test tasks. A sympathetic reader would care because it shows how mapping can improve foundation model effectiveness for robots without needing the largest models.

Core claim

RoboAtlas integrates frontier exploration, global semantic-map reasoning, and egocentric vision-language model reasoning through a contextual multi-armed bandit that transitions from exploration to semantically guided navigation as scene understanding improves. It achieves a 100 percent task success rate in real-world environments exceeding 1800 square meters with around 30,000 mapped instances and state-of-the-art performance on a benchmark with 90.6 percent success rate using a large model, improving over the strongest prior baseline by 17.8 percentage points. Using a much smaller model, it still achieves 88.8 percent success rate.

What carries the argument

The contextual multi-armed bandit that adaptively balances geometric exploration and semantic reasoning using scalable 3D semantic mapping.

If this is right

The system achieves full task success in real robot deployments across very large indoor spaces.
Performance on standard tests exceeds previous methods by nearly 18 percentage points in success rate.
Smaller vision-language models can surpass larger ones when supported by detailed semantic maps.
Grounding vision-language models in large-scale 3D maps supports more robust active SLAM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improving mapping accuracy could allow even lighter models to handle complex navigation tasks.
The method might apply to other tasks where robots need to reason about objects in unseen spaces.
Errors in object labeling could cause the system to choose poor exploration paths.

Load-bearing premise

The 3D semantic mapping must produce labels accurate and complete enough to support reliable reasoning for navigation choices.

What would settle it

A controlled test where semantic labels are randomly perturbed or removed, checking whether the reported success rates drop significantly below the baselines.

Figures

Figures reproduced from arXiv: 2606.26046 by Abraham P. Vinod, Alexander Schperberg, M. K. Jawed, Shivam K. Panda, Stefano Di Cairano.

**Figure 1.** Figure 1: RoboAtlas. RoboAtlas combines frontier exploration, semantic map reasoning, and egocentric VLM reasoning within a contextual multi-armed bandit framework. It receives the environment state through our real-time 3D semantic mapping framework, called OpenRoboVox. The system dynamically switches between geometric exploration and semantic navigation as map understanding improves. After all, humans navigate un… view at source ↗

**Figure 2.** Figure 2: RoboAtlas overall framework. Top: OpenRoboVox performs real-time 3D semantic mapping and scene-dictionary construction from RGB-D observations. Bottom: a contextual multi-armed bandit selects among frontier exploration, semantic map, and egocentric VLM experts to generate navigation goals. spatial properties, condensing millions of low-level voxels into a representation suitable for high-level reasoning. T… view at source ↗

**Figure 3.** Figure 3: OpenRoboVox Hardware Validation. Top row shows the 3D occupancy grid and the OpenRoboVox framework, including the RGB and Depth camera streams, semantic segmentation, and corresponding semantic voxels. The bottom row shows the 3D occupancy, overlaid by the captions of the scene dictionary, for two different floors of an office building (left column is for floor 1 and right column for floor 2). design that … view at source ↗

**Figure 4.** Figure 4: Input system prompts. System prompts used for the semantic map and ego-centric VLM experts for validation experiments. unique instances, which are overlaid on the 3D occupancy grid (bottom row). 2) Contextual Multi-Arm Bandits: To demonstrate the use of the Contextual Multi-Arm Bandits (CMAB), we present an [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Overall Hardware System Flowchart. The user provides a high-level language directive, which is processed by RoboAtlas and a foundation model running on an external desktop GPU. RGB-D observations and robot pose estimates are streamed from the Unitree Go2 platform through an internal Jetson AGX Orin to the desktop, where semantic mapping, contextual reasoning, and goal selection are performed. RoboAtlas gen… view at source ↗

**Figure 6.** Figure 6: We report Success Rate (SR) and Success weighted by [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 6.** Figure 6: Contextual Multi-Arm Bandit Validation. Top: Path results based on using only one expert for finding a large can on the glass table in (A) - (C) or green plant next to the tv in (D). These experts include frontier exploration expert (blue), semantic map expert (purple), or ego-centric VLM expert (green) Bottom: summary statistics over 15 trials for each setting. (1) (2) (3) (4) (5) (6) (7) (8) [PITH_FULL_… view at source ↗

**Figure 7.** Figure 7: RoboAtlas Demonstration. Top row: (1) Ego-centric VLM text output, (2) scene dictionary, (3) Octomap visualization with overlaid scene-dictionary captions, (4) OpenRoboVox semantic visualization. Bottom row: (5) constraint map (blue indicates objects to avoid), (6) Ego-centric VLM goal expert (7) semantic map expert, and (8) frontier exploration expert (red X indicates the proposed goal position). TABLE I… view at source ↗

**Figure 8.** Figure 8: RoboAtlas. Hardware Validation. “Find and navigate to the tree located near the lamp and display cabinet” Habitat “Find and navigate to the dresser located below the mirror in the room” “Find and navigate to a refrigerator” (a) Photo-realistic Habitat simulator validation. Here, we visualize 3 out of 36 val-unseen scenes validated in this study. Isaac Sim Real “Find a large can on the glass table” “Find th… view at source ↗

**Figure 9.** Figure 9: Cross-Domain Validation. Red rectangles represent the target object. If neighbor object is specified, they are shown as yellow rectangles [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

We present RoboAtlas, a contextual Active SLAM framework that adaptively balances geometric exploration and semantic reasoning using a scalable 3D semantic mapping system, OpenRoboVox. RoboAtlas integrates frontier exploration, global semantic-map reasoning, and egocentric VLM-based reasoning through a contextual multi-armed bandit that transitions from exploration to semantically guided navigation as scene understanding improves. We evaluate the system in simulation and on a Unitree Go2 robot in large-scale real-world environments exceeding 1800 m2 with approx. 30k mapped semantic instances, achieving a 100% task success rate. On the GOAT-Bench "Val Unseen" benchmark, RoboAtlas achieves state-of-the-art performance with highest reported success rate (SR) of 90.6%, using GPT-4o, improving over the strongest prior baseline by 17.8 percentage points in SR. Using the much smaller Qwen2.5-VL-7B model, it still achieves 88.8% SR, outperforming all baselines using GPT-4o in SR, and revealing the importance of the information gained by our semantic mapping framework over simply replacing the underlying foundation model. The results demonstrate that grounding foundation models with large-scale 3D semantic maps enables robust and efficient contextual Active SLAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoboAtlas integrates frontier exploration with global semantic maps and egocentric VLM via a contextual bandit, posting 90.6% SR on GOAT-Bench and 100% real-world success, but the methods lack ablations and error analysis.

read the letter

The main takeaway is that this paper gives a working system for active SLAM that uses a contextual multi-armed bandit to move from pure geometric frontier exploration to semantically guided navigation once the map improves. It builds large 3D semantic maps with OpenRoboVox (~30k instances) and combines global map reasoning with local VLM views. On GOAT-Bench Val Unseen it hits 90.6% success with GPT-4o and 88.8% with Qwen2.5-VL-7B, beating prior GPT-4o baselines, and it runs at 100% success on a Unitree Go2 in real spaces over 1800 m2.

The integration itself looks new: the bandit explicitly transitions behavior based on improving scene understanding rather than running fixed policies. The result that the smaller model still wins over GPT-4o baselines is useful evidence that the semantic mapping layer adds real value beyond model size.

The paper does a reasonable job on the empirical side by including both simulation benchmarks and hardware tests in large indoor settings. That combination is worth having for robotics work.

The soft spots are in the evaluation details. The reported success rates come without error bars, statistical significance, or ablation studies on the bandit, the mapping quality, or the individual reasoning modules. There is also no description of how the bandit was tuned or how sensitive the results are to OpenRoboVox label accuracy. These gaps make it hard to isolate exactly what drives the gains.

This paper is for people working on active SLAM or semantic navigation in indoor robotics. A reader who needs concrete numbers on combining classical exploration with foundation models will find usable ideas here.

It deserves peer review because the benchmark lift and real-world test are substantial enough to warrant detailed feedback on the missing analysis.

Referee Report

3 major / 0 minor

Summary. The manuscript presents RoboAtlas, a contextual Active SLAM framework that integrates frontier exploration, global semantic-map reasoning via the OpenRoboVox 3D mapping system, and egocentric VLM reasoning through a contextual multi-armed bandit policy. The policy adaptively shifts from geometric exploration to semantically guided navigation as scene understanding improves. On the GOAT-Bench 'Val Unseen' benchmark the system reports state-of-the-art success rates of 90.6% (GPT-4o) and 88.8% (Qwen2.5-VL-7B), together with 100% task success in real-world trials on >1800 m² environments containing ~30k mapped semantic instances.

Significance. If the reported performance gains prove robust, the work provides concrete evidence that grounding VLMs with large-scale 3D semantic maps can yield substantial improvements in active SLAM, allowing smaller models to surpass larger ones and highlighting the value of scalable semantic mapping over raw model scale.

major comments (3)

[Results / Experiments] Results section: the headline success rates (90.6% and 88.8% SR) are reported without error bars, number of evaluation episodes, or any statistical significance tests, so the claimed 17.8 pp improvement cannot be assessed for reliability.
[Methods / Experiments] Methods / Experiments: no ablation studies isolate the contribution of OpenRoboVox mapping, the contextual bandit, or the VLM component, leaving the attribution of performance gains to the proposed framework unverified.
[Methods] Implementation details: the tuning procedure, hyper-parameters, and exploration-to-exploitation schedule of the contextual multi-armed bandit are not described, which is load-bearing for reproducing the reported benchmark numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and reproducibility.

read point-by-point responses

Referee: [Results / Experiments] Results section: the headline success rates (90.6% and 88.8% SR) are reported without error bars, number of evaluation episodes, or any statistical significance tests, so the claimed 17.8 pp improvement cannot be assessed for reliability.

Authors: We agree that error bars, the exact number of evaluation episodes, and statistical significance tests are necessary to assess reliability. In the revised manuscript we will report the number of GOAT-Bench Val Unseen episodes, include standard-error bars on all success-rate figures, and add appropriate statistical tests comparing RoboAtlas against the strongest baseline. revision: yes
Referee: [Methods / Experiments] Methods / Experiments: no ablation studies isolate the contribution of OpenRoboVox mapping, the contextual bandit, or the VLM component, leaving the attribution of performance gains to the proposed framework unverified.

Authors: We acknowledge that ablation studies are required to attribute gains to individual components. The revised version will include a dedicated ablation section with controlled variants that disable OpenRoboVox, replace the contextual bandit with a fixed policy, and swap the VLM while keeping the mapping framework fixed. revision: yes
Referee: [Methods] Implementation details: the tuning procedure, hyper-parameters, and exploration-to-exploitation schedule of the contextual multi-armed bandit are not described, which is load-bearing for reproducing the reported benchmark numbers.

Authors: We agree that these details are essential for reproducibility. The revised manuscript will add a subsection detailing the bandit formulation, all hyper-parameters, the tuning procedure (including any cross-validation on a held-out set), and the precise schedule governing the shift from exploration to exploitation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical robotic framework evaluated via benchmark success rates (e.g., 90.6% SR on GOAT-Bench Val Unseen) and real-world trials. No equations, derivations, or parameter-fitting steps are presented that would reduce reported outcomes to inputs by construction. The central claims rest on measured performance of the integrated system (OpenRoboVox mapping + contextual bandit), which is externally falsifiable on the stated benchmarks and does not rely on self-citation chains or self-definitional premises.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, no explicit parameters, axioms, or invented physical entities are described. OpenRoboVox and RoboAtlas are system names rather than new postulated entities with independent evidence.

pith-pipeline@v0.9.1-grok · 5774 in / 1315 out tokens · 15939 ms · 2026-06-25T19:10:09.143516+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 11 canonical work pages · 5 internal anchors

[1]

A survey on active simultaneous localization and mapping: State of the art and new frontiers,

J. A. Placed, J. Strader, H. Carrillo, N. Atanasov, V . Indelman, L. Car- lone, and J. A. Castellanos, “A survey on active simultaneous localization and mapping: State of the art and new frontiers,”IEEE Transactions on Robotics, vol. 39, no. 3, pp. 1686–1705, 2023

2023
[2]

Saber: Data-driven motion planner for autonomously navigating heterogeneous robots,

A. Schperberg, S. Tsuei, S. Soatto, and D. Hong, “Saber: Data-driven motion planner for autonomously navigating heterogeneous robots,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 8086–8093, 2021

2021
[3]

Algorithms for routing of unmanned aerial vehicles with mobile recharging stations,

K. Yu, A. K. Budhiraja, and P. Tokekar, “Algorithms for routing of unmanned aerial vehicles with mobile recharging stations,” in2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5720–5725, 2018

2018
[4]

Frontier-based exploration for multi-robot rendezvous in communication-restricted un- known environments,

M. Tellaroli, M. Luperto, M. Antonazzi, and N. Basilico, “Frontier-based exploration for multi-robot rendezvous in communication-restricted un- known environments,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5807–5812, 2024

2024
[5]

How to not train your dragon: Training-free embodied object goal navigation with semantic frontiers,

J. Chen, G. Li, S. Kumar, B. Ghanem, and F. Yu, “How to not train your dragon: Training-free embodied object goal navigation with semantic frontiers,” 2023

2023
[6]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,”CVPR, 2023

2023
[7]

Vlfm: Vision- language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 42– 48, 2024

2024
[8]

Energy- constrained multi-robot exploration for autonomous map building,

S. H. Karumanchi, B. Rokaha, A. Schperberg, and A. P. Vinod, “Energy- constrained multi-robot exploration for autonomous map building,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9154–9161, 2025

2025
[9]

A practical, decision-theoretic approach to multi-robot mapping and exploration,

J. Ko, B. Stewart, D. Fox, K. Konolige, and B. Limketkai, “A practical, decision-theoretic approach to multi-robot mapping and exploration,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3232–3238, 2003

2003
[10]

Coordinated multi-robot exploration using a segmentation of the environment,

K. M. Wurm, C. Stachniss, and W. Burgard, “Coordinated multi-robot exploration using a segmentation of the environment,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1160–1165, 2008

2008
[11]

Multi-robot coordination for energy-efficient exploration,

A. Benkrid, A. Benallegue, and N. Achour, “Multi-robot coordination for energy-efficient exploration,”Journal of Control, Automation and Electrical Systems, vol. 30, no. 6, pp. 911–920, 2019

2019
[12]

Coordinated multi-robot exploration,

W. Burgard, M. Moors, C. Stachniss, and F. E. Schneider, “Coordinated multi-robot exploration,”IEEE Transactions on Robotics, vol. 21, no. 3, pp. 376–386, 2005

2005
[13]

Decentralized coordination for multirobot exploration,

B. Yamauchi, “Decentralized coordination for multirobot exploration,” Robotics and Autonomous Systems, vol. 29, no. 2-3, pp. 111–118, 1999

1999
[14]

Namo-llm: Efficient navigation among movable obstacles with large language model guidance,

Y . Zhang and Y . Kantaros, “Namo-llm: Efficient navigation among movable obstacles with large language model guidance,”IEEE Robotics and Automation Letters, vol. 10, no. 12, pp. 13026–13033, 2025

2025
[15]

Can an embodied agent find your “cat-shaped mug

V . S. Dorbala, J. F. Mullen, and D. Manocha, “Can an embodied agent find your “cat-shaped mug”? llm-based zero-shot object navigation,” IEEE Robotics and Automation Letters, vol. 9, p. 4083–4090, May 2024

2024
[16]

Esc: Exploration with soft commonsense constraints for zero- shot object navigation,

K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “Esc: Exploration with soft commonsense constraints for zero- shot object navigation,” 2023

2023
[17]

Visual language maps for robot navigation,

C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” 2023

2023
[18]

Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,

D. Shah, B. Osinski, B. Ichter, and S. Levine, “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” 2022

2022
[19]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” 2022

2022
[20]

Inner monologue: Embodied reasoning through planning with language models,

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, N. Brown, T. Jack- son, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embodied reasoning through planning with language models,” 2022

2022
[21]

Stairway to success: An online floor-aware zero-shot object- goal navigation framework via llm-driven coarse-to-fine exploration,

Z. Gong, R. Li, T. Hu, R. Qiu, L. Kong, L. Zhang, G. Zhao, Y . Ding, and J. Liang, “Stairway to success: An online floor-aware zero-shot object- goal navigation framework via llm-driven coarse-to-fine exploration,” IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 2943–2950, 2026

2026
[22]

Dynavlm: Zero-shot vision-language naviga- tion system with dynamic viewpoints and self-refining graph memory,

Z. Ji, H. Lin, and Y . Gao, “Dynavlm: Zero-shot vision-language naviga- tion system with dynamic viewpoints and self-refining graph memory,” 2025

2025
[23]

Handle object navigation as weighted traveling repairman problem,

R. Liu, X. Xu, S. Yuan, and L. Xie, “Handle object navigation as weighted traveling repairman problem,” 2025

2025
[24]

Orionnav: Online plan- ning for robot autonomy with context-aware llm and open-vocabulary semantic scene graphs,

V . N. Devarakonda, R. G. Goswami, A. U. Kaypak, N. Patel, R. Khor- rambakht, P. Krishnamurthy, and F. Khorrami, “Orionnav: Online plan- ning for robot autonomy with context-aware llm and open-vocabulary semantic scene graphs,” 2024

2024
[25]

3p-llm: Probabilistic path planning using large language model for autonomous robot navigation,

E. Latif, “3p-llm: Probabilistic path planning using large language model for autonomous robot navigation,” 2024

2024
[26]

Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation,

X. Huang, S. Zhao, Y . Wang, X. Lu, W. Zhang, R. Qu, W. Li, Y . Wang, and C. Wen, “Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation,”arXiv preprint arXiv:2511.10376, 2025

work page arXiv 2025
[27]

Semanticfu- sion: Dense 3d semantic mapping with convolutional neural networks,

J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Semanticfu- sion: Dense 3d semantic mapping with convolutional neural networks,” 2016

2016
[28]

Maskfusion: Real-time recogni- tion, tracking and reconstruction of multiple moving objects,

M. R ¨unz, M. Buffier, and L. Agapito, “Maskfusion: Real-time recogni- tion, tracking and reconstruction of multiple moving objects,” 2018

2018
[29]

Fusion++: V olumetric object-level slam,

J. McCormac, R. Clark, M. Bloesch, A. J. Davison, and S. Leutenegger, “Fusion++: V olumetric object-level slam,” 2018

2018
[30]

Panopticfusion: Online volumetric semantic mapping at the level of stuff and things,

G. Narita, T. Seno, T. Ishikawa, and Y . Kaji, “Panopticfusion: Online volumetric semantic mapping at the level of stuff and things,” 2019

2019
[31]

Pocd: Probabilistic object-level change detection and volu- metric mapping in semi-static scenes,

J. Qian, V . Chatrath, J. Yang, J. Servos, A. P. Schoellig, and S. L. Waslander, “Pocd: Probabilistic object-level change detection and volu- metric mapping in semi-static scenes,” 2022

2022
[32]

Pov-slam: Probabilistic object-aware variational slam in semi-static environments,

J. Qian, V . Chatrath, J. Servos, A. Mavrinac, W. Burgard, S. L. Waslander, and A. P. Schoellig, “Pov-slam: Probabilistic object-aware variational slam in semi-static environments,” 2023

2023
[33]

Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa,et al., “Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 5021–5028, IEEE, 2024

2024
[34]

Openvox: Real-time instance-level open-vocabulary probabilistic voxel representation,

Y . Deng, B. Yao, Y . Tang, Y . Yang, and Y . Yue, “Openvox: Real-time instance-level open-vocabulary probabilistic voxel representation,” 2025

2025
[35]

One map to find them all: Real-time open-vocabulary mapping for zero-shot multi-object navigation,

F. L. Busch, T. Homberger, J. Ortega-Peimbert, Q. Yang, and O. Ander- sson, “One map to find them all: Real-time open-vocabulary mapping for zero-shot multi-object navigation,”arXiv preprint arXiv:2409.11764, 2024

work page arXiv 2024
[36]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang, “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,”arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Yolo- world: Real-time open-vocabulary object detection,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo- world: Real-time open-vocabulary object detection,”arXiv preprint arXiv:2401.17270, 2024

work page arXiv 2024
[38]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,”arXiv preprint arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[39]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,”arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick, “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[41]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, “Faster segment anything: Towards lightweight sam for mobile applications,”arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,

Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” inNeurIPS, 2023

2023
[43]

Embodied question answering,

A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” 2017

2017
[44]

Behavior- 1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation,

C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Mart ´ın-Mart´ın, C. Wang, G. Levine, W. Ai, B. Martinez, H. Yin, M. Lingelbach, M. Hwang, A. Hiranaka, S. Garlanka, A. Aydin, S. Lee, J. Sun, M. Anvari, M. Sharma, D. Bansal, S. Hunter, K.-Y . Kim, A. Lou, C. R. Matthews, I. Villa-Renteria, J. H. Tang, C. Tang, F. Xia, Y . Li, S. Savarese, H. Gweon, ...

2024
[45]

Explore until confident: Efficient exploration for embodied question answering,

A. Z. Ren, J. Clark, A. Dixit, M. Itkina, A. Majumdar, and D. Sadigh, “Explore until confident: Efficient exploration for embodied question answering,” 2024

2024
[46]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks,

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” 2020

2020
[47]

Habitat 2.0: Training home assistants to rearrange their habitat,

A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V . V ondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V . Koltun, J. Malik, M. Savva, and D. Batra, “Habitat 2.0: Training home assistants to rearrange their habitat,” 2022

2022
[48]

Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,

Z. Zhu, X. Wang, Y . Li, Z. Zhang, X. Ma, Y . Chen, B. Jia, W. Liang, Q. Yu, Z. Deng, S. Huang, and Q. Li, “Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,”International Conference on Computer Vision (ICCV), 2025

2025
[49]

3d- mem: 3d scene memory for embodied exploration and reasoning,

Y . Yang, H. Yang, J. Zhou, P. Chen, H. Zhang, Y . Du, and C. Gan, “3d- mem: 3d scene memory for embodied exploration and reasoning,” in Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp. 17294–17303, June 2025

2025
[50]

Reexplore: Improv- ing mllms for embodied exploration with contextualized retrospective experience replay,

G. Zhang, M. Ding, J. Wu, R. Liao, and V . Tresp, “Reexplore: Improv- ing mllms for embodied exploration with contextualized retrospective experience replay,”arXiv preprint arXiv:2511.19033, 2025

work page arXiv 2025
[51]

Explore with long-term memory: A benchmark and multimodal llm- based reinforcement learning framework for embodied exploration,

S. Wang, B. Liu, Z. Gao, L. Ma, X. Wang, Y . Xie, and X. Tan, “Explore with long-term memory: A benchmark and multimodal llm- based reinforcement learning framework for embodied exploration,” in Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2026

2026
[52]

Himm: Human-inspired long-term memory modeling for embodied exploration and question answering,

J. Li, B. Wang, J. Xia, M. Li, and S. Hu, “Himm: Human-inspired long-term memory modeling for embodied exploration and question answering,” 2026

2026
[53]

OctoMap: An efficient probabilistic 3D mapping framework based on octrees,

A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, “OctoMap: An efficient probabilistic 3D mapping framework based on octrees,”Autonomous Robots, 2013. Software available at https: //octomap.github.io

2013
[54]

Slam toolbox: Slam for the dynamic world,

S. Macenski and I. Jambrecic, “Slam toolbox: Slam for the dynamic world,”Journal of Open Source Software, vol. 6, no. 61, p. 2783, 2021

2021
[55]

A contextual-bandit approach to personalized news article recommendation,

L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” inProceedings of the 19th international conference on World wide web, WWW ’10, p. 661–670, ACM, Apr. 2010

2010
[56]

Tokenize anything via prompting,

T. Pan, L. Tang, X. Wang, and S. Shan, “Tokenize anything via prompting,” inEuropean Conference on Computer Vision, pp. 330–348, Springer, 2024

2024
[57]

Isaac Sim

NVIDIA, “Isaac Sim.”
[58]

Habitat: A platform for embodied ai research,

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik,et al., “Habitat: A platform for embodied ai research,”arXiv preprint arXiv:1904.01201, 2019

work page arXiv 1904
[59]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,”arXiv preprint arXiv:2109.08238, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

From the desks of ROS maintainers: A survey of modern and capable mobile robotics algorithms in the robot operating system 2,

S. Macenski, T. Moore, D. Lu, A. Merzlyakov, and M. Ferguson, “From the desks of ROS maintainers: A survey of modern and capable mobile robotics algorithms in the robot operating system 2,”Robotics and Autonomous Systems, vol. 168, p. 104493, 2023

2023
[61]

Open-source, cost-aware kinematically feasible planning for mobile and surface robotics,

S. Macenski, M. Booker, and J. Wallace, “Open-source, cost-aware kinematically feasible planning for mobile and surface robotics,”Arxiv, 2024

2024
[62]

Goat- bench: A benchmark for multi-modal lifelong navigation,

M. Khanna*, R. Ramrakhya*, G. Chhablani, S. Yenamandra, T. Gervet, M. Chang, Z. Kira, D. S. Chaplot, D. Batra, and R. Mottaghi, “Goat- bench: A benchmark for multi-modal lifelong navigation,” inCVPR, 2024

2024
[63]

Explore until confident: Efficient exploration for embodied question answering,

A. Z. Ren, J. Clark, A. Dixit, M. Itkina, A. Majumdar, and D. Sadigh, “Explore until confident: Efficient exploration for embodied question answering,”arXiv preprint arXiv:2403.15941, 2024. APPENDIXA ADAPTINGCMABFORGOAT-BENCH The reward formulation in Sec. IV-A is designed for a single open-ended directive, where exploration progress and semantic similari...

work page arXiv 2024

[1] [1]

A survey on active simultaneous localization and mapping: State of the art and new frontiers,

J. A. Placed, J. Strader, H. Carrillo, N. Atanasov, V . Indelman, L. Car- lone, and J. A. Castellanos, “A survey on active simultaneous localization and mapping: State of the art and new frontiers,”IEEE Transactions on Robotics, vol. 39, no. 3, pp. 1686–1705, 2023

2023

[2] [2]

Saber: Data-driven motion planner for autonomously navigating heterogeneous robots,

A. Schperberg, S. Tsuei, S. Soatto, and D. Hong, “Saber: Data-driven motion planner for autonomously navigating heterogeneous robots,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 8086–8093, 2021

2021

[3] [3]

Algorithms for routing of unmanned aerial vehicles with mobile recharging stations,

K. Yu, A. K. Budhiraja, and P. Tokekar, “Algorithms for routing of unmanned aerial vehicles with mobile recharging stations,” in2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5720–5725, 2018

2018

[4] [4]

Frontier-based exploration for multi-robot rendezvous in communication-restricted un- known environments,

M. Tellaroli, M. Luperto, M. Antonazzi, and N. Basilico, “Frontier-based exploration for multi-robot rendezvous in communication-restricted un- known environments,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5807–5812, 2024

2024

[5] [5]

How to not train your dragon: Training-free embodied object goal navigation with semantic frontiers,

J. Chen, G. Li, S. Kumar, B. Ghanem, and F. Yu, “How to not train your dragon: Training-free embodied object goal navigation with semantic frontiers,” 2023

2023

[6] [6]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,”CVPR, 2023

2023

[7] [7]

Vlfm: Vision- language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision- language frontier maps for zero-shot semantic navigation,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 42– 48, 2024

2024

[8] [8]

Energy- constrained multi-robot exploration for autonomous map building,

S. H. Karumanchi, B. Rokaha, A. Schperberg, and A. P. Vinod, “Energy- constrained multi-robot exploration for autonomous map building,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9154–9161, 2025

2025

[9] [9]

A practical, decision-theoretic approach to multi-robot mapping and exploration,

J. Ko, B. Stewart, D. Fox, K. Konolige, and B. Limketkai, “A practical, decision-theoretic approach to multi-robot mapping and exploration,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3232–3238, 2003

2003

[10] [10]

Coordinated multi-robot exploration using a segmentation of the environment,

K. M. Wurm, C. Stachniss, and W. Burgard, “Coordinated multi-robot exploration using a segmentation of the environment,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1160–1165, 2008

2008

[11] [11]

Multi-robot coordination for energy-efficient exploration,

A. Benkrid, A. Benallegue, and N. Achour, “Multi-robot coordination for energy-efficient exploration,”Journal of Control, Automation and Electrical Systems, vol. 30, no. 6, pp. 911–920, 2019

2019

[12] [12]

Coordinated multi-robot exploration,

W. Burgard, M. Moors, C. Stachniss, and F. E. Schneider, “Coordinated multi-robot exploration,”IEEE Transactions on Robotics, vol. 21, no. 3, pp. 376–386, 2005

2005

[13] [13]

Decentralized coordination for multirobot exploration,

B. Yamauchi, “Decentralized coordination for multirobot exploration,” Robotics and Autonomous Systems, vol. 29, no. 2-3, pp. 111–118, 1999

1999

[14] [14]

Namo-llm: Efficient navigation among movable obstacles with large language model guidance,

Y . Zhang and Y . Kantaros, “Namo-llm: Efficient navigation among movable obstacles with large language model guidance,”IEEE Robotics and Automation Letters, vol. 10, no. 12, pp. 13026–13033, 2025

2025

[15] [15]

Can an embodied agent find your “cat-shaped mug

V . S. Dorbala, J. F. Mullen, and D. Manocha, “Can an embodied agent find your “cat-shaped mug”? llm-based zero-shot object navigation,” IEEE Robotics and Automation Letters, vol. 9, p. 4083–4090, May 2024

2024

[16] [16]

Esc: Exploration with soft commonsense constraints for zero- shot object navigation,

K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “Esc: Exploration with soft commonsense constraints for zero- shot object navigation,” 2023

2023

[17] [17]

Visual language maps for robot navigation,

C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” 2023

2023

[18] [18]

Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,

D. Shah, B. Osinski, B. Ichter, and S. Levine, “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” 2022

2022

[19] [19]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” 2022

2022

[20] [20]

Inner monologue: Embodied reasoning through planning with language models,

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, N. Brown, T. Jack- son, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner monologue: Embodied reasoning through planning with language models,” 2022

2022

[21] [21]

Stairway to success: An online floor-aware zero-shot object- goal navigation framework via llm-driven coarse-to-fine exploration,

Z. Gong, R. Li, T. Hu, R. Qiu, L. Kong, L. Zhang, G. Zhao, Y . Ding, and J. Liang, “Stairway to success: An online floor-aware zero-shot object- goal navigation framework via llm-driven coarse-to-fine exploration,” IEEE Robotics and Automation Letters, vol. 11, no. 3, pp. 2943–2950, 2026

2026

[22] [22]

Dynavlm: Zero-shot vision-language naviga- tion system with dynamic viewpoints and self-refining graph memory,

Z. Ji, H. Lin, and Y . Gao, “Dynavlm: Zero-shot vision-language naviga- tion system with dynamic viewpoints and self-refining graph memory,” 2025

2025

[23] [23]

Handle object navigation as weighted traveling repairman problem,

R. Liu, X. Xu, S. Yuan, and L. Xie, “Handle object navigation as weighted traveling repairman problem,” 2025

2025

[24] [24]

Orionnav: Online plan- ning for robot autonomy with context-aware llm and open-vocabulary semantic scene graphs,

V . N. Devarakonda, R. G. Goswami, A. U. Kaypak, N. Patel, R. Khor- rambakht, P. Krishnamurthy, and F. Khorrami, “Orionnav: Online plan- ning for robot autonomy with context-aware llm and open-vocabulary semantic scene graphs,” 2024

2024

[25] [25]

3p-llm: Probabilistic path planning using large language model for autonomous robot navigation,

E. Latif, “3p-llm: Probabilistic path planning using large language model for autonomous robot navigation,” 2024

2024

[26] [26]

Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation,

X. Huang, S. Zhao, Y . Wang, X. Lu, W. Zhang, R. Qu, W. Li, Y . Wang, and C. Wen, “Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation,”arXiv preprint arXiv:2511.10376, 2025

work page arXiv 2025

[27] [27]

Semanticfu- sion: Dense 3d semantic mapping with convolutional neural networks,

J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Semanticfu- sion: Dense 3d semantic mapping with convolutional neural networks,” 2016

2016

[28] [28]

Maskfusion: Real-time recogni- tion, tracking and reconstruction of multiple moving objects,

M. R ¨unz, M. Buffier, and L. Agapito, “Maskfusion: Real-time recogni- tion, tracking and reconstruction of multiple moving objects,” 2018

2018

[29] [29]

Fusion++: V olumetric object-level slam,

J. McCormac, R. Clark, M. Bloesch, A. J. Davison, and S. Leutenegger, “Fusion++: V olumetric object-level slam,” 2018

2018

[30] [30]

Panopticfusion: Online volumetric semantic mapping at the level of stuff and things,

G. Narita, T. Seno, T. Ishikawa, and Y . Kaji, “Panopticfusion: Online volumetric semantic mapping at the level of stuff and things,” 2019

2019

[31] [31]

Pocd: Probabilistic object-level change detection and volu- metric mapping in semi-static scenes,

J. Qian, V . Chatrath, J. Yang, J. Servos, A. P. Schoellig, and S. L. Waslander, “Pocd: Probabilistic object-level change detection and volu- metric mapping in semi-static scenes,” 2022

2022

[32] [32]

Pov-slam: Probabilistic object-aware variational slam in semi-static environments,

J. Qian, V . Chatrath, J. Servos, A. Mavrinac, W. Burgard, S. L. Waslander, and A. P. Schoellig, “Pov-slam: Probabilistic object-aware variational slam in semi-static environments,” 2023

2023

[33] [33]

Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa,et al., “Concept- graphs: Open-vocabulary 3d scene graphs for perception and planning,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 5021–5028, IEEE, 2024

2024

[34] [34]

Openvox: Real-time instance-level open-vocabulary probabilistic voxel representation,

Y . Deng, B. Yao, Y . Tang, Y . Yang, and Y . Yue, “Openvox: Real-time instance-level open-vocabulary probabilistic voxel representation,” 2025

2025

[35] [35]

One map to find them all: Real-time open-vocabulary mapping for zero-shot multi-object navigation,

F. L. Busch, T. Homberger, J. Ortega-Peimbert, Q. Yang, and O. Ander- sson, “One map to find them all: Real-time open-vocabulary mapping for zero-shot multi-object navigation,”arXiv preprint arXiv:2409.11764, 2024

work page arXiv 2024

[36] [36]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang, “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,”arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Yolo- world: Real-time open-vocabulary object detection,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo- world: Real-time open-vocabulary object detection,”arXiv preprint arXiv:2401.17270, 2024

work page arXiv 2024

[38] [38]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,”arXiv preprint arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[39] [39]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,”arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick, “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023

[41] [41]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, “Faster segment anything: Towards lightweight sam for mobile applications,”arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,

Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” inNeurIPS, 2023

2023

[43] [43]

Embodied question answering,

A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” 2017

2017

[44] [44]

Behavior- 1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation,

C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Mart ´ın-Mart´ın, C. Wang, G. Levine, W. Ai, B. Martinez, H. Yin, M. Lingelbach, M. Hwang, A. Hiranaka, S. Garlanka, A. Aydin, S. Lee, J. Sun, M. Anvari, M. Sharma, D. Bansal, S. Hunter, K.-Y . Kim, A. Lou, C. R. Matthews, I. Villa-Renteria, J. H. Tang, C. Tang, F. Xia, Y . Li, S. Savarese, H. Gweon, ...

2024

[45] [45]

Explore until confident: Efficient exploration for embodied question answering,

A. Z. Ren, J. Clark, A. Dixit, M. Itkina, A. Majumdar, and D. Sadigh, “Explore until confident: Efficient exploration for embodied question answering,” 2024

2024

[46] [46]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks,

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” 2020

2020

[47] [47]

Habitat 2.0: Training home assistants to rearrange their habitat,

A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V . V ondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V . Koltun, J. Malik, M. Savva, and D. Batra, “Habitat 2.0: Training home assistants to rearrange their habitat,” 2022

2022

[48] [48]

Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,

Z. Zhu, X. Wang, Y . Li, Z. Zhang, X. Ma, Y . Chen, B. Jia, W. Liang, Q. Yu, Z. Deng, S. Huang, and Q. Li, “Move to understand a 3d scene: Bridging visual grounding and exploration for efficient and versatile embodied navigation,”International Conference on Computer Vision (ICCV), 2025

2025

[49] [49]

3d- mem: 3d scene memory for embodied exploration and reasoning,

Y . Yang, H. Yang, J. Zhou, P. Chen, H. Zhang, Y . Du, and C. Gan, “3d- mem: 3d scene memory for embodied exploration and reasoning,” in Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp. 17294–17303, June 2025

2025

[50] [50]

Reexplore: Improv- ing mllms for embodied exploration with contextualized retrospective experience replay,

G. Zhang, M. Ding, J. Wu, R. Liao, and V . Tresp, “Reexplore: Improv- ing mllms for embodied exploration with contextualized retrospective experience replay,”arXiv preprint arXiv:2511.19033, 2025

work page arXiv 2025

[51] [51]

Explore with long-term memory: A benchmark and multimodal llm- based reinforcement learning framework for embodied exploration,

S. Wang, B. Liu, Z. Gao, L. Ma, X. Wang, Y . Xie, and X. Tan, “Explore with long-term memory: A benchmark and multimodal llm- based reinforcement learning framework for embodied exploration,” in Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2026

2026

[52] [52]

Himm: Human-inspired long-term memory modeling for embodied exploration and question answering,

J. Li, B. Wang, J. Xia, M. Li, and S. Hu, “Himm: Human-inspired long-term memory modeling for embodied exploration and question answering,” 2026

2026

[53] [53]

OctoMap: An efficient probabilistic 3D mapping framework based on octrees,

A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, “OctoMap: An efficient probabilistic 3D mapping framework based on octrees,”Autonomous Robots, 2013. Software available at https: //octomap.github.io

2013

[54] [54]

Slam toolbox: Slam for the dynamic world,

S. Macenski and I. Jambrecic, “Slam toolbox: Slam for the dynamic world,”Journal of Open Source Software, vol. 6, no. 61, p. 2783, 2021

2021

[55] [55]

A contextual-bandit approach to personalized news article recommendation,

L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” inProceedings of the 19th international conference on World wide web, WWW ’10, p. 661–670, ACM, Apr. 2010

2010

[56] [56]

Tokenize anything via prompting,

T. Pan, L. Tang, X. Wang, and S. Shan, “Tokenize anything via prompting,” inEuropean Conference on Computer Vision, pp. 330–348, Springer, 2024

2024

[57] [57]

Isaac Sim

NVIDIA, “Isaac Sim.”

[58] [58]

Habitat: A platform for embodied ai research,

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik,et al., “Habitat: A platform for embodied ai research,”arXiv preprint arXiv:1904.01201, 2019

work page arXiv 1904

[59] [59]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,”arXiv preprint arXiv:2109.08238, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[60] [60]

From the desks of ROS maintainers: A survey of modern and capable mobile robotics algorithms in the robot operating system 2,

S. Macenski, T. Moore, D. Lu, A. Merzlyakov, and M. Ferguson, “From the desks of ROS maintainers: A survey of modern and capable mobile robotics algorithms in the robot operating system 2,”Robotics and Autonomous Systems, vol. 168, p. 104493, 2023

2023

[61] [61]

Open-source, cost-aware kinematically feasible planning for mobile and surface robotics,

S. Macenski, M. Booker, and J. Wallace, “Open-source, cost-aware kinematically feasible planning for mobile and surface robotics,”Arxiv, 2024

2024

[62] [62]

Goat- bench: A benchmark for multi-modal lifelong navigation,

M. Khanna*, R. Ramrakhya*, G. Chhablani, S. Yenamandra, T. Gervet, M. Chang, Z. Kira, D. S. Chaplot, D. Batra, and R. Mottaghi, “Goat- bench: A benchmark for multi-modal lifelong navigation,” inCVPR, 2024

2024

[63] [63]

Explore until confident: Efficient exploration for embodied question answering,

A. Z. Ren, J. Clark, A. Dixit, M. Itkina, A. Majumdar, and D. Sadigh, “Explore until confident: Efficient exploration for embodied question answering,”arXiv preprint arXiv:2403.15941, 2024. APPENDIXA ADAPTINGCMABFORGOAT-BENCH The reward formulation in Sec. IV-A is designed for a single open-ended directive, where exploration progress and semantic similari...

work page arXiv 2024