CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation

Alvin Jinsung Choi; Dasol Hong; Hyun Myung; Taeyun Kim

arxiv: 2605.19206 · v1 · pith:BQQAFE7Onew · submitted 2026-05-19 · 💻 cs.RO

CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation

Taeyun Kim , Alvin Jinsung Choi , Dasol Hong , Hyun Myung This is my paper

Pith reviewed 2026-05-20 06:21 UTC · model grok-4.3

classification 💻 cs.RO

keywords zero-shot object-goal navigationcontextual cuessemantic mappinglarge language modeladaptive explorationrobot navigationunified value map

0 comments

The pith

A language model estimates how strongly each target object ties to room types so the robot can favor room cues or nearby-object cues accordingly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a navigation approach that queries an offline large language model to gauge the strength of association between a given target object and common room types. When the association is strong, the system leans on room-type information to direct exploration; when weak, it shifts emphasis to cues from co-located objects. These two sources are combined into one value map whose entries are scaled by the estimated ambiguity of the target, producing an exploration strategy that changes its focus depending on the object being sought. A sympathetic reader would care because uniform use of either room or object context alone often wastes steps in large unknown spaces, whereas adaptive selection could shorten paths and raise success rates for everyday household items.

Core claim

The central claim is that estimating a target object's association strength with room types via an offline large language model enables the agent to prioritize room cues for objects that belong in predictable locations and object cues for those with weaker room ties, with both sources fused into a single semantic value map whose weighting adapts to the target's ambiguity and thereby guides more efficient exploration in zero-shot object-goal navigation.

What carries the argument

The unified semantic value map that merges room-type associations and object co-occurrence information with weights set by the target's estimated room-association strength from the language model.

If this is right

For objects with strong room associations the agent directs search first toward the predicted room types.
For objects with weak room associations the agent instead searches near semantically related objects.
The adaptive weighting reduces steps spent in irrelevant areas compared with non-adaptive cue use.
Multi-viewpoint verification applied on top of the map further improves detection reliability during navigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptive cue selection could be tested in other robotic tasks that rely on varying contextual priors such as semantic search or task planning.
Allowing the language model to be queried during operation rather than only offline might let the weights adjust to observed mismatches in a given environment.
Incorporating additional map layers such as typical lighting conditions or object affordances could refine the value estimates beyond room and object cues alone.

Load-bearing premise

The offline language model supplies reliable, generalizable commonsense links between objects and rooms that hold in the actual visual environments the robot encounters.

What would settle it

Deploy the system in multiple test scenes where object placements systematically contradict the language model's room associations and check whether success rate and success-weighted path length fall below those of fixed room-only or object-only baselines.

Figures

Figures reproduced from arXiv: 2605.19206 by Alvin Jinsung Choi, Dasol Hong, Hyun Myung, Taeyun Kim.

**Figure 2.** Figure 2: Overview of CLUE. (a) Construction of a unified semantic value map by adaptively balancing contextual cues according to the target’s characteristics. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the agent’s current observation and its corresponding [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Configuration for real-world experiments. (a) A customized UGV platform based on a Clearpath Jackal, equipped with an Intel NUC, a Velodyne [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Resource usage analysis: comparison of computation time (ms) and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Zero-shot object-goal navigation (ZSON) is a challenging problem in robotics that requires a comprehensive understanding of both language and visual observations. Contextual cues from rooms and objects are critical, but their relative importance depends on the target: some objects are strongly tied to specific room types, while others are better predicted by nearby co-located objects. Existing methods overlook this distinction, leading to inefficient and inaccurate exploration. We present CLUE, a novel navigation framework that adaptively balances the use of contextual rooms and objects by leveraging commonsense knowledge extracted from an offline large language model (LLM). By estimating a target's association with room types using LLM, the agent prioritizes room cues for predictable objects and object cues for those with weak room associations. Our framework constructs a unified semantic value map that integrates both types of contextual information, adaptively weighted by the target's ambiguity to guide exploration. Combined with multi-viewpoint verification and an exploration strategy informed by contextual cues, CLUE achieves robust and efficient navigation. Extensive experiments in simulation and real-world deployments show that our method consistently outperforms state-of-the-art baselines in both success rate (SR) and success weighted by path length (SPL), demonstrating its effectiveness and practicality for real-world navigation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLUE gives a clean way to let an LLM decide when to trust room cues over object cues in zero-shot navigation and folds both into one map, with ablations that back the adaptive step.

read the letter

The main thing to know is that this paper shows how to use an offline LLM to score how strongly a target object links to room types, then adaptively weight room versus object co-occurrence signals inside a single semantic value map. That weighting changes per target based on estimated ambiguity, which is a step past the uniform cue treatment in earlier ZSON work. They lay out the exact prompt, the conversion from LLM output to scalar weights, and the map update rule, plus ablations that isolate the adaptive term and report SR and SPL gains on HM3D and MP3D. Real-robot runs are included as well, which adds some practical weight. The unified map keeps the planner simple while still blending the two context sources. The stress-test note confirms the manuscript supplies those implementation details and that the reported lifts track the mechanism rather than generic exploration. That makes the central claim more credible than the abstract alone suggested. One soft spot is that performance still depends on the LLM giving accurate room associations that hold up under the noise of real visual detection. The paper uses multi-view verification to reduce some risk, but it does not appear to include deliberate stress tests with bad LLM outputs or unusually cluttered scenes. That is a minor rather than load-bearing concern. The work is aimed at people building language-guided domestic robots who need exploration that works without task-specific training data. A reader focused on practical ZSON methods will find usable implementation choices here. I would send it to peer review; the evidence is sufficient to justify the claims even if the advance stays incremental.

Referee Report

0 major / 2 minor

Summary. The paper introduces CLUE, a zero-shot object-goal navigation framework that extracts commonsense room-object associations from an offline LLM to adaptively weight room-type versus object-co-occurrence cues inside a unified semantic value map. The map, modulated by per-target ambiguity, guides exploration together with multi-viewpoint verification. The authors claim consistent gains in success rate (SR) and success weighted by path length (SPL) over baselines on HM3D and MP3D splits as well as real-world tests, supported by ablations that isolate the adaptive-weighting component.

Significance. If the reported gains are reproducible and attributable to the adaptive mechanism, the work would meaningfully advance ZSON by showing how LLM-derived priors can be integrated into semantic mapping without task-specific fine-tuning. The explicit prompting template, scalar-weight mapping, and update rule in Sections 3.2–3.3, together with the ablations isolating the weighting term, strengthen reproducibility and potential impact.

minor comments (2)

Abstract: the claim of consistent outperformance in SR and SPL is stated without any numerical values, error bars, or baseline identifiers, which reduces the abstract's utility for readers seeking a quick assessment of effect size.
Section 3.2: the 'unified semantic value map' is introduced as an invented construct; a compact equation or pseudocode definition early in the section would clarify how the two cue types are combined before the adaptive weighting is applied.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were provided in the report, so we have no points to address point-by-point at this stage. We will incorporate any minor suggestions during revision to strengthen the manuscript.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation relies on an external offline LLM to supply independent commonsense object-room associations that modulate the adaptive weighting between room-type and object-co-occurrence terms inside the unified semantic value map. This step does not reduce to a self-referential definition or fitted prediction because the LLM outputs are generated outside the navigation task and the weighting rule is a fixed design choice applied to those outputs rather than optimized against the same performance metrics. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing elements, and the reported gains on HM3D/MP3D are presented as empirical outcomes of the external knowledge source rather than tautological re-expressions of fitted inputs. The construction remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the untested premise that LLM room-object associations are accurate enough to drive exploration decisions and that the ambiguity-based weighting rule improves net performance; no free parameters are explicitly named, but the weighting function itself functions as an implicit tunable component.

axioms (1)

domain assumption LLM commonsense knowledge about object-room associations is sufficiently accurate and transferable to visual navigation environments.
Invoked when the agent estimates target association with room types to decide cue priority.

invented entities (1)

unified semantic value map no independent evidence
purpose: Integrates room and object contextual cues with adaptive weighting to guide exploration.
New map representation introduced to combine the two cue types; no independent evidence outside the claimed experiments is provided.

pith-pipeline@v0.9.0 · 5766 in / 1377 out tokens · 36534 ms · 2026-05-20T06:21:08.517156+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

vsem = vtarget + ωroom · vroom + ωobject · vobject, ωroom = 1−H(Otarget), ωobject = H(Otarget) where H is normalized entropy of LLM room probabilities
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gaussian contextual object score vobject(x,y) = A·exp(−((x−x̄)²+(y−ȳ)²)/(2σ²)) with A,σ derived from LLM correlation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 5 internal anchors

[1]

Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “ObjectNav revisited: On evaluation of embodied agents navigating to objects,”arXiv preprint arXiv:2006.13171, 2020

work page arXiv 2006
[2]

A survey of object goal navigation,

J. Sun, J. Wu, Z. Ji, and Y .-K. Lai, “A survey of object goal navigation,”IEEE Trans. Autom. Sci. Eng., vol. 22, pp. 2292–2308, 2024

work page 2024
[3]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inProc. Int. Conf. Mach. Learning, 2023, pp. 19 730–19 742

work page 2023
[7]

ZSON: Zero-shot object-goal navigation using multimodal goal em- beddings,

A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “ZSON: Zero-shot object-goal navigation using multimodal goal em- beddings,”Adv. Neural Inf. Process. Syst., vol. 35, pp. 32 340–32 352, 2022

work page 2022
[8]

CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 23 171–23 181

work page 2023
[9]

VLFM: Vision-language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “VLFM: Vision-language frontier maps for zero-shot semantic navigation,” in Proc. IEEE Int. Conf. Robot. Automat., 2024, pp. 42–48

work page 2024
[10]

One map to find them all: Real-time open-vo1cabulary mapping for zero-shot multi-object navigation,

F. L. Busch, T. Homberger, J. Ortega-Peimbert, Q. Yang, and O. An- dersson, “One map to find them all: Real-time open-vo1cabulary mapping for zero-shot multi-object navigation,” inProc. IEEE Int. Conf. Robot. Automat., 2023, pp. 14 835–14 842

work page 2023
[11]

ApexNav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion,

M. Zhang, Y . Du, C. Wu, J. Zhou, Z. Qi, J. Ma, and B. Zhou, “ApexNav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion,”IEEE Robot. Automat. Lett., 2025

work page 2025
[12]

ESC: Exploration with soft commonsense constraints for zero- shot object navigation,

K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “ESC: Exploration with soft commonsense constraints for zero- shot object navigation,” inProc. Int. Conf. Mach. Learning, 2023, pp. 42 829–42 842

work page 2023
[13]

SG-Nav: Online 3D scene graph prompting for LLM-based zero-shot object navigation,

H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu, “SG-Nav: Online 3D scene graph prompting for LLM-based zero-shot object navigation,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 5285–5307, 2024

work page 2024
[14]

V oroNav: V oronoi-based zero-shot object navigation with large language model,

P. Wu, Y . Mu, B. Wu, Y . Hou, J. Ma, S. Zhang, and C. Liu, “V oroNav: V oronoi-based zero-shot object navigation with large language model,” inProc. Int. Conf. Mach. Learning, 2024, pp. 53 737–53 775

work page 2024
[15]

OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,

Y . Kuang, H. Lin, and M. Jiang, “OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,” arXiv preprint arXiv:2402.10670, 2024

work page arXiv 2024
[16]

Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation,

L. Zhong, C. Gao, Z. Ding, Y . Liao, H. Ma, S. Zhang, X. Zhou, and S. Liu, “TopV-Nav: Unlocking the top-view spatial reasoning potential of MLLM for zero-shot object navigation,”arXiv preprint arXiv:2411.16425, 2024

work page arXiv 2024
[17]

L3MVN: Leveraging large language models for visual target navigation,

B. Yu, H. Kasaei, and M. Cao, “L3MVN: Leveraging large language models for visual target navigation,” inProc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2023, pp. 3554–3560

work page 2023
[18]

TriHelper: Zero-shot object navigation with dynamic assistance,

L. Zhang, Q. Zhang, H. Wang, E. Xiao, Z. Jiang, H. Chen, and R. Xu, “TriHelper: Zero-shot object navigation with dynamic assistance,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2024, pp. 10 035– 10 042

work page 2024
[19]

Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI,

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and B. Chruv, “Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI,” inAdv. Neural Inf. Process. Syst. Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[20]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProc. Int. Conf. Mach. Learning, 2021, pp. 8748–8763

work page 2021
[21]

Habitat: A platform for embodied AI research,

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik,et al., “Habitat: A platform for embodied AI research,” inProc. IEEE Int. Conf. Comput. Vis., 2019, pp. 9339–9347

work page 2019
[22]

Object goal navigation using goal-oriented semantic exploration,

D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 4247–4258, 2020

work page 2020
[23]

PONI: Potential functions for ObjectGoal navigation with interaction-free learning,

S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman, “PONI: Potential functions for ObjectGoal navigation with interaction-free learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 18 890–18 900

work page 2022
[24]

On Evaluation of Embodied Navigation Agents

P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir, “On evaluation of embodied navigation agents,”arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen,et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,

C.-Y . Wang, A. Bochkovskiy, and H.-Y . M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 7464–7475

work page 2023
[27]

Microsoft COCO: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” inProc. Eur. Conf. Comput. Vis., 2014, pp. 740–755

work page 2014
[28]

Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang, “Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,” in Proc. Eur. Conf. Comput. Vis., 2024, pp. 38–55

work page 2024
[29]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, “Faster segment anything: Towards lightweight SAM for mobile applications,”arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

TRIP: Terrain traversability mapping with risk-aware prediction for enhanced on- line quadrupedal robot navigation,

M. Oh, B. Yu, I. Nahrendra, S. Jang, H. Lee, D. Lee, S. Lee, Y . Kim, M. K. Christiansen, H. Lim, and H. Myung, “TRIP: Terrain traversability mapping with risk-aware prediction for enhanced on- line quadrupedal robot navigation,”arXiv preprint arXiv:2411.17134, 2024

work page arXiv 2024
[31]

TRG- Planner: Traversal risk graph-based path planning in unstructured environments for safe and efficient navigation,

D. Lee, I. M. A. Nahrendra, M. Oh, B. Yu, and H. Myung, “TRG- Planner: Traversal risk graph-based path planning in unstructured environments for safe and efficient navigation,”IEEE Robot. Automat. Lett., vol. 10, no. 2, pp. 1736–1743, 2025

work page 2025

[1] [1]

Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “ObjectNav revisited: On evaluation of embodied agents navigating to objects,”arXiv preprint arXiv:2006.13171, 2020

work page arXiv 2006

[2] [2]

A survey of object goal navigation,

J. Sun, J. Wu, Z. Ji, and Y .-K. Lai, “A survey of object goal navigation,”IEEE Trans. Autom. Sci. Eng., vol. 22, pp. 2292–2308, 2024

work page 2024

[3] [3]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [6]

BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inProc. Int. Conf. Mach. Learning, 2023, pp. 19 730–19 742

work page 2023

[6] [7]

ZSON: Zero-shot object-goal navigation using multimodal goal em- beddings,

A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “ZSON: Zero-shot object-goal navigation using multimodal goal em- beddings,”Adv. Neural Inf. Process. Syst., vol. 35, pp. 32 340–32 352, 2022

work page 2022

[7] [8]

CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “CoWs on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 23 171–23 181

work page 2023

[8] [9]

VLFM: Vision-language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “VLFM: Vision-language frontier maps for zero-shot semantic navigation,” in Proc. IEEE Int. Conf. Robot. Automat., 2024, pp. 42–48

work page 2024

[9] [10]

One map to find them all: Real-time open-vo1cabulary mapping for zero-shot multi-object navigation,

F. L. Busch, T. Homberger, J. Ortega-Peimbert, Q. Yang, and O. An- dersson, “One map to find them all: Real-time open-vo1cabulary mapping for zero-shot multi-object navigation,” inProc. IEEE Int. Conf. Robot. Automat., 2023, pp. 14 835–14 842

work page 2023

[10] [11]

ApexNav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion,

M. Zhang, Y . Du, C. Wu, J. Zhou, Z. Qi, J. Ma, and B. Zhou, “ApexNav: An adaptive exploration strategy for zero-shot object navigation with target-centric semantic fusion,”IEEE Robot. Automat. Lett., 2025

work page 2025

[11] [12]

ESC: Exploration with soft commonsense constraints for zero- shot object navigation,

K. Zhou, K. Zheng, C. Pryor, Y . Shen, H. Jin, L. Getoor, and X. E. Wang, “ESC: Exploration with soft commonsense constraints for zero- shot object navigation,” inProc. Int. Conf. Mach. Learning, 2023, pp. 42 829–42 842

work page 2023

[12] [13]

SG-Nav: Online 3D scene graph prompting for LLM-based zero-shot object navigation,

H. Yin, X. Xu, Z. Wu, J. Zhou, and J. Lu, “SG-Nav: Online 3D scene graph prompting for LLM-based zero-shot object navigation,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 5285–5307, 2024

work page 2024

[13] [14]

V oroNav: V oronoi-based zero-shot object navigation with large language model,

P. Wu, Y . Mu, B. Wu, Y . Hou, J. Ma, S. Zhang, and C. Liu, “V oroNav: V oronoi-based zero-shot object navigation with large language model,” inProc. Int. Conf. Mach. Learning, 2024, pp. 53 737–53 775

work page 2024

[14] [15]

OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,

Y . Kuang, H. Lin, and M. Jiang, “OpenFMNav: Towards open-set zero-shot object navigation via vision-language foundation models,” arXiv preprint arXiv:2402.10670, 2024

work page arXiv 2024

[15] [16]

Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation,

L. Zhong, C. Gao, Z. Ding, Y . Liao, H. Ma, S. Zhang, X. Zhou, and S. Liu, “TopV-Nav: Unlocking the top-view spatial reasoning potential of MLLM for zero-shot object navigation,”arXiv preprint arXiv:2411.16425, 2024

work page arXiv 2024

[16] [17]

L3MVN: Leveraging large language models for visual target navigation,

B. Yu, H. Kasaei, and M. Cao, “L3MVN: Leveraging large language models for visual target navigation,” inProc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2023, pp. 3554–3560

work page 2023

[17] [18]

TriHelper: Zero-shot object navigation with dynamic assistance,

L. Zhang, Q. Zhang, H. Wang, E. Xiao, Z. Jiang, H. Chen, and R. Xu, “TriHelper: Zero-shot object navigation with dynamic assistance,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2024, pp. 10 035– 10 042

work page 2024

[18] [19]

Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI,

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and B. Chruv, “Habitat-Matterport 3D Dataset (HM3D): 1000 large-scale 3D environments for embodied AI,” inAdv. Neural Inf. Process. Syst. Datasets and Benchmarks Track (Round 2), 2021

work page 2021

[19] [20]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProc. Int. Conf. Mach. Learning, 2021, pp. 8748–8763

work page 2021

[20] [21]

Habitat: A platform for embodied AI research,

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik,et al., “Habitat: A platform for embodied AI research,” inProc. IEEE Int. Conf. Comput. Vis., 2019, pp. 9339–9347

work page 2019

[21] [22]

Object goal navigation using goal-oriented semantic exploration,

D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 4247–4258, 2020

work page 2020

[22] [23]

PONI: Potential functions for ObjectGoal navigation with interaction-free learning,

S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman, “PONI: Potential functions for ObjectGoal navigation with interaction-free learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 18 890–18 900

work page 2022

[23] [24]

On Evaluation of Embodied Navigation Agents

P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V . Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir, “On evaluation of embodied navigation agents,”arXiv preprint arXiv:1807.06757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [25]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen,et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [26]

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,

C.-Y . Wang, A. Bochkovskiy, and H.-Y . M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 7464–7475

work page 2023

[26] [27]

Microsoft COCO: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” inProc. Eur. Conf. Comput. Vis., 2014, pp. 740–755

work page 2014

[27] [28]

Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang, “Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,” in Proc. Eur. Conf. Comput. Vis., 2024, pp. 38–55

work page 2024

[28] [29]

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong, “Faster segment anything: Towards lightweight SAM for mobile applications,”arXiv preprint arXiv:2306.14289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [30]

TRIP: Terrain traversability mapping with risk-aware prediction for enhanced on- line quadrupedal robot navigation,

M. Oh, B. Yu, I. Nahrendra, S. Jang, H. Lee, D. Lee, S. Lee, Y . Kim, M. K. Christiansen, H. Lim, and H. Myung, “TRIP: Terrain traversability mapping with risk-aware prediction for enhanced on- line quadrupedal robot navigation,”arXiv preprint arXiv:2411.17134, 2024

work page arXiv 2024

[30] [31]

TRG- Planner: Traversal risk graph-based path planning in unstructured environments for safe and efficient navigation,

D. Lee, I. M. A. Nahrendra, M. Oh, B. Yu, and H. Myung, “TRG- Planner: Traversal risk graph-based path planning in unstructured environments for safe and efficient navigation,”IEEE Robot. Automat. Lett., vol. 10, no. 2, pp. 1736–1743, 2025

work page 2025