pith. sign in

arxiv: 2606.22219 · v1 · pith:DN73NGTCnew · submitted 2026-06-20 · ⚛️ physics.soc-ph

Lost in Aggregation: A Multi-Scale Diagnostic Benchmark for LLM Spatial Navigation

Pith reviewed 2026-06-26 10:34 UTC · model grok-4.3

classification ⚛️ physics.soc-ph
keywords LLM spatial navigationmulti-scale benchmarkmaze navigationcross-scale aggregationsequential reasoningerror localizationfine meso macro scales
0
0 comments X

The pith

LLMs handle individual spatial skills at fine, meso, and macro scales but cannot aggregate them into long sequential navigation plans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests LLMs on maze navigation to locate where spatial reasoning breaks down. It decomposes the task into three levels drawn from human cognition: Fine for local passability, Meso for junction topology, and Macro for global direction. Models succeed on isolated probes at each level even in large mazes, yet end-to-end navigation drops to near zero by 10x10 size. Error analysis shows failures concentrate at Meso and Fine steps when the scales must be combined over many steps. The work therefore isolates cross-scale aggregation over sequential plans as the core barrier.

Core claim

The central claim is that the barrier to LLM spatial navigation is cross-scale aggregation of individually available competences over a long sequential plan, not any single perceptual deficit. End-to-end one-shot navigation collapses to near zero by 10x10 mazes for GPT-4o, DeepSeek-V3, and Llama-3.3-70B, while the same models respond to isolated Fine, Meso, and Macro probes at 30-75 percent accuracy far beyond that size. A multi-hot first-error analysis localizes failures to Meso junction choices (59 percent) and Fine perception (39 percent), with global direction almost never at fault (1 percent). Hierarchical delegation of per-step execution to a deterministic walker lifts performance at m

What carries the argument

The multi-scale diagnostic benchmark that decomposes maze navigation into Fine (local passability), Meso (junction topology), and Macro (global goal direction) modules, with separate tests for input formats and hierarchical route planning.

If this is right

  • Structured coordinate text input outperforms rendered images for all models.
  • End-to-end navigation success falls to near zero by 10x10 mazes across the three tested LLMs.
  • Failures localize to Meso (59 percent) and Fine (39 percent) scales, not Macro direction (1 percent).
  • Delegating execution to a deterministic walker and prompting only at junctions raises GPT-4o success by up to 92 points at mid sizes.
  • The scaling wall re-emerges by 30x30 even under hierarchical planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same aggregation difficulty may limit LLM performance on other long-horizon sequential tasks that require repeated integration of local and global information.
  • Explicit training or prompting regimes that force repeated cross-scale checks could be tested directly with this benchmark.
  • The released set of 1,050 topology-annotated mazes can serve as a reusable diagnostic for new models without requiring new data collection.

Load-bearing premise

That performance on isolated single-level probes measures the same underlying competences that would be available during full end-to-end sequential navigation.

What would settle it

A controlled test in which models receive explicit correct information from the other two scales at every decision point yet still fail at the same rates as in the original end-to-end setting.

Figures

Figures reproduced from arXiv: 2606.22219 by Liqiu Meng, Peng Luo, Yuhan Jiang.

Figure 1
Figure 1. Figure 1: From spatial input to navigation: where LLMs dif [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Benchmark overview. Left: 1,050 mazes over seven effective sizes (3×3 to 30×30) and three difficulty tiers, with start (blue) and goal (red) marked. ① Input acquisition: each maze is rendered as Words, Coordinates, an ASCII Map, or a rendered Image. ② Multi-scale representation: one-shot navigation is decomposed into Fine (cell passability), Meso (junction topology), and Macro (global heading), each probed… view at source ↗
Figure 3
Figure 3. Figure 3: Structured textual input outperforms visual and grid input across every model and size. (a) SR pooled over sizes [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scaling failure is driven by aggregation, not local perception. (a) One-shot SR vs. size for three models; all collapse [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The cost of each rung of coupling is dwarfed by [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: First-step errors of failed one-shot navigation are [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Junction-level delegation lifts navigation but does not remove the scaling barrier. (a) GPT-4o and (b) DeepSeek-V3 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-question-subtype accuracy across maze sizes (isolated single-level probes, averaged over the three models). Facets: [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-level diagnostic metrics measured inside one-shot navigation, by maze effective size (medium difficulty, pooled over the three models; the figure form of [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A reasoning model (o3-mini) reproduces the main finding ( [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Worked examples of the exact prompts and answer keys at every benchmark stage for one small maze (start [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed as planners and assistants in tasks with inherent spatial structure, such as navigation and route planning, yet they remain brittle in sequential spatial reasoning. We ask not merely whether LLMs fail at navigation but where in the spatial-cognition pipeline they get lost. We introduce a multi-scale diagnostic benchmark that decomposes maze navigation into three cognitive levels drawn from human spatial cognition: Fine (local passability), Meso (junction topology), and Macro (global goal direction). We evaluate three instruction-tuned chat LLMs (GPT-4o, DeepSeek-V3, Llama-3.3-70B) on 1,050 topology-annotated mazes spanning seven sizes (3x3 to 30x30) and three difficulty tiers. The benchmark is organized as three modules. (i) Input acquisition: among four input formats, structured coordinate text is the most navigable, far surpassing rendered images. (ii) Multi-scale representation: end-to-end one-shot navigation collapses to near zero by 10x10 for every model, yet the same models respond to isolated single-level probes (Fine, Meso, Macro) at 30-75% far beyond that size. A multi-hot first-error analysis localizes failures to Meso junction choices (59%) and Fine perception (39%), with global direction almost never at fault (1%). The barrier is therefore the cross-scale aggregation of individually available competences over a long sequential plan, not any single perceptual deficit. (iii) Hierarchical route planning: delegating per-step execution to a deterministic walker and querying the LLM only at junctions, with an explicit cell-type prompt, lifts GPT-4o success by up to 92 points at mid sizes, but the same scaling wall re-emerges by 30x30. We release the benchmark, mazes, and code as a reusable diagnostic instrument for spatial reasoning in LLMs, available at https://yuhanjiang415.github.io/lost-in-aggregation/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a multi-scale diagnostic benchmark for LLM spatial navigation that decomposes maze tasks into Fine (local passability), Meso (junction topology), and Macro (global goal direction) levels. Evaluations of GPT-4o, DeepSeek-V3, and Llama-3.3-70B on 1,050 annotated mazes (3x3 to 30x30) show end-to-end one-shot navigation collapsing near zero by 10x10, while isolated single-level probes achieve 30-75% success at larger sizes. A multi-hot first-error analysis on failing trajectories attributes errors to Meso (59%) and Fine (39%), with Macro at 1%. Hierarchical planning (LLM queried only at junctions) improves mid-size performance by up to 92 points but encounters the same wall at 30x30. The central claim is that failures arise from cross-scale aggregation of available competences rather than any single perceptual deficit. The benchmark, mazes, and code are released publicly.

Significance. If substantiated, the work supplies a reusable diagnostic instrument that separates perceptual from aggregation failures in LLM sequential reasoning, with direct relevance to planning applications. The public release of materials supports reproducibility and community use. The empirical decomposition into human-inspired cognitive levels offers a concrete framework for future LLM spatial-cognition studies.

major comments (2)
  1. [multi-scale representation module] The central claim that the barrier is cross-scale aggregation (rather than missing individual competences) depends on isolated Fine/Meso/Macro probes measuring the same underlying abilities deployed mid-trajectory in sequential navigation. The multi-hot first-error analysis (59% Meso, 39% Fine) is performed only on failing end-to-end trajectories and does not test whether the model, given the identical partial history, can correctly answer the corresponding probe. This assumption is load-bearing for the localization result and the aggregation conclusion.
  2. [experimental results and first-error analysis] The reported first-error breakdown (59% Meso, 39% Fine, 1% Macro) and probe success rates lack error bars, confidence intervals, or statistical tests, and the abstract provides no implementation details on how first-error localization was performed. These omissions affect the reliability of the claim that global direction is almost never at fault.
minor comments (2)
  1. [input acquisition] The four input formats in the input-acquisition module are described but not illustrated; a supplementary table or figure with concrete examples of each format would aid reproducibility.
  2. [hierarchical route planning] The hierarchical condition description would benefit from explicit pseudocode showing the exact cell-type prompt and junction-query interface.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major comment below, indicating planned revisions where appropriate. Our responses focus on clarifying methodological assumptions and improving statistical reporting.

read point-by-point responses
  1. Referee: [multi-scale representation module] The central claim that the barrier is cross-scale aggregation (rather than missing individual competences) depends on isolated Fine/Meso/Macro probes measuring the same underlying abilities deployed mid-trajectory in sequential navigation. The multi-hot first-error analysis (59% Meso, 39% Fine) is performed only on failing end-to-end trajectories and does not test whether the model, given the identical partial history, can correctly answer the corresponding probe. This assumption is load-bearing for the localization result and the aggregation conclusion.

    Authors: We appreciate this methodological observation. The isolated probes are designed to measure the availability of each scale-specific competence independently of sequential integration demands, while the first-error analysis attributes failures in the integrated task based on the first deviation in the trajectory. We agree that conditioning probes on the exact partial history from failing trajectories would provide stronger validation of the localization. We will revise the manuscript to explicitly discuss this assumption as a limitation of the current design and its implications for interpreting the aggregation claim, while noting that full conditional probing is planned for follow-up work. revision: partial

  2. Referee: [experimental results and first-error analysis] The reported first-error breakdown (59% Meso, 39% Fine, 1% Macro) and probe success rates lack error bars, confidence intervals, or statistical tests, and the abstract provides no implementation details on how first-error localization was performed. These omissions affect the reliability of the claim that global direction is almost never at fault.

    Authors: We agree that the lack of error bars, confidence intervals, and statistical tests reduces the robustness of the reported breakdowns. We will add these elements (including standard errors and appropriate tests such as chi-squared for category proportions) to all figures and tables in the revised manuscript. We will also expand the methods section with full implementation details on the multi-hot first-error localization procedure, including labeling criteria and trajectory processing steps. The abstract will be updated to reference these details where space allows, with the main text providing the complete description. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or fitted parameters

full rationale

The paper presents an empirical diagnostic benchmark for LLM navigation across scales, reporting observed success rates on isolated probes versus end-to-end tasks. No equations, derivations, parameter fitting, or self-citation chains appear in the provided text. Central claims rest on direct experimental measurements (e.g., 30-75% probe success vs. near-zero end-to-end), which are falsifiable against external benchmarks and do not reduce to any input by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmark paper with no mathematical derivations or new physical entities; relies on standard assumptions about LLM prompting and the validity of the human-inspired cognitive decomposition.

axioms (1)
  • domain assumption The Fine/Meso/Macro decomposition drawn from human spatial cognition accurately partitions the navigation task for LLMs.
    Invoked to structure the three benchmark modules and interpret error localization.

pith-pipeline@v0.9.1-grok · 5902 in / 1188 out tokens · 36243 ms · 2026-06-26T10:34:58.634150+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 19 canonical work pages · 3 internal anchors

  1. [1]

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sün- derhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. 2018. Vision-and- Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3674–3683. doi:10.11...

  2. [2]

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. 2024. SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models. InAdvances in Neural Information Processing Systems 37 (NeurIPS). doi:10.48550/arXiv.2406.01584

  3. [3]

    Alan Dao and Dinh Bach Vu. 2025. AlphaMaze: Enhancing Large Language Models’ Spatial Intelligence via GRPO. arXiv:2502.14669

  4. [4]

    DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 Lost in Aggregation: A Multi-Scale Diagnostic Benchmark for LLM Spatial Navigation

  5. [5]

    Hafsteinn Einarsson. 2025. MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models. arXiv:2507.20395

  6. [6]

    Jie Feng, Jinwei Zeng, Qingyue Long, Hongyi Chen, Jie Zhao, Yanxin Xi, Zhilun Zhou, Yuan Yuan, Shengyuan Wang, Qingbin Zeng, Songwei Li, Yunke Zhang, Yuming Lin, Tong Li, Jingtao Ding, Chen Gao, Fengli Xu, and Yong Li. 2025. A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Sc...

  7. [7]

    Hand and William E

    Scott M. Freundschuh and Max J. Egenhofer. 1997. Human Conceptions of Spaces: Implications for GIS.Transactions in GIS2, 4 (1997), 361–375. doi:10.1111/j.1467- 9671.1997.tb00063.x

  8. [8]

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with Language Model is Planning with World Model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8154–8173. doi:10.18653/v1/2023.emnlp-main.507

  9. [9]

    Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, and Samy Wu Fung

    Michael Igorevich Ivanitskiy, Rusheb Shah, Alex F. Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, and Samy Wu Fung. 2023. A Configurable Library for Generating and Manipulating Maze Datasets. arXiv:2309.10498

  10. [10]

    Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. 2024. LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks. InProceedings of the 41st International Conference on Machine Learning (ICML). doi:10.48550/arXiv. 2402.01817

  11. [11]

    Zekun Li, Malcolm Grossman, Ehsan Qasemi, Mihir Kulkarni, Muhao Chen, and Yao-Yi Chiang. 2025. MapQA: Open-domain Geospatial Question Answering on Map Data. arXiv:2503.07871

  12. [12]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173. doi:10.1162/tacl_a_00638

  13. [13]

    Llama Team, AI @ Meta. 2024. The Llama 3 Herd of Models. arXiv:2407.21783

  14. [14]

    Nicolás Martorell. 2025. From Text to Space: Mapping Abstract Spatial Models in LLMs During a Grid-World Navigation Task. arXiv:2502.16690

  15. [15]

    Yanghong Mei, Yirong Yang, Longteng Guo, Qunbo Wang, Ming-Ming Yu, Xingjian He, Wenjun Wu, and Jing Liu. 2025. UrbanNav: Learning Language- Guided Urban Navigation from Web-Scale Human Trajectories. arXiv:2512.09607

  16. [16]

    Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi

  17. [17]

    InProceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics (NAACL-HLT)

    SpartQA: A Textual Question Answering Benchmark for Spatial Rea- soning. InProceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics (NAACL-HLT). 4582–4598. doi:10.18653/v1/2021.naacl-main.364

  18. [18]

    Montello

    Daniel R. Montello. 1993. Scale and Multiple Psychologies of Space. InSpatial Information Theory: A Theoretical Basis for GIS (COSIT ’93), Andrew U. Frank and Irene Campari (Eds.). Lecture Notes in Computer Science, Vol. 716. Springer, Berlin, Heidelberg, 312–321. doi:10.1007/3-540-57207-4_21

  19. [19]

    OpenAI. 2024. GPT-4o System Card. arXiv:2410.21276

  20. [20]

    Michael Peer and Russell A. Epstein. 2025. Cognitive Maps for Hierarchical Spaces in the Human Brain. bioRxiv. doi:10.1101/2025.02.05.636580

  21. [21]

    Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Krähenbühl, and Vladlen Koltun. 2025. Does Spatial Cognition Emerge in Frontier Models?. InInternational Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2410.06468

  22. [22]

    Zhengxiang Shi, Qiang Zhang, and Aldo Lipani. 2022. StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts. InProceed- ings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11321–11329. doi:10.1609/aaai.v36i10.21383

  23. [23]

    Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, Xin Yu, Gholamreza Haffari, and Yuan-Fang Li. 2024. An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). 21440–21455. doi:10.18653/ v1/2024.emnlp-main.1195

  24. [24]

    Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. 2023. PlanBench: An Extensible Benchmark for Eval- uating Large Language Models on Planning and Reasoning about Change. In Advances in Neural Information Processing Systems 36 (NeurIPS), Datasets and Benchmarks Track. doi:10.48550/arXiv.2206.10498

  25. [25]

    Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. 2024. Is a Picture Worth a Thousand Words? Delving Into Spatial Reasoning for Vision Language Models. InAdvances in Neural Information Processing Systems 37 (NeurIPS). doi:10.48550/arXiv.2406.14852

  26. [26]

    Junjue Wang, Weihao Xuan, Heli Qi, Pengyu Dai, Kunyi Liu, Hongruixuan Chen, Zhuo Zheng, Junshi Xia, Stefano Ermon, and Naoto Yokoya. 2026. Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations. arXiv:2605.11633

  27. [27]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems 35 (NeurIPS). doi:10.48550/arXiv.2201.11903

  28. [28]

    Thomas Wolbers and Mary Hegarty. 2010. What Determines Our Navigational Abilities?Trends in Cognitive Sciences14, 3 (2010), 138–146. doi:10.1016/j.tics. 2010.01.001

  29. [29]

    Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. 2024. VoroNav: Voronoi-Based Zero-Shot Object Navigation with Large Language Model. arXiv:2401.02695

  30. [30]

    Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. 2024. Mind’s Eye of LLMs: Visualization-of-Thought Elicits Spatial Rea- soning in Large Language Models. InAdvances in Neural Information Processing Systems 37 (NeurIPS). doi:10.48550/arXiv.2404.03622

  31. [31]

    Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. 2024. Reasoning or Reciting? Explor- ing the Capabilities and Limitations of Language Models Through Counterfactual Tasks. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics...

  32. [32]

    Sirui Xia, Aili Chen, Xintao Wang, Tinghui Zhu, Yikai Zhang, Jiangjie Chen, and Yanghua Xiao. 2025. Can LLMs Learn to Map the World from Local Descriptions? arXiv:2505.20874

  33. [33]

    Shuo Xing, Zezhou Sun, Shuangyu Xie, Kaiyuan Chen, Yanjia Huang, Yuping Wang, Jiachen Li, Dezhen Song, and Zhengzhong Tu. 2025. Can Large Vision Language Models Read Maps Like a Human? arXiv:2503.14607

  34. [34]

    Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, and Yunjian Zhang. 2025. SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition. arXiv:2511.21471

  35. [35]

    Anran Yang, Cheng Fu, Qingren Jia, Weihua Dong, Mengyu Ma, Hao Chen, Fei Yang, and Hui Wu. 2025. Evaluating and Enhancing Spatial Cognition Abilities of Large Language Models.International Journal of Geographical Information Science39, 9 (2025), 2009–2044. doi:10.1080/13658816.2025.2490701

  36. [36]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems 36 (NeurIPS). doi:10.48550/arXiv.2305.10601

  37. [37]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR). doi:10. 48550/arXiv.2210.03629

  38. [38]

    Dazhou Yu, Riyang Bao, Ruiyu Ning, Jinghong Peng, Gengchen Mai, and Liang Zhao. 2025. Spatial-RAG: Spatial Retrieval Augmented Generation for Real-World Geospatial Reasoning Questions. arXiv:2502.18470

  39. [39]

    Mike Zhang, Kaixian Qu, Vaishakh Patil, Cesar Cadena, and Marco Hutter. 2024. Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models. arXiv:2409.15451

  40. [40]

    Gengze Zhou, Yicong Hong, and Qi Wu. 2024. NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 7641–7649. doi:10.1609/ aaai.v38i7.28597 A Appendix This appendix collects four supplementary figures; full details are in each caption. Figure 8 break...