Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Chih-Ting Liao; Chunlei Meng; Chunrui Liu; Xin Cao; Xi Xiao; Yitong Qiao; Zhangquan Chen; Zhikai Pan

arxiv: 2605.28277 · v1 · pith:5IO2DY2Qnew · submitted 2026-05-27 · 💻 cs.AI

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Zhikai Pan , Chih-Ting Liao , Chunrui Liu , Xi Xiao , Yitong Qiao , Chunlei Meng , Zhangquan Chen , Xin Cao This is my paper

Pith reviewed 2026-06-29 11:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLMsspatial reasoningworld modelsmultilingual benchmarkviewpoint reasoningreasoning clifftext-only evaluation

0 comments

The pith

LLMs fail to retain even half their atomic spatial accuracy once tasks require viewpoint reasoning from text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can build consistent internal spatial world models from pure text descriptions by creating MentalMap, a benchmark spanning eight languages and a six-level hierarchy of tasks. It finds that every model hits a sharp performance cliff at level L3, where viewpoint changes must be tracked, even when simpler atomic facts are handled correctly. The drop occurs regardless of model size, language, or prompting method. The same pattern appears in human subjects given identical text-only inputs, indicating the limit stems from working memory constraints rather than model architecture. This reframes text-based spatial reasoning as requiring simultaneous handling of multiple reference frames and memory demands.

Core claim

No evaluated LLM retains even half of its L0 performance on viewpoint reasoning tasks once baseline atomic accuracy exceeds 40 percent. The L3 cliff appears uniformly across thirteen models, eight typologically diverse languages, multiple scales, and prompting strategies. Human evaluators under the same pure-text protocol reproduce the identical failure pattern, which the authors attribute to inherent constraints of text-only working memory rather than any model-specific limitation.

What carries the argument

The MentalMap benchmark, built from 100 ProcTHOR scenes and organized into a six-level hierarchy (L0 atomic facts through L5 generative world-graph construction) plus four diagnostic axes for frame of reference, reading-direction bias, reasoning effort, and hallucination.

If this is right

Viewpoint reasoning forms a distinct bottleneck separate from atomic spatial fact retrieval.
The performance cliff cannot be overcome by increasing model scale or changing prompting strategies.
Structured-output failures and hallucination rates vary by model family while the L3 cliff remains constant.
Pure-text spatial reasoning must be treated as a multi-axis problem involving reference frames and memory load.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Multimodal inputs that supply visual structure could bypass the observed text-only memory limit.
Analogous cliffs may exist in other domains that require maintaining consistent internal state across transformations, such as causal or temporal reasoning.
Testing scratchpad or external memory augmentation on the same tasks would directly measure whether the bottleneck is removable.

Load-bearing premise

The six-level hierarchy and thirty-nine task families isolate genuine progressive spatial world-modeling skills without confounds from language structure or task design.

What would settle it

Provide models with an external coordinate grid or diagram of each scene and check whether viewpoint-reasoning accuracy then rises above half the atomic baseline; sustained failure would support the text-memory claim while improvement would falsify it.

Figures

Figures reproduced from arXiv: 2605.28277 by Chih-Ting Liao, Chunlei Meng, Chunrui Liu, Xin Cao, Xi Xiao, Yitong Qiao, Zhangquan Chen, Zhikai Pan.

**Figure 1.** Figure 1: MENTALMAP overview. (A) Scene-to-text construction. ProcTHOR scenes are decomposed into traceable world-state artifacts (objects, receptacles, containment, support, spatial relations), then converted into single- and multi-action temporal cases that instantiate benchmark items with canonical ground-truth, output-format, normaliser, and evaluator keys. (B) Six-level capability staircase (L0–L5), from surfac… view at source ↗

**Figure 2.** Figure 2: Universal L3 viewpoint cliff (F1). Each point is one (model, language) pair. All points lie below L3=L0/2; most below L3=L0/4. ja 48.9%, Gemma en 36.7%) but near-zero for several frontier closed-source models. The frametransformation family is most consistently solvable by frontier closed-source models (GPT-4o 21– 52%) while most open-weight models score 0% in seven of eight languages. Aggregate L3 thus … view at source ↗

**Figure 3.** Figure 3: Reasoning-prompt effect is level-stratified (F3). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Multilingual fingerprints (F6). Heatmap of the nine principal LLMs × eight languages (Qwen2.5-32B scale control and Qwen2.5-VL ablation excluded from this view); cells encode within-model z-score (color) and absolute pass rate (number, mean of L3–L5). The two dendrograms are hierarchicalclustering trees: closer branches share more similar performance profiles. The top tree clusters languages (en/de/es vs.… view at source ↗

**Figure 5.** Figure 5: Full L5 node F1 vs. edge F1 scatter across all (model, sub-task, language) cells. Diagonal marks node = edge. Marker shape distinguishes closed-source (•) from open-weight models. 4 3 2 1 0 1 2 3 4 PC1 (68\% variance) 3 2 1 0 1 2 PC2 (13\% variance) Script (color) Latin (en/de/es) CJK (zh/ja/ko) Arabic (ar) Thai (th) Model (shape) Closed-source Llama Qwen / Gemma Mistral / Falcon3 [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 6.** Figure 6: PCA embedding of (model, language) profiles [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a new multilingual benchmark that documents a consistent L3 performance cliff on viewpoint reasoning once basic spatial accuracy clears 40%, with the same pattern in humans.

read the letter

The main thing to know is that this work sets up MentalMap with a six-level hierarchy and four diagnostic axes, then shows LLMs drop sharply on L3 viewpoint tasks across eight languages and thirteen models, while humans under the same text-only protocol show the same limit.

It does a clean job on the empirical side by using 100 ProcTHOR scenes, 39 task families, and a structured-text control. Running the same protocol on humans is useful for grounding the claim that the bottleneck is text-only working memory rather than LLM-specific. The cross-lingual consistency and the quantified threshold add concrete data points that prior single-language or single-level tests did not supply.

The soft spot is the interpretation of the cliff itself. Viewpoint tasks embed longer relation chains and more tokens than L0 atomic facts, so the drop could track standard transformer limits on prompt complexity instead of a missing internal spatial graph. The abstract notes persistence across prompting strategies, but without an ablation that holds chain length or token count fixed while varying only the spatial integration demand, the world-model claim stays under-supported. The choice of the 40% L0 threshold and half-performance cutoff also needs explicit justification and error bars in the full text.

The hierarchy assumes the levels isolate progressive capabilities without language-structure confounds; the multilingual design helps, but the paper should show that the L3 tasks do not simply scale inference depth.

This is for groups working on spatial planning, robotics interfaces, or multimodal extensions. Readers who want a ready diagnostic suite and human baselines will get direct value. It has enough structure and scope to merit a serious referee, mainly to check the missing ablations and statistics.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the MentalMap benchmark, a multilingual diagnostic for spatial reasoning in LLMs featuring a six-level hierarchy (L0-L5) from atomic facts to world-graph construction, based on 100 ProcTHOR scenes across eight languages. Evaluating thirteen LLMs, it reports a universal L3 cliff in viewpoint reasoning: no model retains half its L0 performance once atomic accuracy exceeds 40%, persisting across languages, scales, and prompts. Human evaluations replicate the pattern, attributing it to text-only working memory constraints. Diagnostic axes cover frame of reference, reading bias, effort allocation, and hallucination.

Significance. If the L3 cliff is shown to reflect absent world-model construction rather than task-length confounds, the result would be significant for LLM reasoning research by reframing pure-text spatial capabilities as limited by working memory and motivating multimodal or scratchpad methods. The multilingual scope, structured-text control, 39 task families, and human replication under identical protocol are clear strengths that add diagnostic value beyond monolingual English benchmarks.

major comments (2)

[Abstract] Abstract: the central claim that 'no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%' supplies no information on statistical tests, error bars, data exclusion rules, or how the 40% threshold and half-performance criterion were chosen, leaving the universality of the L3 cliff difficult to evaluate.
[Abstract / hierarchy description] Abstract / hierarchy description: L3 viewpoint-reasoning tasks embed longer chains of relations and more tokens than L0 atomic facts. The persistence across prompting strategies is noted but does not substitute for an explicit ablation that holds chain length and prompt token count fixed while varying only the spatial integration demand; without it the cliff risks being explained by standard transformer working-memory limits rather than missing world-model construction.

minor comments (2)

Clarify the exact distribution of the 1,950 evaluation cells across the 39 task families, languages, and the four diagnostic axes.
The abstract could list the specific prompting strategies tested so that the 'persists across prompting strategies' claim can be directly replicated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results on the L3 cliff. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%' supplies no information on statistical tests, error bars, data exclusion rules, or how the 40% threshold and half-performance criterion were chosen, leaving the universality of the L3 cliff difficult to evaluate.

Authors: We agree the abstract would be strengthened by these details. Error bars appear in all main figures, statistical tests are reported in Section 4.3 and Appendix B, and data exclusion (primarily invalid parses) follows the protocol in Section 3.2. The 40% threshold marks the observed inflection where retention falls below half of L0 across models; the half-performance criterion is a direct relative measure. We will revise the abstract to reference these elements concisely. revision: yes
Referee: [Abstract / hierarchy description] Abstract / hierarchy description: L3 viewpoint-reasoning tasks embed longer chains of relations and more tokens than L0 atomic facts. The persistence across prompting strategies is noted but does not substitute for an explicit ablation that holds chain length and prompt token count fixed while varying only the spatial integration demand; without it the cliff risks being explained by standard transformer working-memory limits rather than missing world-model construction.

Authors: This concern is well-taken. While the cliff persists across prompting variants that alter length and structure, and human evaluations under the same pure-text protocol show the identical pattern, we did not include an ablation that holds chain length and token count exactly fixed while isolating spatial integration. We will add a dedicated discussion of this potential confound in the revised manuscript, framing the result as consistent with text-only working-memory limits and outlining how future synthetic-task experiments could further isolate the factors. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark study with no derivation chain or self-referential reductions

full rationale

The paper introduces MentalMap as a new multilingual benchmark with a six-level hierarchy and reports empirical performance results across 13 LLMs on 39 task families. All central claims (universal L3 cliff once L0 exceeds 40%, persistence across languages and prompts, human-model alignment) are direct observational outcomes from evaluation cells rather than predictions derived from equations, fitted parameters, or self-citations. No load-bearing step reduces by construction to its inputs; the hierarchy functions as a descriptive taxonomy for task design, not a self-defining loop. The work is self-contained against external benchmarks and contains no mathematical derivations that could exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new benchmark and hierarchy but the abstract mentions no fitted parameters, no ad-hoc axioms beyond standard evaluation practices, and no new postulated entities.

pith-pipeline@v0.9.1-grok · 5780 in / 1178 out tokens · 39601 ms · 2026-06-29T11:59:14.120770+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 22 canonical work pages · 12 internal anchors

[1]

Pranjal Aggarwal and Swarnadeep Saha. 2025. Op- timalThinkingBench: Evaluating over and under- thinking in LLMs.arXiv preprint arXiv:2508.13141. Meta

work page arXiv 2025
[2]

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Cyril Blakeney, Guilherme Penedo, Lewis Pel- letier, Leandro von Werra, and Thomas Wolf. 2025. Smollm2: When smol goes big–data-centric train- ing of a small language model.arXiv preprint arXiv:2502.02737

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An, and Erhong Yang. 2026. From human cognition to neural activations: Probing the computa- tional primitives of spatial reasoning in LLMs.arXiv preprint arXiv:2603.26323. Beijing Language and Culture University. 9

work page arXiv 2026
[4]

Neil Burgess. 2006. Spatial memory: How egocen- tric and allocentric combine.Trends in Cognitive Sciences, 10(12):551–557

2006
[5]

Omar Choukrani, Idriss Malek, Daniil Orel, Zhuo- han Xie, Zangir Iklassov, Martin Takáˇc, and Salem Lahlou. 2025. LLM-BabyBench: Understanding and evaluating grounded planning and reasoning in LLMs.arXiv preprint arXiv:2505.12135. MBZUAI

work page arXiv 2025
[6]

Nelson Cowan. 2001. The magical number 4 in short- term memory: A reconsideration of mental storage capacity.Behavioral and Brain Sciences, 24(1):87– 114

2001
[7]

DeepSeek-AI. 2025. DeepSeek-R1: Incentiviz- ing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. 2022. ProcTHOR: Large-scale embodied AI using procedural generation. InAd- vances in Neural Information Processing Systems (NeurIPS). Outstanding Paper Award

2022
[9]

Hafsteinn Einarsson. 2025. MazeEval: A benchmark for testing sequential decision-making in language models.arXiv preprint arXiv:2507.20395

work page arXiv 2025
[10]

Mike Farmer, Abhinav Kochar, and Yugyung Lee
[11]

Empirical Characterization of Inference-Time Elicited Probability Transformations in Large Language Models

The α-law of observable belief revision in large language model inference.arXiv preprint arXiv:2603.19262. University of Missouri–Kansas City

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Gemma Team. 2025. Gemma 3 technical report. https://goo.gle/Gemma3Report

2025
[13]

Gemma Team, Google DeepMind. 2024. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. Grammar-constrained decoding for structured NLP tasks without finetuning. InPro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics

2023
[15]

Google. 2025. Gemini 2.5 Pro

2025
[16]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and 1 others. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Zhongbin Guo, Zhen Yang, Yushan Li, and 1 others
[18]

Beijing Institute of Tech- nology

Can LLMs see without pixels? benchmarking spatial intelligence from textual descriptions.arXiv preprint arXiv:2601.03590. Beijing Institute of Tech- nology

work page arXiv
[19]

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing. Association for Computational Linguistics

2023
[20]

Mengkang Hu, Tianxing Chen, Yifan Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Lu, and Ping Luo. 2025. TEXT2WORLD: Benchmarking large language models for symbolic world model generation. InFindings of the Associa- tion for Computational Linguistics: ACL

2025
[21]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, and 1 others. 2023. Mistral 7b.arXiv preprint arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Peiyao Jiang, Zequn Qin, and Xi Li. 2026. Spatial- Text: A pure-text cognitive benchmark for spatial un- derstanding in large language models.arXiv preprint arXiv:2603.03002. Zhejiang University

work page arXiv 2026
[23]

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gor- don, Yuke Zhu, Abhinav Gupta, and Ali Farhadi
[24]

AI2-THOR: An interactive 3D environment for visual AI.Preprint, arXiv:1712.05474

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Maria Kozhevnikov and Mary Hegarty. 2001. A dis- sociation between object manipulation spatial ability and spatial orientation ability.Memory & Cognition, 29(5):745–756

2001
[26]

Gon- zalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serv- ing with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP)

2023
[27]

Levinson

Stephen C. Levinson. 2003.Space in Language and Cognition: Explorations in Cognitive Diversity. Lan- guage Culture and Cognition. Cambridge University Press

2003
[28]

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, and Jiajun Wu. 2024. Embodied agent interface: Benchmarking LLMs for embodied decision making. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmark...

2024
[29]

Weijiang Li and 1 others. 2026. Do LLMs build spa- tial world models? evidence from grid-world maze tasks.arXiv preprint arXiv:2604.10690

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Mehta, and Yiding Wu

Yuxi Li, Shuyuan Niu, Sze Chai Wong, Mo Yu, San- jeev J. Mehta, and Yiding Wu. 2024. Do large lan- guage models build internal world representations? probing through the lens of state abstraction. InPro- ceedings of the 2024 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics. 10

2024
[31]

Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, and Xin Cao. 2026. Spamem: Benchmarking dynamic spatial reasoning via perception-memory integration in embodied environments.arXiv preprint arXiv:2604.22409

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Asifa Majid, Melissa Bowerman, Sotaro Kita, Daniel B. M. Haun, and Stephen C. Levinson. 2004. Can language restructure cognition? the case for space.Trends in Cognitive Sciences, 8(3):108–114

2004
[33]

Nicolas Martorell. 2025. From text to space: Mapping abstract spatial models in LLMs dur- ing a grid-world navigation task.arXiv preprint arXiv:2502.16690

work page arXiv 2025
[34]

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. 2021. Show your work: Scratchpads for in- termediate computation with language models.arXiv preprint arXiv:2112.00114

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

OpenAI. 2024. GPT-4o. https://openai. com/index/hello-gpt-4o/

2024
[36]

Shubham Parashar, Blake Olson, Sambhav Khu- rana, Eric Li, Hongyi Ling, James Caverlee, and Shuiwang Ji. 2025. Inference-time computations for LLM reasoning and planning: A benchmark and insights.arXiv preprint arXiv:2502.12521. Texas A&M University

work page arXiv 2025
[37]

Eric Pederson, Eve Danziger, David Wilkins, Stephen Levinson, Sotaro Kita, and Gunter Senft
[38]

Semantic typology and spatial conceptualiza- tion.Language, 74(3):557–589
[39]

Tanawan Premsri and Parisa Kordjamshidi. 2025. FoREST: Frame of reference evaluation in spatial reasoning tasks.arXiv preprint arXiv:2502.17775

work page arXiv 2025
[40]

Qwen Team. 2024. Qwen2.5: A party of foundation models

2024
[41]

Qwen Team. 2025. Qwen2.5-vl: Vision-language models. arXiv:2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Qwen Team. 2025. Qwen3.5-flash. API release

2025
[43]

Md Imbesat Rizvi, Xiaodan Zhu, and Iryna Gurevych. 2024. SpaRC and SpaRP: Spatial rea- soning characterization and path generation for un- derstanding spatial reasoning capability of large lan- guage models. InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 4750–4767. Association for...

2024
[44]

Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, and Peter Wonka. 2026. FloorplanQA: A benchmark for spatial reasoning in LLMs using structured representations. arXiv preprint arXiv:2507.07644. KAUST; v3 last revised 30 Jan 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Shepard and Jacqueline Metzler

Roger N. Shepard and Jacqueline Metzler. 1971. Mental rotation of three-dimensional objects.Sci- ence, 171(3972):701–703

1971
[46]

Zhengxiang Shi, Qiang Zhang, and Aldo Lipani
[47]

InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 36, pages 11321–11329

StepGame: A new benchmark for robust multi- hop spatial reasoning in texts. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 36, pages 11321–11329
[48]

2000.Toward a Cognitive Seman- tics, Volume I: Concept Structuring Systems

Leonard Talmy. 2000.Toward a Cognitive Seman- tics, Volume I: Concept Structuring Systems. MIT Press

2000
[49]

Manveer Singh Tamber, Forrest Sheng Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, Ofer Mendelevitch, Renyi Qu, and Jimmy Lin
[50]

InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pages 799–811

Benchmarking LLM faithfulness in RAG with evolving leaderboards. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pages 799–811. Association for Computational Linguistics

2025
[51]

Thora Tenbrink. 2011. Reference frames of space and time in language.Journal of Pragmatics, 43(3):704–722

2011
[52]

TII Falcon Team. 2025. Falcon 3: Frontier open- weight language models from tii. Technology Inno- vation Institute Technical Report

2025
[53]

Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. 2024. Evaluating the world model implicit in a generative model.Advances in Neural Information Processing Systems

2024
[54]

Jean-Baptiste Van der Henst and Walter Schaeken
[55]

PMCID: PMC8165199

The influence of language on spatial reasoning: Reading habits modulate the formulation of conclu- sions and the integration of premises.Frontiers in Psychology. PMCID: PMC8165199
[56]

Vectara. 2025. Hughes hallucination evaluation model (HHEM) leaderboard. https://github.com/vectara/ hallucination-leaderboard

2025
[57]

Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, and Peter Jansen. 2024. Can language models serve as text- based world simulators? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

2024
[58]

Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2025. Thoughts are all over the place: On the underthinking of o1-like LLMs.arXiv preprint arXiv:2501.18585. Tencent AI Lab; NeurIPS 2025

work page arXiv 2025
[59]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, 11 Quoc V . Le, and Denny Zhou. 2022. Chain-of- thought prompting elicits reasoning in large language models. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), volume 35, pages 24824– 24837

2022
[60]

Dongil Yang, Minjin Kim, Sunghwan Kim, Beong- woo Kwak, Minjun Park, Jinseok Hong, Woontack Woo, and Jinyoung Yeo. 2025. LLM meets scene graph: Can large language models understand and generate scene graphs? a benchmark and empirical study. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers),...

2025
[61]

Zheyuan Zhang, Fengyuan Hu, Jayjun Lee, Freda Shi, Parisa Kordjamshidi, Joyce Chai, and Ziqiao Ma
[62]

InThe Thirteenth International Con- ference on Learning Representations (ICLR)

Do vision-language models represent space and how? evaluating spatial frame of reference under ambiguities. InThe Thirteenth International Con- ference on Learning Representations (ICLR). Oral Presentation
[63]

answer the question

Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Long- pre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. 2024. Aya model: An instruction finetuned open-access multilingual lan- guage model. InProc...

2024
[64]

The cup is tomy lefton the kitchen counter

and the operationalization of Premsri and Ko- rdjamshidi [35]. Table 8 gives an English example for each frame; the same scene is rendered in the eight evaluation languages. A.7 Hallucination Evaluator Details The L5 hallucination evaluator computes four de- composed sub-metrics: (i)node F1—token-level F1 over the set of objects in the generated graph 13 ...

[1] [1]

Pranjal Aggarwal and Swarnadeep Saha. 2025. Op- timalThinkingBench: Evaluating over and under- thinking in LLMs.arXiv preprint arXiv:2508.13141. Meta

work page arXiv 2025

[2] [2]

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Cyril Blakeney, Guilherme Penedo, Lewis Pel- letier, Leandro von Werra, and Thomas Wolf. 2025. Smollm2: When smol goes big–data-centric train- ing of a small language model.arXiv preprint arXiv:2502.02737

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An, and Erhong Yang. 2026. From human cognition to neural activations: Probing the computa- tional primitives of spatial reasoning in LLMs.arXiv preprint arXiv:2603.26323. Beijing Language and Culture University. 9

work page arXiv 2026

[4] [4]

Neil Burgess. 2006. Spatial memory: How egocen- tric and allocentric combine.Trends in Cognitive Sciences, 10(12):551–557

2006

[5] [5]

Omar Choukrani, Idriss Malek, Daniil Orel, Zhuo- han Xie, Zangir Iklassov, Martin Takáˇc, and Salem Lahlou. 2025. LLM-BabyBench: Understanding and evaluating grounded planning and reasoning in LLMs.arXiv preprint arXiv:2505.12135. MBZUAI

work page arXiv 2025

[6] [6]

Nelson Cowan. 2001. The magical number 4 in short- term memory: A reconsideration of mental storage capacity.Behavioral and Brain Sciences, 24(1):87– 114

2001

[7] [7]

DeepSeek-AI. 2025. DeepSeek-R1: Incentiviz- ing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. 2022. ProcTHOR: Large-scale embodied AI using procedural generation. InAd- vances in Neural Information Processing Systems (NeurIPS). Outstanding Paper Award

2022

[9] [9]

Hafsteinn Einarsson. 2025. MazeEval: A benchmark for testing sequential decision-making in language models.arXiv preprint arXiv:2507.20395

work page arXiv 2025

[10] [10]

Mike Farmer, Abhinav Kochar, and Yugyung Lee

[11] [11]

Empirical Characterization of Inference-Time Elicited Probability Transformations in Large Language Models

The α-law of observable belief revision in large language model inference.arXiv preprint arXiv:2603.19262. University of Missouri–Kansas City

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Gemma Team. 2025. Gemma 3 technical report. https://goo.gle/Gemma3Report

2025

[13] [13]

Gemma Team, Google DeepMind. 2024. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. Grammar-constrained decoding for structured NLP tasks without finetuning. InPro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics

2023

[15] [15]

Google. 2025. Gemini 2.5 Pro

2025

[16] [16]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and 1 others. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Zhongbin Guo, Zhen Yang, Yushan Li, and 1 others

[18] [18]

Beijing Institute of Tech- nology

Can LLMs see without pixels? benchmarking spatial intelligence from textual descriptions.arXiv preprint arXiv:2601.03590. Beijing Institute of Tech- nology

work page arXiv

[19] [19]

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing. Association for Computational Linguistics

2023

[20] [20]

Mengkang Hu, Tianxing Chen, Yifan Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Lu, and Ping Luo. 2025. TEXT2WORLD: Benchmarking large language models for symbolic world model generation. InFindings of the Associa- tion for Computational Linguistics: ACL

2025

[21] [21]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, and 1 others. 2023. Mistral 7b.arXiv preprint arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Peiyao Jiang, Zequn Qin, and Xi Li. 2026. Spatial- Text: A pure-text cognitive benchmark for spatial un- derstanding in large language models.arXiv preprint arXiv:2603.03002. Zhejiang University

work page arXiv 2026

[23] [23]

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gor- don, Yuke Zhu, Abhinav Gupta, and Ali Farhadi

[24] [24]

AI2-THOR: An interactive 3D environment for visual AI.Preprint, arXiv:1712.05474

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Maria Kozhevnikov and Mary Hegarty. 2001. A dis- sociation between object manipulation spatial ability and spatial orientation ability.Memory & Cognition, 29(5):745–756

2001

[26] [26]

Gon- zalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serv- ing with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP)

2023

[27] [27]

Levinson

Stephen C. Levinson. 2003.Space in Language and Cognition: Explorations in Cognitive Diversity. Lan- guage Culture and Cognition. Cambridge University Press

2003

[28] [28]

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, and Jiajun Wu. 2024. Embodied agent interface: Benchmarking LLMs for embodied decision making. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmark...

2024

[29] [29]

Weijiang Li and 1 others. 2026. Do LLMs build spa- tial world models? evidence from grid-world maze tasks.arXiv preprint arXiv:2604.10690

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Mehta, and Yiding Wu

Yuxi Li, Shuyuan Niu, Sze Chai Wong, Mo Yu, San- jeev J. Mehta, and Yiding Wu. 2024. Do large lan- guage models build internal world representations? probing through the lens of state abstraction. InPro- ceedings of the 2024 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics. 10

2024

[31] [31]

Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, and Xin Cao. 2026. Spamem: Benchmarking dynamic spatial reasoning via perception-memory integration in embodied environments.arXiv preprint arXiv:2604.22409

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Asifa Majid, Melissa Bowerman, Sotaro Kita, Daniel B. M. Haun, and Stephen C. Levinson. 2004. Can language restructure cognition? the case for space.Trends in Cognitive Sciences, 8(3):108–114

2004

[33] [33]

Nicolas Martorell. 2025. From text to space: Mapping abstract spatial models in LLMs dur- ing a grid-world navigation task.arXiv preprint arXiv:2502.16690

work page arXiv 2025

[34] [34]

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. 2021. Show your work: Scratchpads for in- termediate computation with language models.arXiv preprint arXiv:2112.00114

work page internal anchor Pith review Pith/arXiv arXiv 2021

[35] [35]

OpenAI. 2024. GPT-4o. https://openai. com/index/hello-gpt-4o/

2024

[36] [36]

Shubham Parashar, Blake Olson, Sambhav Khu- rana, Eric Li, Hongyi Ling, James Caverlee, and Shuiwang Ji. 2025. Inference-time computations for LLM reasoning and planning: A benchmark and insights.arXiv preprint arXiv:2502.12521. Texas A&M University

work page arXiv 2025

[37] [37]

Eric Pederson, Eve Danziger, David Wilkins, Stephen Levinson, Sotaro Kita, and Gunter Senft

[38] [38]

Semantic typology and spatial conceptualiza- tion.Language, 74(3):557–589

[39] [39]

Tanawan Premsri and Parisa Kordjamshidi. 2025. FoREST: Frame of reference evaluation in spatial reasoning tasks.arXiv preprint arXiv:2502.17775

work page arXiv 2025

[40] [40]

Qwen Team. 2024. Qwen2.5: A party of foundation models

2024

[41] [41]

Qwen Team. 2025. Qwen2.5-vl: Vision-language models. arXiv:2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Qwen Team. 2025. Qwen3.5-flash. API release

2025

[43] [43]

Md Imbesat Rizvi, Xiaodan Zhu, and Iryna Gurevych. 2024. SpaRC and SpaRP: Spatial rea- soning characterization and path generation for un- derstanding spatial reasoning capability of large lan- guage models. InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 4750–4767. Association for...

2024

[44] [44]

Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, and Peter Wonka. 2026. FloorplanQA: A benchmark for spatial reasoning in LLMs using structured representations. arXiv preprint arXiv:2507.07644. KAUST; v3 last revised 30 Jan 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

Shepard and Jacqueline Metzler

Roger N. Shepard and Jacqueline Metzler. 1971. Mental rotation of three-dimensional objects.Sci- ence, 171(3972):701–703

1971

[46] [46]

Zhengxiang Shi, Qiang Zhang, and Aldo Lipani

[47] [47]

InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 36, pages 11321–11329

StepGame: A new benchmark for robust multi- hop spatial reasoning in texts. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 36, pages 11321–11329

[48] [48]

2000.Toward a Cognitive Seman- tics, Volume I: Concept Structuring Systems

Leonard Talmy. 2000.Toward a Cognitive Seman- tics, Volume I: Concept Structuring Systems. MIT Press

2000

[49] [49]

Manveer Singh Tamber, Forrest Sheng Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, Ofer Mendelevitch, Renyi Qu, and Jimmy Lin

[50] [50]

InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pages 799–811

Benchmarking LLM faithfulness in RAG with evolving leaderboards. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pages 799–811. Association for Computational Linguistics

2025

[51] [51]

Thora Tenbrink. 2011. Reference frames of space and time in language.Journal of Pragmatics, 43(3):704–722

2011

[52] [52]

TII Falcon Team. 2025. Falcon 3: Frontier open- weight language models from tii. Technology Inno- vation Institute Technical Report

2025

[53] [53]

Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. 2024. Evaluating the world model implicit in a generative model.Advances in Neural Information Processing Systems

2024

[54] [54]

Jean-Baptiste Van der Henst and Walter Schaeken

[55] [55]

PMCID: PMC8165199

The influence of language on spatial reasoning: Reading habits modulate the formulation of conclu- sions and the integration of premises.Frontiers in Psychology. PMCID: PMC8165199

[56] [56]

Vectara. 2025. Hughes hallucination evaluation model (HHEM) leaderboard. https://github.com/vectara/ hallucination-leaderboard

2025

[57] [57]

Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, and Peter Jansen. 2024. Can language models serve as text- based world simulators? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

2024

[58] [58]

Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2025. Thoughts are all over the place: On the underthinking of o1-like LLMs.arXiv preprint arXiv:2501.18585. Tencent AI Lab; NeurIPS 2025

work page arXiv 2025

[59] [59]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, 11 Quoc V . Le, and Denny Zhou. 2022. Chain-of- thought prompting elicits reasoning in large language models. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), volume 35, pages 24824– 24837

2022

[60] [60]

Dongil Yang, Minjin Kim, Sunghwan Kim, Beong- woo Kwak, Minjun Park, Jinseok Hong, Woontack Woo, and Jinyoung Yeo. 2025. LLM meets scene graph: Can large language models understand and generate scene graphs? a benchmark and empirical study. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers),...

2025

[61] [61]

Zheyuan Zhang, Fengyuan Hu, Jayjun Lee, Freda Shi, Parisa Kordjamshidi, Joyce Chai, and Ziqiao Ma

[62] [62]

InThe Thirteenth International Con- ference on Learning Representations (ICLR)

Do vision-language models represent space and how? evaluating spatial frame of reference under ambiguities. InThe Thirteenth International Con- ference on Learning Representations (ICLR). Oral Presentation

[63] [63]

answer the question

Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Long- pre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. 2024. Aya model: An instruction finetuned open-access multilingual lan- guage model. InProc...

2024

[64] [64]

The cup is tomy lefton the kitchen counter

and the operationalization of Premsri and Ko- rdjamshidi [35]. Table 8 gives an English example for each frame; the same scene is rendered in the eight evaluation languages. A.7 Hallucination Evaluator Details The L5 hallucination evaluator computes four de- composed sub-metrics: (i)node F1—token-level F1 over the set of objects in the generated graph 13 ...