pith. sign in

arxiv: 2605.28277 · v1 · pith:5IO2DY2Qnew · submitted 2026-05-27 · 💻 cs.AI

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Pith reviewed 2026-06-29 11:59 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLMsspatial reasoningworld modelsmultilingual benchmarkviewpoint reasoningreasoning clifftext-only evaluation
0
0 comments X

The pith

LLMs fail to retain even half their atomic spatial accuracy once tasks require viewpoint reasoning from text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can build consistent internal spatial world models from pure text descriptions by creating MentalMap, a benchmark spanning eight languages and a six-level hierarchy of tasks. It finds that every model hits a sharp performance cliff at level L3, where viewpoint changes must be tracked, even when simpler atomic facts are handled correctly. The drop occurs regardless of model size, language, or prompting method. The same pattern appears in human subjects given identical text-only inputs, indicating the limit stems from working memory constraints rather than model architecture. This reframes text-based spatial reasoning as requiring simultaneous handling of multiple reference frames and memory demands.

Core claim

No evaluated LLM retains even half of its L0 performance on viewpoint reasoning tasks once baseline atomic accuracy exceeds 40 percent. The L3 cliff appears uniformly across thirteen models, eight typologically diverse languages, multiple scales, and prompting strategies. Human evaluators under the same pure-text protocol reproduce the identical failure pattern, which the authors attribute to inherent constraints of text-only working memory rather than any model-specific limitation.

What carries the argument

The MentalMap benchmark, built from 100 ProcTHOR scenes and organized into a six-level hierarchy (L0 atomic facts through L5 generative world-graph construction) plus four diagnostic axes for frame of reference, reading-direction bias, reasoning effort, and hallucination.

If this is right

  • Viewpoint reasoning forms a distinct bottleneck separate from atomic spatial fact retrieval.
  • The performance cliff cannot be overcome by increasing model scale or changing prompting strategies.
  • Structured-output failures and hallucination rates vary by model family while the L3 cliff remains constant.
  • Pure-text spatial reasoning must be treated as a multi-axis problem involving reference frames and memory load.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Multimodal inputs that supply visual structure could bypass the observed text-only memory limit.
  • Analogous cliffs may exist in other domains that require maintaining consistent internal state across transformations, such as causal or temporal reasoning.
  • Testing scratchpad or external memory augmentation on the same tasks would directly measure whether the bottleneck is removable.

Load-bearing premise

The six-level hierarchy and thirty-nine task families isolate genuine progressive spatial world-modeling skills without confounds from language structure or task design.

What would settle it

Provide models with an external coordinate grid or diagram of each scene and check whether viewpoint-reasoning accuracy then rises above half the atomic baseline; sustained failure would support the text-memory claim while improvement would falsify it.

Figures

Figures reproduced from arXiv: 2605.28277 by Chih-Ting Liao, Chunlei Meng, Chunrui Liu, Xin Cao, Xi Xiao, Yitong Qiao, Zhangquan Chen, Zhikai Pan.

Figure 1
Figure 1. Figure 1: MENTALMAP overview. (A) Scene-to-text construction. ProcTHOR scenes are decomposed into traceable world-state artifacts (objects, receptacles, containment, support, spatial relations), then converted into single- and multi-action temporal cases that instantiate benchmark items with canonical ground-truth, output-format, normaliser, and evaluator keys. (B) Six-level capability staircase (L0–L5), from surfac… view at source ↗
Figure 2
Figure 2. Figure 2: Universal L3 viewpoint cliff (F1). Each point is one (model, language) pair. All points lie below L3=L0/2; most below L3=L0/4. ja 48.9%, Gemma en 36.7%) but near-zero for several frontier closed-source models. The frame￾transformation family is most consistently solv￾able by frontier closed-source models (GPT-4o 21– 52%) while most open-weight models score 0% in seven of eight languages. Aggregate L3 thus … view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning-prompt effect is level-stratified (F3). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Multilingual fingerprints (F6). Heatmap of the nine principal LLMs × eight languages (Qwen2.5-32B scale control and Qwen2.5-VL ablation excluded from this view); cells encode within-model z-score (color) and absolute pass rate (number, mean of L3–L5). The two dendrograms are hierarchical￾clustering trees: closer branches share more similar performance profiles. The top tree clusters languages (en/de/es vs.… view at source ↗
Figure 5
Figure 5. Figure 5: Full L5 node F1 vs. edge F1 scatter across all (model, sub-task, language) cells. Diagonal marks node = edge. Marker shape distinguishes closed-source (•) from open-weight models. 4 3 2 1 0 1 2 3 4 PC1 (68\% variance) 3 2 1 0 1 2 PC2 (13\% variance) Script (color) Latin (en/de/es) CJK (zh/ja/ko) Arabic (ar) Thai (th) Model (shape) Closed-source Llama Qwen / Gemma Mistral / Falcon3 [PITH_FULL_IMAGE:figures… view at source ↗
Figure 6
Figure 6. Figure 6: PCA embedding of (model, language) profiles [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the MentalMap benchmark, a multilingual diagnostic for spatial reasoning in LLMs featuring a six-level hierarchy (L0-L5) from atomic facts to world-graph construction, based on 100 ProcTHOR scenes across eight languages. Evaluating thirteen LLMs, it reports a universal L3 cliff in viewpoint reasoning: no model retains half its L0 performance once atomic accuracy exceeds 40%, persisting across languages, scales, and prompts. Human evaluations replicate the pattern, attributing it to text-only working memory constraints. Diagnostic axes cover frame of reference, reading bias, effort allocation, and hallucination.

Significance. If the L3 cliff is shown to reflect absent world-model construction rather than task-length confounds, the result would be significant for LLM reasoning research by reframing pure-text spatial capabilities as limited by working memory and motivating multimodal or scratchpad methods. The multilingual scope, structured-text control, 39 task families, and human replication under identical protocol are clear strengths that add diagnostic value beyond monolingual English benchmarks.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%' supplies no information on statistical tests, error bars, data exclusion rules, or how the 40% threshold and half-performance criterion were chosen, leaving the universality of the L3 cliff difficult to evaluate.
  2. [Abstract / hierarchy description] Abstract / hierarchy description: L3 viewpoint-reasoning tasks embed longer chains of relations and more tokens than L0 atomic facts. The persistence across prompting strategies is noted but does not substitute for an explicit ablation that holds chain length and prompt token count fixed while varying only the spatial integration demand; without it the cliff risks being explained by standard transformer working-memory limits rather than missing world-model construction.
minor comments (2)
  1. Clarify the exact distribution of the 1,950 evaluation cells across the 39 task families, languages, and the four diagnostic axes.
  2. The abstract could list the specific prompting strategies tested so that the 'persists across prompting strategies' claim can be directly replicated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results on the L3 cliff. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%' supplies no information on statistical tests, error bars, data exclusion rules, or how the 40% threshold and half-performance criterion were chosen, leaving the universality of the L3 cliff difficult to evaluate.

    Authors: We agree the abstract would be strengthened by these details. Error bars appear in all main figures, statistical tests are reported in Section 4.3 and Appendix B, and data exclusion (primarily invalid parses) follows the protocol in Section 3.2. The 40% threshold marks the observed inflection where retention falls below half of L0 across models; the half-performance criterion is a direct relative measure. We will revise the abstract to reference these elements concisely. revision: yes

  2. Referee: [Abstract / hierarchy description] Abstract / hierarchy description: L3 viewpoint-reasoning tasks embed longer chains of relations and more tokens than L0 atomic facts. The persistence across prompting strategies is noted but does not substitute for an explicit ablation that holds chain length and prompt token count fixed while varying only the spatial integration demand; without it the cliff risks being explained by standard transformer working-memory limits rather than missing world-model construction.

    Authors: This concern is well-taken. While the cliff persists across prompting variants that alter length and structure, and human evaluations under the same pure-text protocol show the identical pattern, we did not include an ablation that holds chain length and token count exactly fixed while isolating spatial integration. We will add a dedicated discussion of this potential confound in the revised manuscript, framing the result as consistent with text-only working-memory limits and outlining how future synthetic-task experiments could further isolate the factors. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark study with no derivation chain or self-referential reductions

full rationale

The paper introduces MentalMap as a new multilingual benchmark with a six-level hierarchy and reports empirical performance results across 13 LLMs on 39 task families. All central claims (universal L3 cliff once L0 exceeds 40%, persistence across languages and prompts, human-model alignment) are direct observational outcomes from evaluation cells rather than predictions derived from equations, fitted parameters, or self-citations. No load-bearing step reduces by construction to its inputs; the hierarchy functions as a descriptive taxonomy for task design, not a self-defining loop. The work is self-contained against external benchmarks and contains no mathematical derivations that could exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new benchmark and hierarchy but the abstract mentions no fitted parameters, no ad-hoc axioms beyond standard evaluation practices, and no new postulated entities.

pith-pipeline@v0.9.1-grok · 5780 in / 1178 out tokens · 39601 ms · 2026-06-29T11:59:14.120770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 22 canonical work pages · 12 internal anchors

  1. [1]

    Pranjal Aggarwal and Swarnadeep Saha. 2025. Op- timalThinkingBench: Evaluating over and under- thinking in LLMs.arXiv preprint arXiv:2508.13141. Meta

  2. [2]

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Cyril Blakeney, Guilherme Penedo, Lewis Pel- letier, Leandro von Werra, and Thomas Wolf. 2025. Smollm2: When smol goes big–data-centric train- ing of a small language model.arXiv preprint arXiv:2502.02737

  3. [3]

    Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An, and Erhong Yang. 2026. From human cognition to neural activations: Probing the computa- tional primitives of spatial reasoning in LLMs.arXiv preprint arXiv:2603.26323. Beijing Language and Culture University. 9

  4. [4]

    Neil Burgess. 2006. Spatial memory: How egocen- tric and allocentric combine.Trends in Cognitive Sciences, 10(12):551–557

  5. [5]

    Omar Choukrani, Idriss Malek, Daniil Orel, Zhuo- han Xie, Zangir Iklassov, Martin Takáˇc, and Salem Lahlou. 2025. LLM-BabyBench: Understanding and evaluating grounded planning and reasoning in LLMs.arXiv preprint arXiv:2505.12135. MBZUAI

  6. [6]

    Nelson Cowan. 2001. The magical number 4 in short- term memory: A reconsideration of mental storage capacity.Behavioral and Brain Sciences, 24(1):87– 114

  7. [7]

    DeepSeek-AI. 2025. DeepSeek-R1: Incentiviz- ing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948

  8. [8]

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Jordi Salvador, Kiana Ehsani, Winson Han, Eric Kolve, Ali Farhadi, Aniruddha Kembhavi, and Roozbeh Mottaghi. 2022. ProcTHOR: Large-scale embodied AI using procedural generation. InAd- vances in Neural Information Processing Systems (NeurIPS). Outstanding Paper Award

  9. [9]

    Hafsteinn Einarsson. 2025. MazeEval: A benchmark for testing sequential decision-making in language models.arXiv preprint arXiv:2507.20395

  10. [10]

    Mike Farmer, Abhinav Kochar, and Yugyung Lee

  11. [11]

    Empirical Characterization of Inference-Time Elicited Probability Transformations in Large Language Models

    The α-law of observable belief revision in large language model inference.arXiv preprint arXiv:2603.19262. University of Missouri–Kansas City

  12. [12]

    Gemma Team. 2025. Gemma 3 technical report. https://goo.gle/Gemma3Report

  13. [13]

    Gemma Team, Google DeepMind. 2024. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118

  14. [14]

    Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. Grammar-constrained decoding for structured NLP tasks without finetuning. InPro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics

  15. [15]

    Google. 2025. Gemini 2.5 Pro

  16. [16]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and 1 others. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783

  17. [17]

    Zhongbin Guo, Zhen Yang, Yushan Li, and 1 others

  18. [18]

    Beijing Institute of Tech- nology

    Can LLMs see without pixels? benchmarking spatial intelligence from textual descriptions.arXiv preprint arXiv:2601.03590. Beijing Institute of Tech- nology

  19. [19]

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing. Association for Computational Linguistics

  20. [20]

    Mengkang Hu, Tianxing Chen, Yifan Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Lu, and Ping Luo. 2025. TEXT2WORLD: Benchmarking large language models for symbolic world model generation. InFindings of the Associa- tion for Computational Linguistics: ACL

  21. [21]

    Mistral 7B

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, and 1 others. 2023. Mistral 7b.arXiv preprint arXiv:2310.06825

  22. [22]

    Peiyao Jiang, Zequn Qin, and Xi Li. 2026. Spatial- Text: A pure-text cognitive benchmark for spatial un- derstanding in large language models.arXiv preprint arXiv:2603.03002. Zhejiang University

  23. [23]

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gor- don, Yuke Zhu, Abhinav Gupta, and Ali Farhadi

  24. [24]

    AI2-THOR: An interactive 3D environment for visual AI.Preprint, arXiv:1712.05474

  25. [25]

    Maria Kozhevnikov and Mary Hegarty. 2001. A dis- sociation between object manipulation spatial ability and spatial orientation ability.Memory & Cognition, 29(5):745–756

  26. [26]

    Gon- zalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serv- ing with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP)

  27. [27]

    Levinson

    Stephen C. Levinson. 2003.Space in Language and Cognition: Explorations in Cognitive Diversity. Lan- guage Culture and Cognition. Cambridge University Press

  28. [28]

    Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, and Jiajun Wu. 2024. Embodied agent interface: Benchmarking LLMs for embodied decision making. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmark...

  29. [29]

    Weijiang Li and 1 others. 2026. Do LLMs build spa- tial world models? evidence from grid-world maze tasks.arXiv preprint arXiv:2604.10690

  30. [30]

    Mehta, and Yiding Wu

    Yuxi Li, Shuyuan Niu, Sze Chai Wong, Mo Yu, San- jeev J. Mehta, and Yiding Wu. 2024. Do large lan- guage models build internal world representations? probing through the lens of state abstraction. InPro- ceedings of the 2024 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics. 10

  31. [31]

    Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, and Xin Cao. 2026. Spamem: Benchmarking dynamic spatial reasoning via perception-memory integration in embodied environments.arXiv preprint arXiv:2604.22409

  32. [32]

    Asifa Majid, Melissa Bowerman, Sotaro Kita, Daniel B. M. Haun, and Stephen C. Levinson. 2004. Can language restructure cognition? the case for space.Trends in Cognitive Sciences, 8(3):108–114

  33. [33]

    Nicolas Martorell. 2025. From text to space: Mapping abstract spatial models in LLMs dur- ing a grid-world navigation task.arXiv preprint arXiv:2502.16690

  34. [34]

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. 2021. Show your work: Scratchpads for in- termediate computation with language models.arXiv preprint arXiv:2112.00114

  35. [35]

    OpenAI. 2024. GPT-4o. https://openai. com/index/hello-gpt-4o/

  36. [36]

    Shubham Parashar, Blake Olson, Sambhav Khu- rana, Eric Li, Hongyi Ling, James Caverlee, and Shuiwang Ji. 2025. Inference-time computations for LLM reasoning and planning: A benchmark and insights.arXiv preprint arXiv:2502.12521. Texas A&M University

  37. [37]

    Eric Pederson, Eve Danziger, David Wilkins, Stephen Levinson, Sotaro Kita, and Gunter Senft

  38. [38]

    Semantic typology and spatial conceptualiza- tion.Language, 74(3):557–589

  39. [39]

    Tanawan Premsri and Parisa Kordjamshidi. 2025. FoREST: Frame of reference evaluation in spatial reasoning tasks.arXiv preprint arXiv:2502.17775

  40. [40]

    Qwen Team. 2024. Qwen2.5: A party of foundation models

  41. [41]

    Qwen Team. 2025. Qwen2.5-vl: Vision-language models. arXiv:2502.13923

  42. [42]

    Qwen Team. 2025. Qwen3.5-flash. API release

  43. [43]

    Md Imbesat Rizvi, Xiaodan Zhu, and Iryna Gurevych. 2024. SpaRC and SpaRP: Spatial rea- soning characterization and path generation for un- derstanding spatial reasoning capability of large lan- guage models. InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 4750–4767. Association for...

  44. [44]

    Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, and Peter Wonka. 2026. FloorplanQA: A benchmark for spatial reasoning in LLMs using structured representations. arXiv preprint arXiv:2507.07644. KAUST; v3 last revised 30 Jan 2026

  45. [45]

    Shepard and Jacqueline Metzler

    Roger N. Shepard and Jacqueline Metzler. 1971. Mental rotation of three-dimensional objects.Sci- ence, 171(3972):701–703

  46. [46]

    Zhengxiang Shi, Qiang Zhang, and Aldo Lipani

  47. [47]

    InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 36, pages 11321–11329

    StepGame: A new benchmark for robust multi- hop spatial reasoning in texts. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 36, pages 11321–11329

  48. [48]

    2000.Toward a Cognitive Seman- tics, Volume I: Concept Structuring Systems

    Leonard Talmy. 2000.Toward a Cognitive Seman- tics, Volume I: Concept Structuring Systems. MIT Press

  49. [49]

    Manveer Singh Tamber, Forrest Sheng Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, Ofer Mendelevitch, Renyi Qu, and Jimmy Lin

  50. [50]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pages 799–811

    Benchmarking LLM faithfulness in RAG with evolving leaderboards. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pages 799–811. Association for Computational Linguistics

  51. [51]

    Thora Tenbrink. 2011. Reference frames of space and time in language.Journal of Pragmatics, 43(3):704–722

  52. [52]

    TII Falcon Team. 2025. Falcon 3: Frontier open- weight language models from tii. Technology Inno- vation Institute Technical Report

  53. [53]

    Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

    Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. 2024. Evaluating the world model implicit in a generative model.Advances in Neural Information Processing Systems

  54. [54]

    Jean-Baptiste Van der Henst and Walter Schaeken

  55. [55]

    PMCID: PMC8165199

    The influence of language on spatial reasoning: Reading habits modulate the formulation of conclu- sions and the integration of premises.Frontiers in Psychology. PMCID: PMC8165199

  56. [56]

    Vectara. 2025. Hughes hallucination evaluation model (HHEM) leaderboard. https://github.com/vectara/ hallucination-leaderboard

  57. [57]

    Ruoyao Wang, Graham Todd, Ziang Xiao, Xingdi Yuan, Marc-Alexandre Côté, Peter Clark, and Peter Jansen. 2024. Can language models serve as text- based world simulators? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics

  58. [58]

    Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2025. Thoughts are all over the place: On the underthinking of o1-like LLMs.arXiv preprint arXiv:2501.18585. Tencent AI Lab; NeurIPS 2025

  59. [59]

    Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, 11 Quoc V . Le, and Denny Zhou. 2022. Chain-of- thought prompting elicits reasoning in large language models. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), volume 35, pages 24824– 24837

  60. [60]

    Dongil Yang, Minjin Kim, Sunghwan Kim, Beong- woo Kwak, Minjun Park, Jinseok Hong, Woontack Woo, and Jinyoung Yeo. 2025. LLM meets scene graph: Can large language models understand and generate scene graphs? a benchmark and empirical study. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers),...

  61. [61]

    Zheyuan Zhang, Fengyuan Hu, Jayjun Lee, Freda Shi, Parisa Kordjamshidi, Joyce Chai, and Ziqiao Ma

  62. [62]

    InThe Thirteenth International Con- ference on Learning Representations (ICLR)

    Do vision-language models represent space and how? evaluating spatial frame of reference under ambiguities. InThe Thirteenth International Con- ference on Learning Representations (ICLR). Oral Presentation

  63. [63]

    answer the question

    Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Long- pre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. 2024. Aya model: An instruction finetuned open-access multilingual lan- guage model. InProc...

  64. [64]

    The cup is tomy lefton the kitchen counter

    and the operationalization of Premsri and Ko- rdjamshidi [35]. Table 8 gives an English example for each frame; the same scene is rendered in the eight evaluation languages. A.7 Hallucination Evaluator Details The L5 hallucination evaluator computes four de- composed sub-metrics: (i)node F1—token-level F1 over the set of objects in the generated graph 13 ...