pith. sign in

arxiv: 2606.31285 · v1 · pith:WM7NE7PMnew · submitted 2026-06-30 · 💻 cs.AI

Spatial Reasoning via Modality Switching Between Language and Symbolic Representation

Pith reviewed 2026-07-01 05:39 UTC · model grok-4.3

classification 💻 cs.AI
keywords spatial reasoningmodality switchinglarge language modelsgrid representationtrustworthiness signalscomplexity signalsmulti-hop reasoning
0
0 comments X

The pith

Switching LLMs from language to grid representations improves spatial reasoning by up to 42%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models reason better on spatial stories when they switch from pure text to drawing grids or layouts. It introduces a metric using trustworthiness and complexity signals to predict when this switch helps. If the metric works, models can choose the right representation automatically instead of always staying in language. This matters because human reasoning often uses diagrams for hard spatial problems, and LLMs might gain similar flexibility. The experiments show gains of up to 42 percent when the switch is applied.

Core claim

Grounding multi-hop textual-spatial stories into geometry-aware modalities such as grids improves reasoning over natural language inference alone, and a switching metric based on trustworthiness and complexity signals can estimate when this grounding is likely to help.

What carries the argument

The switching metric, which combines trustworthiness and complexity signals to decide between language and grid modalities.

Load-bearing premise

The switching metric built from trustworthiness and complexity signals accurately predicts when switching to a grid will improve performance over language-only reasoning.

What would settle it

Run the switching metric on a new set of spatial stories, apply the predicted modality, and check whether the performance gain matches or exceeds the reported 42 percent improvement.

Figures

Figures reproduced from arXiv: 2606.31285 by Parisa Kordjamshidi, Shreya Rajpal, Tanawan Premsri.

Figure 1
Figure 1. Figure 1: Natural language vs. grid-based reasoning in a multi-hop spatial setting. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mapping the same story and question from Figure [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed pipeline and switching mechanism, using the same story and question input as [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trustworthiness and complexity predict when switching helps. Accuracy of Qwen3-32B on STEPGAME (N=1000) across binned trustworthiness scores (left) and complexity scores (right). Small n on the x-axis denotes the number of items in each bin. Red lines show the switching thresholds, τt=0.95 and τc=0.50, and peach regions mark where the policy switches to grid-based reasoning. SpaRTUN ReSQ Model / Setting YN… view at source ↗
Figure 5
Figure 5. Figure 5: Success case where text-only reasoning fails on a 5-hop diagonal chain, while both full and pruned grids [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure case where both grid views are incorrect despite correct relation extraction. The full grid misreads [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A successful topological reasoning case where both full and pruned grids preserve nested containment. [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An example of containment partially recovered with extra direction added. [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example of full grid misses containment due to extra objects. [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example from ResQ dataset correctly answered by text-only reasoning and by both grid views. [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: An example from ResQ dataset incorrectly answered by text-only and full grid reasoning and correctly [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Worked switching example on an 8-hop StepGame instance. Faithfulness is high ( [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
read the original abstract

Human reasoning is inherently multimodal: when problems become difficult, we rarely think in words alone. We often externalize our reasoning by sketching diagrams or drawing grids to understand the underlying conceptual structure and avoid mistakes. Building on this premise, our research investigates: (a) whether grounding multi-hop textual-spatial stories into geometry-aware modalities, such as layouts or grids, improves reasoning compared to natural language-based inference; and (b) whether a model can decide when to rely on natural language reasoning and when to switch to a structured modality. We address these questions by introducing a switching metric based on trustworthiness and complexity signals, which estimates when grounding a spatial story into structure is likely to improve performance. This takes a first step toward principled modality selection in Large Language Model (LLM) reasoning. Across our settings, switching from natural language-based reasoning to a grid-based representation improves LLM performance by up to 42\%, highlighting the importance of modality choice in shaping reasoning outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that grounding multi-hop textual-spatial stories into grid-based symbolic representations improves LLM performance over language-only reasoning, and introduces a switching metric derived from trustworthiness and complexity signals to decide when to switch modalities, reporting gains of up to 42% across settings.

Significance. If the switching metric is shown to reliably predict gains and the empirical results are robust, the work could contribute to better understanding of modality choice in LLM reasoning for spatial tasks.

major comments (2)
  1. [Switching metric definition and evaluation] The headline claim of up to 42% improvement via switching depends on the metric correctly identifying beneficial cases, but no correlation, precision-recall, or ablation against random/always-language baselines is reported to validate this predictive link (see abstract and any experiments section).
  2. [Empirical evaluation] No experimental details are supplied on dataset sizes, baselines, statistical tests, error analysis, or how the 42% figure was computed, preventing assessment of whether the result supports the central claim (abstract).
minor comments (1)
  1. The abstract refers to 'our settings' and 'multi-hop textual-spatial stories' without defining the tasks or stories used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in validation of the switching metric and in experimental reporting. We will revise the manuscript to address both points by adding the requested analyses and details.

read point-by-point responses
  1. Referee: [Switching metric definition and evaluation] The headline claim of up to 42% improvement via switching depends on the metric correctly identifying beneficial cases, but no correlation, precision-recall, or ablation against random/always-language baselines is reported to validate this predictive link (see abstract and any experiments section).

    Authors: We agree that the predictive link between the trustworthiness-and-complexity switching metric and observed gains requires explicit validation. The current manuscript defines the metric and reports aggregate gains but does not include correlation coefficients, precision-recall for the switch decisions, or ablations versus random or always-language baselines. In revision we will add these evaluations in the experiments section to demonstrate that the metric reliably identifies cases where modality switching is beneficial. revision: yes

  2. Referee: [Empirical evaluation] No experimental details are supplied on dataset sizes, baselines, statistical tests, error analysis, or how the 42% figure was computed, preventing assessment of whether the result supports the central claim (abstract).

    Authors: We acknowledge that the abstract and experiments section currently omit these details. We will expand the experimental reporting to specify dataset sizes, all baselines, statistical significance tests, error analysis, and the exact computation of the reported gains (including per-setting breakdowns that yield the maximum of 42%). revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claim rests on external experimental outcomes

full rationale

The abstract introduces a switching metric constructed from trustworthiness and complexity signals to estimate when grid grounding will help, then reports an empirical performance gain of up to 42% when switching is applied. No equations, definitions, or self-citations are shown that make the metric or the gain reduce to its own inputs by construction. The 42% figure is presented as a measured experimental result rather than a fitted or renamed quantity, and the derivation chain remains open to external validation on held-out stories.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level description of the switching metric itself.

invented entities (1)
  • switching metric no independent evidence
    purpose: Estimate when to switch from language to grid representation
    Introduced as the core new component for deciding modality use

pith-pipeline@v0.9.1-grok · 5698 in / 995 out tokens · 24847 ms · 2026-07-01T05:39:36.211334+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 31 canonical work pages · 2 internal anchors

  1. [1]

    2026 , doi =

    Wang, Rong and Sun, Kun , journal =. 2026 , doi =

  2. [7]

    2003 , publisher=

    Toward a Cognitive Semantics: Volume 1: Concept Structuring Systems and Volume 2: Typology and Process in Concept Structuring , author=. 2003 , publisher=

  3. [9]

    The More the Better? A Systematic Review and Meta-Analysis of the Benefits of More than Two External Representations in STEM Education , volume =

    Rexigel, Eva and Kuhn, Jochen and Becker-Genschow, Sebastian and Malone, Sarah , year =. The More the Better? A Systematic Review and Meta-Analysis of the Benefits of More than Two External Representations in STEM Education , volume =. Educational Psychology Review , doi =

  4. [11]

    Spatial Role Labeling: Task Definition and Annotation Scheme

    Kordjamshidi, Parisa and Van Otterlo, Martijn and Moens, Marie-Francine. Spatial Role Labeling: Task Definition and Annotation Scheme. Proceedings of the Seventh International Conference on Language Resources and Evaluation ( LREC '10). 2010

  5. [12]

    History of Programming Languages---II , pages =

    Colmerauer, Alain and Roussel, Philippe , title =. History of Programming Languages---II , pages =. 1996 , isbn =

  6. [13]

    Grounding spatial relations in text-only language models , volume=

    Azkune, Gorka and Salaberria, Ander and Agirre, Eneko , year=. Grounding spatial relations in text-only language models , volume=. doi:10.1016/j.neunet.2023.11.031 , journal=

  7. [14]

    Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,

    GRASP: A Novel Benchmark for Evaluating Language GRounding and Situated Physics Understanding in Multimodal Language Models , author =. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,. 2024 , month =. doi:10.24963/ijcai.2024/696 , url =

  8. [15]

    2025 , eprint=

    DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning , author=. 2025 , eprint=

  9. [16]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  10. [17]

    Plausibility: On the (Un)Reliability of Explanations from Large Language Models , author=

    Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models , author=. 2024 , eprint=

  11. [19]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  12. [20]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  13. [21]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  14. [22]

    2025 , eprint=

    Spatial Reasoning in Multimodal Large Language Models: A Survey of Tasks, Benchmarks and Methods , author=. 2025 , eprint=

  15. [23]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  16. [24]

    2024 , eprint=

    Evaluating Consistency and Reasoning Capabilities of Large Language Models , author=. 2024 , eprint=

  17. [25]

    Thomas and Pavlick, Ellie and Linzen, Tal

    McCoy, R. Thomas and Pavlick, Ellie and Linzen, Tal. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1334

  18. [27]

    An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

    Shiri, Fatemeh and Guo, Xiao-Yu and Far, Mona Golestan and Yu, Xin and Haf, Reza and Li, Yuan-Fang. An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1195

  19. [28]

    The Fourteenth International Conference on Learning Representations , year=

    InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models , author=. The Fourteenth International Conference on Learning Representations , year=

  20. [29]

    2025 , eprint=

    MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe , author=. 2025 , eprint=

  21. [30]

    2015 , eprint=

    Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , author=. 2015 , eprint=

  22. [31]

    Transactions on Machine Learning Research , issn=

    Extracting and Following Paths for Robust Relational Reasoning with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2026 , url=

  23. [32]

    Vision Language Models in Autonomous Driving: A Survey and Outlook , volume =

    Zhou, Xingcheng and Liu, Mingyu and Yurtsever, Ekim and Žagar, Bare Luka and Zimmer, Walter and Cao, Hu and Knoll, Alois , year =. Vision Language Models in Autonomous Driving: A Survey and Outlook , volume =. IEEE Transactions on Intelligent Vehicles , doi =

  24. [33]

    8th Annual Conference on Robot Learning , year=

    Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models , author=. 8th Annual Conference on Robot Learning , year=

  25. [34]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Song, Chan Hee and Blukis, Valts and Tremblay, Jonathan and Tyree, Stephen and Su, Yu and Birchfield, Stan , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  26. [35]

    2023 , eprint=

    An Evaluation of ChatGPT-4's Qualitative Spatial Reasoning Capabilities in RCC-8 , author=. 2023 , eprint=

  27. [36]

    2023 , eprint=

    Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs , author=. 2023 , eprint=

  28. [38]

    2025 , eprint=

    An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning , author=. 2025 , eprint=

  29. [39]

    The First Workshop on the Application of LLM Explainability to Reasoning and Planning , year=

    Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data , author=. The First Workshop on the Application of LLM Explainability to Reasoning and Planning , year=

  30. [41]

    2025 , eprint=

    SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning , author=. 2025 , eprint=

  31. [42]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  32. [43]

    and Le, Quoc V

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

  33. [44]

    Proceedings of the 40th International Conference on Machine Learning , articleno =

    Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  34. [45]

    Transactions on Machine Learning Research , issn=

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

  35. [46]

    First Conference on Language Modeling , year=

    Chain-of-Symbol Prompting For Spatial Reasoning in Large Language Models , author=. First Conference on Language Modeling , year=

  36. [47]

    Logic- LM : Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning

    Pan, Liangming and Albalak, Alon and Wang, Xinyi and Wang, William. Logic- LM : Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.248

  37. [48]

    Forty-second International Conference on Machine Learning , year=

    CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance , author=. Forty-second International Conference on Machine Learning , year=

  38. [49]

    2025 , eprint=

    Route to Reason: Adaptive Routing for LLM and Reasoning Strategy Selection , author=. 2025 , eprint=

  39. [51]

    Gonzalez and M Waleed Kadous and Ion Stoica , booktitle=

    Isaac Ong and Amjad Almahairi and Vincent Wu and Wei-Lin Chiang and Tianhao Wu and Joseph E. Gonzalez and M Waleed Kadous and Ion Stoica , booktitle=. Route. 2025 , url=

  40. [52]

    2025 , eprint=

    AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning , author=. 2025 , eprint=

  41. [53]

    2025 , url=

    Murong Yue and Wenlin Yao and Haitao Mi and Dian Yu and Ziyu Yao and Dong Yu , booktitle=. 2025 , url=

  42. [54]

    Thinkless:

    Gongfan Fang and Xinyin Ma and Xinchao Wang , booktitle=. Thinkless:. 2026 , url=

  43. [55]

    Chirag Agarwal, Sree Harsha Tanneru, and Himabindu Lakkaraju. 2024. https://arxiv.org/abs/2402.04614 Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models . Preprint, arXiv:2402.04614

  44. [56]

    Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, and Chuchu Fan. 2025. https://openreview.net/forum?id=ezna4V4zHs Codesteer: Symbolic-augmented language models via code/text guidance . In Forty-second International Conference on Machine Learning

  45. [57]

    Anthony G Cohn. 2023. https://arxiv.org/abs/2309.15577 An evaluation of chatgpt-4's qualitative spatial reasoning capabilities in rcc-8 . Preprint, arXiv:2309.15577

  46. [58]

    Anthony G Cohn and Jose Hernandez-Orallo. 2023. https://arxiv.org/abs/2304.11164 Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of llms . Preprint, arXiv:2304.11164

  47. [59]

    Alain Colmerauer and Philippe Roussel. 1996. https://doi.org/10.1145/234286.1057820 The birth of Prolog , page 331–367. Association for Computing Machinery, New York, NY, USA

  48. [60]

    Gongfan Fang, Xinyin Ma, and Xinchao Wang. 2026. https://openreview.net/forum?id=ariVQf0KZx Thinkless: LLM learns when to think . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

  49. [61]

    Hanxu Hu, Hongyuan Lu, Huajian Zhang, Yunze Song, Wai Lam, and Yue Zhang. 2024. https://openreview.net/forum?id=Hvq9RtSoHG Chain-of-symbol prompting for spatial reasoning in large language models . In First Conference on Language Modeling

  50. [62]

    Navdeep Kaur, Lachlan McPheat, Alessandra Russo, Anthony G Cohn, and Pranava Madhyastha. 2025. https://arxiv.org/abs/2503.05439 An empirical study of conformal prediction in llm with asp scaffolds for robust reasoning . Preprint, arXiv:2503.05439

  51. [63]

    Parisa Kordjamshidi, Martijn Van Otterlo, and Marie-Francine Moens. 2010. https://aclanthology.org/L10-1584/ Spatial role labeling: Task definition and annotation scheme . In Proceedings of the Seventh International Conference on Language Resources and Evaluation ( LREC '10) , Valletta, Malta. European Language Resources Association (ELRA)

  52. [64]

    Larkin and Herbert A

    Jill H. Larkin and Herbert A. Simon. 1987. https://doi.org/10.1111/j.1551-6708.1987.tb00863.x Why a diagram is (sometimes) worth ten thousand words . Cognitive Science, 11(1):65--100

  53. [65]

    Hogg, and Anthony G

    Fangjun Li, David C. Hogg, and Anthony G. Cohn. 2024. https://doi.org/10.1609/aaai.v38i17.29811 Advancing spatial reasoning in large language models: an in-depth evaluation and enhancement using the stepgame benchmark . In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of ...

  54. [66]

    Shuaiyi Li, Yang Deng, and Wai Lam. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.428 D ep W i GNN : A depth-wise graph neural network for multi-hop spatial reasoning in text . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6459--6471, Singapore. Association for Computational Linguistics

  55. [67]

    Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. 2025 a . https://arxiv.org/abs/2511.15722 Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods . Preprint, arXiv:2511.15722

  56. [68]

    Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, Helong Huang, Guangjian Tian, Weichao Qiu, Xingyue Quan, Jianye Hao, and Yuzheng Zhuang. 2025 b . https://arxiv.org/abs/2501.10074 Spatialcot: Advancing spatial reasoning through coordinate alignment and chain-of-though...

  57. [69]

    Llama Team . 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . Preprint, arXiv:2407.21783

  58. [70]

    Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qingping Yang, and Shuangzhi Wu. 2025. https://arxiv.org/abs/2505.11896 Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning . Preprint, arXiv:2505.11896

  59. [71]

    Roshanak Mirzaee and Parisa Kordjamshidi. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.413 Transfer learning with synthetic corpora for spatial role labeling and reasoning . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6148--6165, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

  60. [72]

    Roshanak Mirzaee and Parisa Kordjamshidi. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.221 Disentangling extraction and reasoning in multi-hop spatial reasoning . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3379--3397, Singapore. Association for Computational Linguistics

  61. [73]

    Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi. 2021. https://doi.org/10.18653/v1/2021.naacl-main.364 SPARTQA : A textual question answering benchmark for spatial reasoning . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page...

  62. [74]

    Gonzalez, M Waleed Kadous, and Ion Stoica

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. 2025. https://openreview.net/forum?id=8sSqNntaMr Route LLM : Learning to route LLM s from preference data . In The Thirteenth International Conference on Learning Representations

  63. [75]

    OpenAI . 2025 a . https://platform.openai.com/docs/models/gpt-5 GPT-5 Model Documentation . Accessed: 2025-12-26

  64. [76]

    OpenAI . 2025 b . https://platform.openai.com/docs/models/gpt-5.1 GPT-5.1 Model Documentation . Accessed: 2025-12-26

  65. [77]

    Zhihong Pan, Kai Zhang, Yuze Zhao, and Yupeng Han. 2025. https://arxiv.org/abs/2505.19435 Route to reason: Adaptive routing for llm and reasoning strategy selection . Preprint, arXiv:2505.19435

  66. [78]

    Tanawan Premsri and Parisa Kordjamshidi. 2025. https://doi.org/10.18653/v1/2025.findings-naacl.128 Neuro-symbolic training for reasoning over spatial language . In Findings of the Association for Computational Linguistics: NAACL 2025, page 2395–2414. Association for Computational Linguistics

  67. [79]

    Qwen Team . 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  68. [80]

    Eva Rexigel, Jochen Kuhn, Sebastian Becker-Genschow, and Sarah Malone. 2024. https://doi.org/10.1007/s10648-024-09958-y The more the better? a systematic review and meta-analysis of the benefits of more than two external representations in stem education . Educational Psychology Review, 36

  69. [81]

    Md Imbesat Rizvi, Xiaodan Zhu, and Iryna Gurevych. 2024. https://doi.org/10.18653/v1/2024.acl-long.261 S pa RC and S pa RP : Spatial reasoning characterization and path generation for understanding spatial reasoning capability of large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: L...

  70. [82]

    Krithi Shailya, Shreya Rajpal, Gokul S Krishnan, and Balaraman Ravindran. 2025. https://doi.org/10.1145/3715275.3732104 Lext: Towards evaluating trustworthiness of natural language explanations . In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, page 1565–1587. ACM

  71. [83]

    Zhengxiang Shi, Qiang Zhang, and Aldo Lipani. 2022. https://doi.org/10.1609/aaai.v36i10.21383 Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11321--11329

  72. [84]

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. 2025. RoboSpatial : Teaching spatial understanding to 2D and 3D vision-language models for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Oral Presentation

  73. [85]

    L. Talmy. 2003. https://books.google.com/books?id=g7IoanNUNksC Toward a Cognitive Semantics: Volume 1: Concept Structuring Systems and Volume 2: Typology and Process in Concept Structuring . Number v. 2 in A Bradford book. MIT Press

  74. [86]

    Rong Wang and Kun Sun. 2026. https://doi.org/10.1016/j.neunet.2025.108022 DSPy-based neural-symbolic pipeline to enhance spatial reasoning in LLMs . Neural Networks: The Official Journal of the International Neural Network Society, 193:108022

  75. [87]

    Zhun Yang, Adam Ishay, and Joohyung Lee. 2023. https://doi.org/10.18653/v1/2023.findings-acl.321 Coupling large language models with logic programming for robust and general reasoning from text . In Findings of the Association for Computational Linguistics: ACL 2023, pages 5186--5219, Toronto, Canada. Association for Computational Linguistics

  76. [88]

    Murong Yue, Wenlin Yao, Haitao Mi, Dian Yu, Ziyu Yao, and Dong Yu. 2025. https://openreview.net/forum?id=tn2mjzjSyR DOTS : Learning to reason dynamically in LLM s via optimal reasoning trajectories search . In The Thirteenth International Conference on Learning Representations

  77. [89]

    Ge Zhang, Mohammad Ali Alomrani, Hongjian Gu, Jiaming Zhou, Yaochen Hu, Bin Wang, Qun Liu, Mark Coates, Yingxue Zhang, and Jianye HAO. 2026. https://openreview.net/forum?id=EbELaNKmZK Extracting and following paths for robust relational reasoning with large language models . Transactions on Machine Learning Research. Expert Certification

  78. [90]

    Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.184 A dapt T hink: Reasoning models can learn when to think . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3716--3730, Suzhou, China. Association for Computational Linguistics

  79. [91]

    Mike Zhang, Kaixian Qu, Vaishakh Patil, Cesar Cadena, and Marco Hutter. 2024. https://openreview.net/forum?id=eU5E0oTtpS Tag map: A text-based map for spatial reasoning and navigation with large language models . In 8th Annual Conference on Robot Learning

  80. [92]

    Jiaming Zhou, Abbas Ghaddar, Ge Zhang, Liheng Ma, Yaochen Hu, Soumyasundar Pal, Bin Wang, Jianye HAO, Mark Coates, and Yingxue Zhang. 2025. https://openreview.net/forum?id=Kqp4325eXm Enhancing logical reasoning in large language models through graph-based synthetic data . In The First Workshop on the Application of LLM Explainability to Reasoning and Planning

Showing first 80 references.