pith. sign in

arxiv: 2605.18380 · v1 · pith:4JYVJR2Onew · submitted 2026-05-18 · 💻 cs.AI

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

Pith reviewed 2026-05-20 11:14 UTC · model grok-4.3

classification 💻 cs.AI
keywords qualitative spatial reasoningtemporal reasoninglanguage model evaluationbenchmarkRegion Connection CalculusAllen's Interval AlgebraPoint Algebraconceptual neighbourhoods
4
0 comments X

The pith

Current language models exceed random guessing on qualitative spatial and temporal reasoning but cannot solve all problems consistently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents QSTRBench, a benchmark designed to evaluate how well large language models can perform qualitative spatial and temporal reasoning using established calculi. The benchmark includes questions on composing relations, finding converse relations, and identifying conceptual neighborhoods for calculi ranging from simple Point Algebra to complex Region Connection Calculus variants. Testing on frontier models shows they generally do better than chance but vary widely in success depending on the specific calculus, with none achieving full accuracy. The authors release the benchmark publicly along with the new conceptual neighborhood for RCC-22 to encourage further research into improving these reasoning capabilities in AI systems.

Core claim

The paper establishes QSTRBench as a comprehensive evaluation tool for LLMs on QSTR tasks involving compositional reasoning via composition tables, converse relations, and conceptual neighbourhoods across multiple calculi including PA, Allen's Interval Algebra, INDU, RCC-5, RCC-8, RCC-22, and others. It reports that all tested contemporary frontier models perform above random guessing levels but fail to answer every question correctly, with performance differing markedly by calculus type—easiest for PA and hardest for RCC-22. The work also introduces the RCC-22 conceptual neighbourhood for the first time and provides an extended version of the benchmark that varies question formats such as 1

What carries the argument

QSTRBench, the benchmark consisting of questions on composition, converse, and conceptual neighbourhood reasoning for qualitative spatial and temporal calculi.

If this is right

  • Models can handle simpler calculi like PA better than complex ones like RCC-22.
  • No current model achieves consistent correctness across all question types and calculi.
  • Variations in question presentation affect how the benchmark tests reasoning.
  • Open release of the benchmark enables community-wide assessment of LLM capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • LLMs may be using statistical associations from training data rather than performing genuine logical reasoning on these calculi.
  • Extending the benchmark to include visual or multimodal inputs could reveal whether spatial reasoning improves with additional modalities.
  • The difficulty ordering of calculi might guide the development of specialized training data for improving model performance on harder cases.

Load-bearing premise

The benchmark questions test genuine qualitative reasoning rather than being answerable through patterns learned from training data or prompt engineering tricks.

What would settle it

A language model that correctly answers every question in the benchmark across all calculi and presentation variations would falsify the claim that no current models can consistently solve them.

Figures

Figures reproduced from arXiv: 2605.18380 by Anthony G. Cohn, Robert E. Blackwell.

Figure 1
Figure 1. Figure 1: The eight base relations of RCC-8 illustrated in 2D [9]: DC (Disconnected), EC (Externally Connected), PO (Partially Overlapping), TPP (Tangential Proper Part), NTPP (Nontangential Proper Part) and EQ (Equals); TPPi and NTPPi are the converses of TPP and NTPP respectively since they are asymmetric. The arrows denote relations which are conceptual neighbours. The ability of LLMs to reason about RCC-8 was fi… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy of the LLMs tested on our QSTRBench benchmark (converse, CT, CN, and combined questions) using strict evaluation (answers must be precisely correct). The red dotted line is the guess rate. Green bars are open weights models. Experimental repeats are indicated by n=3, but these are not affordable for all models. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy by model by calculus. soning effort outperforms all our other GPT model experiments, so perhaps OpenAI were still refining the question complexity classification algorithm in the GPT-5.1 release. Although setting high reasoning effort in GPT-5.2 improves performance across most calculi, it makes RCC-5 performance slightly worse (0.96 to 0.95, [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy by calculus (solid line), broken down by Composition Table, Concep￾tual Neighbourhood and Converse questions (bars), for all LLMs tested. Dotted lines show the guess rate. Error bars show the prediction interval. The number of base relations for a calculus is shown in brackets beside the label. There has been an extraordinary improvement in model accuracy over time, e.g. OpenAI models improved fro… view at source ↗
Figure 5
Figure 5. Figure 5: The five relations of the RCC-5 calculus, depicted with its conceptual neigh￾bourhood: DR (Discrete), PO (Partially Overlapping), PP (Proper Part) and EQ (Equals). PPi is the converse of PP. is only =), but GPT-3.5 Turbo (10/15) answered with all relations <, =, and >, suggesting that it has no real intuition about PA. Similarly, Kimi K2 (14/15) answers the question If >(x,y) and >(y,z) then what are the p… view at source ↗
Figure 6
Figure 6. Figure 6: Depiction of the nine relations of the Cardinal Direction Calculus (CDC). For example N(x,y) means that x lies along the line that extends due north of y. NE(x,y) means that x lies to the east of the line that extends due north of y, and to the north the line that extends due east of y, and so on. unsurprising that EQ needs less reasoning effort, but why PPi needs the most reasoning effort remains unclear … view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of output tokens by correct answer direction for kimi-k2 on CDC - eponymous relations on the left, nonce word relations on the right. Note that intercardinal directions use more tokens than cardinal directions in the eponymous case. The effect is somewhat true for nonce relations on the right except that South is an anomaly. (CDC), each with nuanced semantics, including [35] and [67]. The CDC … view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy of RCC-8 answers by description style for o1 (3 repeats of n=80 questions per bar). The error bar is the prediction interval. generalise and draw upon both its training data and the information given in the prompt. When the RCC-8 relation names are swapped, accuracy is still 0.80 suggesting that o1 can use information given in the prompt to override training data, either by explicitly reasoning, o… view at source ↗
Figure 9
Figure 9. Figure 9: GPT-5.2 (with high reasoning effort) answers to the INDU questions. Each cell shows the number of correct answers across three repeats. The CT answers are referenced by R1 and R2. The CN and converse answers for R1 are shown as additional columns to the left of the table. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The eight regions, numbered 0 to 7, in the STAR calculus. ID is the identity relation (not shown explicitly but is the central point). None of the LLMs tested gets all INDU questions correct ( [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The nine intersection model (9IM, figure courtesy of Egenhofer [43]). The relations are Disjoint (D - similar to DC), Contains (CT - similar to NTPPi), Inside (I - similar to NTPP), Equal (E - similar to EQ), Meets (M similar to EC), Covers (CV similar to TPPi), Covered By (CB - similar to TPP), and Overlap (O - similar to PO). 31 [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The RCC-22 EC relations from [9]. RCC-22 also includes EQ, PO, TPP, TPPi, NTPP and NTPPi as described for RCC-8. OUTSIDE_OUTSIDEi_DC (OOD), P-INSIDE_OUTSIDEi_DC (POD), INSIDE_OUTSIDEi_DC (IOD), INSIDE￾P_INSIDEi_DC (IPD), P-INSIDE_P-INSIDEi_DC (PPD), OUTSIDE-P_INSIDEi_DC (OPD), OUTSIDE_INSIDEi_DC (OID), and P-INSIDE_INSIDEi_DC (PID) are similar to the above except that the regions are disconnected. Note th… view at source ↗
Figure 13
Figure 13. Figure 13 [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of GPT-5.2 with high reasoning effort answers to the RCC-8 questions (left) and 9IM question (right). Each cell shows the number of correct answers across three repeats. The CT answers are referenced by R1 and R2. The CN and converse answers for R1 are shown as additional columns to the left of each table. answers across the three repeats for 9IM but only one for RCC8. For the CN there are 10 i… view at source ↗
Figure 15
Figure 15. Figure 15: Accuracy by description style for o1 answers to PA, IA, INDU, RCC-5, RCC-8 and RCC-22 questions. The error bar is the prediction interval. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Mean Jaccard agreement between answers given by different description styles for o1 answers to PA, IA, INDU, RCC-5, RCC-8 and RCC-22 combined. and nonce words are compared, i.e. different mistakes are made across these differing prompting styles. 4.12. RCC Narrowing to coarser calculi If we take the RCC-22 answers, we can “collapse” these to RCC-8 – e.g. by converting all the DC relations to plain DC and … view at source ↗
Figure 17
Figure 17. Figure 17: Median output token counts per question by calculus for GPT-5.2 with high reasoning effort (log scale). – We also publish an open-source QSTRBenchExtended benchmark comprising 14372 questions and answers designed to probe LLM reasoning capabilities more deeply. – In both these data sets we vary description and question styles for each canonical question in order to test model robustness, relia￾bility and … view at source ↗
read the original abstract

We introduce an extensive qualitative spatial and temporal reasoning (QSTR) benchmark for evaluating large language models (LLMs). We pose questions concerning compositional reasoning (using composition tables, CT), converse relations, and conceptual neighbourhoods (CN) for QSTR calculi, Point Algebra (PA), Allen's Interval Algebra, Interval and Duration (INDU), Region Connection Calculus (RCC-5, RCC-8, and RCC-22), the nine intersection model, cardinal direction calculus, and STAR. The RCC-22 CN is published here for the first time. An extended benchmark systematically varies question presentation including prefix/infix, words/symbols/nonce terms and schematic descriptions for selected calculi. We report results for contemporary frontier models. All models tested perform better than guessing but none can consistently answer all questions correctly. Performance varies sharply by calculus, with PA being the most straightforward, and RCC-22 the most difficult. We release the benchmark, and our results under an open licence to facilitate further assessment of qualitative spatio/temporal reasoning in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces QSTRBench, an open benchmark for assessing LLMs on qualitative spatial and temporal reasoning tasks. It covers compositional reasoning via composition tables, converse relations, and conceptual neighborhoods across calculi including PA, Allen's Interval Algebra, INDU, RCC-5/8/22, the nine-intersection model, cardinal directions, and STAR (with the RCC-22 CN published for the first time). Questions are systematically varied by format (prefix/infix, words/symbols/nonce terms, schematic descriptions). Results for frontier models show above-chance performance that is never perfect and varies sharply by calculus (PA easiest, RCC-22 hardest). The full question set and results are released.

Significance. If the benchmark design holds, the work is significant for the AI reasoning community: it supplies a reproducible, open testbed that directly targets a core capability gap in current LLMs. The systematic inclusion of nonce-term and format variations, together with the release of the complete question set, provides concrete protection against training-data leakage and statistical shortcuts. The observed performance gradient across calculi aligns with the differing sizes and complexities of their composition tables and relation sets, offering a falsifiable baseline for future model improvements.

major comments (1)
  1. [§3] §3 (Benchmark Construction): the claim that the nonce-term and format variations isolate genuine qualitative reasoning would be strengthened by an explicit ablation showing that accuracy differences persist when controlling for surface-form familiarity; without this, the central interpretation that models are tested on reasoning rather than pattern matching remains partially open.
minor comments (3)
  1. [Table 1] Table 1 and §4.2: the exact scoring rubric (exact match vs. partial credit for converse or neighborhood questions) should be stated in a single, numbered paragraph so that future replications can match the reported numbers without ambiguity.
  2. [§5] §5 (Results): the paper reports that all models exceed chance but none reach ceiling; adding per-calculus chance baselines (derived from the size of the relation set) as an additional column would make the “above guessing” claim immediately verifiable from the table.
  3. [Abstract] The abstract states that the RCC-22 CN is published here for the first time; a short appendix or footnote giving the explicit neighborhood table would be useful for readers who wish to verify the new contribution without consulting external sources.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review, positive assessment of the benchmark's significance, and recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): the claim that the nonce-term and format variations isolate genuine qualitative reasoning would be strengthened by an explicit ablation showing that accuracy differences persist when controlling for surface-form familiarity; without this, the central interpretation that models are tested on reasoning rather than pattern matching remains partially open.

    Authors: We appreciate this suggestion and agree that an explicit ablation would provide additional support for interpreting the results as evidence of qualitative reasoning. The current design already incorporates nonce terms, format variations, and schematic descriptions precisely to reduce reliance on surface-form familiarity and training-data leakage, and the sharp performance gradient across calculi (e.g., PA versus RCC-22) is consistent with differences in relation-set size and composition-table complexity rather than lexical familiarity alone. Nevertheless, to strengthen the central claim, we will add a targeted ablation in the revised manuscript that directly compares accuracy on familiar-term versus nonce-term versions of the same underlying questions while holding format and reasoning task fixed. This analysis will be performed on the released question set and reported in §3 and the results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This is an empirical benchmark paper that introduces QSTRBench and reports LLM performance on questions derived from established qualitative spatial/temporal calculi (PA, Allen's IA, RCC variants, etc.). No mathematical derivations, self-referential predictions, or fitted inputs called predictions appear in the work. Results are direct empirical measurements against the released question set and standard composition tables; systematic variations in presentation (prefix/infix, words/symbols, nonce terms) are used to target statistical shortcuts rather than relying on any internal self-definition or self-citation chain. The central claims remain externally verifiable against the calculi definitions and the open benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark is constructed from established QSTR calculi in prior literature with no new free parameters or invented entities; one new conceptual neighbourhood is defined for RCC-22.

axioms (1)
  • domain assumption Standard definitions, composition tables, and conceptual neighbourhoods for QSTR calculi including RCC-8, Allen's Interval Algebra, and Point Algebra as established in prior literature.
    The benchmark directly applies these pre-existing calculi without re-deriving their properties.

pith-pipeline@v0.9.0 · 5718 in / 1238 out tokens · 38507 ms · 2026-05-20T11:14:57.210013+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 6 internal anchors

  1. [1]

    A. G. Cohn, J. Renz, Qualitative spatial representation and reason- ing, in: F. v. Harmelen, V. Lifschitz, B. Porter (Eds.), Handbook of Knowledge Representation, 1, Elsevier, 2007, pp. 551–596

  2. [2]

    J. Chen, A. G. Cohn, D. Liu, S. Wang, J. Ouyang, Q. Yu, A survey of qualitative spatial representations, The Knowledge Engineering Review 30 (2015) 106–136

  3. [3]

    A. G. Cohn, S. M. Hazarika, Qualitative spatial representation and reasoning: An overview, Fundamenta Informaticae 46 (2001) 1–29. 32https://github.com/RobBlackwell/QSTRBench accessed May 2026. 45

  4. [4]

    A Survey of Qualitative Spatial and Temporal Calculi -- Algebraic and Computational Properties

    F. Dylla, J. H. Lee, T. Mossakowski, T. Schneider, A. V. Delden, J. V. D. Ven, D. Wolter, A survey of qualitative spatial and temporal calculi: Algebraic and computational properties, ACM Comput. Surv. 50 (2017). URL: https://doi.org/10.1145/3038927 . doi: 10.1145/3038927 , available at https://arxiv.org/pdf/1606.00133

  5. [5]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), Association for Computa- tional Linguistics...

  6. [6]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language mod- els are few-shot learners, Advances in Neural Information Processing Systems 33 (2020) 1877–1901

  7. [7]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gille- spie, K. ...

  8. [8]

    T. Kwa, B. West, J. Becker, A. Deng, K. Garcia, M. Hasin, S. Jawhar, M. Kinniment, N. Rush, S. V. Arx, R. Bloom, T. Broadley, H. Du, B. Goodrich, N. Jurkovic, L. H. Miles, S. Nix, T. Lin, N. Parikh, D. Rein, L. J. K. Sato, H. Wijk, D. M. Ziegler, E. Barnes, L. Chan, Measuring Ai ability to complete long tasks, 2025. URL: https://arxiv.org/ab s/2503.14499....

  9. [9]

    A. G. Cohn, B. Bennett, J. Gooday, N. M. Gotts, Qualitative spa- tial representation and reasoning with the region connection calculus, Geoinformatica 1 (1997) 275–316

  10. [10]

    Freksa, Temporal reasoning based on semi-intervals, Artificial intel- ligence 54 (1992) 199–227

    C. Freksa, Temporal reasoning based on semi-intervals, Artificial intel- ligence 54 (1992) 199–227

  11. [11]

    Randell, A

    D. Randell, A. G. Cohn, Modelling topological and metrical properties in physical processes, in: Proceedings of the First International Conference on Principles of Knowledge Representation and Reasoning, 1989, pp. 357–368

  12. [12]

    A. G. Cohn, An evaluation of ChatGPT-4’s Qualitative Spatial Reason- ing Capabilities in RCC-8, arXiv preprint arXiv:2309.15577, Working notes of QR-23 (2023)

  13. [13]

    A. G. Cohn, R. E. Blackwell, Can large language models reason about the Region Connection Calculus?, 2024. URL: https://arxiv.org/ab s/2411.19589. arXiv:2411.19589

  14. [14]

    Gardelakos, V

    E.-O. Gardelakos, V. Kyriakopoulos, D.-A. Pantazi, O.-M. Kapopoulos, M. Tsourma, M. Koubarakis, Can large reasoning models reason about spatial relations?, in: Proceedings of the 8th ACM SIGSPATIAL In- ternational Workshop on AI for Geographic Knowledge Discovery, 2025, pp. 81–91

  15. [15]

    Bellodi, P

    P. Bellodi, P. Casavecchia, A. Paparella, G. Sciavicco, I. E. Stan, As- sessing the (in) ability of LLMs to reason in interval temporal logic, in: 32nd International Symposium on Temporal Representation and Rea- soning (TIME 2025), Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2025, pp. 4–1. 47

  16. [16]

    Fatemi, M

    B. Fatemi, M. Kazemi, A. Tsitsulin, K. Malkan, J. Yim, J. Palowitch, S. Seo, J. Halcrow, B. Perozzi, Test of time: A benchmark for evaluating llms on temporal reasoning, arXiv preprint arXiv:2406.09170 (2024)

  17. [17]

    Topsakal, E

    O. Topsakal, E. Colby, H. Jackson, Evaluating the performance of large language models (LLMs) through grid-based game competitions: An extensible benchmark and leaderboard on the path to artificial general intelligence (AGI), The Journal of Cognitive Systems 9 (2025) 8–19

  18. [18]

    Yamada, Y

    Y. Yamada, Y. Bao, A. K. Lampinen, J. Kasai, I. Yildirim, Evaluating spatial understanding of large language models, 2024. arXiv:2310.14540

  19. [19]

    X. Fu, Y. Hu, B. Li, Y. Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, R. Krishna, Blink: Multimodal large language models can see but not perceive, in: European Conference on Computer Vision, Springer, 2024, pp. 148–166

  20. [20]

    H. Yin, Z. Lin, X. Liu, B. Sun, K. Li, Do multimodal language models really understand direction? a benchmark for compass direction reason- ing, in: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2025, pp. 1–5

  21. [21]

    A. G. Cohn, R. E. Blackwell, Evaluating the ability of large language models to reason about cardinal directions, revisited, 2025. URL: https: //arxiv.org/abs/2507.12059 . arXiv:2507.12059, accepted at the 38th International Workshop on Qualitative Reasoning (QR 2025), co- located with IJCAI

  22. [22]

    Xie, S.-L

    S. Xie, S.-L. Hsu, Q. Zhang, Y. Gao, C. Shahabi, I. Sabek, Evaluating intrinsic geospatial topological reasoning in LLMs, in: Proceedings of the 1st ACM SIGSPATIAL International Workshop on Generative and Agentic AI for Multi-Modality Space-Time Intelligence, 2025, pp. 43–48

  23. [23]

    A. G. Cohn, J. Hernandez-Orallo, Dialectical language model evalua- tion: An initial appraisal of the commonsense spatial reasoning abilities of LLMs, arXiv preprint arXiv:2304.11164 (2023)

  24. [24]

    F. Li, D. C. Hogg, A. G. Cohn, Advancing spatial reasoning in large language models: An in-depth evaluation and enhancement using the 48 stepgame benchmark, Proceedings of the AAAI Conference on Artificial Intelligence 38 (2024) 18500–18507. URL: https://ojs.aaai.org/ind ex.php/AAAI/article/view/29811 . doi: 10.1609/aaai.v38i17.2981 1

  25. [25]

    Z. Shi, Q. Zhang, A. Lipani, StepGame: A new benchmark for robust multi-hop spatial reasoning in texts, in: Proc. AAAI, volume 36, 2022, pp. 11321–11329

  26. [26]

    DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

    L. McPheat, N. Kaur, R. Blackwell, A. Russo, A. G. Cohn, P. Mad- hyastha, DecompSR: A dataset for decomposed analyses of composi- tional multihop spatial reasoning, 2025. URL: https://arxiv.org/ab s/2511.02627. arXiv:2511.02627

  27. [27]

    Mirzaee, H

    R. Mirzaee, H. Rajaby Faghihi, Q. Ning, P. Kordjamshidi, SPARTQA: A textual question answering benchmark for spatial reasoning, in: Proc. NAACL, 2021, pp. 4582–4598

  28. [28]

    Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

    J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. Van Merriënboer, A. Joulin, T. Mikolov, Towards AI-complete question answering: A set of prerequisite toy tasks, arXiv preprint arXiv:1502.05698 (2015)

  29. [29]

    Z. Cai, B. Chang, W. Han, Human-in-the-loop through chain-of- thought, arXiv preprint arXiv:2306.07932 (2023)

  30. [30]

    A. Isli, A. G. Cohn, A new approach to cyclic ordering of 2d orientations using ternary relation algebras, Artificial Intelligence 122 (2000) 137–

  31. [31]

    A vailable at https://www.sciencedirect.com/science/articl e/pii/S0004370200000448/pdf?md5=555f1a9e6f8a6567d9f08f607b 7dc7a2&pid=1-s2.0-S0004370200000448-main.pdf

  32. [32]

    C. Freksa, Using orientation information for qualitative spatial rea- soning, in: Theories and Methods of Spatio-Temporal Reasoning in Geographic Space: International Conference GIS-From Space to Terri- tory: Theories and Methods of Spatio-Temporal Reasoning Pisa, Italy, September 21–23, 1992 Proceedings, Springer, 2005, pp. 162–178. A vail- able at https...

  33. [33]

    Gantner, M

    Z. Gantner, M. Westphal, S. Wölfl, Gqr - a fast reasoner for binary qualitative constraint calculi, in: AAAI Workshop on Spatial and Tem- poral Reasoning, AAAI Chicago (IL), 2008, p. 6. A vailable at https: //cdn.aaai.org/Workshops/2008/WS-08-11/WS08-11-004.pdf

  34. [34]

    Wolter, SparQ – A Spatial Reasoning Toolbox., in: AAAI Spring Symposium: Benchmarking of Qualitative Spatial and Temporal Rea- soning Systems, 2009, p

    D. Wolter, SparQ – A Spatial Reasoning Toolbox., in: AAAI Spring Symposium: Benchmarking of Qualitative Spatial and Temporal Rea- soning Systems, 2009, p. 53. A vailable at https://cdn.aaai.org/Sym posia/Spring/2009/SS-09-02/SS09-02-012.pdf

  35. [35]

    M. J. Egenhofer, D. M. Mark, J. Herring, The 9-intersection: Formal- ism and its use for natural-language spatial predicates (94-1), Technical Report 94-1, National Center for Geographic Information and Analysis,

  36. [36]

    URL: https://escholarship.org/content/qt5nj6647c/qt5n j6647c.pdf

  37. [37]

    A. U. Frank, Qualitative spatial reasoning: Cardinal directions as an example, International Journal of Geographical Information Science 10 (1996) 269–290. A vailable at https://www.frank.gerastree.at/Pub licationList/resources/docs/docsH/ijgis-frank.pdf

  38. [38]

    J. Renz, D. Mitra, et al., Qualitative direction calculi with arbitrary granularity, in: PRICAI, volume 3157, 2004, pp. 65–74

  39. [39]

    A. G. Cohn, R. E. Blackwell, Evaluating the Ability of Large Language Models to Reason About Cardinal Directions, in: B. Adams, A. L. Griffin, S. Scheider, G. McKenzie (Eds.), 16th International Conference on Spatial Information Theory (COSIT 2024), volume 315 of Leibniz International Proceedings in Informatics (LIPIcs) , Schloss Dagstuhl – Leibniz-Zentrum ...

  40. [40]

    J. F. Allen, Maintaining knowledge about temporal intervals, Com- munications of the ACM 26 (1983) 832–843. A vailable at https: //dl.acm.org/doi/pdf/10.1145/182.358434

  41. [41]

    Randell, Z

    D. Randell, Z. Cui, A. G. Cohn, A spatial logic based on regions and connection, in: 3rd International Conference on Knowledge Represen- tation and Reasoning, 1992, volume 92, 1992, pp. 165–176. 50

  42. [42]

    M. B. Vilain, H. A. Kautz, Constraint propagation algorithms for tem- poral reasoning., in: AAAI, volume 86, 1986, pp. 377–382. A vailable at https://cdn.aaai.org/AAAI/1986/AAAI86-063.pdf

  43. [43]

    Bennett, Spatial reasoning with propositional logics, in: Principles of Knowledge Representation and Reasoning, Elsevier, 1994, pp

    B. Bennett, Spatial reasoning with propositional logics, in: Principles of Knowledge Representation and Reasoning, Elsevier, 1994, pp. 51–62. A vailable at https://citeseerx.ist.psu.edu/document?repid=rep 1&type=pdf&doi=4c45519c2db0dac5ceaa76e1b53b1ca3c0bfce00

  44. [44]

    Jonsson, T

    P. Jonsson, T. Drakengren, A complete classification of tractability in RCC-5, Journal of Artificial Intelligence Research 6 (1997) 211–221. A vailable at https://www.jair.org/index.php/jair/article/down load/10187/24187/

  45. [45]

    M. J. Egenhofer, Deriving the composition of binary topological re- lations, Journal of Visual Languages & Computing 5 (1994) 133–149. A vailable at https://www.academia.edu/download/47964251/Deriv ing_the_Composition_of_Binary_Topol20160810-4913-vrxew6.pd f

  46. [46]

    Z. Cui, A. G. Cohn, D. A. Randell, Qualitative and topological relation- ships in spatial databases, in: D. Abel, B. Chin Ooi (Eds.), Advances in Spatial Databases, Springer Berlin Heidelberg, Berlin, Heidelberg, 1993, pp. 296–315

  47. [47]

    A. K. Pujari, G. Vijaya Kumari, A. Sattar, INDU: An interval & du- ration network, in: Australasian Joint Conference on Artificial Intelli- gence, Springer, 1999, pp. 291–303. A vailable at https://citeseerx. ist.psu.edu/document?repid=rep1&type=pdf&doi=11328a3099706 0552f8971c599bde8ea6d581d21

  48. [48]

    Schlieder, Reasoning about ordering, in: International conference on spatial information theory, Springer, 1995, pp

    C. Schlieder, Reasoning about ordering, in: International conference on spatial information theory, Springer, 1995, pp. 341–349

  49. [49]

    Moratz, D

    R. Moratz, D. Lücke, T. Mossakowski, A condensed semantics for quali- tative spatial reasoning about oriented straight line segments, Artificial Intelligence 175 (2011) 2099–2127

  50. [50]

    G. F. Ligozat, Qualitative triangulation for spatial reasoning, in: Eu- ropean Conference on Spatial Information Theory, Springer, 1993, pp. 51 54–68. A vailable at https://link.springer.com/chapter/10.1007/ 3-540-57207-4_5

  51. [51]

    Moratz, F

    R. Moratz, F. Dylla, L. Frommberger, A relative orientation algebra with adjustable granularity, in: Proceedings of the workshop on agents in real-time and dynamic environments (IJCAI 05), volume 21, 2005, p. 22

  52. [52]

    Clementini, P

    E. Clementini, P. Di Felice, D. Hernández, Qualitative representation of positional information, Artificial intelligence 95 (1997) 317–356. A vail- able at https://www.sciencedirect.com/science/article/pii/S0 004370297000465/pdf?md5=be67a5e4a7057f94a25879a9f7c5b076&p id=1-s2.0-S0004370297000465-main.pdf&_valck=1

  53. [53]

    Hernández, E

    D. Hernández, E. Clementini, P. Di Felice, Qualitative distances, in: Spatial Information Theory A Theoretical Basis for GIS: International Conference COSIT’95 Semmering, Austria, September 21–23, 1995 Pro- ceedings 2, Springer, 1995, pp. 45–57

  54. [54]

    Extending Binary Qualitative Direction Calculi with a Granular Distance Concept: Hidden Feature Attachment

    R. Moratz, Extending binary qualitative direction calculi with a gran- ular distance concept: Hidden feature attachment, arXiv preprint arXiv:1012.5960 (2010)

  55. [55]

    H. W. Guesgen, Spatial reasoning based on Allen’s temporal logic, ICSI (1989)

  56. [56]

    Balbiani, J.-F

    P. Balbiani, J.-F. Condotta, L. F. Del Cerro, Tractability results in the block algebra, Journal of Logic and Computation 12 (2002) 885–909. A vailable at https://academic.oup.com/logcom/article-pdf/12/ 5/885/3852916/120885.pdf

  57. [57]

    Köhler, The occlusion calculus, in: Cognitive vision workshop, Cite- seer, 2002, pp

    C. Köhler, The occlusion calculus, in: Cognitive vision workshop, Cite- seer, 2002, pp. 420–450. A vailable at https://citeseerx.ist.psu.ed u/document?repid=rep1&type=pdf&doi=21f52b4007e25b30267b532 d22a74995ba8dcc48

  58. [58]

    Van de Weghe, B

    N. Van de Weghe, B. Kuijpers, P. Bogaert, P. De Maeyer, A qualitative trajectory calculus and the composition of its relations, in: International Conference on GeoSpatial Sematics, Springer, 2005, pp. 60–76. 52

  59. [59]

    Ragni, A

    M. Ragni, A. Scivos, Dependency calculus: Reasoning in a general point relation algebra, in: Annual Conference on Artificial Intelligence, Springer, 2005, pp. 49–63

  60. [60]

    Broxvall, P

    M. Broxvall, P. Jonsson, Point algebras for temporal reasoning: Algo- rithms and complexity, Artificial Intelligence 149 (2003) 179–220

  61. [61]

    Dylla, An agent control perspective on qualitative spatial reasoning: Towards more intuitive spatial agent development

    F. Dylla, An agent control perspective on qualitative spatial reasoning: Towards more intuitive spatial agent development. vol. 320, 2008

  62. [62]

    Crystal, The Cambridge encyclopedia of the English language, Cam- bridge university press, 2018

    D. Crystal, The Cambridge encyclopedia of the English language, Cam- bridge university press, 2018

  63. [63]

    Chang, M

    E. Chang, M. Paltenghi, Y. Li, P.-J. Lin, C. Zhao, P. Huber, Z. Liu, R. Rabatin, Y. Shi, V. Chandra, Scaling parameter-constrained lan- guage models with quality data, arXiv preprint arXiv:2410.03083 (2024)

  64. [65]

    L. Li, L. Sleem, G. Nichil, R. State, et al., Exploring the impact of tem- perature on large language models: Hot or cold?, Procedia Computer Science 264 (2025) 242–251

  65. [66]

    Burnell, W

    R. Burnell, W. Schellaert, J. Burden, T. D. Ullman, F. Martinez- Plumed, J. B. Tenenbaum, D. Rutar, L. G. Cheke, J. Sohl-Dickstein, M. Mitchell, D. Kiela, M. Shanahan, E. M. Voorhees, A. G. Cohn, J. Z. Leibo, J. Hernandez-Orallo, Rethink reporting of evaluation results in AI, Science 380 (2023) 136–138

  66. [67]

    R. E. Blackwell, J. Barry, A. G. Cohn, Towards reproducible LLM evaluation: Quantifying uncertainty in LLM benchmark scores, arXiv preprint arXiv:2410.03492 (2024)

  67. [68]

    A. Cohn, J. Gooday, B. Bennett, A comparison of structures in spatial and temporal logics, in: Philosophy and the Cognitive Sciences, R. Casati, G. White (eds.), Holder-Pichler-Temp, 1994

  68. [69]

    G. É. Ligozat, Reasoning about cardinal directions, Journal of Visual Languages & Computing 9 (1998) 23–44. 53

  69. [70]

    Ragni, B

    M. Ragni, B. Tseden, M. Knauff, Cross-cultural similarities in topo- logical reasoning, in: Spatial Information Theory: 8th International Conference, COSIT 2007, Springer, 2007, pp. 32–46. A vailable at http: //geosensor.net/cositprivate/65.pdf

  70. [71]

    M. J. Egenhofer, J. Sharma, D. M. Mark, et al., A critical comparison of the 4-intersection and 9-intersection models for spatial relations: for- mal analysis, in: Autocarto-Conference, ASPRS American Society for Photogrametry, 1993, pp. 1–1

  71. [72]

    Leyton-Brown, Y

    K. Leyton-Brown, Y. Shoham, Understanding understanding: A prag- matic framework motivated by large language models, arXiv preprint arXiv:2406.10937 (2024)

  72. [73]

    Belcak, G

    P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, P. Molchanov, Small Language Models are the Future of Agentic AI,

  73. [74]

    Small Language Models are the Future of Agentic AI

    URL: https://arxiv.org/abs/2506.02153. arXiv:2506.02153

  74. [75]

    Zheng, Y

    Y. Zheng, Y. Chen, B. Qian, X. Shi, Y. Shu, J. Chen, A review on edge large language models: Design, execution, and applications, ACM Computing Surveys 57 (2025) 1–35

  75. [76]

    R. E. Blackwell, A. G. Cohn, RCC-8 as a benchmark for diagrammatic reasoning in multimodal foundation models, in: Proc. COSIT, 2026, to appear

  76. [77]

    F. Li, D. Hogg, A. Cohn, Reframing spatial reasoning evaluation in language models: A real-world simulation benchmark for qualitative reasoning, in: Proceedings of the Thirty-Third International Joint Con- ference on Artificial Intelligence, International Joint Conferences on Ar- tificial Intelligence, 2024, pp. 6342–6349

  77. [78]

    Drakengren, P

    T. Drakengren, P. Jonsson, A complete classification of tractability in RCC-5, Journal of Artificial Intelligence Research 6 (1997) 211–221

  78. [79]

    J. Renz, B. Nebel, On the complexity of qualitative spatial reasoning: A maximal tractable fragment of the region connection calculus, Artificial Intelligence 108 (1999) 69–123

  79. [80]

    ### An- swer:

    A. Galton, Qualitative spatial change, Oxford University Press, 2000. 54 Appendix A. Example prompts for RCC-8 Appendix A.1. Text symbol prefix You are a helpful assistant who answers questions about qualitative spa- tial and temporal calculi. The Region Connection Calculus (RCC-8) is a qualitative spatial calculus for representing and reasoning about spat...