QSTRBench is a new benchmark evaluating LLMs on compositional reasoning, converse relations, and conceptual neighbourhoods across QSTR calculi including a newly published RCC-22 CN, showing models exceed chance but fail to achieve consistent correctness.
arXiv preprint arXiv:2306.07932 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.
citing papers explorer
-
QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
QSTRBench is a new benchmark evaluating LLMs on compositional reasoning, converse relations, and conceptual neighbourhoods across QSTR calculi including a newly published RCC-22 CN, showing models exceed chance but fail to achieve consistent correctness.
-
Large Language Models are not Fair Evaluators
LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.