DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spatial reasoning in LLMs.
Clutrr: A diagnostic benchmark for inductive reasoning from text.arXiv preprint arXiv:1908.06177
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.
citing papers explorer
-
DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning
DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spatial reasoning in LLMs.
-
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task
Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
-
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.