Clutrr: A diagnostic benchmark for inductive reasoning from text.arXiv preprint arXiv:1908.06177

Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, William L Hamilton · 1908 · arXiv 1908.06177

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

cs.AI · 2025-11-04 · unverdicted · novelty 7.0

DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spatial reasoning in LLMs.

Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

cs.CL · 2023-02-08 · accept · novelty 6.0

ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.

citing papers explorer

Showing 3 of 3 citing papers.

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning cs.AI · 2025-11-04 · unverdicted · none · ref 20
DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spatial reasoning in LLMs.
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task cs.LG · 2026-04-14 · unverdicted · none · ref 23
Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity cs.CL · 2023-02-08 · accept · none · ref 23
ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.

Clutrr: A diagnostic benchmark for inductive reasoning from text.arXiv preprint arXiv:1908.06177

fields

years

verdicts

representative citing papers

citing papers explorer