arXiv preprint arXiv:2002.05867 , year=

Peter Clark, Oyvind Tafjord, Kyle Richardson · 2020 · arXiv 2002.05867

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

cs.AI · 2025-11-04 · unverdicted · novelty 7.0

DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spatial reasoning in LLMs.

HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.

Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

cs.CL · 2025-07-21 · unverdicted · novelty 4.0

LLM accuracy on reasoning tasks differs significantly by question type, with step-by-step reasoning accuracy often uncorrelated to final answer selection.

citing papers explorer

Showing 3 of 3 citing papers.

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning cs.AI · 2025-11-04 · unverdicted · none · ref 6
DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spatial reasoning in LLMs.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory cs.AI · 2026-05-07 · unverdicted · none · ref 40
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked? cs.CL · 2025-07-21 · unverdicted · none · ref 5
LLM accuracy on reasoning tasks differs significantly by question type, with step-by-step reasoning accuracy often uncorrelated to final answer selection.

arXiv preprint arXiv:2002.05867 , year=

fields

years

verdicts

representative citing papers

citing papers explorer