HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.
Proofwriter: Generating implications, proofs, and abductive statements over natural language
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AI 3years
2026 3representative citing papers
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
QMFOL generates monadic first-order logic tasks with controllable complexity via pattern-based structures and round-trip prover verification, then evaluates six LRMs showing performance drops as logical depth and width increase.
citing papers explorer
-
HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs
HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.
-
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
-
QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation
QMFOL generates monadic first-order logic tasks with controllable complexity via pattern-based structures and round-trip prover verification, then evaluates six LRMs showing performance drops as logical depth and width increase.