DeFAb is a large-scale, formally verifiable benchmark for defeasible abduction derived from 18 knowledge bases, demonstrating that frontier LLMs achieve 7.8-65% accuracy versus 100% for a rule-based solver with polynomial-time checks.
arXiv preprint arXiv:2506.22005 , year =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Introduces a Mahalanobis-distance benchmark in conjecture embedding space to quantify non-triviality of AI-generated mathematical conjectures and flag potential errors.
citing papers explorer
-
DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models
DeFAb is a large-scale, formally verifiable benchmark for defeasible abduction derived from 18 knowledge bases, demonstrating that frontier LLMs achieve 7.8-65% accuracy versus 100% for a rule-based solver with polynomial-time checks.
-
Mapping Mathematical Hardness: Machine-Assisted Conjecture Discovery and the Quantification of Non-Triviality
Introduces a Mahalanobis-distance benchmark in conjecture embedding space to quantify non-triviality of AI-generated mathematical conjectures and flag potential errors.