Pith Number

pith:FERP7UN3

pith:2026:FERP7UN3IEHUJRI2Y3JU3QG6UR

not attested not anchored not stored refs resolved

CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models

Armin Toroghi, Faeze Moradi Kalarde, Scott Sanner

CommonWhy introduces 15,000 why questions that test whether LLMs can combine specific entity facts with causal commonsense inference

arxiv:2605.12918 v1 · 2026-05-13 · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{FERP7UN3IEHUJRI2Y3JU3QG6UR}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experiments with state-of-the-art LLMs and LLM-based KGQA methods reveal their significant shortcomings, including frequent factual hallucinations and failures in causal reasoning.

C2weakest assumption

The questions in CommonWhy require genuine integration of entity facts with causal commonsense reasoning rather than being solvable through superficial patterns learned during training.

C3one line summary

CommonWhy is a new dataset of 15,000 why-questions for evaluating LLMs on entity-based causal commonsense reasoning grounded in Wikidata.

References

70 extracted · 70 resolved · 3 Pith anchors

[1] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic eval 2005

[2] CommAI: Evaluating the first steps towards a useful general AI 2017 · arXiv:1701.08954

[3] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Seman- tic parsing on freebase from question-answer pairs. InProceedings of the 2013 conference on empirical methods in natural langua 2013

[4] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. InProceedings of the 2013 Conference on Empirical Methods in Natural Language 2013

[5] A is B” fail to learn “B is A 2024

Receipt and verification

First computed	2026-05-18T03:09:10.302646Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

2922ffd1bb410f44c51ac6d34dc0dea4423380b83f7a65ca7895e1f8f0b93256

Aliases

arxiv: 2605.12918 · arxiv_version: 2605.12918v1 · doi: 10.48550/arxiv.2605.12918 · pith_short_12: FERP7UN3IEHU · pith_short_16: FERP7UN3IEHUJRI2 · pith_short_8: FERP7UN3

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/FERP7UN3IEHUJRI2Y3JU3QG6UR \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2922ffd1bb410f44c51ac6d34dc0dea4423380b83f7a65ca7895e1f8f0b93256

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "9592c8faa0e8b886c12b3b87214f6e419c5f622aa759dae31d6810543695b969",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-05-13T02:47:21Z",
    "title_canon_sha256": "3abb3e7e165af69dcb2ed63b40b440f667de36b49ab7f513f7642a9b5477c766"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12918",
    "kind": "arxiv",
    "version": 1
  }
}