Pith Number

pith:WG6IAFIC

pith:2026:WG6IAFICGEJCZKH7N3ZW2HV47H

not attested not anchored not stored refs resolved

Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism

Konstantine Arkoudas, Serafim Batzoglou

Frontier LLMs perform well on foundational proof tasks but fail at those requiring global combinatorial reasoning or low-level proof synthesis.

arxiv:2605.12524 v1 · 2026-04-07 · cs.LO · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{WG6IAFICGEJCZKH7N3ZW2HV47H}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Frontier models perform well on several foundational tasks, yet difficult tasks, especially those requiring global combinatorial reasoning or low-level proof synthesis, remain far from solved.

C2weakest assumption

That success or failure on these minimal-formalism proof tasks provides a meaningful signal of general reasoning competence independent of domain knowledge, solver delegation, or long-context artifacts.

C3one line summary

ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.

References

143 extracted · 143 resolved · 7 Pith anchors

[1] Karl Popper: Critical Assessments of Leading Philosophers , publisher = 2003

[2] Journal for General Philosophy of Science , year =

[3] Argumentation , year =

[4] Item Response Theory: Principles and Applications , author=. 1985 , publisher= 1985

[5] Handbook of Item Response Theory Modeling: Applications to Typical Performance Assessment , publisher =

Formal links

1 machine-checked theorem link

Receipt and verification

First computed	2026-05-18T03:10:02.821690Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

b1bc80150231122ca8ff6ef36d1ebcf9f54b3e75056f59a6b56fc9cbcc40a85d

Aliases

arxiv: 2605.12524 · arxiv_version: 2605.12524v1 · doi: 10.48550/arxiv.2605.12524 · pith_short_12: WG6IAFICGEJC · pith_short_16: WG6IAFICGEJCZKH7 · pith_short_8: WG6IAFIC

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/WG6IAFICGEJCZKH7N3ZW2HV47H \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b1bc80150231122ca8ff6ef36d1ebcf9f54b3e75056f59a6b56fc9cbcc40a85d

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "a9ebcdcf0dfeb1ed6e7425eadb8b206ec358c2bc2a8025fc1768b7a3799cb160",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LO",
    "submitted_at": "2026-04-07T01:19:41Z",
    "title_canon_sha256": "2a6ed2a57efe72d9274f01e470584a276bc3a7e762df172814a99a3ce8ff98ca"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12524",
    "kind": "arxiv",
    "version": 1
  }
}