pith. sign in

arxiv: 2607.01223 · v1 · pith:VTRXPWIZnew · submitted 2026-07-01 · 💻 cs.AI · cs.CL· cs.LG· cs.LO· cs.SE

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

Pith reviewed 2026-07-02 12:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.LOcs.SE
keywords verificationinformal reasoningAI safetyproof tracesstate transitionsadversarial testingLLM judgeshidden premises
0
0 comments X

The pith

Theoria verifies informal AI answers by rewriting them into sequences of auditable state transitions each backed by explicit justifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Theoria as an architecture that converts a candidate solution into a chain of typed state transitions, where each change must be licensed by a citation, computation, or problem fact. This enforces completeness of change so that any unlicensed mutation appears as a hidden premise rather than passing undetected. On 185 expert problems it certifies 105 at 91.4 percent strict precision while producing human-readable traces open to independent challenge at every step. Structured application of the method detects 94.7 percent of adversarial poisoned proofs, outperforming holistic judges especially on hidden premises and fabricated citations. The result supplies an intermediate verification layer between opaque scalar scores and full formal proofs.

Core claim

Theoria rewrites candidate solutions into sequences of typed state transitions each licensed by an explicit justification, enforcing the completeness-of-change invariant so every difference between consecutive states is accounted for. On HLE-Verified Gold it certifies 105 of 185 problems at 91.4 percent strict precision with Wilson interval [84.5 percent, 95.4 percent] and produces human-readable traces. On 95 adversarial poisoned proofs across 15 domains, structured judges reach 94.7 percent detection versus 83.2 percent for holistic judging, the 11.5-point gap concentrated in hidden premises and fabricated citations as predicted by the formal analysis.

What carries the argument

Rewrite-acceptability verification that decomposes informal reasoning into typed state transitions and enforces completeness of change so every mutation requires an explicit justification.

If this is right

  • Every certification yields a human-readable trace in which each step can be independently challenged.
  • Structured judges detect hidden premises at 90.6 percent versus 62.5 percent for holistic judges.
  • Fabricated citations are caught at 100 percent versus 90 percent.
  • The method and holistic judging succeed on largely disjoint sets of problems.
  • Certified precision reaches 97.1 percent on the GPQA Diamond subset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same state-transition discipline could be applied to longer multi-step tasks where coherence failures accumulate.
  • Traces might support targeted human editing by isolating the exact transition that introduced an error.
  • Combining the rewrite layer with existing formal proof assistants could raise coverage on problems that partially admit formalization.
  • The approach suggests a route to auditing reasoning chains that remain too informal for current formal tools.

Load-bearing premise

The completeness-of-change invariant can be reliably enforced during the rewriting process without missing or misclassifying hidden premises in informal reasoning.

What would settle it

Manual inspection of a sample of Theoria-certified traces that reveals a substantial fraction containing undetected hidden premises or fabricated citations.

read the original abstract

When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the fact and are subject to the same coherence issues as any LLM. We present Theoria, a verification architecture that closes this gap. A candidate solution is rewritten into a sequence of typed state transitions, each licensed by an explicit justification, whether that be a citation, computation, or problem-given fact, and every transition is independently auditable. The foundational invariant is completeness of change: every difference between consecutive proof states must be accounted for, so hidden premises surface as unlicensed mutations rather than passing silently. On HLE-Verified Gold (185 text-only expert problems), Theoria certifies 105 at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]). Every certification produces a human readable proof trace in which each step can be independently challenged. Holistic LLM judges achieve comparable precision at matched coverage but fail on different problems (Jaccard 0.14-0.36), making the approaches complementary. On 95 adversarial poisoned proofs across 15 domains, structured judges catch 94.7% versus 83.2% for holistic judging (p= 0.0017). The overall 11.5 pp gap concentrates in hidden premises (90.6% vs. 62.5%, a 28 pp difference) and fabricated citations (100% vs. 90%), the error classes where the formal analysis predicts an advantage; performance is identical on arithmetic and theorem-misapplication errors, where no advantage is predicted. On GPQA Diamond (n= 65), certified precision is 97.1% (Wilson CI [85.1%, 99.5%]).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Theoria, a verification architecture that rewrites AI candidate solutions into sequences of typed state transitions with explicit justifications (citations, computations, or problem-given facts), enforcing a 'completeness of change' invariant to ensure every difference between consecutive proof states is accounted for, thereby surfacing hidden premises. It reports certifying 105 out of 185 HLE-Verified Gold problems at 91.4% strict precision (Wilson CI [84.5%, 95.4%]), with human-readable proof traces; complementary performance to holistic LLM judges (Jaccard 0.14-0.36); and superior detection on 95 adversarial poisoned proofs (94.7% vs 83.2%, p=0.0017), with a 28 pp advantage on hidden premises.

Significance. If the rewriting process reliably enforces the invariant without missing hidden premises, the approach could provide a valuable auditable alternative to both formal proof systems and scalar LLM judges. The empirical results on expert-level benchmarks (HLE, GPQA) and the targeted advantage on predicted error classes (hidden premises, fabricated citations) in adversarial testing lend support to the method's potential. The generation of independently challengeable proof traces is a clear strength, as is the use of external benchmarks without apparent parameter fitting.

major comments (2)
  1. Abstract: Reports concrete precision numbers (91.4% on HLE-Verified Gold) and statistical comparisons (p=0.0017, 28 pp gap on hidden premises) but provides no description of the rewriting algorithm, typing mechanism, or how states are constructed, leaving the central claim of enforcing the completeness of change invariant without visible derivation support.
  2. Verification architecture description: The foundational invariant of completeness of change (every difference between consecutive proof states must be accounted for by an explicit justification) is presented as reliably enforceable in informal text, but without a formal definition of state differences, a decision procedure, or inter-annotator study, it is unclear how the reported advantages (e.g., 90.6% vs 62.5% on hidden premises) are achieved; this is load-bearing for interpreting the Wilson CIs and adversarial results.
minor comments (2)
  1. Results section: The Jaccard overlap values (0.14-0.36) between Theoria and holistic judges are reported but without specifying the exact problems or calculation method.
  2. Notation for typed state transitions could be clarified with a concrete example early in the paper to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting areas where greater clarity would strengthen the manuscript. We address each major comment below and outline targeted revisions to improve the presentation of the core architecture while preserving the empirical contributions.

read point-by-point responses
  1. Referee: Abstract: Reports concrete precision numbers (91.4% on HLE-Verified Gold) and statistical comparisons (p=0.0017, 28 pp gap on hidden premises) but provides no description of the rewriting algorithm, typing mechanism, or how states are constructed, leaving the central claim of enforcing the completeness of change invariant without visible derivation support.

    Authors: The abstract is deliberately concise to comply with length constraints and to focus on results. The rewriting algorithm, typing of state transitions, and state construction procedure are described in detail in Section 3 of the manuscript. To improve accessibility, we will revise the abstract to include a brief clause such as 'via rewriting into sequences of typed state transitions licensed by explicit justifications' that directly references the mechanisms supporting the completeness-of-change invariant. revision: yes

  2. Referee: Verification architecture description: The foundational invariant of completeness of change (every difference between consecutive proof states must be accounted for by an explicit justification) is presented as reliably enforceable in informal text, but without a formal definition of state differences, a decision procedure, or inter-annotator study, it is unclear how the reported advantages (e.g., 90.6% vs 62.5% on hidden premises) are achieved; this is load-bearing for interpreting the Wilson CIs and adversarial results.

    Authors: The manuscript presents the invariant through the operational rewriting process and validates it empirically via the adversarial poisoned-proof experiments, which were explicitly constructed to probe the predicted error classes (hidden premises and fabricated citations). The 28 pp advantage on hidden premises is therefore tied directly to the architecture's requirement for explicit justifications. We agree that a formal definition of state differences and a decision procedure would strengthen the theoretical section; we will add a new subsection providing these. An inter-annotator study on the rewriting process was not performed; we will explicitly note this as a limitation and indicate it as future work, while retaining the adversarial results as the primary empirical support for the reported advantages. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on external empirical benchmarks

full rationale

The paper presents Theoria as an architecture enforcing the completeness-of-change invariant during rewriting of informal proofs into auditable state transitions. All reported results (105/185 certifications at 91.4% precision on HLE-Verified Gold, 97.1% on GPQA Diamond, 94.7% vs 83.2% on adversarial set) are direct empirical measurements on held-out external datasets with no fitted parameters, no predictions derived from the same data used for tuning, and no load-bearing self-citations. The invariant itself is stated as an explicit design choice rather than derived from prior results by the same authors. No equation or step reduces a claimed outcome to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; only the completeness-of-change invariant is explicitly named as foundational. No free parameters or invented entities beyond the system name itself are described.

axioms (1)
  • domain assumption Completeness of change: every difference between consecutive proof states must be accounted for by an explicit justification (citation, computation, or problem-given fact).
    Stated as the foundational invariant in the abstract.
invented entities (1)
  • Theoria verification architecture no independent evidence
    purpose: Rewrite candidate solutions into auditable typed state transitions
    New system introduced by the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5878 in / 1465 out tokens · 29526 ms · 2026-07-02T12:10:14.130656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 10 internal anchors

  1. [1]

    Lean project.https://lean-lang.org/. H. Xin et al. DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data.arXiv:2405.14333,

  2. [2]

    Advancing Mathematics Research with AI-Driven Formal Proof Search

    G. Tsoukalas et al. Advancing mathematics research with AI-driven formal proof search. arXiv:2605.22763,

  3. [3]

    Aristotle: IMO-level Automated Theorem Proving

    19 T. Achim et al. Aristotle: IMO-level automated theorem proving.arXiv:2510.01346,

  4. [4]

    Chen et al

    E. Chen et al. Dead ends in square-free digit walks.arXiv:2602.05095,

  5. [5]

    Let's Verify Step by Step

    (Proof fully formalized in Lean/Mathlib and produced automatically by AxiomProver.) H. Lightman et al. Let’s verify step by step.arXiv:2305.20050,

  6. [6]

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    P. Wang et al. Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. arXiv:2312.08935,

  7. [7]

    Yuan et al

    L. Yuan et al. Free process rewards without process labels.arXiv:2412.01981,

  8. [8]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    L. Zheng et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.arXiv:2306.05685,

  9. [9]

    A Survey on LLM-as-a-Judge

    J. Gu et al. A survey on LLM-as-a-judge.arXiv:2411.15594,

  10. [10]

    Ling et al

    Z. Ling et al. Deductive verification of chain-of-thought reasoning.arXiv:2306.03872,

  11. [11]

    Zhou and L

    K. Zhou and L. Zhang. Step-wise formal verification for LLM-based mathematical problem solving (MATH-VF).arXiv:2505.20869,

  12. [12]

    J. H. Kirchner et al. Prover-verifier games improve legibility of LLM outputs.arXiv:2407.13692,

  13. [13]

    Reinforced Self-Training (ReST) for Language Modeling

    C. Gulcehre et al. Reinforced self-training (ReST) for language modeling.arXiv:2308.08998,

  14. [14]

    Zhang et al

    D. Zhang et al. ReST-MCTS*: LLM self-training via process reward guided tree search. arXiv:2406.03816,

  15. [15]

    Humanity's Last Exam

    L. Phan et al. Humanity’s last exam.arXiv:2501.14249,

  16. [16]

    Zhai et al

    W. Zhai et al. HLE-Verified: A systematic verification and structured revision of Humanity’s Last Exam.arXiv:2602.13964,

  17. [17]

    Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

    W. Yuan et al. Verify before you commit: Towards faithful reasoning in LLM agents via self-auditing (SAVeR).arXiv:2604.08401,

  18. [18]

    Fang et al

    J. Fang et al. Graph of verification: Structured verification of LLM reasoning with directed acyclic graphs.arXiv:2506.12509,

  19. [19]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    D. Rein et al. GPQA: A graduate-level Google-proof Q&A benchmark.arXiv:2311.12022,