Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

Ben Slivinski; Michael Saldivar

arxiv: 2607.01223 · v1 · pith:VTRXPWIZnew · submitted 2026-07-01 · 💻 cs.AI · cs.CL· cs.LG· cs.LO· cs.SE

Theoria: Rewrite-Acceptability Verification over Informal Reasoning States

Ben Slivinski , Michael Saldivar This is my paper

Pith reviewed 2026-07-02 12:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.LOcs.SE

keywords verificationinformal reasoningAI safetyproof tracesstate transitionsadversarial testingLLM judgeshidden premises

0 comments

The pith

Theoria verifies informal AI answers by rewriting them into sequences of auditable state transitions each backed by explicit justifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Theoria as an architecture that converts a candidate solution into a chain of typed state transitions, where each change must be licensed by a citation, computation, or problem fact. This enforces completeness of change so that any unlicensed mutation appears as a hidden premise rather than passing undetected. On 185 expert problems it certifies 105 at 91.4 percent strict precision while producing human-readable traces open to independent challenge at every step. Structured application of the method detects 94.7 percent of adversarial poisoned proofs, outperforming holistic judges especially on hidden premises and fabricated citations. The result supplies an intermediate verification layer between opaque scalar scores and full formal proofs.

Core claim

Theoria rewrites candidate solutions into sequences of typed state transitions each licensed by an explicit justification, enforcing the completeness-of-change invariant so every difference between consecutive states is accounted for. On HLE-Verified Gold it certifies 105 of 185 problems at 91.4 percent strict precision with Wilson interval [84.5 percent, 95.4 percent] and produces human-readable traces. On 95 adversarial poisoned proofs across 15 domains, structured judges reach 94.7 percent detection versus 83.2 percent for holistic judging, the 11.5-point gap concentrated in hidden premises and fabricated citations as predicted by the formal analysis.

What carries the argument

Rewrite-acceptability verification that decomposes informal reasoning into typed state transitions and enforces completeness of change so every mutation requires an explicit justification.

If this is right

Every certification yields a human-readable trace in which each step can be independently challenged.
Structured judges detect hidden premises at 90.6 percent versus 62.5 percent for holistic judges.
Fabricated citations are caught at 100 percent versus 90 percent.
The method and holistic judging succeed on largely disjoint sets of problems.
Certified precision reaches 97.1 percent on the GPQA Diamond subset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same state-transition discipline could be applied to longer multi-step tasks where coherence failures accumulate.
Traces might support targeted human editing by isolating the exact transition that introduced an error.
Combining the rewrite layer with existing formal proof assistants could raise coverage on problems that partially admit formalization.
The approach suggests a route to auditing reasoning chains that remain too informal for current formal tools.

Load-bearing premise

The completeness-of-change invariant can be reliably enforced during the rewriting process without missing or misclassifying hidden premises in informal reasoning.

What would settle it

Manual inspection of a sample of Theoria-certified traces that reveals a substantial fraction containing undetected hidden premises or fabricated citations.

read the original abstract

When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the fact and are subject to the same coherence issues as any LLM. We present Theoria, a verification architecture that closes this gap. A candidate solution is rewritten into a sequence of typed state transitions, each licensed by an explicit justification, whether that be a citation, computation, or problem-given fact, and every transition is independently auditable. The foundational invariant is completeness of change: every difference between consecutive proof states must be accounted for, so hidden premises surface as unlicensed mutations rather than passing silently. On HLE-Verified Gold (185 text-only expert problems), Theoria certifies 105 at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]). Every certification produces a human readable proof trace in which each step can be independently challenged. Holistic LLM judges achieve comparable precision at matched coverage but fail on different problems (Jaccard 0.14-0.36), making the approaches complementary. On 95 adversarial poisoned proofs across 15 domains, structured judges catch 94.7% versus 83.2% for holistic judging (p= 0.0017). The overall 11.5 pp gap concentrates in hidden premises (90.6% vs. 62.5%, a 28 pp difference) and fabricated citations (100% vs. 90%), the error classes where the formal analysis predicts an advantage; performance is identical on arithmetic and theorem-misapplication errors, where no advantage is predicted. On GPQA Diamond (n= 65), certified precision is 97.1% (Wilson CI [85.1%, 99.5%]).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Theoria sketches an auditable rewrite-based verifier that shows an edge on hidden-premise errors, but the abstract gives no algorithm or enforcement details for the central invariant.

read the letter

The new piece is a verification loop that rewrites a candidate solution into typed state transitions, each tied to an explicit justification, with a completeness-of-change rule meant to force hidden premises into the open. On the reported HLE gold set it certifies 105 of 185 problems at 91.4% precision and produces readable traces; on the 95 poisoned proofs it beats holistic judges by 11.5 points overall and 28 points on hidden premises, with the difference statistically flagged.

That breakdown by error class is the clearest contribution. It matches the prediction that structure should help exactly where holistic scoring is weakest, and the complementarity numbers (Jaccard 0.14-0.36) are useful to see.

The soft spot is exactly where the stress-test note lands. The abstract states the invariant but supplies no rewriting procedure, no typing rules, and no account of how state differences are detected or classified. Without those steps it is impossible to judge whether the reported precision and adversarial gap come from the claimed mechanism or from prompt engineering that happens to work on these particular tests. The Wilson intervals and p-value therefore rest on an unexamined enforcement claim.

This is aimed at groups building evaluation pipelines for science or high-stakes decisions who already know the limits of both formal assistants and scalar judges. A reader who wants a concrete middle path will find the error-class results worth looking at, but anyone needing to reproduce or extend the method will hit the missing implementation first.

I would not send it to referees in its current form; the central claim needs the algorithm and any inter-annotator checks on the rewriting step before the numbers can be taken as evidence rather than illustration.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Theoria, a verification architecture that rewrites AI candidate solutions into sequences of typed state transitions with explicit justifications (citations, computations, or problem-given facts), enforcing a 'completeness of change' invariant to ensure every difference between consecutive proof states is accounted for, thereby surfacing hidden premises. It reports certifying 105 out of 185 HLE-Verified Gold problems at 91.4% strict precision (Wilson CI [84.5%, 95.4%]), with human-readable proof traces; complementary performance to holistic LLM judges (Jaccard 0.14-0.36); and superior detection on 95 adversarial poisoned proofs (94.7% vs 83.2%, p=0.0017), with a 28 pp advantage on hidden premises.

Significance. If the rewriting process reliably enforces the invariant without missing hidden premises, the approach could provide a valuable auditable alternative to both formal proof systems and scalar LLM judges. The empirical results on expert-level benchmarks (HLE, GPQA) and the targeted advantage on predicted error classes (hidden premises, fabricated citations) in adversarial testing lend support to the method's potential. The generation of independently challengeable proof traces is a clear strength, as is the use of external benchmarks without apparent parameter fitting.

major comments (2)

Abstract: Reports concrete precision numbers (91.4% on HLE-Verified Gold) and statistical comparisons (p=0.0017, 28 pp gap on hidden premises) but provides no description of the rewriting algorithm, typing mechanism, or how states are constructed, leaving the central claim of enforcing the completeness of change invariant without visible derivation support.
Verification architecture description: The foundational invariant of completeness of change (every difference between consecutive proof states must be accounted for by an explicit justification) is presented as reliably enforceable in informal text, but without a formal definition of state differences, a decision procedure, or inter-annotator study, it is unclear how the reported advantages (e.g., 90.6% vs 62.5% on hidden premises) are achieved; this is load-bearing for interpreting the Wilson CIs and adversarial results.

minor comments (2)

Results section: The Jaccard overlap values (0.14-0.36) between Theoria and holistic judges are reported but without specifying the exact problems or calculation method.
Notation for typed state transitions could be clarified with a concrete example early in the paper to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting areas where greater clarity would strengthen the manuscript. We address each major comment below and outline targeted revisions to improve the presentation of the core architecture while preserving the empirical contributions.

read point-by-point responses

Referee: Abstract: Reports concrete precision numbers (91.4% on HLE-Verified Gold) and statistical comparisons (p=0.0017, 28 pp gap on hidden premises) but provides no description of the rewriting algorithm, typing mechanism, or how states are constructed, leaving the central claim of enforcing the completeness of change invariant without visible derivation support.

Authors: The abstract is deliberately concise to comply with length constraints and to focus on results. The rewriting algorithm, typing of state transitions, and state construction procedure are described in detail in Section 3 of the manuscript. To improve accessibility, we will revise the abstract to include a brief clause such as 'via rewriting into sequences of typed state transitions licensed by explicit justifications' that directly references the mechanisms supporting the completeness-of-change invariant. revision: yes
Referee: Verification architecture description: The foundational invariant of completeness of change (every difference between consecutive proof states must be accounted for by an explicit justification) is presented as reliably enforceable in informal text, but without a formal definition of state differences, a decision procedure, or inter-annotator study, it is unclear how the reported advantages (e.g., 90.6% vs 62.5% on hidden premises) are achieved; this is load-bearing for interpreting the Wilson CIs and adversarial results.

Authors: The manuscript presents the invariant through the operational rewriting process and validates it empirically via the adversarial poisoned-proof experiments, which were explicitly constructed to probe the predicted error classes (hidden premises and fabricated citations). The 28 pp advantage on hidden premises is therefore tied directly to the architecture's requirement for explicit justifications. We agree that a formal definition of state differences and a decision procedure would strengthen the theoretical section; we will add a new subsection providing these. An inter-annotator study on the rewriting process was not performed; we will explicitly note this as a limitation and indicate it as future work, while retaining the adversarial results as the primary empirical support for the reported advantages. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on external empirical benchmarks

full rationale

The paper presents Theoria as an architecture enforcing the completeness-of-change invariant during rewriting of informal proofs into auditable state transitions. All reported results (105/185 certifications at 91.4% precision on HLE-Verified Gold, 97.1% on GPQA Diamond, 94.7% vs 83.2% on adversarial set) are direct empirical measurements on held-out external datasets with no fitted parameters, no predictions derived from the same data used for tuning, and no load-bearing self-citations. The invariant itself is stated as an explicit design choice rather than derived from prior results by the same authors. No equation or step reduces a claimed outcome to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; only the completeness-of-change invariant is explicitly named as foundational. No free parameters or invented entities beyond the system name itself are described.

axioms (1)

domain assumption Completeness of change: every difference between consecutive proof states must be accounted for by an explicit justification (citation, computation, or problem-given fact).
Stated as the foundational invariant in the abstract.

invented entities (1)

Theoria verification architecture no independent evidence
purpose: Rewrite candidate solutions into auditable typed state transitions
New system introduced by the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5878 in / 1465 out tokens · 29526 ms · 2026-07-02T12:10:14.130656+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 10 internal anchors

[1]

Lean project.https://lean-lang.org/. H. Xin et al. DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data.arXiv:2405.14333,

work page arXiv
[2]

Advancing Mathematics Research with AI-Driven Formal Proof Search

G. Tsoukalas et al. Advancing mathematics research with AI-driven formal proof search. arXiv:2605.22763,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Aristotle: IMO-level Automated Theorem Proving

19 T. Achim et al. Aristotle: IMO-level automated theorem proving.arXiv:2510.01346,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Chen et al

E. Chen et al. Dead ends in square-free digit walks.arXiv:2602.05095,

work page arXiv
[5]

Let's Verify Step by Step

(Proof fully formalized in Lean/Mathlib and produced automatically by AxiomProver.) H. Lightman et al. Let’s verify step by step.arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

P. Wang et al. Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. arXiv:2312.08935,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Yuan et al

L. Yuan et al. Free process rewards without process labels.arXiv:2412.01981,

work page arXiv
[8]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

A Survey on LLM-as-a-Judge

J. Gu et al. A survey on LLM-as-a-judge.arXiv:2411.15594,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Ling et al

Z. Ling et al. Deductive verification of chain-of-thought reasoning.arXiv:2306.03872,

work page arXiv
[11]

Zhou and L

K. Zhou and L. Zhang. Step-wise formal verification for LLM-based mathematical problem solving (MATH-VF).arXiv:2505.20869,

work page arXiv
[12]

J. H. Kirchner et al. Prover-verifier games improve legibility of LLM outputs.arXiv:2407.13692,

work page arXiv
[13]

Reinforced Self-Training (ReST) for Language Modeling

C. Gulcehre et al. Reinforced self-training (ReST) for language modeling.arXiv:2308.08998,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Zhang et al

D. Zhang et al. ReST-MCTS*: LLM self-training via process reward guided tree search. arXiv:2406.03816,

work page arXiv
[15]

Humanity's Last Exam

L. Phan et al. Humanity’s last exam.arXiv:2501.14249,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Zhai et al

W. Zhai et al. HLE-Verified: A systematic verification and structured revision of Humanity’s Last Exam.arXiv:2602.13964,

work page arXiv
[17]

Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

W. Yuan et al. Verify before you commit: Towards faithful reasoning in LLM agents via self-auditing (SAVeR).arXiv:2604.08401,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Fang et al

J. Fang et al. Graph of verification: Structured verification of LLM reasoning with directed acyclic graphs.arXiv:2506.12509,

work page arXiv
[19]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Rein et al. GPQA: A graduate-level Google-proof Q&A benchmark.arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Lean project.https://lean-lang.org/. H. Xin et al. DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data.arXiv:2405.14333,

work page arXiv

[2] [2]

Advancing Mathematics Research with AI-Driven Formal Proof Search

G. Tsoukalas et al. Advancing mathematics research with AI-driven formal proof search. arXiv:2605.22763,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Aristotle: IMO-level Automated Theorem Proving

19 T. Achim et al. Aristotle: IMO-level automated theorem proving.arXiv:2510.01346,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Chen et al

E. Chen et al. Dead ends in square-free digit walks.arXiv:2602.05095,

work page arXiv

[5] [5]

Let's Verify Step by Step

(Proof fully formalized in Lean/Mathlib and produced automatically by AxiomProver.) H. Lightman et al. Let’s verify step by step.arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

P. Wang et al. Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. arXiv:2312.08935,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Yuan et al

L. Yuan et al. Free process rewards without process labels.arXiv:2412.01981,

work page arXiv

[8] [8]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

A Survey on LLM-as-a-Judge

J. Gu et al. A survey on LLM-as-a-judge.arXiv:2411.15594,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Ling et al

Z. Ling et al. Deductive verification of chain-of-thought reasoning.arXiv:2306.03872,

work page arXiv

[11] [11]

Zhou and L

K. Zhou and L. Zhang. Step-wise formal verification for LLM-based mathematical problem solving (MATH-VF).arXiv:2505.20869,

work page arXiv

[12] [12]

J. H. Kirchner et al. Prover-verifier games improve legibility of LLM outputs.arXiv:2407.13692,

work page arXiv

[13] [13]

Reinforced Self-Training (ReST) for Language Modeling

C. Gulcehre et al. Reinforced self-training (ReST) for language modeling.arXiv:2308.08998,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Zhang et al

D. Zhang et al. ReST-MCTS*: LLM self-training via process reward guided tree search. arXiv:2406.03816,

work page arXiv

[15] [15]

Humanity's Last Exam

L. Phan et al. Humanity’s last exam.arXiv:2501.14249,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Zhai et al

W. Zhai et al. HLE-Verified: A systematic verification and structured revision of Humanity’s Last Exam.arXiv:2602.13964,

work page arXiv

[17] [17]

Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

W. Yuan et al. Verify before you commit: Towards faithful reasoning in LLM agents via self-auditing (SAVeR).arXiv:2604.08401,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Fang et al

J. Fang et al. Graph of verification: Structured verification of LLM reasoning with directed acyclic graphs.arXiv:2506.12509,

work page arXiv

[19] [19]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Rein et al. GPQA: A graduate-level Google-proof Q&A benchmark.arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv