Theorist Toolbox: Tools for Agent Based LLM-assisted economic theory Research
Pith reviewed 2026-06-26 09:58 UTC · model grok-4.3
The pith
External verification, not model capability, determines the reliability of LLM-assisted economic theory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the bottleneck in LLM-assisted economic theory is trust rather than production, and that protocols differing in verification approach can address this. In the worked example, none produced a strict direct-revelation VCG mechanism, but convergent discovery occurred on an effective-resistance externality kernel, adversarial verification caught three false claims, and the gate rejected a sub-goal, while polish did not ensure rigor.
What carries the argument
The three verification protocols consisting of a single disciplined pass, an adversarial prover-verifier pair, and a structured multi-agent project with a reviewer gate.
If this is right
- Adversarial verification caught three of its own false claims in the demonstration.
- The multi-agent reviewer gate rejected a flawed sub-goal.
- Convergent discovery of the same effective-resistance externality kernel occurred in two runs.
- Polish in the output did not correspond to higher rigor or verification.
- The specific mechanism requested was not produced, possibly due to non-existence.
Where Pith is reading between the lines
- Applying these protocols to other economic theory problems could test their generality beyond the grade inflation model.
- Different model combinations in the adversarial pair might yield varying error-catching rates.
- Integrating these verification steps could shorten the time from idea to initial draft in theoretical economics.
Load-bearing premise
The assumption that the specific example of designing a Groves/Pigouvian mechanism for the Gans-Kominers eigengrade model serves as a representative test case for evaluating the three verification protocols in nontrivial economic theory.
What would settle it
A test on a different nontrivial economic theory problem where the ground truth is independently known, checking if the adversarial pair catches a similar number of false claims or if the single pass performs equivalently.
read the original abstract
Empirical economists often start their projects with a toolbox. Shared packages, replication archives, and circulated guides shorten the time between and idea and a rough initial draft. Theorists, on the other-hand, largely start from a blank page. By 2026, large language models can a produce and check nontrivial mathematics. The can also hallucinate and write wrong claims very convincingly. The current bottleneck on machine-assisted theory is no longer production but trust: a model will claim to prove a false theorem as readily as a true one. Building on recent attempts in mathematics, I present 3 methods for doing economic theory with a language model. These methods differ on how the work is verified: a single disciplined pass, an adversarial prover-verifier pair (Claude Opus~4.8 proposing, OpenAI Codex refuting), and a structured multi-agent project with a reviewer gate (inspired by the Google co-mathematician architecture). I demonstrate these protocols on one open worked example: designing a Groves/Pigouvian incentive mechanism for the Gans--Kominers eigengrade model of grade inflation. None of the three runs produced a strict direct-revelation VCG/Clarke mechanism (as requested, perhaps due to the non-existence of such mechanism). Three phenomena recur. First, convergent discovery: two runs derive the same effective-resistance externality kernel on opposite margins. Second, adversarial verification is load-bearing: the pair caught three of its own false claims and the gate rejected a sub-goal. Third, polish is not rigor: the most finished-looking output was the least verified. The methodological takeaway is that external verification, not model capability, is the design variable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents three verification protocols for LLM-assisted economic theory (single disciplined pass, adversarial prover-verifier pair, and multi-agent project with reviewer gate) and demonstrates them on one open problem: designing a Groves/Pigouvian mechanism for the Gans-Kominers eigengrade model. It reports three recurring phenomena (convergent discovery of an effective-resistance externality kernel, adversarial catching of false claims, and polish not equaling rigor) and concludes that external verification, not model capability, is the controlling design variable.
Significance. If the observations hold and generalize, the work could usefully redirect attention in LLM-assisted theory toward verification architectures rather than prompt engineering or model scale. The concrete protocols (including the adversarial pair and gate) and the choice of an open mechanism-design task are strengths that could be built upon by other researchers.
major comments (2)
- [Abstract / demonstration] Abstract and demonstration section: the central claim that 'external verification, not model capability, is the design variable' rests on three protocol runs on a single open mechanism-design task where the requested strict VCG/Clarke mechanism may not exist; without additional cases, model ablations, or non-LLM baselines, the reported phenomena (convergent discovery, error catching) cannot be shown to be general rather than idiosyncratic to this instance.
- [Abstract] Abstract: the paper states that the adversarial pair 'caught three of its own false claims' and the gate 'rejected a sub-goal,' yet supplies no details on the mathematical steps, the specific erroneous claims, or the reasoning establishing non-existence of the requested mechanism; this prevents assessment of whether the verification protocols performed as described.
minor comments (1)
- [Abstract] Abstract contains typographical errors: 'can a produce' should read 'can produce' and 'The can also hallucinate' should read 'They can also hallucinate'.
Simulated Author's Rebuttal
We thank the referee for the constructive report. The comments correctly identify limitations in scope and detail; we address them point-by-point below and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract / demonstration] Abstract and demonstration section: the central claim that 'external verification, not model capability, is the design variable' rests on three protocol runs on a single open mechanism-design task where the requested strict VCG/Clarke mechanism may not exist; without additional cases, model ablations, or non-LLM baselines, the reported phenomena (convergent discovery, error catching) cannot be shown to be general rather than idiosyncratic to this instance.
Authors: We agree the paper is a methods demonstration on one open problem rather than a multi-case empirical study. The central claim will be revised in the abstract and conclusion to present the phenomena as observations from this specific demonstration, with explicit language noting that generality requires further work. The non-existence of the requested VCG mechanism is retained as part of the test case, as it illustrates protocol behavior on an unsolved task. revision: yes
-
Referee: [Abstract] Abstract: the paper states that the adversarial pair 'caught three of its own false claims' and the gate 'rejected a sub-goal,' yet supplies no details on the mathematical steps, the specific erroneous claims, or the reasoning establishing non-existence of the requested mechanism; this prevents assessment of whether the verification protocols performed as described.
Authors: The demonstration section contains the full traces of the three false claims, the adversarial refutations, the rejected sub-goal, and the reasoning on mechanism non-existence. We will revise the abstract to include one concrete example of a caught claim and will add a concise summary table of verification events to the demonstration section for easier assessment. revision: yes
Circularity Check
No circularity; methodological observations drawn directly from described runs
full rationale
The paper describes three verification protocols for LLM-assisted theory work and applies them to one explicit worked example (Groves/Pigouvian mechanism for the Gans-Kominers model). The takeaway that external verification is the controlling design variable is presented as a direct observation from the three protocol runs (convergent discovery, adversarial catching of errors, gate rejection). No equations, fitted parameters, or quantitative predictions exist that could reduce to the paper's own inputs by construction. No self-citations are invoked as load-bearing support for the central claim. The analysis is therefore self-contained as a report on the stated experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can produce and check nontrivial mathematics but will also produce false claims convincingly.
Reference graph
Works this paper leans on
-
[1]
Clarke, E. H. (1971). Multipart pricing of public goods.Public Choice11, 17–33. 37 DRAFT
1971
-
[2]
(2021).Causal Inference: The Mixtape
Cunningham, S. (2021).Causal Inference: The Mixtape. Yale University Press. https://mixtape. scunning.com. Google DeepMind (2024). AI achieves silver-medal standard solving Interna- tional Mathematical Olympiad problems. https://deepmind.google/blog/ ai-solves-imo-problems-at-silver-medal-level/
2021
-
[3]
Horváth, Goran Žuži´c, Eric Wieser et al
Hubert, T., Mehta, H., et al. (AlphaProof team) (2025). Olympiad-level formal mathematical reasoning with reinforcement learning.Nature. DOI 10.1038/s41586-025-09833-y. Epoch AI (2026). FrontierMath benchmark program (Tier 4; v2 error-corrected release). https: //epoch.ai/frontiermath
-
[4]
Gans, J. S. and Kominers, S. D. (2026). What does a grade mean? Informativeness and strategic manipulation of grading systems. NBER Working Paper No. 35183. https://www.nber.org/ papers/w35183
2026
-
[5]
Glazer, E., et al. (2024). FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI. arXiv:2411.04872.https://arxiv.org/abs/2411.04872
Pith/arXiv arXiv 2024
-
[6]
Goldsmith-Pinkham, P., Sorkin, I., and Swift, H. (2020). Bartik instruments: What, when, why, and how.American Economic Review110(8), 2586–2624.https://paulgp.com
2020
-
[7]
and Laffont, J.-J
Green, J. and Laffont, J.-J. (1979).Incentives in Public Decision-Making. North-Holland
1979
-
[8]
Groves, T. (1973). Incentives in teams.Econometrica41(4), 617–631
1973
-
[9]
Miller, N., Resnick, P., and Zeckhauser, R. (2005). Eliciting informative feedback: The peer- prediction method.Management Science51(9), 1359–1373
2005
-
[10]
ProofBridge: Auto-formalization of natural language proofs in Lean via joint embeddings
Jana, Kale, Tanriverdi, Song, Vishwanath, and Ganesh (2025). ProofBridge: Auto-formalization of natural language proofs in Lean via joint embeddings. arXiv:2510.15681. https://arxiv. org/abs/2510.15681. QEDBench: Quantifying the alignment gap in automated evaluation of university-level mathe- matical proofs (2026). arXiv:2602.20629.https://arxiv.org/abs/2...
arXiv 2025
-
[11]
Weng, Du, Li, et al. (2025). Autoformalization in the era of large language models: A survey. arXiv:2505.23486.https://arxiv.org/abs/2505.23486
arXiv 2025
-
[12]
Petrov, I., Dekoninck, J., and Vechev, M. (2025). BrokenMath: A benchmark for sycophancy in theorem proving with LLMs. arXiv:2510.04721.https://arxiv.org/abs/2510.04721
arXiv 2025
-
[13]
Examining false positives under inference scaling for mathematical reasoning
Wang, Yang, Wang, Wei, and Feng (2025). Examining false positives under inference scaling for mathematical reasoning. arXiv:2502.06217.https://arxiv.org/abs/2502.06217
arXiv 2025
-
[14]
Munkres’ general topology autoformalized in Isabelle/HOL
Bryant, Huerta y Munive, Kaliszyk, and Urban (2026). Munkres’ general topology autoformalized in Isabelle/HOL. arXiv:2604.07455.https://arxiv.org/abs/2604.07455
Pith/arXiv arXiv 2026
-
[15]
Vickrey, W. (1961). Counterspeculation, auctions, and competitive sealed tenders.Journal of Finance16(1), 8–37
1961
-
[16]
and Parkes, D
Witkowski, J. and Parkes, D. C. (2012). A robust Bayesian truth serum for small populations. InProceedings of the 26th AAAI Conference on Artificial Intelligence, 1492–1498
2012
-
[17]
Zheng, D., von Glehn, I., Zwols, Y., et al. (2026). AI co-mathematician: Accelerating mathe- maticians with agentic AI. Google DeepMind. arXiv:2605.06651. https://arxiv.org/abs/ 2605.06651. 38
Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.