arxiv: 2604.19558 · v1 · submitted 2026-04-21 · 💻 cs.SE

Recognition: unknown

On Reasoning-Centric LLM-based Automated Theorem Proving

Yican Sun , Chengwei Shi , Hangzhou Lyu , Yingfei Xiong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:23 UTC · model grok-4.3

classification 💻 cs.SE

keywords automated theorem provingLLM proof agentsRocqproof reflectionproof planninglemma retrievalCoqStoq benchmark

0 comments

The pith

Reasoning about plans and critiquing tactics lets an LLM prover solve 22.58 percent more theorems with the same number of model calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing LLM-based proof agents for Rocq fail to use the models' full reasoning power for high-level strategy and self-critique. It introduces two techniques that force the model to generate an overall proof plan before retrieving lemmas and to reflect on each proposed tactic to produce failure summaries when errors are likely. These steps increase the number of LLM invocations per attempt, yet the system still delivers a 22.58 percent relative gain in proved theorems on the CoqStoq benchmark when the total budget of calls is held fixed against prior agents. The result suggests that directing compute toward strategic reasoning rather than raw tactic generation improves automated proof search.

Core claim

ReCent-Prover incorporates validation with reflection, allowing the LLM to scrutinize its own tactics and synthesize failure summaries that filter out likely errors early, and retrieval with planning, which retrieves lemmas and proofs conditioned on an LLM-generated proof plan rather than local subgoal similarity. Evaluated on the CoqStoq benchmark, the approach produces a 22.58 percent relative improvement in the number of proved theorems over the previous state-of-the-art, even when the total number of LLM invocations is kept identical.

What carries the argument

Validation with reflection and retrieval with planning, the two techniques that shift LLM use from direct tactic generation toward strategic planning and self-critique.

If this is right

Proof agents gain from early filtering of flawed tactics via explicit reflection and failure summarization.
Conditioning lemma retrieval on a high-level proof plan produces more relevant premises than subgoal-only similarity.
Performance advantages remain even after equalizing the total count of LLM calls across systems.
LLMs can be guided to allocate reasoning effort to planning and critique instead of low-level tactic output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same planning-plus-reflection pattern could be tested on other proof assistants such as Lean or Isabelle to check generality.
Future agents might further improve by feeding reflection summaries back into the next planning step for iterative refinement.
Allocating a larger share of the fixed LLM budget to plan generation and critique steps may outperform simply generating more tactics.

Load-bearing premise

The measured performance gain is caused by the reflection and planning techniques themselves rather than by unstated differences in implementation, tuning, or baseline systems.

What would settle it

Re-running the CoqStoq evaluation with fully disclosed code, identical baseline implementations, and the same total LLM invocation budget would confirm whether the 22.58 percent gain holds.

Figures

Figures reproduced from arXiv: 2604.19558 by Chengwei Shi, Hangzhou Lyu, Yican Sun, Yingfei Xiong.

**Figure 1.** Figure 1: Running Example Proof Library. A proof library is a collection of (i) previously established lemmas imported at the point where the proof of dl_align_app begins, and (ii) the proofs of these imported lemmas. These lemmas can be reused to significantly simplify the proof. Moreover, existing proofs in the library may provide useful proof patterns or insights that can be adapted when constructing the formal … view at source ↗

**Figure 2.** Figure 2: Workflow of ReCent-Prover. In this section, we present the workflow of ReCent-Prover (illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Motivating Example for Retrieval with Planning [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Diagram of Retrieval with Planning. 5.2 Details of Retrieval Procedure Below, we first present the retrieval procedure for lemmas, and then briefly describe the retrieval procedure for proofs, which follows the same overall structure. Pre-processing Lemma Databases. ReCent-Prover requires a repository of lemmas and proofs. To avoid data leakage, at query time we restrict candidates to the lemmas that are … view at source ↗

read the original abstract

Automated theorem proving is fundamental to formal methods, and the recent trend is to integrate large language models (LLMs) and proof assistants to form effective proof agents. While existing proof agents show promising performance, they inadequately leverage reasoning capabilities of modern LLMs in high-level planning and self-critique. We argue that proof agents should not merely generate tactics but also reason strategically about proof plans and critically evaluate their own proposals. This paper introduces ReCent-Prover, a reasoning-centric LLM-based proof agent for Rocq that addresses two critical limitations in current systems. First, we present validation with reflection, enabling LLMs to scrutinize their generated tactics and synthesize failure summaries when reflection identifies potential errors, filtering out potentially misapplied tactics earlier. Second, we propose retrieval with planning, which conditions retrieval on LLM-generated proof plans rather than subgoal similarity, retrieving lemmas and proofs that align with the anticipated proof strategy. Both techniques increase the number of invocations of LLMs. However, when evaluated on the CoqStoq benchmark, even under the same budget of LLM invocations, ReCent-Prover achieves a 22.58% relative improvement in the number of proved theorems over the previous state-of-the-art, demonstrating that our reasoning-centric design significantly enhances automated theorem proving capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReCent-Prover adds reflection on tactics and plan-guided retrieval to LLM Coq agents, claiming a 22.58% relative gain on CoqStoq under matched LLM call budgets, but the experiments need to show the gain is not from other implementation differences.

read the letter

ReCent-Prover adds two reasoning steps to LLM-based Coq provers: one where the model reflects on and validates its own tactic suggestions, and another where it first sketches a proof plan to guide what lemmas to retrieve. The paper reports that this setup proves 22.58% more theorems than the prior best on the CoqStoq benchmark, even when the total number of LLM calls is held constant.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ReCent-Prover, an LLM-based proof agent for Rocq that adds two reasoning-centric components to existing tactic-generation pipelines: validation with reflection (LLM self-critique of proposed tactics plus failure-summary synthesis) and retrieval with planning (conditioning lemma retrieval on an LLM-generated high-level proof plan rather than subgoal similarity). The central empirical claim is that, on the CoqStoq benchmark and under an identical total budget of LLM invocations, these additions yield a 22.58% relative increase in the number of proved theorems over the prior state-of-the-art.

Significance. If the performance delta is shown to be robustly attributable to the two proposed techniques rather than implementation or baseline differences, the work would provide concrete evidence that explicit high-level planning and self-reflection improve LLM-driven automated theorem proving. This would be a useful data point for the design of future hybrid LLM-formal-systems agents.

major comments (3)

[§5 and Table 2] §5 (Evaluation) and Table 2: The manuscript reports a 22.58% relative improvement under a fixed LLM-invocation budget, yet provides no explicit protocol for how total calls are counted, allocated, or early-stopped across ReCent-Prover and the re-implemented baselines. Because both new techniques increase per-attempt LLM usage, the absence of this protocol is load-bearing for the central claim.
[§5.2] §5.2 (Ablation studies): No ablation isolating validation-with-reflection from retrieval-with-planning is presented. Without these controlled removals it is impossible to attribute the observed gain specifically to the reasoning-centric design rather than other prompt-engineering or search-heuristic choices.
[§5.1] §5.1 (Baseline re-implementation): The comparison to prior SOTA systems does not state whether those systems were re-executed inside the authors' code base with identical LLM backend, temperature, and call-budgeting rules. If the baselines were taken from published numbers only, differences in engineering details could explain the delta.

minor comments (2)

[Introduction] The paper uses both 'Coq' and 'Rocq' without an explicit note on the recent renaming; a single consistent term plus a parenthetical clarification would improve readability.
[Figure 1] Figure 1 (system overview) would benefit from an explicit call-count annotation on each arrow so readers can immediately see where the extra LLM invocations occur.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We respond to each major comment point by point below, clarifying our experimental setup and committing to revisions that enhance the transparency of our results.

read point-by-point responses

Referee: [§5 and Table 2] §5 (Evaluation) and Table 2: The manuscript reports a 22.58% relative improvement under a fixed LLM-invocation budget, yet provides no explicit protocol for how total calls are counted, allocated, or early-stopped across ReCent-Prover and the re-implemented baselines. Because both new techniques increase per-attempt LLM usage, the absence of this protocol is load-bearing for the central claim.

Authors: We agree that an explicit protocol is essential for validating the central claim. In our implementation, the total LLM invocation budget is fixed at 100 calls per theorem (as used in prior work for fair comparison). Each call to the LLM for tactic generation, reflection, planning, or retrieval counts as one invocation. The agent early-stops upon finding a proof or exhausting the budget. We will revise §5 to include a detailed description of this protocol, including how calls are allocated across components and the early-stopping logic, ensuring the comparison is under identical conditions. revision: yes
Referee: [§5.2] §5.2 (Ablation studies): No ablation isolating validation-with-reflection from retrieval-with-planning is presented. Without these controlled removals it is impossible to attribute the observed gain specifically to the reasoning-centric design rather than other prompt-engineering or search-heuristic choices.

Authors: We acknowledge the value of isolating each component. While our current ablations in §5.2 remove combinations of features, we will add new experiments that disable validation-with-reflection alone and retrieval-with-planning alone, keeping the other intact. This will allow precise attribution of the performance gains to each reasoning-centric technique. revision: yes
Referee: [§5.1] §5.1 (Baseline re-implementation): The comparison to prior SOTA systems does not state whether those systems were re-executed inside the authors' code base with identical LLM backend, temperature, and call-budgeting rules. If the baselines were taken from published numbers only, differences in engineering details could explain the delta.

Authors: All baseline results in Table 2 were obtained by re-implementing the prior SOTA systems in our codebase, using the identical LLM (the same model and version as for ReCent-Prover), temperature, and the same fixed LLM invocation budget protocol. Published numbers served only as a reference for expected performance. We will explicitly document this re-implementation process in §5.1 and provide additional details in the supplementary material to eliminate any ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark result stands independent of inputs

full rationale

The paper introduces two engineering techniques (validation with reflection; retrieval with planning) for an LLM-based proof agent and reports an empirical performance delta (22.58% relative improvement on CoqStoq under fixed LLM-call budget) against prior SOTA. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. The central claim is a controlled benchmark comparison whose validity rests on experimental controls (budget matching, baseline re-implementation) rather than any reduction to the paper's own inputs by construction. Self-citations, if present in the full text, are not load-bearing for the reported result. This is the normal non-circular outcome for an empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical study of LLM agent design and introduces no mathematical free parameters, axioms, or invented entities; the two techniques are algorithmic proposals evaluated on an existing benchmark.

pith-pipeline@v0.9.0 · 5525 in / 1131 out tokens · 51896 ms · 2026-05-10T02:23:07.547562+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 14 canonical work pages · 2 internal anchors

[1]

SpringerVerlag (2004)

Bertot, Y., Casteran, P.: Interactive Theorem Proving and Program Development. SpringerVerlag (2004)

2004
[2]

Confidence in Assurance 2.0 Cases

Blaauwbroek, L., Urban, J., Geuvers, H.: The tactician: A seamless, interactive tac- tic learner and prover for coq. In: Intelligent Computer Mathematics: 13th Interna- tional Conference, CICM 2020, Bertinoro, Italy, July 26–31, 2020, Proceedings. p. 271–277. Springer-Verlag, Berlin, Heidelberg (2020). https://doi.org/10.1007/978- 3-030-53518-6_17,https:/...

work page doi:10.1007/978- 2020
[3]

In: Bjørner, N., Sofronie-Stokkermans, V

Blanchette, J.C., Böhme, S., Paulson, L.C.: Extending sledgehammer with smt solvers. In: Bjørner, N., Sofronie-Stokkermans, V. (eds.) Automated Deduction – CADE-23. pp. 116–130. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)

2011
[4]

Journal of Machine Learning Research24(240), 1–113 (2023),https://jmlr.org/papers/v24/22-1144.html

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prab- hakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Gur-Ari, G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., ...

2023
[5]

In: Klein, G., Gamboa, R

Cohen, C., Mörtberg, A.: A coq formalization of finitely presented modules. In: Klein, G., Gamboa, R. (eds.) Interactive Theorem Proving. pp. 193–208. Springer International Publishing, Cham (2014) On Reasoning-Centric LLM-based Automated Theorem Proving 21

2014
[6]

Czajka, z., Kaliszyk, C.: Hammer for coq: Automation for dependent type theory. J. Autom. Reason.61(1–4), 423–453 (Jun 2018). https://doi.org/10.1007/s10817- 018-9458-4,https://doi.org/10.1007/s10817-018-9458-4

work page doi:10.1007/s10817- 2018
[7]

Doczkal, C., Smolka, G.: Regular language representations in the con- structive type theory of coq. J. Autom. Reason.61(1–4), 521–553 (Jun 2018). https://doi.org/10.1007/s10817-018-9460-x,https://doi.org/10.1007/ s10817-018-9460-x

work page doi:10.1007/s10817-018-9460-x 2018
[8]

In: Chandra, S., Blincoe, K., Tonella, P

First, E., Rabe, M.N., Ringer, T., Brun, Y.: Baldur: Whole-proof generation and repair with large language models. In: Proceedings of the 31st ACM Joint Eu- ropean Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). pp. 1229–1241. ACM, San Francisco, CA, USA (2023). https://doi.org/10.1145/3611643.3616...

work page doi:10.1145/3611643.3616243 2023
[9]

Notices of the Ameri- can Mathematical Society55(11), 1382–1393 (2008),https://www.ams.org/ journals/notices/200811/tx081101382p.pdf

Gonthier, G.: Formal proof—the four-color theorem. Notices of the Ameri- can Mathematical Society55(11), 1382–1393 (2008),https://www.ams.org/ journals/notices/200811/tx081101382p.pdf

2008
[10]

arXiv preprint arXiv:2210.12283 , year=

Jiang, A.Q., Welleck, S., Zhou, J.P., Li, W., Liu, J., Jamnik, M., Lacroix, T., Wu, Y., Lample, G.: Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In: International Conference on Learning Representations (2023), https://arxiv.org/abs/2210.12283

work page arXiv 2023
[11]

Cobblestone: A Divide-and-Conquer Approach for Automating Formal Verification

Kasibatla, S.R., Agarwal, A., Brun, Y., Lerner, S., Ringer, T., First, E.: Cobblestone: A divide-and-conquer approach for automating formal verifi- cation (2025). https://doi.org/10.48550/arXiv.2410.19940,https://arxiv.org/ abs/2410.19940, arXiv v3 (Aug 2025)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.19940 2025
[12]

URL https: //doi.org/10.1145/1538788.1538814

Leroy, X.: Formal verification of a realistic compiler. Communications of the ACM52(7), 107–115 (2009). https://doi.org/10.1145/1538788.1538814,https: //xavierleroy.org/publi/compcert-CACM.pdf

work page doi:10.1145/1538788.1538814 2009
[13]

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Lin, Y., Tang, S., Lyu, B., Yang, Z., Chung, J.H., Zhao, H., Jiang, L., Geng, Y., Ge, J., Sun, J., Wu, J., Gesi, J., Lu, X., Acuna, D., Yang, K., Lin, H., Choi, Y., Chen, D., Arora, S., Jin, C.: Goedel-prover v2: Scaling formal theorem proving with scaffolded data synthesis and self-correction. arXiv preprint arXiv:2508.03613 (2025),https://arxiv.org/abs/...

work page arXiv 2025
[14]

In: Singh, A., Fazel, M., Hsu, D., Lacoste- Julien,S.,Berkenkamp,F.,Maharaj,T.,Wagstaff,K.,Zhu,J.(eds.)Proceedingsof the 42nd International Conference on Machine Learning

Liu, H., Sun, J., Li, Z., Yao, A.C.: Proofaug: Efficient neural theorem proving via fine-grained proof structure analysis. In: Singh, A., Fazel, M., Hsu, D., Lacoste- Julien,S.,Berkenkamp,F.,Maharaj,T.,Wagstaff,K.,Zhu,J.(eds.)Proceedingsof the 42nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 267, pp. 39568–...

2025
[15]

In: Proceedings of the 37th International Conference on Neural In- formation Processing Systems

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-refine: iterative refinement with self-feedback. In: Proceedings of the 37th International Conference on Neural In- formation Processing Systems. NIPS...

2023
[16]

Cambridge University Press, Cambridge, UK (2008)

Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK (2008)

2008
[17]

minimax.io/news/minimax-m25(2026), accessed: 2026 22 Y

MiniMax: MiniMax-M2.5: Built for real-world productivity.https://www. minimax.io/news/minimax-m25(2026), accessed: 2026 22 Y. Sun et al

2026
[18]

In: 16th International Conference on Interactive Theorem Prov- ing (ITP 2025)

Norman, C., Avigad, J.: Canonical for automated theorem proving in lean. In: 16th International Conference on Interactive Theorem Prov- ing (ITP 2025). Leibniz International Proceedings in Informatics (LIPIcs), vol. 352, pp. 14:1–14:20. Schloss Dagstuhl–Leibniz-Zentrum für Informatik (2025). https://doi.org/10.4230/LIPIcs.ITP.2025.14,https://drops.dagstuh...

work page doi:10.4230/lipics.itp.2025.14 2025
[19]

OpenAI: Gpt-4 technical report,https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv
[20]

OpenAI: text-embedding-3-large,https://platform.openai.com/docs/models/ text-embedding-3-large, openAI API embedding model documentation; model released 2024-01-25

2024
[21]

OpenAI: o4-mini: latest reasoning model in the openai o-series (2025),https: //platform.openai.com/docs/models/o4-mini

2025
[22]

In: Piskac, R., Rakamarić, Z

Qian, Y., Clune, J., Barrett, C., Avigad, J.: Lean-auto: An interface between lean 4 and automated theorem provers. In: Piskac, R., Rakamarić, Z. (eds.) Computer Aided Verification. pp. 175–196. Springer Nature Switzerland, Cham (2025)

2025
[23]

Ren, Z.Z., Shao, Z., Song, J., Xin, H., Wang, H., Zhao, W., Zhang, L., Fu, Z., Zhu, Q., Yang, D., Wu, Z.F., Gou, Z., Ma, S., Tang, H., Liu, Y., Gao, W., Guo, D., Ruan, C.: Deepseek-prover-v2: Advancing formal mathematical reasoning via re- inforcement learning for subgoal decomposition (2025),https://arxiv.org/abs/ 2504.21801

work page arXiv 2025
[24]

Saul, and Sorin Lerner

Sanchez-Stern, A., Alhessi, Y., Saul, L., Lerner, S.: Generating correctness proofs with neural networks. In: Proceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. p. 1–10. MAPL 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394450.3397466,https://doi.org/ 10...

work page doi:10.1145/3394450.3397466 2020
[25]

In: Proceedings of the 37th Inter- national Conference on Neural Information Processing Systems

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: lan- guage agents with verbal reinforcement learning. In: Proceedings of the 37th Inter- national Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

2023
[26]

In: Proceedings of the 1st Conference on Language Modeling (COLM 2024)

Thakur, A., Tsoukalas, G., Wen, Y., Xin, J., Chaudhuri, S.: An in-context learn- ing agent for formal theorem-proving. In: Proceedings of the 1st Conference on Language Modeling (COLM 2024). Philadelphia, PA, USA (Oct 2024),https: //openreview.net/forum?id=V7HRrxXUhN, cOLM 2024, Oct 7–9, 2024

2024
[27]

Ferreira, Sorin Lerner, and Emily First

Thompson, K., Saavedra, N., Carrott, P., Fisher, K., Sanchez-Stern, A., Brun, Y., Ferreira, J.F., Lerner, S., First, E.: Rango: Adaptive retrieval- augmented proving for automated software verification. In: Proceedings of the 47th International Conference on Software Engineering (ICSE). Ottawa, Canada (Apr 2025). https://doi.org/10.48550/arXiv.2412.14063,...

work page doi:10.48550/arxiv.2412.14063 2025
[28]

Yang, K., Swope, A.M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R.J., Anandkumar, A.: Leandojo: Theorem proving with retrieval-augmented language models. In: Advances in Neural Information Processing Systems 36 (NeurIPS 2023), Datasets and Benchmarks Track (2023),https://proceedings.neurips.cc/paper_files/paper/2023/file/ 4441469427094...

2023
[29]

NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023) On Reasoning-Centric LLM-based Automated Theorem Proving 23

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree ofthoughts:deliberateproblemsolvingwithlargelanguagemodels.In:Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023) On Reasoning-Centric LLM-based Automated Theorem Proving 23

2023
[30]

In: The Eleventh Inter- national Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, May 1–5, 2023

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: Re- act: Synergizing reasoning and acting in language models. In: The Eleventh Inter- national Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, May 1–5, 2023. OpenReview.net (2023),https://openreview.net/forum?id=WE_ vluYUL-X

2023
[31]

Your task is to generate a sequence of tactics to prove the given subgoal

Zhang, J., Wang, Q., Ji, X., Liu, Y., Yue, Y., Zhang, F., Zhang, D., Zhou, G., Gai, K.: Leanabell-prover: Posttraining scaling in formal reasoning (2025),https: //arxiv.org/abs/2504.06122 A Omitted Prompts A.1 Prompts for LLM-based Proof Generation [System Prompt] You are an expert in Rocq theorem proving . Your task is to generate a sequence of tactics t...

work page arXiv 2025
[32]

Focus on logical validity and provability , not on finding the optimal proof strategy
[33]

Sun et al

Be c o n s e r v a t i v e : if you ’ re not sure whether so me th in g is provable , mark it as UN CE RT AI N rather than U N P R O V A B L E 26 Y. Sun et al
[34]

Consider that the proof may require advanced t e c h n i q u e s you ’ re not aware of - don ’ t mark so me th in g U N P R O V A B L E unless you ’ re co nf id en t
[35]

Check for c o n t r a d i c t i o n s ca re fu ll y - subtle c o n t r a d i c t i o n s can make goals u n p r o v a b l e
[36]

Consider the c o n s t r u c t i v e nature of Rocq ’ s logic - some c l a s s i c a l l y true s t a t e m e n t s may not be c o n s t r u c t i v e l y provable
[38]

Provide a c t i o n a b l e s u g g e s t i o n s when marking goals as U N P R O V A B L E
[39]

di ff ic ul t

Remember that " di ff ic ul t " does not mean " u n p r o v a b l e " [ User Prompts ] ### Current Goals ... ### Relevant D e f i n i t i o n s ... A.3 Prompts for Induction Schema Evaluation [ System Prompt ] # Rocq I nd uc ti on E v a l u a t i o n Module You are an expert Rocq proof as si st an t s p e c i a l i z i n g in e v a l u a t i n g i nd uc t...
[40]

** Goal Before In du ct io n **: The original proof goal
[41]

** Goal After In du ct io n **: The re su lt in g subgoals after applying i nd uc ti on
[42]

** In du ct io n S t r a t e g i e s **: The specific tactic ( s ) used for in du ct io n strategy ( e . g . , intro + in du ct io n pattern )
[43]

Use the relevant d e f i n i t i o n s to u n d e r s t a n d the r ec ur si ve s tr uc tu re of f un ct io ns and d at at yp es involved in the goal

** Relevant D e f i n i t i o n s **: Key d e f i n i t i o n s from the codebase that are relevant to u n d e r s t a n d i n g the goal s tr uc tu re and d e t e r m i n i n g the a p p r o p r i a t e i nd uc ti on strategy You must evaluate whether the in du ct io n was pe rf or me d reasonably , p a r t i c u l a r l y checking if v ar ia bl es were ...
[44]

Look at the relevant function d e f i n i t i o n s in the goal
[45]

Find the ** toplevel match ** st at em en t ( ignore nested matches )
[46]

The variable being matched at the toplevel is the one you should induct on ** Example 1 - Simple Case :** ‘‘‘ coq Fixpoint plus ( n m : nat ) : nat := match n with (* toplevel match on n *) | O = > m | S p = > S ( plus p m ) end . ‘‘‘ - Toplevel match is on ‘n ‘ - ** Correct in du ct io n :** ‘ in du ct io n n ‘ - ** In co rr ec t in du ct io n :** ‘ in d...
[47]

Focus on whether the in du ct io n h y p o t h e s i s will be strong enough to complete the proof
[48]

Check if u n i v e r s a l l y q u a n t i f i e d va ri ab le s that appear after the i nd uc ti on variable were over - bound ( i n t r o d u c e d before i nd u ct io n when they shouldn ’ t be )
[49]

Consider the st ru ct ur e of the goal and what the i nd uc ti on h y p o t h e s i s needs to prove
[50]

Be specific in your s u g g e s t i o n s - provide actual Rocq tactic syntax
[51]

Always follow the exact output format for p a r s e a b i l i t y
[52]

Look at ‘ Relevant Definitions ‘ why checking ‘ Wrong Variable for Induction ‘
[53]

Think ca re fu ll y for a good answer
[54]

Wrap the code in a ‘‘‘ rocq code block in the s u g g e s t i o n part
[55]

[ User prompts ] ### Goal Before In du ct io n

Provide ONLY ONE better version of tactic sequence . [ User prompts ] ### Goal Before In du ct io n ... ### Goal After In du ct io n ... ### In du ct io n S t r a t e g i e s ... ### Relevant D e f i n i t i o n s ... A.4 Prompts for Building Retrieval Database for Lemmas [System Prompt] You are an expert in Rocq theorem proving . Your task is to generate...