BRIDGE: Building Representations In Domain Guided Program Synthesis

Anima Anandkumar; Carson Eisenach; Dean Foster; Dominique Perrault-Joncas; Robert Joseph George; Udaya Ghai

arxiv: 2511.21104 · v4 · submitted 2025-11-26 · 💻 cs.LG · cs.PL

BRIDGE: Building Representations In Domain Guided Program Synthesis

Robert Joseph George , Carson Eisenach , Udaya Ghai , Dominique Perrault-Joncas , Anima Anandkumar , Dean Foster This is my paper

Pith reviewed 2026-05-17 05:19 UTC · model grok-4.3

classification 💻 cs.LG cs.PL

keywords program synthesislarge language modelsformal verificationLeanpromptingmulti-domain reasoningcode generationtheorem proving

0 comments

The pith

BRIDGE prompting connects generated code to formal specifications and proofs via domain-specific reasoning steps in Lean.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often produce code that cannot be reliably verified in proof assistants like Lean because the outputs for code, specifications, theorems, and proofs drift from each other. BRIDGE addresses this by breaking synthesis into three linked domains—Code, Specification, and Theorem/Proof—and inserting targeted intermediate reasoning at each stage to keep the artifacts consistent. It typically anchors the process in the generated code and uses that as a reference for the other domains. On 178 algorithmic problems across five LLMs, this raises Lean executable correctness by up to 1.5 times compared with direct prompting while using roughly half as many samples for similar results. Fine-tuning models on the intermediate traces further improves Lean success rates by about 1.5 times over training on code alone, and the same style of prompting also raises Python pass rates when focused on specifications.

Core claim

BRIDGE is a structured prompting framework that decomposes program synthesis into three interconnected domains of Code, Specification, and Theorem/Proof, inserting domain-specific intermediate reasoning steps to maintain consistency across the artifacts; in Lean it follows a code-first workflow that treats the generated implementation as a semantic anchor for producing the specification, theorem statement, and proof attempt.

What carries the argument

Domain-specific intermediate reasoning steps that link the Code, Specification, and Theorem/Proof domains while using the generated code as the central semantic reference.

If this is right

Lean executable correctness rises by up to nearly 1.5 times relative to direct prompting on the same problems.
Sample efficiency roughly doubles at comparable output lengths.
Specification-focused prompting raises Python pass rates by as much as 17.5 percentage points.
Supervised fine-tuning on BRIDGE reasoning traces produces nearly 1.5 times higher Lean pass rates than fine-tuning on code alone.
A measurable gap remains between executable correctness and complete formal proof generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same intermediate-reasoning pattern could be applied to other proof assistants such as Coq or Isabelle without changing the core structure.
The generated reasoning traces offer a reusable source of training data that might improve models on related formal-synthesis benchmarks.
Combining BRIDGE-style prompting with search or reinforcement learning over the intermediate steps could further close the gap to full proofs.
The approach suggests that explicit cross-domain consistency checks are more important than raw model scale for verified code tasks.

Load-bearing premise

The intermediate reasoning steps produce specifications and proofs that remain semantically consistent with the generated code and do not introduce errors that break downstream Lean verification.

What would settle it

Run BRIDGE and direct prompting on the same 178 problems with a new LLM; if the Lean pass rate shows no consistent gain or if many generated proofs fail verification despite correct code execution, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2511.21104 by Anima Anandkumar, Carson Eisenach, Dean Foster, Dominique Perrault-Joncas, Robert Joseph George, Udaya Ghai.

**Figure 2.** Figure 2: Minimum key pushes LeetCode problem. two. This decomposition enables scalable verification by evaluating cross-domain consistency rather than isolated accuracy, allowing models to reason locally within each domain while maintaining global coherence across the pipeline. A summary of these domains and their challenges is shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Functional reasoning content from Claude 4. The BRIDGE framework allows for different paradigms – including DbC, Property-Based Testing, Test-Driven Development, and Defensive Programming – to leverage additional views on the same task [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Specification reasoning content from Claude 4. Functional representations, for instance, naturally capture recursion and compositionality; specification-oriented reasoning captures correctness constraints; and theorem-driven reasoning expresses abstract invariants. These are not different languages, but different modes of representation that help the model preserve semantic structure as it transitions from… view at source ↗

**Figure 6.** Figure 6: BRIDGE Experimental Architecture We evaluate BRIDGE on 178 algorithmic problems curated from LEETCODE, spanning arrays, strings, graphs, dynamic programming, numerical algorithms, and trees. Each problem includes natural language documentation and test suites, enabling meaningful evaluation across all three domains: Code (correct Lean4 implementations), Specifications (non-vacuous formal contracts), and … view at source ↗

**Figure 5.** Figure 5: Theorem reasoning content from Claude 4. DEEPSEEK R1, LLAMA-3.1-70B, and QWEN2.5 CODER. All are run with consistent decoding parameters (temperature 0.7, max output tokens 4096). 4.2 Reasoning Strategies and Evaluation Design We systematically vary reasoning strategies to test the semantic bridge hypothesis. For the Code domain, we compare • direct natural language→Lean4 generation • reasoning functionally… view at source ↗

**Figure 7.** Figure 7: Imperative vs Functional Reasoning. Line graphs show each model’s pass@5 performance trajectory across enhancement strategies, with functional paradigms consistently outperforming imperative approaches. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Dual improvement analysis showing that specification reasoning enhances both Python code generation (left) and Lean4 formal verification (right). All models benefit from specification-guided reasoning over direct baseline approaches. In Python code generation, specification reasoning yields clear gains. Claude 4.0 Sonnet improves from 96.1% to 97.9%, while weaker models see larger boosts: DeepSeek-R1 (57.9… view at source ↗

**Figure 9.** Figure 9: Code Embedding Analysis: Generated Python implementations cluster by strategy, with baseline (orange) showing distinct separation from other approaches [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Proof Generation Analysis: The left panel shows direct proof reasoning performance and reveals a significant bottleneck between Python generation (97-98% success) and Lean translation (9–20% success). 6 Conclusion and Future Work We introduced BRIDGE, a framework that reframes automated formal verification as achieving semantic consistency across three interdependent domains; code, specifications, and pro… view at source ↗

**Figure 11.** Figure 11: Temperature robustness of reasoning paradigms. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Python code success rates across improvement stages. Functional paradigms consistently outperform imperative ones, and incorporating Lean error feedback boosts Python success, though the gap to final verification remains large. Functional paradigms consistently outperform imperative ones across all pipeline stages, maintaining this advantage even with error feedback. Combining Python and Lean error signal… view at source ↗

read the original abstract

Large language models can generate plausible code, but remain brittle for formal verification in proof assistants such as Lean. A central scalability challenge is that verified synthesis requires consistent artifacts across several coupled domains: executable code, formal specifications, theorem statements, and proof attempts. Existing approaches often treat these artifacts separately. We present BRIDGE, a structured prompting framework for multi-artifact program synthesis. BRIDGE decomposes generation into three interconnected domains: Code, Specification, and Theorem/Proof, and uses domain-specific intermediate reasoning to connect them. In Lean, BRIDGE often follows a code-first workflow, using the generated implementation as a semantic anchor for downstream specification, theorem statement, and proof-attempt generation. Across 178 algorithmic problems and five LLMs, BRIDGE improves Lean executable correctness by up to nearly 1.5x over direct prompting and can be roughly 2x more sample efficient at comparable generation lengths. We further find that specification-oriented prompting improves Python pass rates by up to 17.5 percentage points. Beyond inference-time prompting, supervised fine-tuning on BRIDGE-style reasoning traces yields nearly 1.5x higher Lean pass success than code-only fine-tuning, suggesting that these intermediate representations provide a learnable inductive bias. BRIDGE provides a practical framework for scaling verified synthesis while highlighting the remaining gap between executable correctness and full formal proof generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BRIDGE gives a workable code-first prompting structure for Lean that lifts correctness and fine-tuning results, but the multi-artifact consistency claim rests on an unverified assumption.

read the letter

The main takeaway is that BRIDGE decomposes program synthesis into Code, Specification, and Theorem/Proof domains, anchors on generated code in Lean, and uses intermediate reasoning steps to link them. This produces measurable gains over plain prompting on 178 problems across five models, with up to 1.5x better executable correctness and roughly 2x better sample efficiency at similar lengths. The supervised fine-tuning results are the clearest addition: traces built this way yield nearly 1.5x higher Lean pass rates than code-only fine-tuning, and specification-oriented prompts also help Python pass rates by up to 17.5 points. The framework is straightforward to implement and directly addresses the brittleness of LLMs on verified synthesis tasks. That is useful incremental work in the LLM-assisted formal methods corner of the field. The experiments are empirical and compare against direct prompting baselines, so the basic improvements look plausible on their face. The soft spot is exactly the one the stress-test flags. The paper treats the generated code as a semantic anchor for the downstream specification and theorem, yet reports only aggregate Lean correctness and pass rates. Without an explicit check or ablation that measures whether the spec and theorem actually match the code's semantics, it is possible the gains come mostly from better code generation rather than true cross-domain consistency. If mismatches occur, the verified-synthesis framing weakens even when the code executes. The abstract does not reference such a metric, so this is a point that needs clarification in review rather than a fatal flaw. This paper is for researchers working on prompting and fine-tuning for formal verification and multi-artifact generation. Anyone already running Lean experiments with LLMs will get concrete ideas from the workflow and the fine-tuning results. It has enough empirical grounding and testable claims to deserve a serious referee. I would send it to peer review and ask specifically for details on how they validated alignment across the three domains.

Referee Report

2 major / 2 minor

Summary. The paper introduces BRIDGE, a structured prompting framework for multi-artifact program synthesis that decomposes generation into three interconnected domains (Code, Specification, and Theorem/Proof) using domain-specific intermediate reasoning steps. In a code-first workflow for Lean, the generated implementation serves as a semantic anchor for downstream artifacts. Across 178 algorithmic problems and five LLMs, it claims up to nearly 1.5x improvement in Lean executable correctness over direct prompting, roughly 2x better sample efficiency at comparable lengths, specification-oriented prompting gains of up to 17.5 percentage points in Python, and nearly 1.5x higher Lean pass rates from supervised fine-tuning on BRIDGE-style traces versus code-only fine-tuning.

Significance. If the results hold under scrutiny, the work is significant for scaling LLM-assisted verified synthesis. It provides a practical inference-time framework for handling coupled artifacts and demonstrates that the intermediate representations supply a learnable inductive bias, as evidenced by the SFT experiments. The evaluation scale (178 problems, multiple LLMs) and the distinction between executable correctness and full proof generation are useful contributions.

major comments (2)

[§5] §5 (Experiments) and associated tables: The central claim that BRIDGE produces consistent multi-artifact outputs relies on the domain-specific intermediates reliably linking code semantics to specifications and theorems. However, the reported metrics are limited to aggregate Lean executable correctness and pass rates with no explicit measure, ablation, or consistency check quantifying semantic alignment (e.g., whether generated pre/post-conditions or theorem statements match the actual behavior of the code). This leaves open the possibility that gains are driven by improved code generation alone rather than verified multi-domain consistency.
[§5.3] §5.3 (SFT experiments): The claim that BRIDGE-style reasoning traces provide a superior inductive bias for fine-tuning is load-bearing for the broader contribution, yet the comparison to code-only fine-tuning does not report controls for trace length, data quality filtering, or whether the traces were verified for internal consistency before use in SFT.

minor comments (2)

[§4] The abstract and §4 could more explicitly define what constitutes a 'domain-specific intermediate reasoning step' with a short illustrative example from one of the 178 problems.
Figure captions and axis labels in the efficiency plots would benefit from stating the exact generation length ranges used for the 'comparable generation lengths' comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments identify valuable opportunities to strengthen the presentation of our experimental results and the SFT analysis. We address each major comment below and indicate the revisions we plan to incorporate.

read point-by-point responses

Referee: [§5] §5 (Experiments) and associated tables: The central claim that BRIDGE produces consistent multi-artifact outputs relies on the domain-specific intermediates reliably linking code semantics to specifications and theorems. However, the reported metrics are limited to aggregate Lean executable correctness and pass rates with no explicit measure, ablation, or consistency check quantifying semantic alignment (e.g., whether generated pre/post-conditions or theorem statements match the actual behavior of the code). This leaves open the possibility that gains are driven by improved code generation alone rather than verified multi-domain consistency.

Authors: We agree that an explicit quantification of semantic alignment would provide stronger support for the multi-artifact consistency claim. Our evaluation centers on Lean executable correctness because it offers an objective, automated signal that the generated code satisfies test cases, and the code-first workflow is designed to use this implementation as a semantic anchor for subsequent specification and theorem generation. Nevertheless, we acknowledge that aggregate pass rates alone do not directly measure whether the generated pre/post-conditions or theorem statements are semantically faithful to the code. In the revised manuscript we will add a new subsection with an ablation that generates specifications and theorems both with and without the BRIDGE intermediate steps, then reports the fraction of generated specifications that are satisfied by the code on held-out test cases as well as a manual audit of theorem-statement alignment on a random sample of 30 problems. revision: yes
Referee: [§5.3] §5.3 (SFT experiments): The claim that BRIDGE-style reasoning traces provide a superior inductive bias for fine-tuning is load-bearing for the broader contribution, yet the comparison to code-only fine-tuning does not report controls for trace length, data quality filtering, or whether the traces were verified for internal consistency before use in SFT.

Authors: We appreciate the referee’s emphasis on rigorous controls for the SFT comparison. The BRIDGE traces used for fine-tuning were produced by the same prompting procedure and retained only those that yielded executable code; the code-only baseline was constructed from the same problems and filtered by the same execution criterion. We did not, however, explicitly match or report average trace lengths or perform additional automated consistency checks beyond execution success. In the revision we will add a table reporting average token lengths for the BRIDGE-style and code-only training sets, describe the exact filtering pipeline, and note that internal consistency was enforced via execution verification where feasible. Full manual verification of every trace was not performed due to dataset scale, but the observed performance gap remains informative given the matched problem set and execution filter. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivations or self-referential reductions

full rationale

The paper describes an empirical prompting framework (BRIDGE) for generating code, specs, and proofs in Lean, evaluated on 178 problems across five LLMs with reported gains in executable correctness and sample efficiency versus direct prompting and code-only fine-tuning. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. All central claims rest on experimental comparisons rather than any chain that reduces to its own inputs by construction, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that LLMs can produce coherent multi-artifact outputs when guided by domain-specific prompts; no free parameters are fitted in the reported results, and the main addition is the prompting structure itself rather than new mathematical entities.

axioms (1)

domain assumption Large language models can generate plausible code and intermediate reasoning traces when given structured prompts.
Invoked throughout the description of the BRIDGE workflow and its application to Lean synthesis.

invented entities (1)

Domain-specific intermediate reasoning no independent evidence
purpose: To connect generated code to specifications and theorem/proof artifacts.
Introduced as the core mechanism of the BRIDGE framework.

pith-pipeline@v0.9.0 · 5555 in / 1393 out tokens · 53877 ms · 2026-05-17T05:19:23.661442+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BRIDGE decomposes verification into three interconnected domains: Code, Specifications, and Proofs and provides a scalable way to generate and interconnect them.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Functional reasoning improves correctness of code in formal languages (LEAN4) by nearly 1.5×

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

[1]

AUSTIN, J., ODENA, A., NYE, M., BOSMA, M., MICHALEWSKI, H., DOHAN, D., JIANG, E., CAI, C., TERRY, M., LE, Q.ET AL. (2021). Program synthesis with large language models.arXiv:2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

E., DELINE, R., JACOBS, B

BARNETT, M., CHANG, B.-Y. E., DELINE, R., JACOBS, B. and LEINO, K. R. M. (2005). Boogie: A modular reusable verifier for object-oriented programs. InFormal Methods for Components and Objects. Springer, 364–387

work page 2005
[3]

and CASTÉRAN, P

BERTOT, Y. and CASTÉRAN, P. (2013).Interactive Theorem Proving and Program Development: Coq’Art: The Calculus of Inductive Constructions. Springer Science & Business Media

work page 2013
[4]

and PASKEVICH, A

BOBOT, F., FILLIÂTRE, J.-C., MARCHÉ, C. and PASKEVICH, A. (2011). Why3: Shepherd your herd of provers. In Boogie 2011: First International Workshop on Intermediate Verification Languages

work page 2011
[5]

E., KUKOVEC, J., SINGH, A., BUTTERLEY, O., BIZID, A., DOUGHERTY, Q., ZHAO, M., TAN, M

BURSUC, S., EHRENBORG, T., LIN, S., ASTEFANOAEI, L., CHIOSA, I. E., KUKOVEC, J., SINGH, A., BUTTERLEY, O., BIZID, A., DOUGHERTY, Q., ZHAO, M., TAN, M. and TEGMARK, M. (2025). A benchmark for vericoding: Formally verified program synthesis.arXiv:2509.22908

work page arXiv 2025
[6]

CHEN, M., TWOREK, J., JUN, H., YUAN, Q.,DEOLIVEIRAPINTO, H. P., KAPLAN, J., EDWARDS, H., BURDA, Y., JOSEPH, N., BROCKMAN, G., RAY, A., PURI, N., KRUEGER, G., PETROV, M., KHLAAF, H., SASTRY, G., MISHKIN, P., CHAN, B., GRAY, S., RYDER, N., PAVLOV, M., POWER, B., KAISER, L., BAVARIAN, M., WINTER, C., TILLET, P., SUCH, F. P., CUMMINGS, A., PLAPPERT, M., CHANT...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

CLARKE, E. M. and WING, J. M. (1996). Formal methods: State of the art and future directions.ACM Computing Surveys 28626–643. 12 BRIDGE: Domain Guided Program VerificationA PREPRINT

work page 1996
[8]

andVONRAUMER, J

DEMOURA, L., KONG, S., AVIGAD, J.,VANDOORN, F. andVONRAUMER, J. (2015). The lean theorem prover (system description). InAutomated Deduction - CADE-25: 25th International Conference on Automated Deduction. Springer

work page 2015
[9]

DIJKSTRA, E. W. (1976).A discipline of programming. Prentice Hall

work page 1976
[10]

and ZHOU, M

FENG, Z., GUO, D., TANG, D., DUAN, N., FENG, X., GONG, M., SHOU, L., QIN, B., LIU, T., JIANG, D. and ZHOU, M. (2020). Codebert: A pre-trained model for programming and natural languages. InFindings of the Association for Computational Linguistics: EMNLP 2020

work page 2020
[11]

FLOYD, R. W. (1967). Assigning meanings to programs. InProceedings of the Symposium on Applied Mathematics, vol. 19. American Mathematical Society, 19–32

work page 1967
[12]

PAL: Program-aided Language Models

GAO, L., MADAAN, A., ZHOU, S., ALON, U., LIU, P., YANG, Y., CALLAN, J. and NEUBIG, G. (2022). Pal: Program- aided language models.arXiv:2211.10435

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

K., CLEMENT, C., DRAIN, D., SUNDARESAN, N., YIN, J., JIANG, D

GUO, D., REN, S., LU, S., FENG, Z., TANG, D., DUAN, N., SVYATKOVSKIY, A., FU, S., TUFANO, M., DENG, S. K., CLEMENT, C., DRAIN, D., SUNDARESAN, N., YIN, J., JIANG, D. and ZHOU, M. (2021). Graphcodebert: Pre-training code representations with data flow. InInternational Conference on Learning Representations (ICLR)

work page 2021
[14]

HOARE, C. A. R. (1969). An axiomatic basis for computer programming.Communications of the ACM12576–580

work page 1969
[15]

HOARE, C. A. R. (1971). Procedures and parameters: An axiomatic approach. InSymposium on Semantics of Algorithmic Languages

work page 1971
[16]

HUDAK, P. (1989). Conception, evolution, and application of functional programming languages.ACM Computing Surveys 21359–411

work page 1989
[17]

KLEIN, G., ELPHINSTONE, K., HEISER, G., ANDRONICK, J., COCK, D., DERRIN, P., ELKADUWE, D., ENGELHARDT, K., KOLANSKI, R., NORRISH, M.ET AL. (2009). sel4: Formal verification of an operating-system kernel. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM

work page 2009
[18]

and HAW- BLITZEL, C

LATTUADA, A., HANCE, T., CHO, C., BRUN, M., SUBASINGHE, I., ZHOU, Y., HOWELL, J., PARNO, B. and HAW- BLITZEL, C. (2023). Verus: Verifying rust programs using linear ghost types.Proc. ACM Program. Lang.7

work page 2023
[19]

LEINO, K. R. M. (2010). Dafny: An automatic program verifier for functional correctness. InInternational Conference on Logic for Programming Artificial Intelligence and Reasoning. Springer

work page 2010
[20]

LEROY, X. (2009). Formal verification of a realistic compiler.Communications of the ACM52107–115

work page 2009
[21]

and WELLECK, S

LOHN, E. and WELLECK, S. (2024). minicodeprops: a minimal benchmark for proving code properties. arXiv:2406.11915

work page arXiv 2024
[22]

LOUGHRIDGE, C., SUN, Q., AHRENBACH, S., CASSANO, F., SUN, C., SHENG, Y., MUDIDE, A., MISU, M. R. H., AMIN, N. and TEGMARK, M. (2024). Dafnybench: A benchmark for formal software verification. arXiv:2406.08467

work page arXiv 2024
[23]

Self-Refine: Iterative Refinement with Self-Feedback

MADAAN, A., TANDON, N., GUPTA, P., HALLINAN, S., GAO, L., WIEGREFFE, S., ALON, U., DZIRI, N., PRABHUMOYE, S., YANG, Y., GUPTA, S., MAJUMDER, B. P., HERMANN, K., WELLECK, S. and CLARK, P. (2023). Self-refine: Iterative refinement with self-feedback.arXiv:2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

MATHLIBCOMMUNITY, T. (2020). The lean mathematical library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. ACM, New Orleans, LA, USA

work page 2020
[25]

and KOYEJO, S

MIRANDA, B., ZHOU, Z., NIE, A., OBBAD, E., ANIVA, L., FRONSDAL, K., KIRK, W., SOYLU, D., YU, A., LI, Y. and KOYEJO, S. (2025). Veribench: End-to-end formal verification benchmark for ai code generation in lean 4. In AI4Math@ICML25

work page 2025
[26]

and YAO, S

SHINN, N., CASSANO, F., BERMAN, E., GOPINATH, A., NARASIMHAN, K. and YAO, S. (2023). Reflexion: Language agents with verbal reinforcement learning. InNeurIPS

work page 2023
[27]

and BARRETT, C

SUN, C., SHENG, Y., PADON, O. and BARRETT, C. (2024). Clover: Closed-loop verifiable code generation. arXiv:2310.17807

work page arXiv 2024
[28]

and CHAUDHURI, S

THAKUR, A., LEE, J., TSOUKALAS, G., SISTLA, M., ZHAO, M., ZETZSCHE, S., DURRETT, G., YUE, Y. and CHAUDHURI, S. (2025). Clever: A curated benchmark for formally verified code generation.arXiv:2505.13938

work page arXiv 2025
[29]

VIDELA, A. (2018). Are functional programs easier to verify? URLhttps://semantic-domain.blogspot.com/2018/04/are-functional-programs-easier-to.html

work page 2018
[30]

and ZHANG, T

WANG, R., ZHANG, J., JIA, Y., PAN, R., DIAO, S., PI, R. and ZHANG, T. (2024). Theoremllama: Transforming general-purpose llms into lean4 experts. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA

work page 2024
[31]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

WANG, X., WEI, J., SCHUURMANS, D., LE, Q. V., CHI, E. H., NARANG, S., CHOWDHERY, A. and ZHOU, D. (2022). Self-consistency improves chain of thought reasoning in language models.arXiv:2203.11171. 13 BRIDGE: Domain Guided Program VerificationA PREPRINT

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

WEI, J., WANG, X., SCHUURMANS, D., BOSMA, M., ICHTER, B., XIA, F., CHI, E. H., LE, Q. V. and ZHOU, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

and TIAN, C

WEN, C., CAO, J., SU, J., XU, Z., QIN, S., HE, M., LI, H., CHEUNG, S.-C. and TIAN, C. (2024). Enchanting program specification synthesis by large language models using static analysis and program verification.arXiv:2404.00762

work page arXiv 2024
[34]

and YUAN, B

XU, X., LI, X., QU, X., FU, J. and YUAN, B. (2025). Local success does not compose: Benchmarking large language models for compositional formal verification.arXiv:2509.23061

work page arXiv 2025
[35]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

YAO, S., YU, D., ZHAO, J., SHAFRAN, I., GRIFFITHS, T. L., CAO, Y. and NARASIMHAN, K. (2023). Tree of thoughts: Deliberate problem solving with large language models.arXiv:2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

and CAO, Y

YAO, S., ZHAO, J., YU, D., SHAFRAN, I., NARASIMHAN, K. and CAO, Y. (2023). React: Synergizing reasoning and acting in language models. InICLR

work page 2023
[37]

arXiv preprint arXiv:2505.23135 , year =

YE, Z., YAN, Z., HE, J., KASRIEL, T., YANG, K. and SONG, D. (2025). Verina: Benchmarking verifiable code generation. arXiv:2505.23135. A Extended Analysis A.1 Literate Programming for Formal Verification Literate programming presents programs as readable mathematical exposition with embedded, executable code. In this section we ask whether large language ...

work page arXiv 2025
[38]

Preprocessing: Input validation and data preparation

work page
[39]

Core Algorithm: Optimized computation with memoization

work page
[40]

Each pathway approaches mathematical property identification from different perspectives to ensure comprehensive coverage

Post-processing: Result validation and verification Complexity: O(n) time, O(1) space for this implementation """ # Step 1: Preprocessing if not data: return 0 # Step 2: Core algorithm with optimization result = sum(x for x in data if isinstance(x, int) and x > 0) # Step 3: Post-processing validation return max(0, result) 22 BRIDGE: Domain Guided Program ...

work page
[41]

**Lean Implementation**: Complete solution to the problem

work page
[42]

intersection_analysis

**Formal Theorems**: Mathematical properties discovered via [SWAP:PATHWAY] # Output Format import Std import Mathlib def solution (input : InputType) : OutputType := -- Implementation guided by [SWAP:PATHWAY] analysis sorry -- [SWAP:THEOREM_TYPE] theorem [SWAP:THEOREM_NAME] (input : InputType) : [SWAP:THEOREM_STATEMENT] := by sorry B.3.1 Concrete Pathway ...

work page

[1] [1]

AUSTIN, J., ODENA, A., NYE, M., BOSMA, M., MICHALEWSKI, H., DOHAN, D., JIANG, E., CAI, C., TERRY, M., LE, Q.ET AL. (2021). Program synthesis with large language models.arXiv:2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

E., DELINE, R., JACOBS, B

BARNETT, M., CHANG, B.-Y. E., DELINE, R., JACOBS, B. and LEINO, K. R. M. (2005). Boogie: A modular reusable verifier for object-oriented programs. InFormal Methods for Components and Objects. Springer, 364–387

work page 2005

[3] [3]

and CASTÉRAN, P

BERTOT, Y. and CASTÉRAN, P. (2013).Interactive Theorem Proving and Program Development: Coq’Art: The Calculus of Inductive Constructions. Springer Science & Business Media

work page 2013

[4] [4]

and PASKEVICH, A

BOBOT, F., FILLIÂTRE, J.-C., MARCHÉ, C. and PASKEVICH, A. (2011). Why3: Shepherd your herd of provers. In Boogie 2011: First International Workshop on Intermediate Verification Languages

work page 2011

[5] [5]

E., KUKOVEC, J., SINGH, A., BUTTERLEY, O., BIZID, A., DOUGHERTY, Q., ZHAO, M., TAN, M

BURSUC, S., EHRENBORG, T., LIN, S., ASTEFANOAEI, L., CHIOSA, I. E., KUKOVEC, J., SINGH, A., BUTTERLEY, O., BIZID, A., DOUGHERTY, Q., ZHAO, M., TAN, M. and TEGMARK, M. (2025). A benchmark for vericoding: Formally verified program synthesis.arXiv:2509.22908

work page arXiv 2025

[6] [6]

CHEN, M., TWOREK, J., JUN, H., YUAN, Q.,DEOLIVEIRAPINTO, H. P., KAPLAN, J., EDWARDS, H., BURDA, Y., JOSEPH, N., BROCKMAN, G., RAY, A., PURI, N., KRUEGER, G., PETROV, M., KHLAAF, H., SASTRY, G., MISHKIN, P., CHAN, B., GRAY, S., RYDER, N., PAVLOV, M., POWER, B., KAISER, L., BAVARIAN, M., WINTER, C., TILLET, P., SUCH, F. P., CUMMINGS, A., PLAPPERT, M., CHANT...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

CLARKE, E. M. and WING, J. M. (1996). Formal methods: State of the art and future directions.ACM Computing Surveys 28626–643. 12 BRIDGE: Domain Guided Program VerificationA PREPRINT

work page 1996

[8] [8]

andVONRAUMER, J

DEMOURA, L., KONG, S., AVIGAD, J.,VANDOORN, F. andVONRAUMER, J. (2015). The lean theorem prover (system description). InAutomated Deduction - CADE-25: 25th International Conference on Automated Deduction. Springer

work page 2015

[9] [9]

DIJKSTRA, E. W. (1976).A discipline of programming. Prentice Hall

work page 1976

[10] [10]

and ZHOU, M

FENG, Z., GUO, D., TANG, D., DUAN, N., FENG, X., GONG, M., SHOU, L., QIN, B., LIU, T., JIANG, D. and ZHOU, M. (2020). Codebert: A pre-trained model for programming and natural languages. InFindings of the Association for Computational Linguistics: EMNLP 2020

work page 2020

[11] [11]

FLOYD, R. W. (1967). Assigning meanings to programs. InProceedings of the Symposium on Applied Mathematics, vol. 19. American Mathematical Society, 19–32

work page 1967

[12] [12]

PAL: Program-aided Language Models

GAO, L., MADAAN, A., ZHOU, S., ALON, U., LIU, P., YANG, Y., CALLAN, J. and NEUBIG, G. (2022). Pal: Program- aided language models.arXiv:2211.10435

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

K., CLEMENT, C., DRAIN, D., SUNDARESAN, N., YIN, J., JIANG, D

GUO, D., REN, S., LU, S., FENG, Z., TANG, D., DUAN, N., SVYATKOVSKIY, A., FU, S., TUFANO, M., DENG, S. K., CLEMENT, C., DRAIN, D., SUNDARESAN, N., YIN, J., JIANG, D. and ZHOU, M. (2021). Graphcodebert: Pre-training code representations with data flow. InInternational Conference on Learning Representations (ICLR)

work page 2021

[14] [14]

HOARE, C. A. R. (1969). An axiomatic basis for computer programming.Communications of the ACM12576–580

work page 1969

[15] [15]

HOARE, C. A. R. (1971). Procedures and parameters: An axiomatic approach. InSymposium on Semantics of Algorithmic Languages

work page 1971

[16] [16]

HUDAK, P. (1989). Conception, evolution, and application of functional programming languages.ACM Computing Surveys 21359–411

work page 1989

[17] [17]

KLEIN, G., ELPHINSTONE, K., HEISER, G., ANDRONICK, J., COCK, D., DERRIN, P., ELKADUWE, D., ENGELHARDT, K., KOLANSKI, R., NORRISH, M.ET AL. (2009). sel4: Formal verification of an operating-system kernel. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM

work page 2009

[18] [18]

and HAW- BLITZEL, C

LATTUADA, A., HANCE, T., CHO, C., BRUN, M., SUBASINGHE, I., ZHOU, Y., HOWELL, J., PARNO, B. and HAW- BLITZEL, C. (2023). Verus: Verifying rust programs using linear ghost types.Proc. ACM Program. Lang.7

work page 2023

[19] [19]

LEINO, K. R. M. (2010). Dafny: An automatic program verifier for functional correctness. InInternational Conference on Logic for Programming Artificial Intelligence and Reasoning. Springer

work page 2010

[20] [20]

LEROY, X. (2009). Formal verification of a realistic compiler.Communications of the ACM52107–115

work page 2009

[21] [21]

and WELLECK, S

LOHN, E. and WELLECK, S. (2024). minicodeprops: a minimal benchmark for proving code properties. arXiv:2406.11915

work page arXiv 2024

[22] [22]

LOUGHRIDGE, C., SUN, Q., AHRENBACH, S., CASSANO, F., SUN, C., SHENG, Y., MUDIDE, A., MISU, M. R. H., AMIN, N. and TEGMARK, M. (2024). Dafnybench: A benchmark for formal software verification. arXiv:2406.08467

work page arXiv 2024

[23] [23]

Self-Refine: Iterative Refinement with Self-Feedback

MADAAN, A., TANDON, N., GUPTA, P., HALLINAN, S., GAO, L., WIEGREFFE, S., ALON, U., DZIRI, N., PRABHUMOYE, S., YANG, Y., GUPTA, S., MAJUMDER, B. P., HERMANN, K., WELLECK, S. and CLARK, P. (2023). Self-refine: Iterative refinement with self-feedback.arXiv:2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

MATHLIBCOMMUNITY, T. (2020). The lean mathematical library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. ACM, New Orleans, LA, USA

work page 2020

[25] [25]

and KOYEJO, S

MIRANDA, B., ZHOU, Z., NIE, A., OBBAD, E., ANIVA, L., FRONSDAL, K., KIRK, W., SOYLU, D., YU, A., LI, Y. and KOYEJO, S. (2025). Veribench: End-to-end formal verification benchmark for ai code generation in lean 4. In AI4Math@ICML25

work page 2025

[26] [26]

and YAO, S

SHINN, N., CASSANO, F., BERMAN, E., GOPINATH, A., NARASIMHAN, K. and YAO, S. (2023). Reflexion: Language agents with verbal reinforcement learning. InNeurIPS

work page 2023

[27] [27]

and BARRETT, C

SUN, C., SHENG, Y., PADON, O. and BARRETT, C. (2024). Clover: Closed-loop verifiable code generation. arXiv:2310.17807

work page arXiv 2024

[28] [28]

and CHAUDHURI, S

THAKUR, A., LEE, J., TSOUKALAS, G., SISTLA, M., ZHAO, M., ZETZSCHE, S., DURRETT, G., YUE, Y. and CHAUDHURI, S. (2025). Clever: A curated benchmark for formally verified code generation.arXiv:2505.13938

work page arXiv 2025

[29] [29]

VIDELA, A. (2018). Are functional programs easier to verify? URLhttps://semantic-domain.blogspot.com/2018/04/are-functional-programs-easier-to.html

work page 2018

[30] [30]

and ZHANG, T

WANG, R., ZHANG, J., JIA, Y., PAN, R., DIAO, S., PI, R. and ZHANG, T. (2024). Theoremllama: Transforming general-purpose llms into lean4 experts. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA

work page 2024

[31] [31]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

WANG, X., WEI, J., SCHUURMANS, D., LE, Q. V., CHI, E. H., NARANG, S., CHOWDHERY, A. and ZHOU, D. (2022). Self-consistency improves chain of thought reasoning in language models.arXiv:2203.11171. 13 BRIDGE: Domain Guided Program VerificationA PREPRINT

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

WEI, J., WANG, X., SCHUURMANS, D., BOSMA, M., ICHTER, B., XIA, F., CHI, E. H., LE, Q. V. and ZHOU, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

and TIAN, C

WEN, C., CAO, J., SU, J., XU, Z., QIN, S., HE, M., LI, H., CHEUNG, S.-C. and TIAN, C. (2024). Enchanting program specification synthesis by large language models using static analysis and program verification.arXiv:2404.00762

work page arXiv 2024

[34] [34]

and YUAN, B

XU, X., LI, X., QU, X., FU, J. and YUAN, B. (2025). Local success does not compose: Benchmarking large language models for compositional formal verification.arXiv:2509.23061

work page arXiv 2025

[35] [35]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

YAO, S., YU, D., ZHAO, J., SHAFRAN, I., GRIFFITHS, T. L., CAO, Y. and NARASIMHAN, K. (2023). Tree of thoughts: Deliberate problem solving with large language models.arXiv:2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

and CAO, Y

YAO, S., ZHAO, J., YU, D., SHAFRAN, I., NARASIMHAN, K. and CAO, Y. (2023). React: Synergizing reasoning and acting in language models. InICLR

work page 2023

[37] [37]

arXiv preprint arXiv:2505.23135 , year =

YE, Z., YAN, Z., HE, J., KASRIEL, T., YANG, K. and SONG, D. (2025). Verina: Benchmarking verifiable code generation. arXiv:2505.23135. A Extended Analysis A.1 Literate Programming for Formal Verification Literate programming presents programs as readable mathematical exposition with embedded, executable code. In this section we ask whether large language ...

work page arXiv 2025

[38] [38]

Preprocessing: Input validation and data preparation

work page

[39] [39]

Core Algorithm: Optimized computation with memoization

work page

[40] [40]

Each pathway approaches mathematical property identification from different perspectives to ensure comprehensive coverage

Post-processing: Result validation and verification Complexity: O(n) time, O(1) space for this implementation """ # Step 1: Preprocessing if not data: return 0 # Step 2: Core algorithm with optimization result = sum(x for x in data if isinstance(x, int) and x > 0) # Step 3: Post-processing validation return max(0, result) 22 BRIDGE: Domain Guided Program ...

work page

[41] [41]

**Lean Implementation**: Complete solution to the problem

work page

[42] [42]

intersection_analysis

**Formal Theorems**: Mathematical properties discovered via [SWAP:PATHWAY] # Output Format import Std import Mathlib def solution (input : InputType) : OutputType := -- Implementation guided by [SWAP:PATHWAY] analysis sorry -- [SWAP:THEOREM_TYPE] theorem [SWAP:THEOREM_NAME] (input : InputType) : [SWAP:THEOREM_STATEMENT] := by sorry B.3.1 Concrete Pathway ...

work page