pith. sign in

arxiv: 2511.21104 · v4 · submitted 2025-11-26 · 💻 cs.LG · cs.PL

BRIDGE: Building Representations In Domain Guided Program Synthesis

Pith reviewed 2026-05-17 05:19 UTC · model grok-4.3

classification 💻 cs.LG cs.PL
keywords program synthesislarge language modelsformal verificationLeanpromptingmulti-domain reasoningcode generationtheorem proving
0
0 comments X

The pith

BRIDGE prompting connects generated code to formal specifications and proofs via domain-specific reasoning steps in Lean.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often produce code that cannot be reliably verified in proof assistants like Lean because the outputs for code, specifications, theorems, and proofs drift from each other. BRIDGE addresses this by breaking synthesis into three linked domains—Code, Specification, and Theorem/Proof—and inserting targeted intermediate reasoning at each stage to keep the artifacts consistent. It typically anchors the process in the generated code and uses that as a reference for the other domains. On 178 algorithmic problems across five LLMs, this raises Lean executable correctness by up to 1.5 times compared with direct prompting while using roughly half as many samples for similar results. Fine-tuning models on the intermediate traces further improves Lean success rates by about 1.5 times over training on code alone, and the same style of prompting also raises Python pass rates when focused on specifications.

Core claim

BRIDGE is a structured prompting framework that decomposes program synthesis into three interconnected domains of Code, Specification, and Theorem/Proof, inserting domain-specific intermediate reasoning steps to maintain consistency across the artifacts; in Lean it follows a code-first workflow that treats the generated implementation as a semantic anchor for producing the specification, theorem statement, and proof attempt.

What carries the argument

Domain-specific intermediate reasoning steps that link the Code, Specification, and Theorem/Proof domains while using the generated code as the central semantic reference.

If this is right

  • Lean executable correctness rises by up to nearly 1.5 times relative to direct prompting on the same problems.
  • Sample efficiency roughly doubles at comparable output lengths.
  • Specification-focused prompting raises Python pass rates by as much as 17.5 percentage points.
  • Supervised fine-tuning on BRIDGE reasoning traces produces nearly 1.5 times higher Lean pass rates than fine-tuning on code alone.
  • A measurable gap remains between executable correctness and complete formal proof generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same intermediate-reasoning pattern could be applied to other proof assistants such as Coq or Isabelle without changing the core structure.
  • The generated reasoning traces offer a reusable source of training data that might improve models on related formal-synthesis benchmarks.
  • Combining BRIDGE-style prompting with search or reinforcement learning over the intermediate steps could further close the gap to full proofs.
  • The approach suggests that explicit cross-domain consistency checks are more important than raw model scale for verified code tasks.

Load-bearing premise

The intermediate reasoning steps produce specifications and proofs that remain semantically consistent with the generated code and do not introduce errors that break downstream Lean verification.

What would settle it

Run BRIDGE and direct prompting on the same 178 problems with a new LLM; if the Lean pass rate shows no consistent gain or if many generated proofs fail verification despite correct code execution, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2511.21104 by Anima Anandkumar, Carson Eisenach, Dean Foster, Dominique Perrault-Joncas, Robert Joseph George, Udaya Ghai.

Figure 1
Figure 1. Figure 1: Functional reasoning improves verification accuracy while reducing inference-time compute. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Minimum key pushes LeetCode problem. two. This decomposition enables scalable verification by evaluating cross-domain consistency rather than isolated accuracy, allowing models to reason locally within each domain while maintaining global coherence across the pipeline. A summary of these domains and their challenges is shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Functional reasoning content from Claude 4. The BRIDGE framework allows for different paradigms – including DbC, Property-Based Testing, Test-Driven Development, and Defensive Programming – to leverage additional views on the same task [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Specification reasoning content from Claude 4. Functional representations, for instance, naturally capture recursion and compositionality; specification-oriented reasoning captures correctness constraints; and theorem-driven reasoning expresses abstract invariants. These are not different languages, but different modes of representation that help the model preserve semantic structure as it transitions from… view at source ↗
Figure 6
Figure 6. Figure 6: BRIDGE Experimental Architecture We evaluate BRIDGE on 178 algorithmic problems curated from LEET￾CODE, spanning arrays, strings, graphs, dynamic programming, nu￾merical algorithms, and trees. Each problem includes natural language documentation and test suites, enabling meaningful evaluation across all three domains: Code (correct Lean4 implementations), Specifications (non-vacuous formal contracts), and … view at source ↗
Figure 5
Figure 5. Figure 5: Theorem reasoning content from Claude 4. DEEPSEEK R1, LLAMA-3.1-70B, and QWEN2.5 CODER. All are run with consistent decoding parameters (temperature 0.7, max output tokens 4096). 4.2 Reasoning Strategies and Evaluation Design We systematically vary reasoning strategies to test the semantic bridge hypothesis. For the Code domain, we compare • direct natural language→Lean4 generation • reasoning functionally… view at source ↗
Figure 7
Figure 7. Figure 7: Imperative vs Functional Reasoning. Line graphs show each model’s pass@5 performance trajectory across enhancement strategies, with functional paradigms consistently outperforming imperative approaches. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Dual improvement analysis showing that specification reasoning enhances both Python code generation (left) and Lean4 formal verification (right). All models benefit from specification-guided reasoning over direct baseline approaches. In Python code generation, specification reasoning yields clear gains. Claude 4.0 Sonnet improves from 96.1% to 97.9%, while weaker models see larger boosts: DeepSeek-R1 (57.9… view at source ↗
Figure 9
Figure 9. Figure 9: Code Embedding Analysis: Generated Python implementations cluster by strategy, with baseline (orange) showing distinct separation from other approaches [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Proof Generation Analysis: The left panel shows direct proof reasoning performance and reveals a significant bottleneck between Python generation (97-98% success) and Lean translation (9–20% success). 6 Conclusion and Future Work We introduced BRIDGE, a framework that reframes automated formal verification as achieving semantic consistency across three interdependent domains; code, specifications, and pro… view at source ↗
Figure 11
Figure 11. Figure 11: Temperature robustness of reasoning paradigms. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Python code success rates across improvement stages. Functional paradigms consistently outperform imperative ones, and incorporating Lean error feedback boosts Python success, though the gap to final verification remains large. Functional paradigms consistently outperform imperative ones across all pipeline stages, maintaining this advantage even with error feedback. Combining Python and Lean error signal… view at source ↗
read the original abstract

Large language models can generate plausible code, but remain brittle for formal verification in proof assistants such as Lean. A central scalability challenge is that verified synthesis requires consistent artifacts across several coupled domains: executable code, formal specifications, theorem statements, and proof attempts. Existing approaches often treat these artifacts separately. We present BRIDGE, a structured prompting framework for multi-artifact program synthesis. BRIDGE decomposes generation into three interconnected domains: Code, Specification, and Theorem/Proof, and uses domain-specific intermediate reasoning to connect them. In Lean, BRIDGE often follows a code-first workflow, using the generated implementation as a semantic anchor for downstream specification, theorem statement, and proof-attempt generation. Across 178 algorithmic problems and five LLMs, BRIDGE improves Lean executable correctness by up to nearly 1.5x over direct prompting and can be roughly 2x more sample efficient at comparable generation lengths. We further find that specification-oriented prompting improves Python pass rates by up to 17.5 percentage points. Beyond inference-time prompting, supervised fine-tuning on BRIDGE-style reasoning traces yields nearly 1.5x higher Lean pass success than code-only fine-tuning, suggesting that these intermediate representations provide a learnable inductive bias. BRIDGE provides a practical framework for scaling verified synthesis while highlighting the remaining gap between executable correctness and full formal proof generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces BRIDGE, a structured prompting framework for multi-artifact program synthesis that decomposes generation into three interconnected domains (Code, Specification, and Theorem/Proof) using domain-specific intermediate reasoning steps. In a code-first workflow for Lean, the generated implementation serves as a semantic anchor for downstream artifacts. Across 178 algorithmic problems and five LLMs, it claims up to nearly 1.5x improvement in Lean executable correctness over direct prompting, roughly 2x better sample efficiency at comparable lengths, specification-oriented prompting gains of up to 17.5 percentage points in Python, and nearly 1.5x higher Lean pass rates from supervised fine-tuning on BRIDGE-style traces versus code-only fine-tuning.

Significance. If the results hold under scrutiny, the work is significant for scaling LLM-assisted verified synthesis. It provides a practical inference-time framework for handling coupled artifacts and demonstrates that the intermediate representations supply a learnable inductive bias, as evidenced by the SFT experiments. The evaluation scale (178 problems, multiple LLMs) and the distinction between executable correctness and full proof generation are useful contributions.

major comments (2)
  1. [§5] §5 (Experiments) and associated tables: The central claim that BRIDGE produces consistent multi-artifact outputs relies on the domain-specific intermediates reliably linking code semantics to specifications and theorems. However, the reported metrics are limited to aggregate Lean executable correctness and pass rates with no explicit measure, ablation, or consistency check quantifying semantic alignment (e.g., whether generated pre/post-conditions or theorem statements match the actual behavior of the code). This leaves open the possibility that gains are driven by improved code generation alone rather than verified multi-domain consistency.
  2. [§5.3] §5.3 (SFT experiments): The claim that BRIDGE-style reasoning traces provide a superior inductive bias for fine-tuning is load-bearing for the broader contribution, yet the comparison to code-only fine-tuning does not report controls for trace length, data quality filtering, or whether the traces were verified for internal consistency before use in SFT.
minor comments (2)
  1. [§4] The abstract and §4 could more explicitly define what constitutes a 'domain-specific intermediate reasoning step' with a short illustrative example from one of the 178 problems.
  2. Figure captions and axis labels in the efficiency plots would benefit from stating the exact generation length ranges used for the 'comparable generation lengths' comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments identify valuable opportunities to strengthen the presentation of our experimental results and the SFT analysis. We address each major comment below and indicate the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [§5] §5 (Experiments) and associated tables: The central claim that BRIDGE produces consistent multi-artifact outputs relies on the domain-specific intermediates reliably linking code semantics to specifications and theorems. However, the reported metrics are limited to aggregate Lean executable correctness and pass rates with no explicit measure, ablation, or consistency check quantifying semantic alignment (e.g., whether generated pre/post-conditions or theorem statements match the actual behavior of the code). This leaves open the possibility that gains are driven by improved code generation alone rather than verified multi-domain consistency.

    Authors: We agree that an explicit quantification of semantic alignment would provide stronger support for the multi-artifact consistency claim. Our evaluation centers on Lean executable correctness because it offers an objective, automated signal that the generated code satisfies test cases, and the code-first workflow is designed to use this implementation as a semantic anchor for subsequent specification and theorem generation. Nevertheless, we acknowledge that aggregate pass rates alone do not directly measure whether the generated pre/post-conditions or theorem statements are semantically faithful to the code. In the revised manuscript we will add a new subsection with an ablation that generates specifications and theorems both with and without the BRIDGE intermediate steps, then reports the fraction of generated specifications that are satisfied by the code on held-out test cases as well as a manual audit of theorem-statement alignment on a random sample of 30 problems. revision: yes

  2. Referee: [§5.3] §5.3 (SFT experiments): The claim that BRIDGE-style reasoning traces provide a superior inductive bias for fine-tuning is load-bearing for the broader contribution, yet the comparison to code-only fine-tuning does not report controls for trace length, data quality filtering, or whether the traces were verified for internal consistency before use in SFT.

    Authors: We appreciate the referee’s emphasis on rigorous controls for the SFT comparison. The BRIDGE traces used for fine-tuning were produced by the same prompting procedure and retained only those that yielded executable code; the code-only baseline was constructed from the same problems and filtered by the same execution criterion. We did not, however, explicitly match or report average trace lengths or perform additional automated consistency checks beyond execution success. In the revision we will add a table reporting average token lengths for the BRIDGE-style and code-only training sets, describe the exact filtering pipeline, and note that internal consistency was enforced via execution verification where feasible. Full manual verification of every trace was not performed due to dataset scale, but the observed performance gap remains informative given the matched problem set and execution filter. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivations or self-referential reductions

full rationale

The paper describes an empirical prompting framework (BRIDGE) for generating code, specs, and proofs in Lean, evaluated on 178 problems across five LLMs with reported gains in executable correctness and sample efficiency versus direct prompting and code-only fine-tuning. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. All central claims rest on experimental comparisons rather than any chain that reduces to its own inputs by construction, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that LLMs can produce coherent multi-artifact outputs when guided by domain-specific prompts; no free parameters are fitted in the reported results, and the main addition is the prompting structure itself rather than new mathematical entities.

axioms (1)
  • domain assumption Large language models can generate plausible code and intermediate reasoning traces when given structured prompts.
    Invoked throughout the description of the BRIDGE workflow and its application to Lean synthesis.
invented entities (1)
  • Domain-specific intermediate reasoning no independent evidence
    purpose: To connect generated code to specifications and theorem/proof artifacts.
    Introduced as the core mechanism of the BRIDGE framework.

pith-pipeline@v0.9.0 · 5555 in / 1393 out tokens · 53877 ms · 2026-05-17T05:19:23.661442+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

  1. [1]

    AUSTIN, J., ODENA, A., NYE, M., BOSMA, M., MICHALEWSKI, H., DOHAN, D., JIANG, E., CAI, C., TERRY, M., LE, Q.ET AL. (2021). Program synthesis with large language models.arXiv:2108.07732

  2. [2]

    E., DELINE, R., JACOBS, B

    BARNETT, M., CHANG, B.-Y. E., DELINE, R., JACOBS, B. and LEINO, K. R. M. (2005). Boogie: A modular reusable verifier for object-oriented programs. InFormal Methods for Components and Objects. Springer, 364–387

  3. [3]

    and CASTÉRAN, P

    BERTOT, Y. and CASTÉRAN, P. (2013).Interactive Theorem Proving and Program Development: Coq’Art: The Calculus of Inductive Constructions. Springer Science & Business Media

  4. [4]

    and PASKEVICH, A

    BOBOT, F., FILLIÂTRE, J.-C., MARCHÉ, C. and PASKEVICH, A. (2011). Why3: Shepherd your herd of provers. In Boogie 2011: First International Workshop on Intermediate Verification Languages

  5. [5]

    E., KUKOVEC, J., SINGH, A., BUTTERLEY, O., BIZID, A., DOUGHERTY, Q., ZHAO, M., TAN, M

    BURSUC, S., EHRENBORG, T., LIN, S., ASTEFANOAEI, L., CHIOSA, I. E., KUKOVEC, J., SINGH, A., BUTTERLEY, O., BIZID, A., DOUGHERTY, Q., ZHAO, M., TAN, M. and TEGMARK, M. (2025). A benchmark for vericoding: Formally verified program synthesis.arXiv:2509.22908

  6. [6]

    CHEN, M., TWOREK, J., JUN, H., YUAN, Q.,DEOLIVEIRAPINTO, H. P., KAPLAN, J., EDWARDS, H., BURDA, Y., JOSEPH, N., BROCKMAN, G., RAY, A., PURI, N., KRUEGER, G., PETROV, M., KHLAAF, H., SASTRY, G., MISHKIN, P., CHAN, B., GRAY, S., RYDER, N., PAVLOV, M., POWER, B., KAISER, L., BAVARIAN, M., WINTER, C., TILLET, P., SUCH, F. P., CUMMINGS, A., PLAPPERT, M., CHANT...

  7. [7]

    CLARKE, E. M. and WING, J. M. (1996). Formal methods: State of the art and future directions.ACM Computing Surveys 28626–643. 12 BRIDGE: Domain Guided Program VerificationA PREPRINT

  8. [8]

    andVONRAUMER, J

    DEMOURA, L., KONG, S., AVIGAD, J.,VANDOORN, F. andVONRAUMER, J. (2015). The lean theorem prover (system description). InAutomated Deduction - CADE-25: 25th International Conference on Automated Deduction. Springer

  9. [9]

    DIJKSTRA, E. W. (1976).A discipline of programming. Prentice Hall

  10. [10]

    and ZHOU, M

    FENG, Z., GUO, D., TANG, D., DUAN, N., FENG, X., GONG, M., SHOU, L., QIN, B., LIU, T., JIANG, D. and ZHOU, M. (2020). Codebert: A pre-trained model for programming and natural languages. InFindings of the Association for Computational Linguistics: EMNLP 2020

  11. [11]

    FLOYD, R. W. (1967). Assigning meanings to programs. InProceedings of the Symposium on Applied Mathematics, vol. 19. American Mathematical Society, 19–32

  12. [12]

    PAL: Program-aided Language Models

    GAO, L., MADAAN, A., ZHOU, S., ALON, U., LIU, P., YANG, Y., CALLAN, J. and NEUBIG, G. (2022). Pal: Program- aided language models.arXiv:2211.10435

  13. [13]

    K., CLEMENT, C., DRAIN, D., SUNDARESAN, N., YIN, J., JIANG, D

    GUO, D., REN, S., LU, S., FENG, Z., TANG, D., DUAN, N., SVYATKOVSKIY, A., FU, S., TUFANO, M., DENG, S. K., CLEMENT, C., DRAIN, D., SUNDARESAN, N., YIN, J., JIANG, D. and ZHOU, M. (2021). Graphcodebert: Pre-training code representations with data flow. InInternational Conference on Learning Representations (ICLR)

  14. [14]

    HOARE, C. A. R. (1969). An axiomatic basis for computer programming.Communications of the ACM12576–580

  15. [15]

    HOARE, C. A. R. (1971). Procedures and parameters: An axiomatic approach. InSymposium on Semantics of Algorithmic Languages

  16. [16]

    HUDAK, P. (1989). Conception, evolution, and application of functional programming languages.ACM Computing Surveys 21359–411

  17. [17]

    KLEIN, G., ELPHINSTONE, K., HEISER, G., ANDRONICK, J., COCK, D., DERRIN, P., ELKADUWE, D., ENGELHARDT, K., KOLANSKI, R., NORRISH, M.ET AL. (2009). sel4: Formal verification of an operating-system kernel. InProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM

  18. [18]

    and HAW- BLITZEL, C

    LATTUADA, A., HANCE, T., CHO, C., BRUN, M., SUBASINGHE, I., ZHOU, Y., HOWELL, J., PARNO, B. and HAW- BLITZEL, C. (2023). Verus: Verifying rust programs using linear ghost types.Proc. ACM Program. Lang.7

  19. [19]

    LEINO, K. R. M. (2010). Dafny: An automatic program verifier for functional correctness. InInternational Conference on Logic for Programming Artificial Intelligence and Reasoning. Springer

  20. [20]

    LEROY, X. (2009). Formal verification of a realistic compiler.Communications of the ACM52107–115

  21. [21]

    and WELLECK, S

    LOHN, E. and WELLECK, S. (2024). minicodeprops: a minimal benchmark for proving code properties. arXiv:2406.11915

  22. [22]

    LOUGHRIDGE, C., SUN, Q., AHRENBACH, S., CASSANO, F., SUN, C., SHENG, Y., MUDIDE, A., MISU, M. R. H., AMIN, N. and TEGMARK, M. (2024). Dafnybench: A benchmark for formal software verification. arXiv:2406.08467

  23. [23]

    Self-Refine: Iterative Refinement with Self-Feedback

    MADAAN, A., TANDON, N., GUPTA, P., HALLINAN, S., GAO, L., WIEGREFFE, S., ALON, U., DZIRI, N., PRABHUMOYE, S., YANG, Y., GUPTA, S., MAJUMDER, B. P., HERMANN, K., WELLECK, S. and CLARK, P. (2023). Self-refine: Iterative refinement with self-feedback.arXiv:2303.17651

  24. [24]

    MATHLIBCOMMUNITY, T. (2020). The lean mathematical library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. ACM, New Orleans, LA, USA

  25. [25]

    and KOYEJO, S

    MIRANDA, B., ZHOU, Z., NIE, A., OBBAD, E., ANIVA, L., FRONSDAL, K., KIRK, W., SOYLU, D., YU, A., LI, Y. and KOYEJO, S. (2025). Veribench: End-to-end formal verification benchmark for ai code generation in lean 4. In AI4Math@ICML25

  26. [26]

    and YAO, S

    SHINN, N., CASSANO, F., BERMAN, E., GOPINATH, A., NARASIMHAN, K. and YAO, S. (2023). Reflexion: Language agents with verbal reinforcement learning. InNeurIPS

  27. [27]

    and BARRETT, C

    SUN, C., SHENG, Y., PADON, O. and BARRETT, C. (2024). Clover: Closed-loop verifiable code generation. arXiv:2310.17807

  28. [28]

    and CHAUDHURI, S

    THAKUR, A., LEE, J., TSOUKALAS, G., SISTLA, M., ZHAO, M., ZETZSCHE, S., DURRETT, G., YUE, Y. and CHAUDHURI, S. (2025). Clever: A curated benchmark for formally verified code generation.arXiv:2505.13938

  29. [29]

    VIDELA, A. (2018). Are functional programs easier to verify? URLhttps://semantic-domain.blogspot.com/2018/04/are-functional-programs-easier-to.html

  30. [30]

    and ZHANG, T

    WANG, R., ZHANG, J., JIA, Y., PAN, R., DIAO, S., PI, R. and ZHANG, T. (2024). Theoremllama: Transforming general-purpose llms into lean4 experts. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA

  31. [31]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    WANG, X., WEI, J., SCHUURMANS, D., LE, Q. V., CHI, E. H., NARANG, S., CHOWDHERY, A. and ZHOU, D. (2022). Self-consistency improves chain of thought reasoning in language models.arXiv:2203.11171. 13 BRIDGE: Domain Guided Program VerificationA PREPRINT

  32. [32]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    WEI, J., WANG, X., SCHUURMANS, D., BOSMA, M., ICHTER, B., XIA, F., CHI, E. H., LE, Q. V. and ZHOU, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.arXiv:2201.11903

  33. [33]

    and TIAN, C

    WEN, C., CAO, J., SU, J., XU, Z., QIN, S., HE, M., LI, H., CHEUNG, S.-C. and TIAN, C. (2024). Enchanting program specification synthesis by large language models using static analysis and program verification.arXiv:2404.00762

  34. [34]

    and YUAN, B

    XU, X., LI, X., QU, X., FU, J. and YUAN, B. (2025). Local success does not compose: Benchmarking large language models for compositional formal verification.arXiv:2509.23061

  35. [35]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    YAO, S., YU, D., ZHAO, J., SHAFRAN, I., GRIFFITHS, T. L., CAO, Y. and NARASIMHAN, K. (2023). Tree of thoughts: Deliberate problem solving with large language models.arXiv:2305.10601

  36. [36]

    and CAO, Y

    YAO, S., ZHAO, J., YU, D., SHAFRAN, I., NARASIMHAN, K. and CAO, Y. (2023). React: Synergizing reasoning and acting in language models. InICLR

  37. [37]

    arXiv preprint arXiv:2505.23135 , year =

    YE, Z., YAN, Z., HE, J., KASRIEL, T., YANG, K. and SONG, D. (2025). Verina: Benchmarking verifiable code generation. arXiv:2505.23135. A Extended Analysis A.1 Literate Programming for Formal Verification Literate programming presents programs as readable mathematical exposition with embedded, executable code. In this section we ask whether large language ...

  38. [38]

    Preprocessing: Input validation and data preparation

  39. [39]

    Core Algorithm: Optimized computation with memoization

  40. [40]

    Each pathway approaches mathematical property identification from different perspectives to ensure comprehensive coverage

    Post-processing: Result validation and verification Complexity: O(n) time, O(1) space for this implementation """ # Step 1: Preprocessing if not data: return 0 # Step 2: Core algorithm with optimization result = sum(x for x in data if isinstance(x, int) and x > 0) # Step 3: Post-processing validation return max(0, result) 22 BRIDGE: Domain Guided Program ...

  41. [41]

    **Lean Implementation**: Complete solution to the problem

  42. [42]

    intersection_analysis

    **Formal Theorems**: Mathematical properties discovered via [SWAP:PATHWAY] # Output Format import Std import Mathlib def solution (input : InputType) : OutputType := -- Implementation guided by [SWAP:PATHWAY] analysis sorry -- [SWAP:THEOREM_TYPE] theorem [SWAP:THEOREM_NAME] (input : InputType) : [SWAP:THEOREM_STATEMENT] := by sorry B.3.1 Concrete Pathway ...