pith. sign in

arxiv: 2606.26525 · v1 · pith:RXFAMV2Unew · submitted 2026-06-25 · 💻 cs.LG · cs.LO· cs.PL

Theory-Scale Auto-Formalization of Logics for Computer Science

Pith reviewed 2026-06-26 05:25 UTC · model grok-4.3

classification 💻 cs.LG cs.LOcs.PL
keywords auto-formalizationLeanbenchmarktheory-scalelogicsagentic pipelineformal verificationdefinitional equivalence
0
0 comments X

The pith

A theory-scale benchmark reveals that auto-formalization of interdependent logic statements succeeds at only 20.1 percent for current models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper constructs a large benchmark consisting of hundreds of interdependent definitions, lemmas, and theorems from a computer science logic textbook translated into the Lean proof assistant. The construction uses an agentic pipeline to ensure the formalizations remain consistent with each other and faithful to the source material. Evaluating multiple models on this benchmark shows low success rates, indicating that scaling auto-formalization beyond isolated statements is still difficult. The work also provides new evaluation methods using definitional equivalence to assess the translations more precisely.

Core claim

The paper establishes that a semi-automated pipeline can generate a consistent formalization of 327 textbook items comprising over 4,000 Lean declarations, and that this reveals current models achieve a maximum of 20.1% success on the auto-formalization tasks.

What carries the argument

semi-automated agentic pipeline leveraging concept graphs, formal signature planning, issue tracking, sorry-filling with counter-example search, and human faithfulness review

If this is right

  • The dataset automatically derives five tracks of evaluation benchmarks for different aspects of auto-formalization and theorem proving.
  • Current models struggle specifically with maintaining coherence across interdependent statements rather than isolated ones.
  • The definitional equivalence checkers enable finer-grained assessment than previous protocols.
  • Analysis of the results points to specific bottlenecks in theory-scale formalization.
  • The benchmark supports ongoing testing of new methods for auto-formalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pipeline generalizes, it could support formalization of other large mathematical texts without manual intervention for each item.
  • Low performance on interdependent items suggests future models may need explicit mechanisms for tracking dependencies between statements.
  • Success at this scale could reduce the manual effort required to build verified libraries for computer science applications.
  • The approach might extend to domains beyond logics, such as formalizing parts of software specifications.

Load-bearing premise

The pipeline's combination of automated steps and human review produces translations that are consistent across all items and faithful to the original text without introducing errors.

What would settle it

Discovery of a systematic inconsistency across the generated Lean declarations or a model achieving over 50 percent success on the full set of interdependent items would challenge the quality and difficulty claims.

Figures

Figures reproduced from arXiv: 2606.26525 by Frederick Pu, Jiani Huang, Li Zhang, One An, Osbert Bastani, Xujie Si, Yuming Feng, Ziyang Li.

Figure 1
Figure 1. Figure 1: An overview of LCS-Bench. Tracks of datasets are generated from the Lean repository [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustrative examples in the textbook about Gentzen system [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the LCS-Bench formalization pipeline, which consists of four stages from [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A concrete example of human expert intervention. The [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: IAF track statistics. We show the number of textbook items by chapter, item type, and by [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: #ITP items by GT output LoC. CO46 GPT54 CS46 GLM5 CH45 DS32 K25 D2 L4 Q3C CS46 CO46 GPT54 GPTOSS K25 MM25 GLM5 DSR1 Q32 CO46 CS46 GPT54 K25 GLM5 MM25 Q32 GPTOSS 0 5 10 15 20 25 30 16.8 15.4 15 13.9 11.1 9.5 9 8.4 7.6 3.1 17.917.6 16.315.5 14.4 9.1 5.4 4.2 2.5 20.1 18 16.115.314.7 12.1 5.5 2.1 Pass rate (%) Zero-shot Thinking Agentic [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: IAF pass rate (%) per model, averaged over three runs, grouped by evaluation mode (zero [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cumulative proof rate vs. # turns for DTP and ITP, across three models: CO46, CS46, and K25. CS46 CO46 CH45 K25 DS32 D2 GLM5 L4 Q3C K25 CO46 CS46 MM25 Q32 DSR1 GLM5 CO46 GLM5 K25 CS46 MM25 Q32 0 20 40 60 80 100 Pass / Compile rate (%) Zero-shot Thinking Agentic [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: SSAF results per model-mode. Each stacked bar shows the pass rate (bottom, solid) and the additional compile-only rate (top, faded). CO46 CS46 GLM5 CH45 DS32 K25 D2 L4 Q3C 0 5 10 15 20 −2.7 −2.2 −3.2 −1.3 −2.2 −2.0 −0.8 −1.2 Pass rate (%) +0.3 IAF IAF-D [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Violin plots showing the relationship between test-time compute vs. correctness on IAF. [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Pass-rate difference (items with tag minus without) for selected feature tags on IAF evaluation results with CO46. matching expression): the generated term compiles and has the right syntactic shape, but is not semantically equivalent. Missing constants (CMiss) and kind mismatches (NMKind) occur very rarely, indicating that models generally follow the structural template. To understand what makes an item … view at source ↗
Figure 15
Figure 15. Figure 15: Evaluation protocol diagrams. results in a REJECT verdict; if a downstream review later discovers that an upstream FAITHFUL verdict was erroneous, the entire dependent chain is reverted. Loophole catalog. Over the course of the project, 13 classes of faithfulness failure (L1–L13) were identified from real committed-then-reverted work, each with a detection signature and a prescribed fix. Representative ex… view at source ↗
Figure 16
Figure 16. Figure 16: IAF error-type breakdown per (model, mode) cell. Each bar is a 100% stacked compo [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Three-way Venn diagram of passing IAF items for Claude Opus 4.6 (57 pass), GLM-5 [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Per-item tag cardinality across the five tagging dimensions ( [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Per-tag pass-rate difference for surface_features. Length buckets (len-short, len-medium, len-long) are included and show a monotone effect: shorter items pass at +12.5 pp while longer items lose −10.9 pp. Same conventions as [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Per-tag pass-rate difference for subject_matter, aggregated over six runs of Opus 4.6 on track1_v3 (N=1950 item–run trials). Bar = rate(items with tag) − rate(items without tag). Stars: * p<.05, ** p<.01, *** p<.001 (two-proportion z-test). Tags with nwith<20 or nwithout<20 omitted. by abstract, mathlib-shaped vocabulary: higher-order (+17.9 pp) and set-theoretic (+7.5 pp) items pass much more often, pres… view at source ↗
Figure 21
Figure 21. Figure 21: Per-tag pass-rate difference for logical_shape. Same conventions as [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Per-tag pass-rate difference for vocabulary. Same conventions as [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Per-tag pass-rate difference for definition_shape. Same conventions as [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: IAF reasoning-token distribution per attempt, split by correctness (left half / green: items [PITH_FULL_IMAGE:figures/full_fig_p027_24.png] view at source ↗
read the original abstract

Auto-formalization is critical for scalable formal verification, but existing progress largely focuses on isolated statements, while theory-scale auto-formalization, which coherently translates hundreds of interdependent definitions, lemmas, and theorems, remains open due to challenges in consistency, faithfulness, scalability, and correctness. In this paper, we introduce LCS-Bench, a stand-alone, theory-scale benchmark based on Logics for Computer Science. LCS-Bench is built through a novel semi-automated agentic pipeline that leverages concept graphs, formal signature planning, issue tracking, sorry-filling with counter-example search, complemented by faithfulness review from human experts. The resulting artifact covers 327 textbook items, over 4,076 Lean declarations, and more than 85K lines of Lean code. The dataset supports broad evaluation through a data engine that automatically derives five tracks of evaluation benchmarks, measuring different aspects of auto-formalization and theorem-proving capabilities. We also introduce a novel evaluation protocol featuring definitional equivalence checkers, enabling more fine-grained and faithful assessment. Through extensive evaluation on 14 models, we demonstrate that (1) LCS-Bench is of high quality, consistent, and faithful; (2) the benchmark is challenging, with state-of-the-art models achieving only 20.1% on auto-formalization tasks; and (3) our analysis reveals key findings regarding theory-scale auto-formalization and suggests promising directions for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LCS-Bench, a theory-scale benchmark for auto-formalization derived from the 'Logics for Computer Science' textbook. It is constructed via a semi-automated agentic pipeline using concept graphs, formal signature planning, issue tracking, sorry-filling with counter-example search, and human faithfulness review. The artifact comprises 327 textbook items, over 4,076 Lean declarations, and more than 85K lines of code. It supports five evaluation tracks via a data engine and introduces definitional equivalence checkers for assessment. Evaluations across 14 models show state-of-the-art performance at only 20.1% on auto-formalization tasks.

Significance. If the pipeline indeed produces a consistent and faithful artifact at this scale, LCS-Bench would be a notable contribution to theory-scale auto-formalization by addressing interdependencies among definitions, lemmas, and theorems that isolated-statement benchmarks overlook. The multi-track evaluation design and definitional equivalence protocol offer a more granular assessment framework than prior work. The reported 20.1% SOTA performance underscores the difficulty of the task and could guide future model development.

major comments (2)
  1. [Abstract] Abstract and the description of the construction pipeline: the central claim that 'LCS-Bench is of high quality, consistent, and faithful' rests on the semi-automated pipeline plus human review, yet no quantitative metrics are supplied for faithfulness (e.g., fraction of items corrected, inter-annotator agreement) or consistency (e.g., results of definitional equivalence checks across the 327 items). This evidence gap is load-bearing for the primary contribution.
  2. [Pipeline description (Section 3)] The human faithfulness review step is described at a high level but lacks specifics on reviewer count, expertise, review protocol, or any error analysis; without these, it is impossible to evaluate whether systematic biases or inconsistencies were introduced in the 85K-line artifact.
minor comments (2)
  1. [Abstract] The 20.1% figure should be explicitly tied to a particular model, track, and evaluation protocol in the abstract for immediate clarity.
  2. [Evaluation section] Notation for the five evaluation tracks and the definitional equivalence checkers would benefit from a small illustrative example early in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and for identifying key areas where additional evidence would strengthen the presentation of LCS-Bench. We address each major comment below and will revise the manuscript to supply the requested quantitative details and process specifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the description of the construction pipeline: the central claim that 'LCS-Bench is of high quality, consistent, and faithful' rests on the semi-automated pipeline plus human review, yet no quantitative metrics are supplied for faithfulness (e.g., fraction of items corrected, inter-annotator agreement) or consistency (e.g., results of definitional equivalence checks across the 327 items). This evidence gap is load-bearing for the primary contribution.

    Authors: We agree that the current manuscript lacks explicit quantitative metrics supporting the quality claim. In the revision we will add a dedicated subsection (likely in Section 3 or a new Section 4) that reports: (i) the fraction of the 327 items that required corrections during the human review stage, (ii) inter-annotator agreement statistics from the faithfulness review, and (iii) aggregate results of the definitional equivalence checks performed across the full artifact. These additions will directly substantiate the claim of high quality, consistency, and faithfulness. revision: yes

  2. Referee: [Pipeline description (Section 3)] The human faithfulness review step is described at a high level but lacks specifics on reviewer count, expertise, review protocol, or any error analysis; without these, it is impossible to evaluate whether systematic biases or inconsistencies were introduced in the 85K-line artifact.

    Authors: We accept that the description of the human review is insufficiently detailed. The revised Section 3 will specify: the number of reviewers, their relevant expertise (Lean, formal methods, and the textbook domain), the exact review protocol (independent review followed by consensus discussion), and a concise error analysis summarizing the categories and frequencies of issues identified and resolved. We will also note how the definitional equivalence checkers were used to detect and correct inconsistencies. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper constructs LCS-Bench as a new artifact via an explicitly described semi-automated pipeline (concept graphs, signature planning, issue tracking, sorry-filling with counter-example search, and human faithfulness review) and then evaluates external models on it. No equations, fitted parameters, or self-citations are invoked to derive the benchmark's properties or the reported performance numbers; the scale (327 items, 4,076 declarations, 85K lines) and quality claims rest on the workflow description and external human review rather than reducing to the inputs by construction. The evaluation protocol (definitional equivalence checkers) is presented as an independent verification layer. This matches the default expectation of a self-contained artifact paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper depends on the assumption that the described agentic pipeline plus human review yields faithful and consistent formalizations; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption The semi-automated pipeline using concept graphs and human faithfulness review produces consistent and faithful Lean translations of interdependent textbook items.
    Invoked to support the claim that LCS-Bench is high quality and consistent.

pith-pipeline@v0.9.1-grok · 5811 in / 1324 out tokens · 26518 ms · 2026-06-26T05:25:34.954483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 4 canonical work pages

  1. [1]

    Jiang, Wenda Li, Markus N

    Yuhuai Wu, Albert Q. Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. Autoformalization with large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  2. [2]

    Formal mathematical reasoning: A new frontier in AI.arXiv preprint, 2024

    Kaiyu Yang, Gabriel Poesia, Jingxuan He, Wenda Li, Kristin Lauter, Swarat Chaudhuri, and Dawn Song. Formal mathematical reasoning: A new frontier in AI.arXiv preprint, 2024

  3. [3]

    A survey on deep learning for theorem proving.arXiv preprint, 2024

    Zhaoyu Li, Jialiang Sun, Logan Murphy, Qidong Su, Zenan Li, Xian Zhang, Kaiyu Yang, and Xujie Si. A survey on deep learning for theorem proving.arXiv preprint, 2024

  4. [4]

    Autoformalization in the era of large language models: A survey.arXiv preprint, 2025

    Ke Weng, Lun Du, Sirui Li, Wangyue Lu, Haozhe Sun, Hengyu Liu, and Tiancheng Zhang. Autoformalization in the era of large language models: A survey.arXiv preprint, 2025

  5. [5]

    The lean mathematical library

    The mathlib Community. The lean mathematical library. InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, CPP 2020, pages 367– 381, New York, NY , USA, 2020. Association for Computing Machinery. ISBN 9781450370974. doi: 10.1145/3372885.3373824. URLhttps://doi.org/10.1145/3372885.3373824

  6. [6]

    MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics. InInternational Conference on Learning Representations (ICLR), 2022

  7. [7]

    Ayers, Dragomir Radev, and Jeremy Avigad

    Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. ProofNet: Autoformalizing and formally proving undergraduate-level mathematics.arXiv preprint, 2023

  8. [8]

    FIMO: A challenge formal dataset for automated theorem proving.arXiv preprint, 2023

    Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, and Qun Liu. FIMO: A challenge formal dataset for automated theorem proving.arXiv preprint, 2023

  9. [9]

    FATE: A formal benchmark series for frontier algebra of multiple difficulty levels.arXiv preprint, 2025

    Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailin Guan, Peihao Wu, Chunbo Dai, Liang Xiao, and Bin Dong. FATE: A formal benchmark series for frontier algebra of multiple difficulty levels.arXiv preprint, 2025

  10. [10]

    CombiBench: Benchmarking LLM capability for combinatorial mathematics.arXiv preprint, 2025

    Junqi Liu, Xiaohan Lin, Jonas Bayer, Yaël Dillies, Weijie Jiang, Xiaodan Liang, Roman Soletskyi, Haiming Wang, Yunzhou Xie, Beibei Xiong, Zhengfeng Yang, Jujian Zhang, Lihong Zhi, Jia Li, and Zhengying Liu. CombiBench: Benchmarking LLM capability for combinatorial mathematics.arXiv preprint, 2025

  11. [11]

    LeanCat: A benchmark suite for formal category theory in Lean (part I: 1-categories).arXiv preprint, 2025

    Rongge Xu, Hui Dai, Yiming Fu, Jiedong Jiang, Tianjiao Nie, Junkai Wang, Holiverse Yang, and Zhi-Hao Zhang. LeanCat: A benchmark suite for formal category theory in Lean (part I: 1-categories).arXiv preprint, 2025

  12. [12]

    PUTNAMBENCH: evaluating neural theorem- provers on the putnam mathematical competition

    George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. PUTNAMBENCH: evaluating neural theorem- provers on the putnam mathematical competition. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Asso...

  13. [13]

    Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar

    Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo: Theorem proving with retrieval- augmented language models. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023. 10

  14. [14]

    miniCTX: Neural theorem proving with (long- )contexts.arXiv preprint, 2024

    Jiewen Hu, Thomas Zhu, and Sean Welleck. miniCTX: Neural theorem proving with (long- )contexts.arXiv preprint, 2024

  15. [15]

    Dover Publications, second edition, 2015

    Jean Gallier.Logic for Computer Science: Foundations of Automatic Theorem Proving. Dover Publications, second edition, 2015. A corrected version of the original Wiley edition (pp. 511, 1986)

  16. [16]

    Claude opus 4 system card

    Anthropic. Claude opus 4 system card. Technical report, Anthropic, 2025. URL https: //www.anthropic.com/claude-opus-4-system-card

  17. [17]

    GPT-5 system card

    OpenAI. GPT-5 system card. Technical report, OpenAI, 2025. URL https://openai.com/ research/gpt-5-system-card

  18. [18]

    Marker: Convert pdf to markdown, 2024

    Vikas Subramanian and Datalab. Marker: Convert pdf to markdown, 2024. URL https: //github.com/datalab-to/marker

  19. [19]

    DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

  20. [20]

    DeepSeek-V3 technical report, 2024

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, et al. DeepSeek-V3 technical report, 2024

  21. [21]

    Kimi K2: Open agentic intelligence, 2025

    Kimi Team, Yifan Bai, Yiping Bao, Cheng Chen, Guanduo Chen, et al. Kimi K2: Open agentic intelligence, 2025

  22. [22]

    Jiang, Alexander H

    Abhinav Rastogi, Adam Yang, Albert Q. Jiang, Alexander H. Liu, Alexandre Sablayrolles, et al. Devstral: Fine-tuning language models for coding agent applications, 2025

  23. [23]

    The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

    Meta AI. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ , 2025. Blog post, April 2025

  24. [24]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, et al. Qwen3 technical report, 2025

  25. [25]

    Qwen2.5-Coder technical report, 2024

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-Coder technical report, 2024

  26. [26]

    MiniMax-M1: Scaling test-time compute efficiently with lightning attention, 2025

    MiniMax, Aili Chen, Aonian Li, Bangwei Gong, et al. MiniMax-M1: Scaling test-time compute efficiently with lightning attention, 2025

  27. [27]

    ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools, 2024

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, et al. ChatGLM: A family of large language models from GLM-130B t...

  28. [28]

    gpt-oss-120b & gpt-oss-20b model card, 2025

    OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025

  29. [29]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  30. [30]

    Formalml: A benchmark for evaluating formal subgoal completion in machine learning theory, 2025

    Xiao-Wen Yang, Zihao Zhang, Jianuo Cao, Zhi Zhou, Zenan Li, Lan-Zhe Guo, Yuan Yao, Taolue Chen, Yu-Feng Li, and Xiaoxing Ma. Formalml: A benchmark for evaluating formal subgoal completion in machine learning theory, 2025. URL https://arxiv.org/abs/2510.02335

  31. [31]

    ATLAS: Autoformalizing theorems through lifting, augmentation, and synthesis of data.arXiv preprint, 2025

    Xiaoyang Liu, Kangjie Bao, Jiashuo Zhang, Yunqi Liu, Yu Chen, Yuntian Liu, Yang Jiao, and Tao Luo. ATLAS: Autoformalizing theorems through lifting, augmentation, and synthesis of data.arXiv preprint, 2025

  32. [32]

    Aria: An agent for retrieval and iterative auto-formalization via dependency graph.arXiv preprint, 2025

    Hanyu Wang, Ruohan Xie, Yutong Wang, Guoxiong Gao, Xintao Yu, and Bin Dong. Aria: An agent for retrieval and iterative auto-formalization via dependency graph.arXiv preprint, 2025

  33. [33]

    Process-driven autoformalization in Lean 4.arXiv preprint, 2024

    Jianqiao Lu, Yingjia Wan, Zhengying Liu, Yinya Huang, Jing Xiong, Chengwu Liu, Jianhao Shen, Hui Jin, Jipeng Zhang, Haiming Wang, Zhicheng Yang, Jing Tang, and Zhijiang Guo. Process-driven autoformalization in Lean 4.arXiv preprint, 2024. 11

  34. [34]

    DRIFT: Decompose, retrieve, illustrate, then formalize theorems.arXiv preprint, 2025

    Meiru Zhang, Philipp Borchert, Milan Gritta, and Gerasimos Lampouras. DRIFT: Decompose, retrieve, illustrate, then formalize theorems.arXiv preprint, 2025

  35. [35]

    Autoformalizer with tool feedback.arXiv preprint, 2025

    Qi Guo, Jianing Wang, Jianfei Zhang, Deyang Kong, Xiangzhou Huang, Xiangyu Xi, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, and Wei Ye. Autoformalizer with tool feedback.arXiv preprint, 2025

  36. [36]

    Consistent autoformalization for constructing mathematical libraries

    Lan Zhang, Xin Quan, and Andre Freitas. Consistent autoformalization for constructing mathematical libraries. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, page 4020–4033. Association for Computational Linguistics,

  37. [37]

    URL http://dx.doi.org/10.18653/v1/ 2024.emnlp-main.233

    doi: 10.18653/v1/2024.emnlp-main.233. URL http://dx.doi.org/10.18653/v1/ 2024.emnlp-main.233

  38. [38]

    Min, Yeqi Gao, Wilson Sy, Zhaoyu Li, Xujie Si, and Osbert Bastani

    Marcus J. Min, Yeqi Gao, Wilson Sy, Zhaoyu Li, Xujie Si, and Osbert Bastani. Divide and abstract: Autoformalization via decomposition and abstraction learning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=NjgaeXNit3

  39. [39]

    Formal verification of a realistic compiler.Communications of the ACM, 52(7): 107–115, 2009

    Xavier Leroy. Formal verification of a realistic compiler.Communications of the ACM, 52(7): 107–115, 2009

  40. [40]

    Jianzhou Zhao, Santosh Nagarakatte, Milo M. K. Martin, and Steve Zdancewic. Formal verification of SSA-based optimizations for LLVM. InProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2013

  41. [41]

    seL4: formal verification of an OS kernel

    Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood. seL4: formal verification of an OS kernel. InProceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP), pages 207–220, 2009

  42. [42]

    CertiKOS: An extensible architecture for building certified concurrent OS kernels

    Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan (Newman) Wu, Jieung Kim, Vilhelm Sjöberg, and David Costanzo. CertiKOS: An extensible architecture for building certified concurrent OS kernels. InUSENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016

  43. [43]

    Lorch, Bryan Parno, Michael L

    Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. IronFleet: Proving practical distributed systems correct. InProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2015

  44. [44]

    Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D

    James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas Anderson. Verdi: A framework for implementing and formally verifying distributed systems. InProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2015

  45. [45]

    Appel, Lennart Beringer, Adam Chlipala, Greg Morrisett, Benjamin C

    Andrew W. Appel, Lennart Beringer, Adam Chlipala, Greg Morrisett, Benjamin C. Pierce, et al. The DeepSpec project: The science of deep specification. NSF Expedition in Computing, 2017. https://deepspec.org/

  46. [46]

    Appel.Program Logics for Certified Compilers

    Andrew W. Appel.Program Logics for Certified Compilers. Cambridge University Press, 2014

  47. [47]

    Pierce, Andrew W

    Benjamin C. Pierce, Andrew W. Appel, Adam Chlipala, Stephanie Weirich, Chris Casinghino, Arthur Azevedo de Amorim, Lennart Beringer, Brent Yorgey, et al. Software foundations. Electronic textbook series, 2010.https://softwarefoundations.cis.upenn.edu/

  48. [48]

    Springer, 2014

    Tobias Nipkow and Gerwin Klein.Concrete Semantics with Isabelle/HOL. Springer, 2014

  49. [49]

    MIT Press, 2013

    Adam Chlipala.Certified Programming with Dependent Types. MIT Press, 2013

  50. [50]

    SeCaV: A sequent calculus verifier in Isabelle/HOL.arXiv preprint, 2022

    Asta Halkjær From, Frederik Krogsdal Jacobsen, and Jørgen Villadsen. SeCaV: A sequent calculus verifier in Isabelle/HOL.arXiv preprint, 2022

  51. [51]

    Teaching a formalized logical calculus.arXiv preprint, 2020

    Asta Halkjær From, Alexander Birch Jensen, Anders Schlichtkrull, and Jørgen Villadsen. Teaching a formalized logical calculus.arXiv preprint, 2020

  52. [52]

    An analysis of the constructive content of Henkin’s proof of Gödel’s completeness theorem.arXiv preprint, 2024

    Hugo Herbelin and Danko Ilik. An analysis of the constructive content of Henkin’s proof of Gödel’s completeness theorem.arXiv preprint, 2024

  53. [53]

    Aydemir, Aaron Bohannon, Matthew Fairbairn, J

    Brian E. Aydemir, Aaron Bohannon, Matthew Fairbairn, J. Nathan Foster, Benjamin C. Pierce, Peter Sewell, Dimitrios Vytiniotis, Geoffrey Washburn, Stephanie Weirich, and Steve Zdancewic. Mechanized metatheory for the masses: The POPLmark challenge. InInternational Conference on Theorem Proving in Higher Order Logics (TPHOLs), 2005. 12

  54. [54]

    2023 , issue_date =

    Ziyang Li, Jiani Huang, and Mayur Naik. Scallop: A language for neurosymbolic programming. Proceedings of the ACM on Programming Languages (PLDI), 7, 2023. doi: 10.1145/3591280

  55. [55]

    Scallop: From probabilistic deductive databases to scalable differentiable reasoning

    Jiani Huang, Ziyang Li, Binghong Chen, Karan Samel, Mayur Naik, Le Song, and Xujie Si. Scallop: From probabilistic deductive databases to scalable differentiable reasoning. In Advances in Neural Information Processing Systems (NeurIPS), 2021

  56. [56]

    Neurosymbolic programming in Scallop: Principles and practice.Foundations and Trends in Programming Languages, 2024

    Ziyang Li, Jiani Huang, Jason Liu, and Mayur Naik. Neurosymbolic programming in Scallop: Principles and practice.Foundations and Trends in Programming Languages, 2024. doi: 10.1561/2500000059

  57. [57]

    DeepProbLog: Neural probabilistic logic programming

    Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. DeepProbLog: Neural probabilistic logic programming. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

  58. [58]

    NeurASP: Embracing neural networks into answer set programming

    Zhun Yang, Adam Ishay, and Joohyung Lee. NeurASP: Embracing neural networks into answer set programming. InInternational Joint Conference on Artificial Intelligence (IJCAI), 2020

  59. [59]

    PROC2PDDL: Open-domain planning representations from texts

    Tianyi Zhang, Li Zhang, Zhaoyi Hou, Ziyu Wang, Yuling Gu, Peter Clark, Chris Callison- Burch, and Niket Tandon. PROC2PDDL: Open-domain planning representations from texts. In Workshop on Natural Language Reasoning and Structured Explanations (NLRSE) at NAACL, 2024

  60. [60]

    On the limit of language models as planning formalizers

    Cassie Huang and Li Zhang. On the limit of language models as planning formalizers. In Annual Meeting of the Association for Computational Linguistics (ACL), 2025

  61. [61]

    Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D

    Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. Pyro: Deep universal probabilistic programming.Journal of Machine Learning Research, 2019

  62. [62]

    Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus A

    Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus A. Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language.Journal of Statistical Software, 76(1), 2017

  63. [63]

    ProbLog: A probabilistic Prolog and its application in link discovery

    Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. ProbLog: A probabilistic Prolog and its application in link discovery. InInternational Joint Conference on Artificial Intelligence (IJCAI), 2007

  64. [64]

    SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

  65. [65]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI soft...

  66. [66]

    DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data.arXiv preprint, 2024

    Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data.arXiv preprint, 2024

  67. [67]

    Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z.F

    Huajian Xin, Z.Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z.F. Wu, Fuli Luo, and Chong Ruan. DeepSeek-Prover-V1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search.arXiv preprint, 2024

  68. [68]

    Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z.F

    Z.Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z.F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint, 2025

  69. [69]

    Goedel-Prover: A frontier model for open-source automated theorem proving.arXiv preprint, 2025

    Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-Prover: A frontier model for open-source automated theorem proving.arXiv preprint, 2025. 13

  70. [70]

    Kimina-Prover preview: Towards large formal reasoning models with reinforcement learning

    Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, et al. Kimina-Prover preview: Towards large formal reasoning models with reinforcement learning. ar...

  71. [71]

    Jiang, Sean Welleck, Jin Peng Zhou, Wenda Li, Jiacheng Liu, Mateja Jamnik, Timothée Lacroix, Yuhuai Wu, and Guillaume Lample

    Albert Q. Jiang, Sean Welleck, Jin Peng Zhou, Wenda Li, Jiacheng Liu, Mateja Jamnik, Timothée Lacroix, Yuhuai Wu, and Guillaume Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. InInternational Conference on Learning Representations (ICLR), 2023

  72. [72]

    LEGO-Prover: Neural theorem proving with growing libraries.arXiv preprint, 2023

    Haiming Wang, Huajian Xin, Chuanyang Zheng, Lin Li, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jian Yin, Zhenguo Li, Heng Liao, and Xiaodan Liang. LEGO-Prover: Neural theorem proving with growing libraries.arXiv preprint, 2023

  73. [73]

    An in-context learning agent for formal theorem-proving.arXiv preprint, 2023

    Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An in-context learning agent for formal theorem-proving.arXiv preprint, 2023

  74. [74]

    HyperTree proof search for neural theorem proving.arXiv preprint, 2022

    Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. HyperTree proof search for neural theorem proving.arXiv preprint, 2022

  75. [75]

    Reliable evaluation and benchmarks for statement autoformalization

    Auguste Poiroux, Gail Weiss, Viktor Kunˇcak, and Antoine Bosselut. Reliable evaluation and benchmarks for statement autoformalization. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  76. [76]

    Autoformalize mathematical statements by symbolic equivalence and semantic consistency

    Zenan Li, Yifan Wu, Zhaoyu Li, Xinming Wei, Xian Zhang, Fan Yang, and Xiaoxing Ma. Autoformalize mathematical statements by symbolic equivalence and semantic consistency. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  77. [77]

    Autoformalization in the wild: Assessing LLMs on real-world mathematical definitions.arXiv preprint, 2025

    Lan Zhang, Marco Valentino, and Andre Freitas. Autoformalization in the wild: Assessing LLMs on real-world mathematical definitions.arXiv preprint, 2025

  78. [78]

    Evaluating autoformalization robustness via semantically similar paraphrasing.arXiv preprint, 2025

    Hayden Moore and Asfahan Shah. Evaluating autoformalization robustness via semantically similar paraphrasing.arXiv preprint, 2025

  79. [79]

    propositional formula

    Jianqiao Lu, Yingjia Wan, Yinya Huang, Jing Xiong, Zhengying Liu, and Zhijiang Guo. FormalAlign: Automated alignment evaluation for autoformalization.arXiv preprint, 2024. A Details of the Automated Formalization Pipeline This appendix gives a detailed account of the four-stage automated pipeline summarized in Section 3. A.1 Pre-Processing The textbook PD...

  80. [80]

    2.Build succeeds.lake env leanonˆyreturns exit code zero; warnings are permitted

    Declarations present.Every target in Ty appears in ˆywith a type signature whitespace-equivalent to its ground-truth signature. 2.Build succeeds.lake env leanonˆyreturns exit code zero; warnings are permitted. 3.Nosorry.After stripping comments and string literals, nosorrykeyword remains

Showing first 80 references.