pith. sign in

arxiv: 2606.25561 · v1 · pith:M747V2TGnew · submitted 2026-06-24 · 💻 cs.CR

CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes

Pith reviewed 2026-06-25 20:49 UTC · model grok-4.3

classification 💻 cs.CR
keywords CrypFormBenchformal analysislarge language modelscryptographic schemessymbolic securitycomputational securitybenchmark evaluation
0
0 comments X

The pith

Large language models have limited capability for formal analysis of cryptographic schemes, performing well on interpretation but struggling with generation, transformation, and correction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CrypFormBench, a new benchmark designed to evaluate the ability of large language models to assist in the formal analysis of cryptographic schemes. This benchmark covers five core capabilities: interpretation, generation, completion, transformation, and correction, using instances from both symbolic and computational security approaches. Evaluation of nine leading LLMs shows they leverage code knowledge for better results on interpretation and completion but have overall limited performance, especially in tasks requiring generation and correction of formal descriptions. The work also offers suggestions for improving outputs through prompting and fine-tuning techniques.

Core claim

CrypFormBench is introduced as a comprehensive benchmark with 700 instances from 677 schemes in 7 formal languages and 160 security properties to measure LLM performance on formal cryptographic analysis tasks. The evaluation demonstrates that LLMs' effectiveness remains limited, with particular difficulties in generation, transformation, and correction despite strengths in interpretation and completion.

What carries the argument

CrypFormBench, a benchmark jointly covering symbolic and computational security to evaluate five core LLM capabilities in formal analysis of cryptographic schemes.

If this is right

  • LLMs can be used to assist experts in interpreting and completing formal specifications of cryptographic schemes.
  • Few-shot prompting, Pass@K sampling, and lightweight fine-tuning can help mitigate issues with producing executable formal outputs.
  • The benchmark enables standardized tracking of improvements in LLM capabilities for this domain.
  • Overall performance limitations indicate that full automation of formal analysis is not yet achievable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improved LLM performance on these tasks could reduce the expertise required for formal verification of crypto protocols.
  • The benchmark could serve as a foundation for developing specialized models or tools for security analysis.
  • Connections to other formal methods domains might reveal similar capability patterns in LLMs.

Load-bearing premise

The 700 instances in the benchmark form a representative sample of the labor-intensive manual formal analysis tasks performed by experts.

What would settle it

A follow-up evaluation using a new set of formal analysis instances outside the benchmark that yields markedly different performance results for the tested LLMs.

Figures

Figures reproduced from arXiv: 2606.25561 by Bingzheng Wang, Haihui Fan, Hengyuan Liu, Hongbo Liu, Li Zhou, Qionglu Zhang, Rui Zhang, Xianhui Lu, Xiaoyan Gu, Zhaoxuan Li, Ziming Zhao.

Figure 1
Figure 1. Figure 1: Motivation for benchmarking LLM-assisted Cryptographic Scheme analysis. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The workflow of C.F.B (Crypto Formal Benchmark). Once generated, these specifications can be analyzed by their tools under the adversaries such as Dolev-Yao [19]. This enables a practical pipeline: LLMs translate natural-language descriptions into tool-compatible formal code, which is checked for properties such as secrecy/authentication [51]. Q1. How does formal code generation differ from traditional cod… view at source ↗
Figure 3
Figure 3. Figure 3: The property and length distribution for various languages. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Concrete examples for five benchmark tasks in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results of the generation experiment. Different colors represent different formal languages. Among [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Results of completion experiment. The last sub-figure shows ACC improvement of completion over [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Results of the correction experiment. The left two show the [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Results of transformation experiment from model and language perspectives. The source [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Interpretation results across models and languages. The left heatmap shows the similarities between [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Representative cases of LLM outputs across evaluation tasks, highlighting typical error patterns. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Execution efficiency (analysis time and file size) of successfully analyzed files. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Overall ranks of LLMs under C1-C4. are usually comparable to human-curated scripts. However, redundant declarations, over-nested terms, or ill-structured rules can inflate size and cause timeouts even after compilation. Thus, efficiency is mainly governed by intrinsic tool/language difficulty, while analyzability remains the key bottleneck. It also reveals whether models learn verifier-friendly structure … view at source ↗
Figure 13
Figure 13. Figure 13: Effects of 𝐾-shot prompting strategy (𝐾 ∈ {0, 1, 2, 3}) in task and language perspectives. Transforma￾tion is omitted since it does not yield analyzable outputs for all 𝐾 in this experiment. our multi-signal interpretation score that combines logic similarity, annotation similarity, and tool-executability/verification outcomes (c.f., Eq. (3)). The natural-language references derived from formal specificat… view at source ↗
Figure 14
Figure 14. Figure 14: Effects of LoRA fine-tuning on Qwen2.5-Coder-3B from task- and language-level views. The timeout ratio is computed as #timeout/#analyzed, and the transformation is omitted as it has no analyzable outputs. Timeout/Analyzed ratio ACC A F1 A [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Effects of PASS@K (𝐾 =1, 2, 3) mechanism on task- and language-levels. is computed on this analyzable set [15] [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Performance of LLMs on fine-grained (property-level) evaluation, where [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Performance of GPT-5.1 on language and task levels. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
read the original abstract

Manual formal analysis of cryptographic schemes is labor-intensive and requires substantial expertise. While model-checking tools (e.g., Scyther and Tamarin) and computational-security tools (e.g., CryptoVerif and EasyCrypt) improve the automation of security proofs, they still rely on experts to abstract schemes and write tool-specific formal descriptions. Large language models (LLMs) are a promising alternative, but their effectiveness in this domain remains unexplored due to the absence of standardized evaluation methodologies. To fill this gap, we introduce CrypFormBench (C.F.B for short), a comprehensive benchmark jointly covering symbolic and computational security to evaluate five core LLM capabilities: interpretation, generation, completion, transformation, and correction. It comprises 700 instances spanning 677 schemes, 7 mainstream formal verifier languages, and 160 security properties. The evaluation of 9 state-of-the-art LLMs reveals that most of them perform well on interpretation and completion, given their code-awareness advantages, but struggle with generation, transformation, and correction. Overall, their performance remains limited, with Claude-3.5 achieving the highest score at 48.7 out of 100. We further provide practical guidance, e.g., few-shot prompting, Pass@K sampling, and lightweight fine-tuning, to mitigate the executability bottleneck and improve tool-usable outputs. Taken together, our benchmark and analyses offer a grounded view of current progress and concrete directions toward reliable LLM-assisted formal cryptographic analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces CrypFormBench, a benchmark with 700 instances spanning 677 cryptographic schemes, 7 formal verifier languages, and 160 security properties, to evaluate LLMs on five core capabilities (interpretation, generation, completion, transformation, correction) for formal analysis of cryptographic schemes. It reports an evaluation of 9 state-of-the-art LLMs, finding stronger performance on interpretation and completion but struggles on generation, transformation, and correction, with Claude-3.5 achieving the highest overall score of 48.7/100, and provides guidance on prompting and fine-tuning to improve outputs.

Significance. If the benchmark construction and evaluation are shown to be rigorous and representative, the work would offer the first standardized methodology for assessing LLM assistance in labor-intensive cryptographic formal analysis tasks, bridging symbolic and computational security approaches and identifying concrete limitations and mitigation strategies.

major comments (2)
  1. [Benchmark construction] Benchmark construction (likely §3): The manuscript provides no details on instance curation process, expert validation of ground truth, sampling strategy across scheme classes or security properties, or inter-rater agreement on task difficulty or correctness. This directly undermines the central claim that the observed performance gaps (e.g., struggles with generation/transformation/correction) reflect intrinsic model limitations rather than benchmark artifacts, as the representativeness of the 700 instances as a sample of expert tasks remains unverified.
  2. [Evaluation methodology] Evaluation methodology (likely §4): No information is supplied on scoring rubrics for the five capabilities, controls for prompt sensitivity, inter-annotator agreement for automated or manual scoring, or how executability of outputs was assessed against the 7 verifier languages. Without these, the headline scores (including 48.7/100) cannot be independently verified or reproduced, making the performance comparison across models load-bearing but unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater transparency in benchmark construction and evaluation methodology. We agree these details are essential for supporting our claims and will revise the manuscript to address both points.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction (likely §3): The manuscript provides no details on instance curation process, expert validation of ground truth, sampling strategy across scheme classes or security properties, or inter-rater agreement on task difficulty or correctness. This directly undermines the central claim that the observed performance gaps (e.g., struggles with generation/transformation/correction) reflect intrinsic model limitations rather than benchmark artifacts, as the representativeness of the 700 instances as a sample of expert tasks remains unverified.

    Authors: We acknowledge that the current manuscript omits explicit details on these aspects of benchmark construction. In the revised version, we will add a new subsection to §3 describing the curation process: instances were drawn from published cryptographic literature and public formal verification repositories; ground truth was validated by domain experts in cryptography and formal methods; sampling was stratified to ensure coverage across scheme classes (encryption, signatures, key exchange, protocols) and the 160 security properties; and inter-rater agreement was measured on a 10% subset using Cohen's kappa for both correctness and difficulty ratings. These additions will directly support the representativeness of the benchmark. revision: yes

  2. Referee: [Evaluation methodology] Evaluation methodology (likely §4): No information is supplied on scoring rubrics for the five capabilities, controls for prompt sensitivity, inter-annotator agreement for automated or manual scoring, or how executability of outputs was assessed against the 7 verifier languages. Without these, the headline scores (including 48.7/100) cannot be independently verified or reproduced, making the performance comparison across models load-bearing but unsupported.

    Authors: We agree the evaluation methodology section lacks necessary detail. The revised §4 will include: explicit scoring rubrics for each capability with examples of full, partial, and zero credit; controls for prompt sensitivity via fixed templates and temperature settings with results from multiple prompt variants; inter-annotator agreement statistics (Cohen's kappa) for the manual scoring component performed by two independent annotators; and the precise executability assessment procedure, including automated compilation and verification commands used for each of the 7 languages together with success/failure criteria. These changes will make the reported scores reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation is self-contained

full rationale

The paper constructs CrypFormBench and reports measured LLM performance scores on its 700 instances. No equations, fitted parameters, predictions, or derivations appear in the abstract or described content. The central results (e.g., Claude-3.5 at 48.7/100, relative strengths on interpretation vs. generation) are direct empirical measurements on externally defined tasks, not reductions to the authors' inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the benchmark or scores. Representativeness concerns affect validity but do not create circularity under the specified criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The headline performance numbers rest on the assumption that the chosen instances and task definitions capture the essential difficulties of expert formal analysis; the benchmark itself is the primary new entity introduced.

axioms (1)
  • domain assumption The 700 instances across 677 schemes and 160 properties are representative of real manual formal analysis work.
    This premise is required for the benchmark scores to be taken as meaningful measures of LLM capability in the domain.
invented entities (1)
  • CrypFormBench no independent evidence
    purpose: Standardized evaluation of LLMs on symbolic and computational cryptographic formal analysis tasks
    New benchmark introduced by the paper to address the absence of evaluation methodologies.

pith-pipeline@v0.9.1-grok · 5831 in / 1482 out tokens · 36763 ms · 2026-06-25T20:49:36.433581+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 42 canonical work pages · 7 internal anchors

  1. [1]

    Nada AlMarwani and Mona T. Diab. 2021. Discrete Cosine Transform as Universal Sentence Encoder. In ACL/IJCNLP (2). Association for Computational Linguistics, 419–426. doi:10.18653/V1/2021.ACL-SHORT.53

  2. [2]

    Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Repre- sentations of Code. In ICLR (Poster). OpenReview.net

  3. [3]

    Anthropic. 2024. Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet

  4. [4]

    Alessandro Armando, David A. Basin, Yohan Boichut, Yannick Chevalier, Luca Compagna, Jorge Cuéllar, Paul Hankes Drielsma, Pierre-Cyrille Héam, Olga Kouchnarenko, Jacopo Mantovani, Sebastian Mödersheim, David von Oheimb, Michaël Rusinowitch, Judson Santiago, Mathieu Turuani, Luca Viganò, and Laurent Vigneron. 2005. The AVISPA Tool for the Automated Validat...

  5. [5]

    Matteo Avalle, Alfredo Pironti, and Riccardo Sisto. 2014. Formal verification of security protocol implementations: a survey. Formal Aspects Comput. 26, 1 (2014), 99–123. doi:10.1007/S00165-012-0269-9

  6. [6]

    Ayers, Dragomir Radev, and Jeremy Avigad

    Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad

  7. [7]

    CoRR abs/2302.12433 (2023)

    ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics. CoRR abs/2302.12433 (2023). doi:10.48550/ARXIV.2302.12433

  8. [8]

    Manuel Barbosa, Gilles Barthe, Karthik Bhargavan, Bruno Blanchet, Cas Cremers, Kevin Liao, and Bryan Parno. 2021. SoK: Computer-Aided Cryptography. In SP. IEEE, 777–795. doi:10.1109/SP40001.2021.00008

  9. [9]

    Gilles Barthe, Benjamin Grégoire, Sylvain Heraud, and Santiago Zanella-Béguelin. 2011. Computer-Aided Security Proofs for the Working Cryptographer. In CRYPTO (Lecture Notes in Computer Science, Vol. 6841) . Springer, 71–90. doi:10.1007/978-3-642-22792-9_5

  10. [10]

    Basin, Jannik Dreier, Lucca Hirschi, Sasa Radomirovic, Ralf Sasse, and Vincent Stettler

    David A. Basin, Jannik Dreier, Lucca Hirschi, Sasa Radomirovic, Ralf Sasse, and Vincent Stettler. 2018. A Formal Analysis of 5G Authentication. In CCS. ACM, 1383–1396. doi:10.1145/3243734.3243846

  11. [11]

    Yves Bertot and Pierre Castéran. 2004. Interactive Theorem Proving and Program Development - Coq’Art: The Calculus of Inductive Constructions. Springer

  12. [12]

    Bruno Blanchet. 2001. An Efficient Cryptographic Protocol Verifier Based on Prolog Rules. In CSFW. IEEE Computer Society, 82–96. doi:10.1109/CSFW.2001.930138

  13. [13]

    Bruno Blanchet. 2006. A Computationally Sound Mechanized Prover for Security Protocols. In S&P. IEEE Computer Society, 140–154. doi:10.1109/SP.2006.1

  14. [14]

    Tom Brown, Benjamin Mann, et al. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Vol. 33. 1877–1901

  15. [15]

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, et al. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. CoRR abs/2303.12712 (2023). doi:10.48550/ARXIV.2303.12712

  16. [16]

    Mark Chen, Jerry Tworek, Heewoo Jun, et al . 2021. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021)

  17. [17]

    Yongchao Chen, Rujul Gandhi, Yang Zhang, and Chuchu Fan. 2023. NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models. In EMNLP. Association for Computational Linguistics, 15880–15903. doi:10.18653/V1/2023.EMNLP-MAIN.985

  18. [18]

    Vincent Cheval, Charlie Jacomme, Steve Kremer, and Robert Künnemann. 2022. SAPIC+: protocol verifiers of the world, unite!. In USENIX Security Symposium. USENIX Association, 3935–3952

  19. [19]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. 2023. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 24 (2023), 240:1–240:113

  20. [20]

    Véronique Cortier, Steve Kremer, and Bogdan Warinschi. 2011. A Survey of Symbolic Methods in Computational Analysis of Cryptographic Systems. J. Autom. Reason. 46, 3-4 (2011), 225–259. doi:10.1007/S10817-010-9187-9

  21. [21]

    Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. 2023. nl2spec: Interactively Translating Unstructured Natural Language to Temporal Logics with Large Language Models. In CA V (2) (Lecture Notes in Computer Science, Vol. 13965) . Springer, 383–396. doi:10.1007/978-3-031-37703-7_18

  22. [22]

    Cas J. F. Cremers. 2008. The Scyther Tool: Verification, Falsification, and Analysis of Security Protocols. InCA V (Lecture Notes in Computer Science, Vol. 5123) . Springer, 414–418. doi:10.1007/978-3-540-70545-1_38

  23. [23]

    Tim Dierks and Eric Rescorla. 2008. The Transport Layer Security (TLS) Protocol Version 1.2. RFC 5246 (2008), 1–104. doi:10.17487/RFC5246

  24. [24]

    Santiago Escobar, Catherine Meadows, and José Meseguer. 2007. Maude-NPA: Cryptographic Protocol Analysis Modulo Equational Properties. In FOSAD (Lecture Notes in Computer Science, Vol. 5705) . Springer, 1–50. doi:10.1007/978-3-642- 03829-7_1

  25. [25]

    Eval4LLMs. 2025. The code and experiment results of CrypFormBench. [EB/OL]. https://github.com/Eval4LLMs/ CrypFormBench/tree/main

  26. [26]

    Hugging Face. 2025. BAAI/bge-large-en-v1.5. https://huggingface.co/BAAI/bge-large-en-v1.5 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE177. Publication date: July 2026. FSE177:22 Z. Li, Q. Zhang, H. Liu, X. Gu, X. Lu, H. Liu, B. Wang, H. Fan, Z. Zhao, R. Zhang, and L. Zhou

  27. [27]

    Hugging Face. 2025. intfloat/e5-large-v2. https://huggingface.co/intfloat/e5-large-v2

  28. [28]

    Hugging Face. 2025. Qwen/Qwen2.5-Coder-3B. https://huggingface.co/Qwen/Qwen2.5-Coder-3B

  29. [29]

    Hugging Face. 2025. Qwen/Qwen3-Embedding-8B. https://huggingface.co/Qwen/Qwen3-Embedding-8B

  30. [30]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In EMNLP (Findings) (Findings of ACL, Vol. EMNLP 2020) . Association for Computational Linguistics, 1536–1547. doi:10.18653/V1/2020.FINDINGS-EMNLP.139

  31. [31]

    Luyu Gao, Aman Madaan, Shuyan Zhou, et al. 2023. PAL: Program-aided Language Models. In ICML (Proceedings of Machine Learning Research, Vol. 202). PMLR, 10764–10799

  32. [32]

    Ivan Gavran, Eva Darulova, and Rupak Majumdar. 2020. Interactive synthesis of temporal specifications from examples and natural language. Proc. ACM Program. Lang. 4, OOPSLA (2020), 201:1–201:26. doi:10.1145/3428269

  33. [33]

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. 2024. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024). doi:10.48550/ARXIV.2406.12793

  34. [34]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025). doi:10.48550/ARXIV.2501.12948

  35. [35]

    Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In ACL (1). Association for Computational Linguistics, 4524–4535. doi:10.18653/V1/P19-1444

  36. [36]

    Dick Hardt. 2012. The OAuth 2.0 Authorization Framework. RFC 6749 (2012), 1–76. doi:10.17487/RFC6749

  37. [37]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. In ICLR. OpenReview.net

  38. [38]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024). doi:10.48550/ARXIV. 2410.21276

  39. [39]

    Nadim Kobeissi, Georgio Nicolas, and Mukesh Tiwari. 2020. Verifpal: Cryptographic Protocol Analysis for the Real World. In INDOCRYPT (Lecture Notes in Computer Science, Vol. 12578) . Springer, 151–202. doi:10.1007/978-3-030-65277- 7_8

  40. [40]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In NeurIPS. 1–16

  41. [41]

    Tingting Li, Ziming Zhao, Zhaoxuan Li, Xiaofei Yue, and Jiongchi Yu. 2026. Fair and Carbon-Aware LLM Routing for Web Services. In WWW. ACM, 9563–9571. doi:10.1145/3774904.3793001

  42. [42]

    Yujia Li, David Choi, Junyoung Chung, et al. 2022. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097. doi:10.48550/ARXIV.2203.07814

  43. [43]

    Yixuan Li, Julian Parsert, and Elizabeth Polgreen. 2024. Guiding Enumerative Program Synthesis with Large Language Models. In CA V (2) (Lecture Notes in Computer Science, Vol. 14682). Springer, 280–301. doi:10.1007/978-3-031-65630-9_15

  44. [44]

    Zhaoxuan Li, Rui Zhang, and Pengchao Li. 2020. A Secure and Efficient Smart Contract Execution Scheme. In ICWS (Lecture Notes in Computer Science) . Springer, 17–32. doi:10.1007/978-3-030-59618-7_2

  45. [45]

    Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, and Michael D Ernst. 2017. Program synthesis from natural language using recurrent neural networks. University of Washington Department of Computer Science and Engineering, Seattle, W A, USA, Tech. Rep. UW-CSE-17-031 (2017), 1–12

  46. [46]

    Liu, Kevin Lin, John Hewitt, et al

    Nelson F. Liu, Kevin Lin, John Hewitt, et al. 2024. Lost in the Middle: How Language Models Use Long Contexts. Trans. Assoc. Comput. Linguistics 12 (2024), 157–173. doi:10.1162/TACL_A_00638

  47. [47]

    Gavin Lowe. 1995. An Attack on the Needham-Schroeder Public-Key Authentication Protocol. Inf. Process. Lett. 56, 3 (1995), 131–133. doi:10.1016/0020-0190(95)00144-2

  48. [48]

    Siqi Lu, Hanjie Dong, Zhaoxuan Li, and Laurence T. Yang. 2024. Not Just Summing: The Identifier Leakage of Private-Join-and-Compute and its Improvement. IEEE Trans. Dependable Secur. Comput. 21, 6 (2024), 5143–5155. doi:10.1109/TDSC.2024.3371569

  49. [49]

    Siqi Lu, Zhaoxuan Li, Xuyang Miao, Qingdi Han, and Jianhua Zheng. 2023. PIWS: Private Intersection Weighted Sum Protocol for Privacy-Preserving Score-Based Voting With Perfect Ballot Secrecy. IEEE Trans. Comput. Soc. Syst. 10, 3 (2023), 1039–1056. doi:10.1109/TCSS.2022.3162869

  50. [51]

    Ziyu Mao, Jingyi Wang, Jun Sun, Shengchao Qin, and Jiawen Xiong. 2025. LLM-aided Automatic Modeling for Security Protocol Verification. In ICSE. doi:10.1109/ICSE55347.2025.00197 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE177. Publication date: July 2026. Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes FSE177:23

  51. [53]

    Meta. 2025. Llama 4. https://www.llama.com/models/llama-4/

  52. [54]

    Needham and Michael D

    Roger M. Needham and Michael D. Schroeder. 1978. Using Encryption for Authentication in Large Networks of Computers. Commun. ACM 21, 12 (1978), 993–999. doi:10.1145/359657.359659

  53. [55]

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, et al . 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In ICLR. OpenReview.net

  54. [56]

    OpenAI. 2025. GPT-5.1: A smarter, more conversational ChatGPT. https://openai.com/index/gpt-5-1/

  55. [57]

    Stanislas Polu and Ilya Sutskever. 2020. Generative Language Modeling for Automated Theorem Proving. CoRR abs/2009.03393 (2020)

  56. [58]

    Colin Raffel, Noam Shazeer, Adam Roberts, et al . 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.J. Mach. Learn. Res. 21 (2020), 140:1–140:67

  57. [59]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 3980–3990. doi:10.18653/V1/D19-1410

  58. [60]

    Raúl Rojas. 2015. A Tutorial Introduction to the Lambda Calculus. CoRR abs/1503.09060 (2015)

  59. [61]

    Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. CoRR abs/2507.06261 (2025). doi:10.48550/ARXIV.2507.06261

  60. [62]

    Haiming Wang, Ye Yuan, Zhengying Liu, et al. 2023. DT-Solver: Automated Theorem Proving with Dynamic-Tree Sampling Guided by Proof-level Value Function. In ACL (1). Association for Computational Linguistics, 12632–12646. doi:10.18653/V1/2023.ACL-LONG.706

  61. [63]

    Jason Wei, Yi Tay, Rishi Bommasani, et al. 2022. Emergent Abilities of Large Language Models. Trans. Mach. Learn. Res. 2022 (2022)

  62. [64]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS

  63. [65]

    xAI. 2025. Grok 3. https://x.ai/grok

  64. [66]

    Swope, Alex Gu, et al

    Kaiyu Yang, Aidan M. Swope, Alex Gu, et al. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. In NeurIPS

  65. [67]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In NeurIPS

  66. [68]

    Tatu Ylönen and Chris Lonvick. 2006. The Secure Shell (SSH) Authentication Protocol. RFC 4252 (2006), 1–17. doi:10.17487/RFC4252

  67. [69]

    Xiaofei Yue, Fangming Zhao, Fulun Ye, Jiongchi Yu, Zhaoxuan Li, Tingting Li, Ziming Zhao, and Jianwei Yin. 2026. HeteroSim: Towards High-Fidelity Heterogeneous LLM Training Simulation on GPUs. In WWW. ACM, 5189–5197. doi:10.1145/3774904.3792254

  68. [70]

    Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. 2023. Multilingual Code Co-evolution using Large Language Models. In ESEC/SIGSOFT FSE. ACM, 695–707. doi:10.1145/3611643.3616350

  69. [71]

    Rui Zhang, Zhaoxuan Li, and Lijuan Zheng. 2021. Secure and Efficient Key Hierarchical Management and Collaborative Signature Schemes of Blockchain. In ICAIS (2) (Lecture Notes in Computer Science) . Springer, 332–345. doi:10.1007/978- 3-030-78612-0_27

  70. [72]

    Ziming Zhao, Zhaoxuan Li, Tingting Li, and Fan Zhang. 2025. CyberLLM: Enable Mapping CVE to Tactics and Techniques of Cyber Threats via LLM. In DASFAA (5) (Lecture Notes in Computer Science) . Springer, 473–488. doi:10. 1007/978-981-95-4155-3_33

  71. [73]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In NeurIPS

  72. [74]

    Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931 (2024). doi:10.48550/ARXIV.2406.11931 Received 2025-09-12; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. F...