CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes

Bingzheng Wang; Haihui Fan; Hengyuan Liu; Hongbo Liu; Li Zhou; Qionglu Zhang; Rui Zhang; Xianhui Lu; Xiaoyan Gu; Zhaoxuan Li

arxiv: 2606.25561 · v1 · pith:M747V2TGnew · submitted 2026-06-24 · 💻 cs.CR

CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes

Zhaoxuan Li , Qionglu Zhang , Hengyuan Liu , Xiaoyan Gu , Xianhui Lu , Hongbo Liu , Bingzheng Wang , Haihui Fan

show 3 more authors

Ziming Zhao Rui Zhang Li Zhou

This is my paper

Pith reviewed 2026-06-25 20:49 UTC · model grok-4.3

classification 💻 cs.CR

keywords CrypFormBenchformal analysislarge language modelscryptographic schemessymbolic securitycomputational securitybenchmark evaluation

0 comments

The pith

Large language models have limited capability for formal analysis of cryptographic schemes, performing well on interpretation but struggling with generation, transformation, and correction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CrypFormBench, a new benchmark designed to evaluate the ability of large language models to assist in the formal analysis of cryptographic schemes. This benchmark covers five core capabilities: interpretation, generation, completion, transformation, and correction, using instances from both symbolic and computational security approaches. Evaluation of nine leading LLMs shows they leverage code knowledge for better results on interpretation and completion but have overall limited performance, especially in tasks requiring generation and correction of formal descriptions. The work also offers suggestions for improving outputs through prompting and fine-tuning techniques.

Core claim

CrypFormBench is introduced as a comprehensive benchmark with 700 instances from 677 schemes in 7 formal languages and 160 security properties to measure LLM performance on formal cryptographic analysis tasks. The evaluation demonstrates that LLMs' effectiveness remains limited, with particular difficulties in generation, transformation, and correction despite strengths in interpretation and completion.

What carries the argument

CrypFormBench, a benchmark jointly covering symbolic and computational security to evaluate five core LLM capabilities in formal analysis of cryptographic schemes.

If this is right

LLMs can be used to assist experts in interpreting and completing formal specifications of cryptographic schemes.
Few-shot prompting, Pass@K sampling, and lightweight fine-tuning can help mitigate issues with producing executable formal outputs.
The benchmark enables standardized tracking of improvements in LLM capabilities for this domain.
Overall performance limitations indicate that full automation of formal analysis is not yet achievable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improved LLM performance on these tasks could reduce the expertise required for formal verification of crypto protocols.
The benchmark could serve as a foundation for developing specialized models or tools for security analysis.
Connections to other formal methods domains might reveal similar capability patterns in LLMs.

Load-bearing premise

The 700 instances in the benchmark form a representative sample of the labor-intensive manual formal analysis tasks performed by experts.

What would settle it

A follow-up evaluation using a new set of formal analysis instances outside the benchmark that yields markedly different performance results for the tested LLMs.

Figures

Figures reproduced from arXiv: 2606.25561 by Bingzheng Wang, Haihui Fan, Hengyuan Liu, Hongbo Liu, Li Zhou, Qionglu Zhang, Rui Zhang, Xianhui Lu, Xiaoyan Gu, Zhaoxuan Li, Ziming Zhao.

**Figure 2.** Figure 2: The workflow of C.F.B (Crypto Formal Benchmark). Once generated, these specifications can be analyzed by their tools under the adversaries such as Dolev-Yao [19]. This enables a practical pipeline: LLMs translate natural-language descriptions into tool-compatible formal code, which is checked for properties such as secrecy/authentication [51]. Q1. How does formal code generation differ from traditional cod… view at source ↗

**Figure 3.** Figure 3: The property and length distribution for various languages. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Concrete examples for five benchmark tasks in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Results of the generation experiment. Different colors represent different formal languages. Among [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Results of completion experiment. The last sub-figure shows ACC improvement of completion over [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Results of the correction experiment. The left two show the [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Results of transformation experiment from model and language perspectives. The source [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Interpretation results across models and languages. The left heatmap shows the similarities between [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Representative cases of LLM outputs across evaluation tasks, highlighting typical error patterns. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Execution efficiency (analysis time and file size) of successfully analyzed files. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Overall ranks of LLMs under C1-C4. are usually comparable to human-curated scripts. However, redundant declarations, over-nested terms, or ill-structured rules can inflate size and cause timeouts even after compilation. Thus, efficiency is mainly governed by intrinsic tool/language difficulty, while analyzability remains the key bottleneck. It also reveals whether models learn verifier-friendly structure … view at source ↗

**Figure 13.** Figure 13: Effects of 𝐾-shot prompting strategy (𝐾 ∈ {0, 1, 2, 3}) in task and language perspectives. Transformation is omitted since it does not yield analyzable outputs for all 𝐾 in this experiment. our multi-signal interpretation score that combines logic similarity, annotation similarity, and tool-executability/verification outcomes (c.f., Eq. (3)). The natural-language references derived from formal specificat… view at source ↗

**Figure 14.** Figure 14: Effects of LoRA fine-tuning on Qwen2.5-Coder-3B from task- and language-level views. The timeout ratio is computed as #timeout/#analyzed, and the transformation is omitted as it has no analyzable outputs. Timeout/Analyzed ratio ACC A F1 A [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Effects of PASS@K (𝐾 =1, 2, 3) mechanism on task- and language-levels. is computed on this analyzable set [15] [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Performance of LLMs on fine-grained (property-level) evaluation, where [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: Performance of GPT-5.1 on language and task levels. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

read the original abstract

Manual formal analysis of cryptographic schemes is labor-intensive and requires substantial expertise. While model-checking tools (e.g., Scyther and Tamarin) and computational-security tools (e.g., CryptoVerif and EasyCrypt) improve the automation of security proofs, they still rely on experts to abstract schemes and write tool-specific formal descriptions. Large language models (LLMs) are a promising alternative, but their effectiveness in this domain remains unexplored due to the absence of standardized evaluation methodologies. To fill this gap, we introduce CrypFormBench (C.F.B for short), a comprehensive benchmark jointly covering symbolic and computational security to evaluate five core LLM capabilities: interpretation, generation, completion, transformation, and correction. It comprises 700 instances spanning 677 schemes, 7 mainstream formal verifier languages, and 160 security properties. The evaluation of 9 state-of-the-art LLMs reveals that most of them perform well on interpretation and completion, given their code-awareness advantages, but struggle with generation, transformation, and correction. Overall, their performance remains limited, with Claude-3.5 achieving the highest score at 48.7 out of 100. We further provide practical guidance, e.g., few-shot prompting, Pass@K sampling, and lightweight fine-tuning, to mitigate the executability bottleneck and improve tool-usable outputs. Taken together, our benchmark and analyses offer a grounded view of current progress and concrete directions toward reliable LLM-assisted formal cryptographic analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CrypFormBench is a new multi-language benchmark for LLM crypto formal tasks, but the reported performance gaps rest on unverified instance construction.

read the letter

The main point is that this paper introduces CrypFormBench to test LLMs across interpretation, generation, completion, transformation, and correction for both symbolic and computational crypto security. It covers seven verifier languages and 160 properties in 700 instances drawn from 677 schemes. The evaluation of nine models shows they handle interpretation and completion better than the other three tasks, with Claude-3.5 at the top but still below 50 out of 100. The authors also note some prompting and fine-tuning steps that can help produce more executable outputs.

What is new is the joint coverage of symbolic and computational sides plus the range of languages and capabilities in one benchmark. No earlier work is cited that tries this scope, so the construction itself fills a stated gap in standardized testing for this domain. The practical guidance section is a small plus because it moves beyond raw scores to concrete suggestions.

The soft spot is exactly the one flagged in the stress-test note. The abstract and available description give no information on how the instances were selected, how ground truth was validated by experts, or what sampling was used across scheme types and properties. Without those details, it is difficult to tell whether the low scores on generation and correction reflect real model limits or artifacts of how the tasks were written. If the full paper has a solid methods section on curation and inter-rater checks, that would strengthen the claims considerably; otherwise the central finding stays provisional.

This paper is mainly for researchers working on LLM-assisted formal methods or automated verification in security. Readers who build or use benchmarks in code-like domains would get value from the scope and the reported failure modes.

It deserves a serious referee. A benchmark with this breadth is worth community scrutiny even if the current evidence on representativeness needs more work.

Referee Report

2 major / 0 minor

Summary. The paper introduces CrypFormBench, a benchmark with 700 instances spanning 677 cryptographic schemes, 7 formal verifier languages, and 160 security properties, to evaluate LLMs on five core capabilities (interpretation, generation, completion, transformation, correction) for formal analysis of cryptographic schemes. It reports an evaluation of 9 state-of-the-art LLMs, finding stronger performance on interpretation and completion but struggles on generation, transformation, and correction, with Claude-3.5 achieving the highest overall score of 48.7/100, and provides guidance on prompting and fine-tuning to improve outputs.

Significance. If the benchmark construction and evaluation are shown to be rigorous and representative, the work would offer the first standardized methodology for assessing LLM assistance in labor-intensive cryptographic formal analysis tasks, bridging symbolic and computational security approaches and identifying concrete limitations and mitigation strategies.

major comments (2)

[Benchmark construction] Benchmark construction (likely §3): The manuscript provides no details on instance curation process, expert validation of ground truth, sampling strategy across scheme classes or security properties, or inter-rater agreement on task difficulty or correctness. This directly undermines the central claim that the observed performance gaps (e.g., struggles with generation/transformation/correction) reflect intrinsic model limitations rather than benchmark artifacts, as the representativeness of the 700 instances as a sample of expert tasks remains unverified.
[Evaluation methodology] Evaluation methodology (likely §4): No information is supplied on scoring rubrics for the five capabilities, controls for prompt sensitivity, inter-annotator agreement for automated or manual scoring, or how executability of outputs was assessed against the 7 verifier languages. Without these, the headline scores (including 48.7/100) cannot be independently verified or reproduced, making the performance comparison across models load-bearing but unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater transparency in benchmark construction and evaluation methodology. We agree these details are essential for supporting our claims and will revise the manuscript to address both points.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (likely §3): The manuscript provides no details on instance curation process, expert validation of ground truth, sampling strategy across scheme classes or security properties, or inter-rater agreement on task difficulty or correctness. This directly undermines the central claim that the observed performance gaps (e.g., struggles with generation/transformation/correction) reflect intrinsic model limitations rather than benchmark artifacts, as the representativeness of the 700 instances as a sample of expert tasks remains unverified.

Authors: We acknowledge that the current manuscript omits explicit details on these aspects of benchmark construction. In the revised version, we will add a new subsection to §3 describing the curation process: instances were drawn from published cryptographic literature and public formal verification repositories; ground truth was validated by domain experts in cryptography and formal methods; sampling was stratified to ensure coverage across scheme classes (encryption, signatures, key exchange, protocols) and the 160 security properties; and inter-rater agreement was measured on a 10% subset using Cohen's kappa for both correctness and difficulty ratings. These additions will directly support the representativeness of the benchmark. revision: yes
Referee: [Evaluation methodology] Evaluation methodology (likely §4): No information is supplied on scoring rubrics for the five capabilities, controls for prompt sensitivity, inter-annotator agreement for automated or manual scoring, or how executability of outputs was assessed against the 7 verifier languages. Without these, the headline scores (including 48.7/100) cannot be independently verified or reproduced, making the performance comparison across models load-bearing but unsupported.

Authors: We agree the evaluation methodology section lacks necessary detail. The revised §4 will include: explicit scoring rubrics for each capability with examples of full, partial, and zero credit; controls for prompt sensitivity via fixed templates and temperature settings with results from multiple prompt variants; inter-annotator agreement statistics (Cohen's kappa) for the manual scoring component performed by two independent annotators; and the precise executability assessment procedure, including automated compilation and verification commands used for each of the 7 languages together with success/failure criteria. These changes will make the reported scores reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation is self-contained

full rationale

The paper constructs CrypFormBench and reports measured LLM performance scores on its 700 instances. No equations, fitted parameters, predictions, or derivations appear in the abstract or described content. The central results (e.g., Claude-3.5 at 48.7/100, relative strengths on interpretation vs. generation) are direct empirical measurements on externally defined tasks, not reductions to the authors' inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the benchmark or scores. Representativeness concerns affect validity but do not create circularity under the specified criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The headline performance numbers rest on the assumption that the chosen instances and task definitions capture the essential difficulties of expert formal analysis; the benchmark itself is the primary new entity introduced.

axioms (1)

domain assumption The 700 instances across 677 schemes and 160 properties are representative of real manual formal analysis work.
This premise is required for the benchmark scores to be taken as meaningful measures of LLM capability in the domain.

invented entities (1)

CrypFormBench no independent evidence
purpose: Standardized evaluation of LLMs on symbolic and computational cryptographic formal analysis tasks
New benchmark introduced by the paper to address the absence of evaluation methodologies.

pith-pipeline@v0.9.1-grok · 5831 in / 1482 out tokens · 36763 ms · 2026-06-25T20:49:36.433581+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 42 canonical work pages · 7 internal anchors

[1]

Nada AlMarwani and Mona T. Diab. 2021. Discrete Cosine Transform as Universal Sentence Encoder. In ACL/IJCNLP (2). Association for Computational Linguistics, 419–426. doi:10.18653/V1/2021.ACL-SHORT.53

work page doi:10.18653/v1/2021.acl-short.53 2021
[2]

Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Repre- sentations of Code. In ICLR (Poster). OpenReview.net

2019
[3]

Anthropic. 2024. Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet

2024
[4]

Alessandro Armando, David A. Basin, Yohan Boichut, Yannick Chevalier, Luca Compagna, Jorge Cuéllar, Paul Hankes Drielsma, Pierre-Cyrille Héam, Olga Kouchnarenko, Jacopo Mantovani, Sebastian Mödersheim, David von Oheimb, Michaël Rusinowitch, Judson Santiago, Mathieu Turuani, Luca Viganò, and Laurent Vigneron. 2005. The AVISPA Tool for the Automated Validat...

work page doi:10.1007/11513988_27 2005
[5]

Matteo Avalle, Alfredo Pironti, and Riccardo Sisto. 2014. Formal verification of security protocol implementations: a survey. Formal Aspects Comput. 26, 1 (2014), 99–123. doi:10.1007/S00165-012-0269-9

work page doi:10.1007/s00165-012-0269-9 2014
[6]

Ayers, Dragomir Radev, and Jeremy Avigad

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad
[7]

CoRR abs/2302.12433 (2023)

ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics. CoRR abs/2302.12433 (2023). doi:10.48550/ARXIV.2302.12433

work page doi:10.48550/arxiv.2302.12433 2023
[8]

Manuel Barbosa, Gilles Barthe, Karthik Bhargavan, Bruno Blanchet, Cas Cremers, Kevin Liao, and Bryan Parno. 2021. SoK: Computer-Aided Cryptography. In SP. IEEE, 777–795. doi:10.1109/SP40001.2021.00008

work page doi:10.1109/sp40001.2021.00008 2021
[9]

Gilles Barthe, Benjamin Grégoire, Sylvain Heraud, and Santiago Zanella-Béguelin. 2011. Computer-Aided Security Proofs for the Working Cryptographer. In CRYPTO (Lecture Notes in Computer Science, Vol. 6841) . Springer, 71–90. doi:10.1007/978-3-642-22792-9_5

work page doi:10.1007/978-3-642-22792-9_5 2011
[10]

Basin, Jannik Dreier, Lucca Hirschi, Sasa Radomirovic, Ralf Sasse, and Vincent Stettler

David A. Basin, Jannik Dreier, Lucca Hirschi, Sasa Radomirovic, Ralf Sasse, and Vincent Stettler. 2018. A Formal Analysis of 5G Authentication. In CCS. ACM, 1383–1396. doi:10.1145/3243734.3243846

work page doi:10.1145/3243734.3243846 2018
[11]

Yves Bertot and Pierre Castéran. 2004. Interactive Theorem Proving and Program Development - Coq’Art: The Calculus of Inductive Constructions. Springer

2004
[12]

Bruno Blanchet. 2001. An Efficient Cryptographic Protocol Verifier Based on Prolog Rules. In CSFW. IEEE Computer Society, 82–96. doi:10.1109/CSFW.2001.930138

work page doi:10.1109/csfw.2001.930138 2001
[13]

Bruno Blanchet. 2006. A Computationally Sound Mechanized Prover for Security Protocols. In S&P. IEEE Computer Society, 140–154. doi:10.1109/SP.2006.1

work page doi:10.1109/sp.2006.1 2006
[14]

Tom Brown, Benjamin Mann, et al. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Vol. 33. 1877–1901

2020
[15]

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, et al. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. CoRR abs/2303.12712 (2023). doi:10.48550/ARXIV.2303.12712

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.12712 2023
[16]

Mark Chen, Jerry Tworek, Heewoo Jun, et al . 2021. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021)

Pith/arXiv arXiv 2021
[17]

Yongchao Chen, Rujul Gandhi, Yang Zhang, and Chuchu Fan. 2023. NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models. In EMNLP. Association for Computational Linguistics, 15880–15903. doi:10.18653/V1/2023.EMNLP-MAIN.985

work page doi:10.18653/v1/2023.emnlp-main.985 2023
[18]

Vincent Cheval, Charlie Jacomme, Steve Kremer, and Robert Künnemann. 2022. SAPIC+: protocol verifiers of the world, unite!. In USENIX Security Symposium. USENIX Association, 3935–3952

2022
[19]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. 2023. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 24 (2023), 240:1–240:113

2023
[20]

Véronique Cortier, Steve Kremer, and Bogdan Warinschi. 2011. A Survey of Symbolic Methods in Computational Analysis of Cryptographic Systems. J. Autom. Reason. 46, 3-4 (2011), 225–259. doi:10.1007/S10817-010-9187-9

work page doi:10.1007/s10817-010-9187-9 2011
[21]

Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. 2023. nl2spec: Interactively Translating Unstructured Natural Language to Temporal Logics with Large Language Models. In CA V (2) (Lecture Notes in Computer Science, Vol. 13965) . Springer, 383–396. doi:10.1007/978-3-031-37703-7_18

work page doi:10.1007/978-3-031-37703-7_18 2023
[22]

Cas J. F. Cremers. 2008. The Scyther Tool: Verification, Falsification, and Analysis of Security Protocols. InCA V (Lecture Notes in Computer Science, Vol. 5123) . Springer, 414–418. doi:10.1007/978-3-540-70545-1_38

work page doi:10.1007/978-3-540-70545-1_38 2008
[23]

Tim Dierks and Eric Rescorla. 2008. The Transport Layer Security (TLS) Protocol Version 1.2. RFC 5246 (2008), 1–104. doi:10.17487/RFC5246

work page doi:10.17487/rfc5246 2008
[24]

Santiago Escobar, Catherine Meadows, and José Meseguer. 2007. Maude-NPA: Cryptographic Protocol Analysis Modulo Equational Properties. In FOSAD (Lecture Notes in Computer Science, Vol. 5705) . Springer, 1–50. doi:10.1007/978-3-642- 03829-7_1

work page doi:10.1007/978-3-642- 2007
[25]

Eval4LLMs. 2025. The code and experiment results of CrypFormBench. [EB/OL]. https://github.com/Eval4LLMs/ CrypFormBench/tree/main

2025
[26]

Hugging Face. 2025. BAAI/bge-large-en-v1.5. https://huggingface.co/BAAI/bge-large-en-v1.5 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE177. Publication date: July 2026. FSE177:22 Z. Li, Q. Zhang, H. Liu, X. Gu, X. Lu, H. Liu, B. Wang, H. Fan, Z. Zhao, R. Zhang, and L. Zhou

2025
[27]

Hugging Face. 2025. intfloat/e5-large-v2. https://huggingface.co/intfloat/e5-large-v2

2025
[28]

Hugging Face. 2025. Qwen/Qwen2.5-Coder-3B. https://huggingface.co/Qwen/Qwen2.5-Coder-3B

2025
[29]

Hugging Face. 2025. Qwen/Qwen3-Embedding-8B. https://huggingface.co/Qwen/Qwen3-Embedding-8B

2025
[30]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In EMNLP (Findings) (Findings of ACL, Vol. EMNLP 2020) . Association for Computational Linguistics, 1536–1547. doi:10.18653/V1/2020.FINDINGS-EMNLP.139

work page doi:10.18653/v1/2020.findings-emnlp.139 2020
[31]

Luyu Gao, Aman Madaan, Shuyan Zhou, et al. 2023. PAL: Program-aided Language Models. In ICML (Proceedings of Machine Learning Research, Vol. 202). PMLR, 10764–10799

2023
[32]

Ivan Gavran, Eva Darulova, and Rupak Majumdar. 2020. Interactive synthesis of temporal specifications from examples and natural language. Proc. ACM Program. Lang. 4, OOPSLA (2020), 201:1–201:26. doi:10.1145/3428269

work page doi:10.1145/3428269 2020
[33]

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. 2024. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024). doi:10.48550/ARXIV.2406.12793

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12793 2024
[34]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025). doi:10.48550/ARXIV.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[35]

Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In ACL (1). Association for Computational Linguistics, 4524–4535. doi:10.18653/V1/P19-1444

work page doi:10.18653/v1/p19-1444 2019
[36]

Dick Hardt. 2012. The OAuth 2.0 Authorization Framework. RFC 6749 (2012), 1–76. doi:10.17487/RFC6749

work page doi:10.17487/rfc6749 2012
[37]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. In ICLR. OpenReview.net

2021
[38]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024). doi:10.48550/ARXIV. 2410.21276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[39]

Nadim Kobeissi, Georgio Nicolas, and Mukesh Tiwari. 2020. Verifpal: Cryptographic Protocol Analysis for the Real World. In INDOCRYPT (Lecture Notes in Computer Science, Vol. 12578) . Springer, 151–202. doi:10.1007/978-3-030-65277- 7_8

work page doi:10.1007/978-3-030-65277- 2020
[40]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In NeurIPS. 1–16

2020
[41]

Tingting Li, Ziming Zhao, Zhaoxuan Li, Xiaofei Yue, and Jiongchi Yu. 2026. Fair and Carbon-Aware LLM Routing for Web Services. In WWW. ACM, 9563–9571. doi:10.1145/3774904.3793001

work page doi:10.1145/3774904.3793001 2026
[42]

Yujia Li, David Choi, Junyoung Chung, et al. 2022. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097. doi:10.48550/ARXIV.2203.07814

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.07814 2022
[43]

Yixuan Li, Julian Parsert, and Elizabeth Polgreen. 2024. Guiding Enumerative Program Synthesis with Large Language Models. In CA V (2) (Lecture Notes in Computer Science, Vol. 14682). Springer, 280–301. doi:10.1007/978-3-031-65630-9_15

work page doi:10.1007/978-3-031-65630-9_15 2024
[44]

Zhaoxuan Li, Rui Zhang, and Pengchao Li. 2020. A Secure and Efficient Smart Contract Execution Scheme. In ICWS (Lecture Notes in Computer Science) . Springer, 17–32. doi:10.1007/978-3-030-59618-7_2

work page doi:10.1007/978-3-030-59618-7_2 2020
[45]

Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, and Michael D Ernst. 2017. Program synthesis from natural language using recurrent neural networks. University of Washington Department of Computer Science and Engineering, Seattle, W A, USA, Tech. Rep. UW-CSE-17-031 (2017), 1–12

2017
[46]

Liu, Kevin Lin, John Hewitt, et al

Nelson F. Liu, Kevin Lin, John Hewitt, et al. 2024. Lost in the Middle: How Language Models Use Long Contexts. Trans. Assoc. Comput. Linguistics 12 (2024), 157–173. doi:10.1162/TACL_A_00638

work page doi:10.1162/tacl_a_00638 2024
[47]

Gavin Lowe. 1995. An Attack on the Needham-Schroeder Public-Key Authentication Protocol. Inf. Process. Lett. 56, 3 (1995), 131–133. doi:10.1016/0020-0190(95)00144-2

work page doi:10.1016/0020-0190(95)00144-2 1995
[48]

Siqi Lu, Hanjie Dong, Zhaoxuan Li, and Laurence T. Yang. 2024. Not Just Summing: The Identifier Leakage of Private-Join-and-Compute and its Improvement. IEEE Trans. Dependable Secur. Comput. 21, 6 (2024), 5143–5155. doi:10.1109/TDSC.2024.3371569

work page doi:10.1109/tdsc.2024.3371569 2024
[49]

Siqi Lu, Zhaoxuan Li, Xuyang Miao, Qingdi Han, and Jianhua Zheng. 2023. PIWS: Private Intersection Weighted Sum Protocol for Privacy-Preserving Score-Based Voting With Perfect Ballot Secrecy. IEEE Trans. Comput. Soc. Syst. 10, 3 (2023), 1039–1056. doi:10.1109/TCSS.2022.3162869

work page doi:10.1109/tcss.2022.3162869 2023
[51]

Ziyu Mao, Jingyi Wang, Jun Sun, Shengchao Qin, and Jiawen Xiong. 2025. LLM-aided Automatic Modeling for Security Protocol Verification. In ICSE. doi:10.1109/ICSE55347.2025.00197 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE177. Publication date: July 2026. Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes FSE177:23

work page doi:10.1109/icse55347.2025.00197 2025
[53]

Meta. 2025. Llama 4. https://www.llama.com/models/llama-4/

2025
[54]

Needham and Michael D

Roger M. Needham and Michael D. Schroeder. 1978. Using Encryption for Authentication in Large Networks of Computers. Commun. ACM 21, 12 (1978), 993–999. doi:10.1145/359657.359659

work page doi:10.1145/359657.359659 1978
[55]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, et al . 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In ICLR. OpenReview.net

2023
[56]

OpenAI. 2025. GPT-5.1: A smarter, more conversational ChatGPT. https://openai.com/index/gpt-5-1/

2025
[57]

Stanislas Polu and Ilya Sutskever. 2020. Generative Language Modeling for Automated Theorem Proving. CoRR abs/2009.03393 (2020)

Pith/arXiv arXiv 2020
[58]

Colin Raffel, Noam Shazeer, Adam Roberts, et al . 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.J. Mach. Learn. Res. 21 (2020), 140:1–140:67

2020
[59]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 3980–3990. doi:10.18653/V1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[60]

Raúl Rojas. 2015. A Tutorial Introduction to the Lambda Calculus. CoRR abs/1503.09060 (2015)

Pith/arXiv arXiv 2015
[61]

Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. CoRR abs/2507.06261 (2025). doi:10.48550/ARXIV.2507.06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025
[62]

Haiming Wang, Ye Yuan, Zhengying Liu, et al. 2023. DT-Solver: Automated Theorem Proving with Dynamic-Tree Sampling Guided by Proof-level Value Function. In ACL (1). Association for Computational Linguistics, 12632–12646. doi:10.18653/V1/2023.ACL-LONG.706

work page doi:10.18653/v1/2023.acl-long.706 2023
[63]

Jason Wei, Yi Tay, Rishi Bommasani, et al. 2022. Emergent Abilities of Large Language Models. Trans. Mach. Learn. Res. 2022 (2022)

2022
[64]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS

2022
[65]

xAI. 2025. Grok 3. https://x.ai/grok

2025
[66]

Swope, Alex Gu, et al

Kaiyu Yang, Aidan M. Swope, Alex Gu, et al. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. In NeurIPS

2023
[67]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In NeurIPS

2023
[68]

Tatu Ylönen and Chris Lonvick. 2006. The Secure Shell (SSH) Authentication Protocol. RFC 4252 (2006), 1–17. doi:10.17487/RFC4252

work page doi:10.17487/rfc4252 2006
[69]

Xiaofei Yue, Fangming Zhao, Fulun Ye, Jiongchi Yu, Zhaoxuan Li, Tingting Li, Ziming Zhao, and Jianwei Yin. 2026. HeteroSim: Towards High-Fidelity Heterogeneous LLM Training Simulation on GPUs. In WWW. ACM, 5189–5197. doi:10.1145/3774904.3792254

work page doi:10.1145/3774904.3792254 2026
[70]

Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. 2023. Multilingual Code Co-evolution using Large Language Models. In ESEC/SIGSOFT FSE. ACM, 695–707. doi:10.1145/3611643.3616350

work page doi:10.1145/3611643.3616350 2023
[71]

Rui Zhang, Zhaoxuan Li, and Lijuan Zheng. 2021. Secure and Efficient Key Hierarchical Management and Collaborative Signature Schemes of Blockchain. In ICAIS (2) (Lecture Notes in Computer Science) . Springer, 332–345. doi:10.1007/978- 3-030-78612-0_27

work page doi:10.1007/978- 2021
[72]

Ziming Zhao, Zhaoxuan Li, Tingting Li, and Fan Zhang. 2025. CyberLLM: Enable Mapping CVE to Tactics and Techniques of Cyber Threats via LLM. In DASFAA (5) (Lecture Notes in Computer Science) . Springer, 473–488. doi:10. 1007/978-981-95-4155-3_33

2025
[73]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In NeurIPS

2023
[74]

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931 (2024). doi:10.48550/ARXIV.2406.11931 Received 2025-09-12; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. F...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.11931 2024

[1] [1]

Nada AlMarwani and Mona T. Diab. 2021. Discrete Cosine Transform as Universal Sentence Encoder. In ACL/IJCNLP (2). Association for Computational Linguistics, 419–426. doi:10.18653/V1/2021.ACL-SHORT.53

work page doi:10.18653/v1/2021.acl-short.53 2021

[2] [2]

Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Repre- sentations of Code. In ICLR (Poster). OpenReview.net

2019

[3] [3]

Anthropic. 2024. Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet

2024

[4] [4]

Alessandro Armando, David A. Basin, Yohan Boichut, Yannick Chevalier, Luca Compagna, Jorge Cuéllar, Paul Hankes Drielsma, Pierre-Cyrille Héam, Olga Kouchnarenko, Jacopo Mantovani, Sebastian Mödersheim, David von Oheimb, Michaël Rusinowitch, Judson Santiago, Mathieu Turuani, Luca Viganò, and Laurent Vigneron. 2005. The AVISPA Tool for the Automated Validat...

work page doi:10.1007/11513988_27 2005

[5] [5]

Matteo Avalle, Alfredo Pironti, and Riccardo Sisto. 2014. Formal verification of security protocol implementations: a survey. Formal Aspects Comput. 26, 1 (2014), 99–123. doi:10.1007/S00165-012-0269-9

work page doi:10.1007/s00165-012-0269-9 2014

[6] [6]

Ayers, Dragomir Radev, and Jeremy Avigad

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad

[7] [7]

CoRR abs/2302.12433 (2023)

ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics. CoRR abs/2302.12433 (2023). doi:10.48550/ARXIV.2302.12433

work page doi:10.48550/arxiv.2302.12433 2023

[8] [8]

Manuel Barbosa, Gilles Barthe, Karthik Bhargavan, Bruno Blanchet, Cas Cremers, Kevin Liao, and Bryan Parno. 2021. SoK: Computer-Aided Cryptography. In SP. IEEE, 777–795. doi:10.1109/SP40001.2021.00008

work page doi:10.1109/sp40001.2021.00008 2021

[9] [9]

Gilles Barthe, Benjamin Grégoire, Sylvain Heraud, and Santiago Zanella-Béguelin. 2011. Computer-Aided Security Proofs for the Working Cryptographer. In CRYPTO (Lecture Notes in Computer Science, Vol. 6841) . Springer, 71–90. doi:10.1007/978-3-642-22792-9_5

work page doi:10.1007/978-3-642-22792-9_5 2011

[10] [10]

Basin, Jannik Dreier, Lucca Hirschi, Sasa Radomirovic, Ralf Sasse, and Vincent Stettler

David A. Basin, Jannik Dreier, Lucca Hirschi, Sasa Radomirovic, Ralf Sasse, and Vincent Stettler. 2018. A Formal Analysis of 5G Authentication. In CCS. ACM, 1383–1396. doi:10.1145/3243734.3243846

work page doi:10.1145/3243734.3243846 2018

[11] [11]

Yves Bertot and Pierre Castéran. 2004. Interactive Theorem Proving and Program Development - Coq’Art: The Calculus of Inductive Constructions. Springer

2004

[12] [12]

Bruno Blanchet. 2001. An Efficient Cryptographic Protocol Verifier Based on Prolog Rules. In CSFW. IEEE Computer Society, 82–96. doi:10.1109/CSFW.2001.930138

work page doi:10.1109/csfw.2001.930138 2001

[13] [13]

Bruno Blanchet. 2006. A Computationally Sound Mechanized Prover for Security Protocols. In S&P. IEEE Computer Society, 140–154. doi:10.1109/SP.2006.1

work page doi:10.1109/sp.2006.1 2006

[14] [14]

Tom Brown, Benjamin Mann, et al. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Vol. 33. 1877–1901

2020

[15] [15]

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, et al. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. CoRR abs/2303.12712 (2023). doi:10.48550/ARXIV.2303.12712

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.12712 2023

[16] [16]

Mark Chen, Jerry Tworek, Heewoo Jun, et al . 2021. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021)

Pith/arXiv arXiv 2021

[17] [17]

Yongchao Chen, Rujul Gandhi, Yang Zhang, and Chuchu Fan. 2023. NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models. In EMNLP. Association for Computational Linguistics, 15880–15903. doi:10.18653/V1/2023.EMNLP-MAIN.985

work page doi:10.18653/v1/2023.emnlp-main.985 2023

[18] [18]

Vincent Cheval, Charlie Jacomme, Steve Kremer, and Robert Künnemann. 2022. SAPIC+: protocol verifiers of the world, unite!. In USENIX Security Symposium. USENIX Association, 3935–3952

2022

[19] [19]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. 2023. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 24 (2023), 240:1–240:113

2023

[20] [20]

Véronique Cortier, Steve Kremer, and Bogdan Warinschi. 2011. A Survey of Symbolic Methods in Computational Analysis of Cryptographic Systems. J. Autom. Reason. 46, 3-4 (2011), 225–259. doi:10.1007/S10817-010-9187-9

work page doi:10.1007/s10817-010-9187-9 2011

[21] [21]

Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. 2023. nl2spec: Interactively Translating Unstructured Natural Language to Temporal Logics with Large Language Models. In CA V (2) (Lecture Notes in Computer Science, Vol. 13965) . Springer, 383–396. doi:10.1007/978-3-031-37703-7_18

work page doi:10.1007/978-3-031-37703-7_18 2023

[22] [22]

Cas J. F. Cremers. 2008. The Scyther Tool: Verification, Falsification, and Analysis of Security Protocols. InCA V (Lecture Notes in Computer Science, Vol. 5123) . Springer, 414–418. doi:10.1007/978-3-540-70545-1_38

work page doi:10.1007/978-3-540-70545-1_38 2008

[23] [23]

Tim Dierks and Eric Rescorla. 2008. The Transport Layer Security (TLS) Protocol Version 1.2. RFC 5246 (2008), 1–104. doi:10.17487/RFC5246

work page doi:10.17487/rfc5246 2008

[24] [24]

Santiago Escobar, Catherine Meadows, and José Meseguer. 2007. Maude-NPA: Cryptographic Protocol Analysis Modulo Equational Properties. In FOSAD (Lecture Notes in Computer Science, Vol. 5705) . Springer, 1–50. doi:10.1007/978-3-642- 03829-7_1

work page doi:10.1007/978-3-642- 2007

[25] [25]

Eval4LLMs. 2025. The code and experiment results of CrypFormBench. [EB/OL]. https://github.com/Eval4LLMs/ CrypFormBench/tree/main

2025

[26] [26]

Hugging Face. 2025. BAAI/bge-large-en-v1.5. https://huggingface.co/BAAI/bge-large-en-v1.5 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE177. Publication date: July 2026. FSE177:22 Z. Li, Q. Zhang, H. Liu, X. Gu, X. Lu, H. Liu, B. Wang, H. Fan, Z. Zhao, R. Zhang, and L. Zhou

2025

[27] [27]

Hugging Face. 2025. intfloat/e5-large-v2. https://huggingface.co/intfloat/e5-large-v2

2025

[28] [28]

Hugging Face. 2025. Qwen/Qwen2.5-Coder-3B. https://huggingface.co/Qwen/Qwen2.5-Coder-3B

2025

[29] [29]

Hugging Face. 2025. Qwen/Qwen3-Embedding-8B. https://huggingface.co/Qwen/Qwen3-Embedding-8B

2025

[30] [30]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In EMNLP (Findings) (Findings of ACL, Vol. EMNLP 2020) . Association for Computational Linguistics, 1536–1547. doi:10.18653/V1/2020.FINDINGS-EMNLP.139

work page doi:10.18653/v1/2020.findings-emnlp.139 2020

[31] [31]

Luyu Gao, Aman Madaan, Shuyan Zhou, et al. 2023. PAL: Program-aided Language Models. In ICML (Proceedings of Machine Learning Research, Vol. 202). PMLR, 10764–10799

2023

[32] [32]

Ivan Gavran, Eva Darulova, and Rupak Majumdar. 2020. Interactive synthesis of temporal specifications from examples and natural language. Proc. ACM Program. Lang. 4, OOPSLA (2020), 201:1–201:26. doi:10.1145/3428269

work page doi:10.1145/3428269 2020

[33] [33]

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. 2024. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024). doi:10.48550/ARXIV.2406.12793

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12793 2024

[34] [34]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025). doi:10.48550/ARXIV.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025

[35] [35]

Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In ACL (1). Association for Computational Linguistics, 4524–4535. doi:10.18653/V1/P19-1444

work page doi:10.18653/v1/p19-1444 2019

[36] [36]

Dick Hardt. 2012. The OAuth 2.0 Authorization Framework. RFC 6749 (2012), 1–76. doi:10.17487/RFC6749

work page doi:10.17487/rfc6749 2012

[37] [37]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. In ICLR. OpenReview.net

2021

[38] [38]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024). doi:10.48550/ARXIV. 2410.21276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024

[39] [39]

Nadim Kobeissi, Georgio Nicolas, and Mukesh Tiwari. 2020. Verifpal: Cryptographic Protocol Analysis for the Real World. In INDOCRYPT (Lecture Notes in Computer Science, Vol. 12578) . Springer, 151–202. doi:10.1007/978-3-030-65277- 7_8

work page doi:10.1007/978-3-030-65277- 2020

[40] [40]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In NeurIPS. 1–16

2020

[41] [41]

Tingting Li, Ziming Zhao, Zhaoxuan Li, Xiaofei Yue, and Jiongchi Yu. 2026. Fair and Carbon-Aware LLM Routing for Web Services. In WWW. ACM, 9563–9571. doi:10.1145/3774904.3793001

work page doi:10.1145/3774904.3793001 2026

[42] [42]

Yujia Li, David Choi, Junyoung Chung, et al. 2022. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097. doi:10.48550/ARXIV.2203.07814

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.07814 2022

[43] [43]

Yixuan Li, Julian Parsert, and Elizabeth Polgreen. 2024. Guiding Enumerative Program Synthesis with Large Language Models. In CA V (2) (Lecture Notes in Computer Science, Vol. 14682). Springer, 280–301. doi:10.1007/978-3-031-65630-9_15

work page doi:10.1007/978-3-031-65630-9_15 2024

[44] [44]

Zhaoxuan Li, Rui Zhang, and Pengchao Li. 2020. A Secure and Efficient Smart Contract Execution Scheme. In ICWS (Lecture Notes in Computer Science) . Springer, 17–32. doi:10.1007/978-3-030-59618-7_2

work page doi:10.1007/978-3-030-59618-7_2 2020

[45] [45]

Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, and Michael D Ernst. 2017. Program synthesis from natural language using recurrent neural networks. University of Washington Department of Computer Science and Engineering, Seattle, W A, USA, Tech. Rep. UW-CSE-17-031 (2017), 1–12

2017

[46] [46]

Liu, Kevin Lin, John Hewitt, et al

Nelson F. Liu, Kevin Lin, John Hewitt, et al. 2024. Lost in the Middle: How Language Models Use Long Contexts. Trans. Assoc. Comput. Linguistics 12 (2024), 157–173. doi:10.1162/TACL_A_00638

work page doi:10.1162/tacl_a_00638 2024

[47] [47]

Gavin Lowe. 1995. An Attack on the Needham-Schroeder Public-Key Authentication Protocol. Inf. Process. Lett. 56, 3 (1995), 131–133. doi:10.1016/0020-0190(95)00144-2

work page doi:10.1016/0020-0190(95)00144-2 1995

[48] [48]

Siqi Lu, Hanjie Dong, Zhaoxuan Li, and Laurence T. Yang. 2024. Not Just Summing: The Identifier Leakage of Private-Join-and-Compute and its Improvement. IEEE Trans. Dependable Secur. Comput. 21, 6 (2024), 5143–5155. doi:10.1109/TDSC.2024.3371569

work page doi:10.1109/tdsc.2024.3371569 2024

[49] [49]

Siqi Lu, Zhaoxuan Li, Xuyang Miao, Qingdi Han, and Jianhua Zheng. 2023. PIWS: Private Intersection Weighted Sum Protocol for Privacy-Preserving Score-Based Voting With Perfect Ballot Secrecy. IEEE Trans. Comput. Soc. Syst. 10, 3 (2023), 1039–1056. doi:10.1109/TCSS.2022.3162869

work page doi:10.1109/tcss.2022.3162869 2023

[50] [51]

Ziyu Mao, Jingyi Wang, Jun Sun, Shengchao Qin, and Jiawen Xiong. 2025. LLM-aided Automatic Modeling for Security Protocol Verification. In ICSE. doi:10.1109/ICSE55347.2025.00197 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE177. Publication date: July 2026. Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes FSE177:23

work page doi:10.1109/icse55347.2025.00197 2025

[51] [53]

Meta. 2025. Llama 4. https://www.llama.com/models/llama-4/

2025

[52] [54]

Needham and Michael D

Roger M. Needham and Michael D. Schroeder. 1978. Using Encryption for Authentication in Large Networks of Computers. Commun. ACM 21, 12 (1978), 993–999. doi:10.1145/359657.359659

work page doi:10.1145/359657.359659 1978

[53] [55]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, et al . 2023. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In ICLR. OpenReview.net

2023

[54] [56]

OpenAI. 2025. GPT-5.1: A smarter, more conversational ChatGPT. https://openai.com/index/gpt-5-1/

2025

[55] [57]

Stanislas Polu and Ilya Sutskever. 2020. Generative Language Modeling for Automated Theorem Proving. CoRR abs/2009.03393 (2020)

Pith/arXiv arXiv 2020

[56] [58]

Colin Raffel, Noam Shazeer, Adam Roberts, et al . 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.J. Mach. Learn. Res. 21 (2020), 140:1–140:67

2020

[57] [59]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 3980–3990. doi:10.18653/V1/D19-1410

work page doi:10.18653/v1/d19-1410 2019

[58] [60]

Raúl Rojas. 2015. A Tutorial Introduction to the Lambda Calculus. CoRR abs/1503.09060 (2015)

Pith/arXiv arXiv 2015

[59] [61]

Gemini Team. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. CoRR abs/2507.06261 (2025). doi:10.48550/ARXIV.2507.06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025

[60] [62]

Haiming Wang, Ye Yuan, Zhengying Liu, et al. 2023. DT-Solver: Automated Theorem Proving with Dynamic-Tree Sampling Guided by Proof-level Value Function. In ACL (1). Association for Computational Linguistics, 12632–12646. doi:10.18653/V1/2023.ACL-LONG.706

work page doi:10.18653/v1/2023.acl-long.706 2023

[61] [63]

Jason Wei, Yi Tay, Rishi Bommasani, et al. 2022. Emergent Abilities of Large Language Models. Trans. Mach. Learn. Res. 2022 (2022)

2022

[62] [64]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS

2022

[63] [65]

xAI. 2025. Grok 3. https://x.ai/grok

2025

[64] [66]

Swope, Alex Gu, et al

Kaiyu Yang, Aidan M. Swope, Alex Gu, et al. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. In NeurIPS

2023

[65] [67]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In NeurIPS

2023

[66] [68]

Tatu Ylönen and Chris Lonvick. 2006. The Secure Shell (SSH) Authentication Protocol. RFC 4252 (2006), 1–17. doi:10.17487/RFC4252

work page doi:10.17487/rfc4252 2006

[67] [69]

Xiaofei Yue, Fangming Zhao, Fulun Ye, Jiongchi Yu, Zhaoxuan Li, Tingting Li, Ziming Zhao, and Jianwei Yin. 2026. HeteroSim: Towards High-Fidelity Heterogeneous LLM Training Simulation on GPUs. In WWW. ACM, 5189–5197. doi:10.1145/3774904.3792254

work page doi:10.1145/3774904.3792254 2026

[68] [70]

Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. 2023. Multilingual Code Co-evolution using Large Language Models. In ESEC/SIGSOFT FSE. ACM, 695–707. doi:10.1145/3611643.3616350

work page doi:10.1145/3611643.3616350 2023

[69] [71]

Rui Zhang, Zhaoxuan Li, and Lijuan Zheng. 2021. Secure and Efficient Key Hierarchical Management and Collaborative Signature Schemes of Blockchain. In ICAIS (2) (Lecture Notes in Computer Science) . Springer, 332–345. doi:10.1007/978- 3-030-78612-0_27

work page doi:10.1007/978- 2021

[70] [72]

Ziming Zhao, Zhaoxuan Li, Tingting Li, and Fan Zhang. 2025. CyberLLM: Enable Mapping CVE to Tactics and Techniques of Cyber Threats via LLM. In DASFAA (5) (Lecture Notes in Computer Science) . Springer, 473–488. doi:10. 1007/978-981-95-4155-3_33

2025

[71] [73]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In NeurIPS

2023

[72] [74]

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931 (2024). doi:10.48550/ARXIV.2406.11931 Received 2025-09-12; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. F...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.11931 2024