pith. sign in

arxiv: 2606.07341 · v1 · pith:NCKWF6BNnew · submitted 2026-06-05 · 💻 cs.CR

Empirical Evaluation of Large Language Models for Migration of Code Fragments to Post-Quantum Cryptography

Pith reviewed 2026-06-27 21:51 UTC · model grok-4.3

classification 💻 cs.CR
keywords post-quantum cryptographylarge language modelscode migrationfine-tuningcryptographic codefunctional correctnesssynthetic datasetPython
0
0 comments X

The pith

Fine-tuned GPT-4.1-mini migrates pre-quantum cryptographic code to post-quantum versions at 92.5 percent functional correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether large language models can convert fragments of pre-quantum cryptographic code into post-quantum equivalents while keeping the same behavior. The authors built a dataset of 800 paired Python examples spanning six cryptographic families plus multi-primitive cases, each checked by category-specific functional tests. Four models were compared: one in zero-shot mode and three after domain-specific fine-tuning. The fine-tuned GPT-4.1-mini produced the highest static similarity and passed dynamic tests 92.5 percent of the time, beating the zero-shot baseline by a wide margin. Tests on six open-source repositories confirmed usefulness for isolated modules yet exposed limits when dependencies span multiple files.

Core claim

Domain-specific fine-tuning of large language models enables reliable migration of pre-quantum cryptographic code fragments to post-quantum counterparts. On a synthetic dataset of 800 validated pairs the fine-tuned GPT-4.1-mini reached a mean static similarity of 0.9072 and 92.5 percent dynamic functional correctness, substantially outperforming zero-shot GPT-4.1. The same model generated useful migrations in localized modules of real repositories while revealing difficulties with complex cross-module dependencies.

What carries the argument

A reproducible experimental framework centered on a synthetic dataset of 800 paired Python code fragments, each validated by category-specific functional tests, that measures both static code similarity and dynamic functional correctness of model outputs.

If this is right

  • Domain-specific fine-tuning is essential for reliable cryptographic migration performance.
  • Fine-tuned LLMs can serve as practical components inside crypto-agile migration pipelines when paired with automated verification.
  • The method produces useful migrations inside localized cryptographic modules of open-source projects.
  • Larger projects with complex dependencies and cross-module interactions remain challenging for the current approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the dataset to additional languages such as C or Java would test whether the same fine-tuning gains appear outside Python.
  • Combining the LLM migration step with static dependency-graph analysis could reduce failures on multi-module codebases.
  • Measuring how often the generated post-quantum code passes the same test suite used for the original fragments on a broader set of libraries would give a clearer picture of generalization.

Load-bearing premise

The synthetic dataset of 800 paired code fragments, validated only through category-specific functional tests, sufficiently represents the structure, dependencies, and edge cases found in real-world cryptographic modules.

What would settle it

Running the fine-tuned GPT-4.1-mini on a large real-world cryptographic library and measuring whether dynamic functional correctness falls below 80 percent on modules with cross-file dependencies would test the central performance claim.

Figures

Figures reproduced from arXiv: 2606.07341 by Ana I. Gonz\'alez-Tablas, Javier Pallar\'es de Bonrostro, Mar\'ia Isabel Gonz\'alez Vasco.

Figure 1
Figure 1. Figure 1: Overall methodological pipeline of the study. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dataset Generation Phase. The internal flow starts with the manual creation of migrations, continues with synthetic generation and validation of the result￾ing corpus. encryption. Hash functions follow a similar logic: while SHA-2 remains viable with sufficiently long outputs, SHA-3 offers a more modern structure and stronger security margin [8]. Accordingly, SHA3-256 is used as the default target for hash… view at source ↗
Figure 4
Figure 4. Figure 4: Instruction-style prompt for CodeLlama models [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LLM Preparation Phase. The validated syn￾thetic dataset is split into train and validation sub￾sets. Only the train split is used for dataset adapta￾tion and model-specific fine-tuning, while the valida￾tion split is reserved for the evaluation phase. GPT￾3.5 Turbo, GPT-4.1 Mini, and CodeLlama-7B Instruct undergo fine-tuning, whereas GPT-4.1 is included as a zero-shot reference and bypasses the preparation… view at source ↗
Figure 6
Figure 6. Figure 6: LLM Evaluation Phase. The prepared LLMs are evaluated on the validation dataset using both static similarity metrics and automated functional tests. Their results are then compared through a multi￾criteria analysis, leading to the selection of the best￾performing LLM for the final real-world validation stage. Metrics. As already introduced, two complementary evaluation dimensions were considered. The first… view at source ↗
Figure 7
Figure 7. Figure 7: Static similarity distributions for pre- to post-quantum code migration. The horizontal axis represents [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Dynamic functional correctness across model configurations. Each subfigure shows the distribution of [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Similarity distribution with predominant cate [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Real-world Validation Phase. The best￾performing LLM selected in the previous phase is applied to a set of selected open-source repositories. Repository-specific tests are then used to assess practi￾cal migration behavior under realistic dependency and modularity constraints, producing a structured sum￾mary of real-world migration outcomes. Real-world repository selection. Beyond controlled evaluation on … view at source ↗
Figure 12
Figure 12. Figure 12: Position of the proposed LLM-assisted migra [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: GPT-4.1-Mini–assisted post-quantum migra [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
read the original abstract

The transition to post-quantum cryptography (PQC) requires not only replacing vulnerable cryptographic primitives, but also refactoring the surrounding software logic. While existing PQC migration frameworks provide organizational guidance, practical code-level remediation remains largely manual and error-prone. This paper evaluates whether large language models (LLMs) can be trained to assist in the migration of pre-quantum cryptographic code fragments to post-quantum counterparts while preserving functional correctness. To this end, we introduce a reproducible experimental framework built around a synthetic dataset of 800 paired Python code fragments covering six cryptographic families and combined multi-primitive cases. Each pair is validated through category-specific functional tests, enabling both dataset quality control and objective evaluation of model-generated migrations. Four models are assessed: GPT-4.1 in a zero-shot setting, and fine-tuned versions of GPT-3.5-turbo, GPT-4.1-mini, and CodeLlama-7B-Instruct. The results show that domain-specific fine-tuning is essential for reliable cryptographic migration. The fine-tuned GPT-4.1-mini model achieves the best overall performance, with a mean static similarity of 0.9072 and a dynamic functional correctness rate of 92.5%, substantially outperforming the zero-shot baseline. A complementary validation on six open-source repositories further shows that the approach can produce useful migrations in localized cryptographic modules, while also revealing limitations in larger projects with complex dependencies and cross-module interactions. These findings suggest that fine-tuned LLMs can serve as practical components in future crypto-agile migration pipelines, provided they are coupled with automated verification and dependency-aware validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a reproducible experimental framework and a synthetic dataset of 800 paired Python code fragments covering six cryptographic families plus multi-primitive cases. It evaluates zero-shot GPT-4.1 against fine-tuned GPT-3.5-turbo, GPT-4.1-mini, and CodeLlama-7B-Instruct on migrating pre-quantum cryptographic fragments to post-quantum counterparts, reporting that the fine-tuned GPT-4.1-mini achieves the highest mean static similarity (0.9072) and dynamic functional correctness (92.5%). A secondary check on six open-source repositories shows localized utility but degraded results on complex dependency structures.

Significance. If the central empirical claims hold, the work demonstrates that domain-specific fine-tuning enables LLMs to produce functionally correct PQC migrations at the fragment level and supplies a reusable test harness with both static similarity and dynamic execution metrics. The explicit real-repository validation and the paper's own acknowledgment of cross-module limitations are positive features that strengthen the contribution relative to purely synthetic studies.

major comments (2)
  1. [Dataset section] Dataset section (synthetic 800-pair construction): the headline static similarity of 0.9072 and dynamic correctness of 92.5% rest on category-specific functional tests whose coverage of call-pattern changes, exception handling, and multi-primitive interactions is not independently quantified; the paper itself reports degraded performance on real repositories with cross-module dependencies, indicating that the synthetic regime may systematically understate task difficulty.
  2. [Real-repository validation paragraph] Real-repository validation paragraph: the claim that the approach 'can produce useful migrations in localized cryptographic modules' is supported only by qualitative description; without quantitative static/dynamic scores comparable to the synthetic 0.9072/92.5% figures, it is impossible to calibrate how far the primary results generalize to production cryptographic code.
minor comments (2)
  1. [Abstract] Abstract and model list: 'GPT-4.1' is used for the zero-shot baseline while 'GPT-4.1-mini' appears in the fine-tuned results; consistent naming and explicit parameter counts or API versions would improve clarity.
  2. [Evaluation metrics paragraph] Evaluation metrics paragraph: the precise definition of 'static similarity' (e.g., which token or AST metric) is referenced only at a high level; adding the exact formula or library call would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our empirical evaluation. We address each major comment below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Dataset section] Dataset section (synthetic 800-pair construction): the headline static similarity of 0.9072 and dynamic correctness of 92.5% rest on category-specific functional tests whose coverage of call-pattern changes, exception handling, and multi-primitive interactions is not independently quantified; the paper itself reports degraded performance on real repositories with cross-module dependencies, indicating that the synthetic regime may systematically understate task difficulty.

    Authors: We agree that independent quantification of test coverage (e.g., via statement or branch coverage) would improve transparency. The category-specific tests were constructed to exercise the primary functional requirements of each primitive family and multi-primitive combinations, including representative call patterns and exception paths. We did not compute aggregate coverage metrics across the 800 pairs. The manuscript already notes performance degradation on real repositories as a limitation of the synthetic regime. In revision we will expand the Dataset section with a more explicit description of the test cases and add a dedicated paragraph discussing how the synthetic setting may understate cross-module complexity. revision: partial

  2. Referee: [Real-repository validation paragraph] Real-repository validation paragraph: the claim that the approach 'can produce useful migrations in localized cryptographic modules' is supported only by qualitative description; without quantitative static/dynamic scores comparable to the synthetic 0.9072/92.5% figures, it is impossible to calibrate how far the primary results generalize to production cryptographic code.

    Authors: We accept that the real-repository check is qualitative and cannot be directly calibrated against the synthetic metrics. This component was included to illustrate localized applicability rather than to provide a quantitative generalization study; constructing executable test harnesses and resolving full dependency graphs for the six repositories would have required substantial additional engineering outside the paper's scope. We will revise the paragraph to explicitly characterize the validation as qualitative, remove any implication of direct comparability, and reinforce the already-stated limitations regarding complex dependencies. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement on synthetic dataset

full rationale

The paper reports experimental results from fine-tuning and evaluating LLMs on a fixed synthetic dataset of 800 code pairs, using static similarity and dynamic functional tests as direct metrics. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains are present; performance numbers (0.9072 similarity, 92.5% correctness) are measured outputs, not reductions of prior fitted quantities. The study is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the assumption that functional tests on synthetic fragments are adequate proxies for real cryptographic correctness and on standard supervised fine-tuning assumptions; no new physical or mathematical entities are introduced.

axioms (1)
  • domain assumption Category-specific functional tests are sufficient to determine whether a migrated code fragment preserves functional correctness.
    Dynamic correctness rate of 92.5% is computed directly from these tests.

pith-pipeline@v0.9.1-grok · 5842 in / 1331 out tokens · 30408 ms · 2026-06-27T21:51:35.617548+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 6 canonical work pages

  1. [1]

    A survey of post-quantum cryptog- raphy support in cryptographic libraries

    Nadeem Ahmed, Lei Zhang, and Aryya Gan- gopadhyay. “A survey of post-quantum cryptog- raphy support in cryptographic libraries”. In: arXiv preprint arXiv:2508.16078(2025)

  2. [2]

    NIST Interagency/Internal Report (NISTIR) 8528

    Gorjan Alagic et al.Status Report on the First Round of the Additional Digital Sig- nature Schemes for the NIST Post-Quantum Cryptography Standardization Process. NIST Interagency/Internal Report (NISTIR) 8528. Gaithersburg, MD, USA: National Institute of Standards and Technology, Oct. 2024.doi: 10 . 6028 / NIST . IR . 8528.url:https : / / doi . org/10.60...

  3. [3]

    NIST Cybersecurity White Paper NIST CSWP

    Elaine Barker et al.Considerations for Achieving Cryptographic Agility: Strategies and Practices. NIST Cybersecurity White Paper NIST CSWP

  4. [4]

    2025.doi:10

    National Institute of Standards and Technol- ogy, Dec. 2025.doi:10 . 6028 / NIST . CSWP . 39. url:https://doi.org/10.6028/NIST.CSWP. 39

  5. [5]

    Ward Beullens et al.Post-Quantum Cryptog- raphy: Current State and Quantum Mitigation (v2). Tech. rep. European Union Agency for Cybersecurity (ENISA), 2021.doi:10 . 2824 / 92307.url:https : / / www . enisa . europa . eu / sites / default / files / publications / ENISA % 20Report % 20 - %20Post - Quantum % 20Cryptography % 20Current % 20state % 20and % 20...

  6. [6]

    Cybersecurity and Infrastructure Security Agency (CISA).Strategy for Migrating to Au- tomated Post-Quantum Cryptography Discov- ery and Inventory Tools. Tech. rep. Accedido el 23 de mayo de 2025. U.S. Department of Homeland Security, Sept. 2024.url:https : //www.cisa.gov/sites/default/files/2024- 09/Strategy- for- Migrating- to- Automated- PQC-Discovery-a...

  7. [7]

    Qlora: Efficient finetuning of quantized llms

    Tim Dettmers et al. “Qlora: Efficient finetuning of quantized llms”. In:Advances in neural infor- mation processing systems36 (2023), pp. 10088– 10115

  8. [8]

    Nos CorpusNOS-GL: Galician Macrocorpus for LLM training

    Iria de-Dios-Flores et al. “Nos CorpusNOS-GL: Galician Macrocorpus for LLM training”. In: Nos CorpusNOS-GL: Galician Macrocorpus for LLM training(2024)

  9. [9]

    SHA-3 standard: Permutation-based hash and extendable-output functions

    Morris J Dworkin et al. “SHA-3 standard: Permutation-based hash and extendable-output functions”. In: (2015)

  10. [10]

    European Commission and NIS Cooperation Group.Coordinated Implementation Roadmap for the Transition to Post-Quantum Cryptog- raphy.https : / / digital - strategy . ec . europa . eu / en / library / coordinated - implementation - roadmap - transition - post - quantum-cryptography. European Union policy roadmap on coordinated migration to post- quantum cryp...

  11. [11]

    Why Do Large Language Mod- els (LLMs) Struggle to Count Letters?

    Tairan Fu et al. “Why Do Large Language Mod- els (LLMs) Struggle to Count Letters?” In:arXiv preprint arXiv:2412.18626(2024)

  12. [12]

    Response accuracy of GPT- 4 across languages: insights from an expert-level diagnostic radiology examination in Japan

    Ayaka Harigai et al. “Response accuracy of GPT- 4 across languages: insights from an expert-level diagnostic radiology examination in Japan”. In:Japanese Journal of Radiology43.2 (2025), pp. 319–329

  13. [13]

    A framework for migrating to post-quantum cryptography: Secu- rity dependency analysis and case studies

    Khondokar Fida Hasan et al. “A framework for migrating to post-quantum cryptography: Secu- rity dependency analysis and case studies”. In: IEEE Access12 (2024), pp. 23427–23450

  14. [14]

    A Survey on Large Language Models for Code Generation

    Juyong Jiang et al. “A Survey on Large Language Models for Code Generation”. In:ACM Transac- tions on Software Engineering and Methodology 35.2 (Jan. 2026), pp. 1–72.issn: 1557-7392.doi: 10.1145/3747588.url:http://dx.doi.org/ 10.1145/3747588

  15. [15]

    Breaking symmetric cryp- tosystems using quantum period finding

    Marc Kaplan et al. “Breaking symmetric cryp- tosystems using quantum period finding”. In:Ad- vances in Cryptology–CRYPTO 2016: 36th An- nual International Cryptology Conference, Santa Barbara, CA, USA, August 14-18, 2016, Proceed- ings, Part II 36. Springer. 2016, pp. 207–237

  16. [16]

    Unsupervised trans- lation of programming languages

    Marie-Anne Lachaux et al. “Unsupervised trans- lation of programming languages”. In:arXiv preprint arXiv:2006.03511(2020)

  17. [17]

    On the post- quantum security of classical authenticated en- cryption schemes

    Nathalie Lang and Stefan Lucks. “On the post- quantum security of classical authenticated en- cryption schemes”. In:International Conference on Cryptology in Africa. Springer. 2023, pp. 79– 104

  18. [18]

    CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection

    Zhihao Li et al. “CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection”. In:arXiv preprint arXiv:2508.11599(2025)

  19. [19]

    Quantifying multilingual per- formance of large language models across lan- guages

    Zihao Li et al. “Quantifying multilingual per- formance of large language models across lan- guages”. In:arXiv e-prints(2024), arXiv–2404

  20. [20]

    AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers

    Zijie Lin et al. “AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers”. In: arXiv preprint arXiv:2504.20115(2025)

  21. [21]

    Understanding llms: A com- prehensive overview from training to inference

    Yiheng Liu et al. “Understanding llms: A com- prehensive overview from training to inference”. In:Neurocomputing(2024), p. 129190. 27

  22. [22]

    Self-refine: Iterative refine- ment with self-feedback

    Aman Madaan et al. “Self-refine: Iterative refine- ment with self-feedback”. In:Advances in Neural Information Processing Systems36 (2023), pp. 46534–46594

  23. [23]

    Automated Update of Android Deprecated API Usages with Large Lan- guage Models

    Tarek Mahmud et al. “Automated Update of Android Deprecated API Usages with Large Lan- guage Models”. In:arXiv preprint arXiv:2411.04387 (2024)

  24. [24]

    Benchmarking large language models for cryptanalysis and mismatched-generalization

    Utsav Maskey, Chencheng Zhu, and Usman Naseem. “Benchmarking large language models for cryptanalysis and mismatched-generalization”. In:arXiv preprint arXiv:2505.24621(2025)

  25. [25]

    Be- yond Static Tools: Evaluating Large Language Models for Cryptographic Misuse Detection

    Zohaib Masood and Miguel Vargas Martin. “Be- yond Static Tools: Evaluating Large Language Models for Cryptographic Misuse Detection”. In: arXiv preprint arXiv:2411.09772(2024)

  26. [26]

    Federal Information Processing Standards Publication NIST FIPS 203

    National Institute of Standards and Technol- ogy.FIPS 203: Module-Lattice-Based Key- Encapsulation Mechanism Standard. Federal In- formation Processing Standards Publication 203. National Institute of Standards and Technology, Aug. 2024.doi:10.6028/NIST.FIPS.203.url: https://doi.org/10.6028/NIST.FIPS.203

  27. [27]

    Federal Information Pro- cessing Standards Publication 204

    National Institute of Standards and Technol- ogy.FIPS 204: Module-Lattice-Based Digital Signature Standard. Federal Information Pro- cessing Standards Publication 204. National Institute of Standards and Technology, Aug. 2024.doi:10 . 6028 / NIST . FIPS . 204.url: https://doi.org/10.6028/NIST.FIPS.204

  28. [28]

    FIPS 205: Stateless Hash-Based Digital Signature Standard

    National Institute of Standards and Technology. FIPS 205: Stateless Hash-Based Digital Signature Standard. Federal Information Processing Stan- dards Publication 205. National Institute of Stan- dards and Technology, Aug. 2024.doi:10.6028/ NIST . FIPS . 205.url:https : / / doi . org / 10 . 6028/NIST.FIPS.205

  29. [29]

    NIST Announces First Four Quantum-Resistant Cryptographic Algorithms.https://www.nist

    National Institute of Standards and Technology. NIST Announces First Four Quantum-Resistant Cryptographic Algorithms.https://www.nist. gov / news - events / news / 2022 / 07 / nist - announces- first- four- quantum- resistant- cryptographic- algorithms. Accedido el 30 de mayo de 2025. July 2022

  30. [30]

    National Institute of Standards and Technology (NIST).Migration to Post-Quantum Cryptog- raphy: Mappings to Risk Framework. Tech. rep. Draft White Paper. Available athttps : / / www . nist . gov / news - events / news / 2025 / 09/new-draft-white-paper-pqc-migration- mappings - risk - framework - docs. National Cybersecurity Center of Excellence (NCCoE), S...

  31. [31]

    Accedido el 22 de mayo de 2025

    OpenAI.GPT-4.1 Overview. Accedido el 22 de mayo de 2025. 2025.url:https://platform. openai.com/docs/models/gpt-4.1

  32. [32]

    Ac- cedido el 22 de mayo de 2025

    OpenAI.Introducing GPT-4.1 in the API. Ac- cedido el 22 de mayo de 2025. Apr. 2025.url: https://openai.com/index/gpt-4-1/

  33. [33]

    Accedido el 22 de mayo de 2025

    OpenAI.Pricing - OpenAI API. Accedido el 22 de mayo de 2025. 2025.url:https://platform. openai.com/docs/pricing/

  34. [34]

    Ver- sion V1

    Javier Pallar´ es de Bonrostro and Ana Is- abel Gonz´ alez-Tablas.Cryptographic Migration Dataset: Pre-Quantum to Post-Quantum. Ver- sion V1. 2025.doi:10 . 21950 / 7GK4MJ.url: https://doi.org/10.21950/7GK4MJ

  35. [35]

    PQC Migration Roadmap

    Post-Quantum Cryptography Coalition (PQCC). PQC Migration Roadmap. Tech. rep. Available at https : / / pqcc . org / wp - content / uploads / 2025 / 05 / PQC - Migration - Roadmap - PQCC - 2 . pdf. PQC Coalition, May 2025

  36. [36]

    Robert Praas.Self-Reflection on Chain-of- Thought Reasoning in Large Language Models. 2023

  37. [37]

    Accessed January

    IBM Research.Cryptography Bill of Materials (CBOM).https://research.ibm.com/blog/ crypto-bill-of-materials. Accessed January

  38. [38]

    Applied Post Quantum Cryptography: A Practical Approach for Generating Certificates in Industrial Environments

    Nino Ricchizzi, Christian Schwinne, and Jan Pelzl. “Applied Post Quantum Cryptography: A Practical Approach for Generating Certificates in Industrial Environments”. In:arXiv preprint arXiv:2505.04333(2025)

  39. [39]

    Code llama: Open foun- dation models for code

    Baptiste Roziere et al. “Code llama: Open foun- dation models for code”. In:arXiv preprint arXiv:2308.12950(2023)

  40. [40]

    Refactoring programs using large language models with few-shot ex- amples

    Atsushi Shirafuji et al. “Refactoring programs using large language models with few-shot ex- amples”. In:2023 30th Asia-Pacific Software Engineering Conference (APSEC). IEEE. 2023, pp. 151–160

  41. [41]

    ELCA: Introducing enterprise-level cryptographic agility for a post- quantum era

    Dimitrios Sikeridis et al. “ELCA: Introducing enterprise-level cryptographic agility for a post- quantum era”. In:Cryptology ePrint Archive (2023)

  42. [42]

    Assessing and Enhancing Quantum Readiness in Mobile Apps

    Joseph Strauss et al. “Assessing and Enhancing Quantum Readiness in Mobile Apps”. In:arXiv preprint arXiv:2506.00790(2025). Available at https://arxiv.org/abs/2506.00790

  43. [43]

    Large language models for software vulnerabil- ity detection: a guide for researchers on mod- els, methods, techniques, datasets, and metrics

    Seyed Mohammad Taghavi Far and Farid Feyzi. “Large language models for software vulnerabil- ity detection: a guide for researchers on mod- els, methods, techniques, datasets, and metrics”. In:International Journal of Information Security 24.2 (2025), p. 78. 28

  44. [44]

    Llama: Open and efficient foundation language models

    Hugo Touvron et al. “Llama: Open and efficient foundation language models”. In:arXiv preprint arXiv:2302.13971(2023)

  45. [45]

    Meltem S¨ onmez Turan et al.Ascon-Based Lightweight Cryptography Standards for Con- strained Devices: Authenticated Encryption, Hash, and Extendable Output Functions. Tech. rep. NIST SP 800-232 (Initial Public Draft). Initial Public Draft. National Institute of Stan- dards and Technology, Oct. 2024.url:https : //csrc.nist.gov/pubs/sp/800/232/ipd

  46. [46]

    Attention is all you need

    Ashish Vaswani et al. “Attention is all you need”. In:Advances in neural information processing systems30 (2017)

  47. [47]

    A survey of human-in-the- loop for machine learning

    Xingjiao Wu et al. “A survey of human-in-the- loop for machine learning”. In:Future Generation Computer Systems135 (2022), pp. 364–381

  48. [48]

    Jingfeng Yang et al.Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. 2023. arXiv:2304.13712 [cs.CL].url: https://arxiv.org/abs/2304.13712

  49. [49]

    Quanjun Zhang et al.A Survey on Large Lan- guage Models for Software Engineering. 2024. arXiv:2312 . 15223 [cs.SE].url:https : //arxiv.org/abs/2312.15223

  50. [50]

    Hybrid API migration: A marriage of small API mapping models and large language models

    Bingzhe Zhou et al. “Hybrid API migration: A marriage of small API mapping models and large language models”. In:Proceedings of the 14th Asia-Pacific Symposium on Internetware. 2023, pp. 12–21

  51. [51]

    Migrating Code At Scale With LLMs At Google

    Celal Ziftci et al. “Migrating Code At Scale With LLMs At Google”. In:arXiv preprint arXiv:2504.09691(2025). 29 Appendix A: F ull dataset distribution Table 8: Distribution of single cryptographic primitives across training and validation splits. Primitive / Alg. #Var. Train Val. Hash functions BLAKE2b 7 7 0 BLAKE2s 7 7 0 MD5 10 9 1 RIPEMD160 1 1 0 SHA-1 ...