pith. machine review for the scientific record. sign in

arxiv: 2604.03048 · v1 · submitted 2026-04-03 · 💻 cs.SE

Recognition: no theorem link

Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:28 UTC · model grok-4.3

classification 💻 cs.SE
keywords algorithm recognitionstatic code analysislarge language modelshybrid analysiscode comprehensionF1-score evaluationprompting strategiesidentifier obfuscation
0
0 comments X

The pith

Static code analysis filters paired with LLMs cut model calls by up to 97 percent while raising algorithm recognition accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether lightweight static checks can pre-screen source code before an LLM attempts to name the algorithm inside it. By applying simple filter patterns that look for structural signatures, the method skips the LLM on obvious non-matches. Experiments show the hybrid pipeline needs far fewer LLM queries than a pure model approach and produces higher F1 scores. The same tests reveal that LLMs still succeed even when variable names are replaced with meaningless strings. A reader would care because the technique offers a practical route to faster, cheaper automated code understanding for maintenance and comprehension tasks.

Core claim

Combining LLMs with lightweight static analysis using different filter patterns reduces required LLM calls by 72.39-97.50 percent depending on the pattern chosen. The same combination raises F1-scores by up to 12 percentage points over the LLM-only baseline. In-context learning with two examples gives a practical trade-off of 75-77 percent F1 at modest extra cost, and the models continue to recognize most algorithms even after systematic identifier obfuscation.

What carries the argument

Lightweight static analysis filter patterns that pre-screen code snippets and decide whether to invoke the LLM for algorithm classification.

If this is right

  • In-context learning with two examples balances accuracy and speed across the tested algorithms.
  • LLMs retain most of their recognition ability when all identifiers are replaced by meaningless tokens.
  • The hybrid method delivers both lower runtime and higher F1 than either component used alone.
  • Different filter patterns produce different trade-offs between call reduction and accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar pre-filters could reduce LLM usage in other static analysis tasks such as bug pattern detection.
  • The approach suggests a general design pattern of cheap structural checks before expensive model inference.
  • Developers might integrate these filters into IDEs to provide on-the-fly algorithm labels during browsing.

Load-bearing premise

The static filter patterns reliably exclude only non-matching code without missing any true algorithm implementations or biasing the test set.

What would settle it

A new test collection containing algorithm implementations that the chosen static filters incorrectly reject, measured by whether recall falls below the pure-LLM baseline.

Figures

Figures reproduced from arXiv: 2604.03048 by David Sch\"uler, Denis Neum\"uller, Matthias Tichy, Sebastian Boll.

Figure 1
Figure 1. Figure 1: Average F1-score for Yes/No and Score prompting. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We can see that increasing the score threshold indeed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average F1-score of in-context learning with varying [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average F1-scores for the Baseline, (4P+4N) as the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Context: Since it is well-established that developers spend a substantial portion of their time understanding source code, the ability to automatically identify algorithms within source code presents a valuable opportunity. This capability can support program comprehension, facilitate maintenance, and enhance overall software quality. Objective: We empirically evaluate how combining LLMs with static code analysis can improve the automated recognition of algorithms, while also evaluating their standalone performance and dependence on identifier names. Method: We perform multiple experiments evaluating the combination of LLMs with static analysis using different filter patterns. We compare this combined approach against their standalone performance under various prompting strategies and investigate the impact of systematic identifier obfuscation on classification performance and runtime. Results: The combination of LLMs with lightweight static analysis performs surprisingly well, reducing required LLM calls by 72.39-97.50% depending on the filter pattern. This not only lowers runtime significantly but also improves F1-scores by up to 12 percentage points (pp) compared to the baseline. Regarding the different prompting strategies, in-context learning with two examples provides an effective trade-off between classification performance and runtime efficiency, achieving F1-scores of 75-77% with only a modest increase in inference time. Lastly, we find that LLMs are not solely dependent on name-information as they are still able to identify most algorithm implementations when identifiers are obfuscated. Conclusion: By combining LLMs with static analysis, we achieve substantial reductions in runtime while simultaneously improving F1-scores, underscoring the value of a hybrid approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that combining LLMs with lightweight static code analysis for algorithm recognition reduces required LLM calls by 72.39-97.50% (depending on filter pattern), improves F1-scores by up to 12 pp over LLM-only baselines, that two-example in-context learning offers an effective performance-runtime trade-off (F1 75-77%), and that LLMs remain effective even under systematic identifier obfuscation.

Significance. If the central results hold after addressing filter evaluation, the hybrid method offers a practical route to faster, more accurate algorithm detection tools that could support program comprehension, maintenance, and quality assurance in software engineering. The multi-strategy experiments (filter patterns, prompting variants, obfuscation) and direct runtime measurements provide concrete evidence of efficiency gains that would be valuable for IDE integration or large-scale codebases.

major comments (2)
  1. [Results] Results section (and abstract): The reported F1 gains (+12 pp) and LLM-call reductions (72-97%) are measured only on the post-filter subset; no recall figures for the static filter patterns are provided, nor is there a manual audit or false-negative count for rejected snippets. If the filters silently drop true algorithm instances, the headline metrics become conditional on near-perfect filter recall and may overstate the hybrid system's advantage on the full input distribution.
  2. [Method] Method section: The experimental description lacks sample sizes, number of distinct algorithms/code snippets, statistical significance tests, and explicit controls for selection bias introduced by the filters. These omissions make it impossible to judge whether the 75-77% F1 range and the obfuscation results are robust or sensitive to the particular corpus and filter thresholds chosen.
minor comments (2)
  1. [Abstract] Abstract and Results: The exact number of prompting strategies, filter patterns, and obfuscation levels tested should be stated numerically rather than described qualitatively as 'multiple'.
  2. [Results] Clarify whether the baseline LLM-only F1 scores were computed on the identical post-filter subset or on the full unfiltered set; the comparison must be apples-to-apples to support the 'improves F1' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity and completeness on the points raised.

read point-by-point responses
  1. Referee: [Results] Results section (and abstract): The reported F1 gains (+12 pp) and LLM-call reductions (72-97%) are measured only on the post-filter subset; no recall figures for the static filter patterns are provided, nor is there a manual audit or false-negative count for rejected snippets. If the filters silently drop true algorithm instances, the headline metrics become conditional on near-perfect filter recall and may overstate the hybrid system's advantage on the full input distribution.

    Authors: We agree that the headline metrics are conditional on the post-filter subset and that recall of the static filters is an important missing piece for evaluating the hybrid approach on the full input distribution. In the revised manuscript we will add recall figures for each filter pattern, report the number of true-positive algorithm instances rejected by the filters, and include a manual audit of a random sample of rejected snippets to quantify false-negative rates. These additions will make the conditional nature of the results explicit and allow readers to assess the overall trade-off. revision: yes

  2. Referee: [Method] Method section: The experimental description lacks sample sizes, number of distinct algorithms/code snippets, statistical significance tests, and explicit controls for selection bias introduced by the filters. These omissions make it impossible to judge whether the 75-77% F1 range and the obfuscation results are robust or sensitive to the particular corpus and filter thresholds chosen.

    Authors: We acknowledge the need for greater experimental detail. The revised Method section will report the exact number of code snippets and distinct algorithms in the corpus, include statistical significance tests (e.g., McNemar’s test for paired F1 comparisons), and discuss selection bias by reporting results across multiple filter-threshold settings and on the unfiltered corpus where feasible. These changes will allow readers to evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper reports results from direct experimental measurements comparing hybrid LLM+static-analysis pipelines against baselines on code datasets. No equations, parameter fitting, or derivations are present; performance metrics (F1, runtime, call reduction) are computed from observed outcomes rather than constructed from inputs. Self-citations, if any, are not load-bearing for the central claims, which rest on reproducible experimental comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study with no new theoretical parameters, axioms beyond standard ones, or invented entities.

axioms (1)
  • domain assumption The selected code snippets represent typical implementations of the target algorithms
    The performance claims depend on the test set being representative of real-world code.

pith-pipeline@v0.9.0 · 5582 in / 1303 out tokens · 73022 ms · 2026-05-13T19:28:17.222849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 11 internal anchors

  1. [1]

    Since OpenAI’s API pricing is based on token count, we used the reduced dataset for GPT

    Results:To choose a sensible baseline for further exper- iments we compared the Yes/No and the score-prompt. Since OpenAI’s API pricing is based on token count, we used the reduced dataset for GPT. Figure 1 shows the average F1-score over all algorithms in the dataset for each of the evaluated LLMs with the two prompting styles. When using the score promp...

  2. [2]

    However, the score prompt offers the added benefit of enabling adjustments to the recall-precision trade-off

    Discussion:In our experiment, the binary (Y/N) and score prompts yield very similar F1-scores. However, the score prompt offers the added benefit of enabling adjustments to the recall-precision trade-off. For this reason, we select the score prompt as the baseline for all subsequent experiments. 2We also tested a finer 0 – 10 scale. GPT and Llama performe...

  3. [3]

    We therefore defined the following two type of negative examples: (i) random negativesare methods that share no similarities with the seven algorithm types used in our dataset

    Pre-study Regarding Negative Examples:We also wondered whether negative examples consisting of random code or code that shares conceptual or structural similarity with the algorithm — without actually implementing it — would lead to better results. We therefore defined the following two type of negative examples: (i) random negativesare methods that share...

  4. [4]

    For all models providing examples in the context increases performance compared to the baseline

    Results:Figure 3 displays the F1-score achieved by the LLMs when using the different example combinations com- pared to the baseline. For all models providing examples in the context increases performance compared to the baseline. GPT improves from a baseline F1-score of 69% to 77% while Llama improves from 70% to 78% with the (4P+4N) combination. For GPT...

  5. [5]

    sweet-spot

    Discussion:From the experiment, we conclude that in-context learning improves performance by 4–8 percentage points (pp). We also find that positive examples have a higher relative improvement in performance compared to negative examples. Providing more than two positive examples only marginally increases performance while at the same time linearly increas...

  6. [6]

    With CoT prompting GPT, Llama and Mixtral achieved F1-scores of 68%, 67% and 72% respectively

    Results:Figure 4 displays the results for CoT compared to the baseline and the best performing in-contest learning combination (4P+4N) from our previous experiments. With CoT prompting GPT, Llama and Mixtral achieved F1-scores of 68%, 67% and 72% respectively. Surprisingly the use of CoT prompting leads to a decrease in performance for the GPT and Llama m...

  7. [7]

    sweet-spot

    Discussion:We find that CoT prompting is inferior when compared to ICL, both in terms of achieved F1-score as well as runtime. One possible explanation for this could be that our models are too small to take full advantage of CoT. This explanation is supported by the experiments of Wei et al. [ 54] who find that CoT prompting only shows its positive effec...

  8. [8]

    Creation recipe: First, we extracted all identifiers from the example implementations

    Recall Focused:These patterns aim to maximize recall to avoid excluding any true positives. Creation recipe: First, we extracted all identifiers from the example implementations. Next we divided identifiers into explicit (e.g.,transposeMatrix, transpose) and generic (e.g., rows, cols, temp, . . .). We then defined regular expressions for each group and in...

  9. [9]

    Although there is no silver bullet that works for each algorithm, the following modifications proved effective in achieving this

    Recall Focused Enhanced Precision:With these patterns, our goal was to increase precision considerably while maintaining high recall. Although there is no silver bullet that works for each algorithm, the following modifications proved effective in achieving this. Creation recipe: We removed overly generic keywords shared across multiple patterns and lower...

  10. [10]

    These patterns capture only the most important features that are characteristic for the implementations of a specific algorithm

    Prominent Feature:The objective of theProminent Featurepatterns is to further enhance precision compared to thekeyword-basedpatterns, while preserving high recall. These patterns capture only the most important features that are characteristic for the implementations of a specific algorithm. Creation recipe: We examined our set of example implemen- tation...

  11. [11]

    sweet-spot

    Neum ¨uller et al.:To evaluate their DSL, Neum ¨uller et al. [ 16] also published a set of algorithm search-patterns for BCEval. Unlike ourProminent Featurepatterns — which are lightweight filter heuristics to use with LLMs — their patterns are standalone solutions targeting both high precision and high recall by themselves. As a result, these patterns ar...

  12. [12]

    Measuring program comprehension: a large-scale field study with professionals,

    X. Xia, L. Bao, D. Lo, Z. Xing, A. E. Hassan, and S. Li, “Measuring program comprehension: a large-scale field study with professionals,” inProceedings of the 40th International Conference on Software Engineering, ser. ICSE ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 584. [Online]. Available: https://doi.org/10.1145/3180155.3182538

  13. [13]

    I know what you did last summer: an investigation of how developers spend their time,

    R. Minelli, A. M. and, and M. Lanza, “I know what you did last summer: an investigation of how developers spend their time,” inProceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, ser. ICPC ’15. Florence, Italy: IEEE Press, 2015, pp. 25–35

  14. [14]

    Recovering Architectural Design Decisions,

    A. Shahbazian, Y . K. Lee, D. M. Le, Y . Brun, and N. Medvidovic, “Recovering Architectural Design Decisions,” inIEEE International Conference on Software Architecture, ICSA 2018, Seattle, WA, USA, April 30 - May 4, 2018. IEEE Computer Society, 2018, pp. 95–104. [Online]. Available: https://doi.org/10.1109/ICSA.2018.00019

  15. [15]

    Model-Driven Reverse Engineering Approaches: A systematic literature review,

    C. Raibulet, F. A. Fontana, and M. Zanoni, “Model-Driven Reverse Engineering Approaches: A systematic literature review,”IEEE Access, vol. 5, pp. 14 516–14 542, 2017. [Online]. Available: https://doi.org/10.1109/ACCESS.2017.2733518

  16. [16]

    Program concept recognition and transformation,

    W. Kozaczynski, J. Ning, and A. Engberts, “Program concept recognition and transformation,”IEEE Transactions on Software Engineering, vol. 18, no. 12, pp. 1065–1075, Dec. 1992

  17. [17]

    A Memory-Based Approach to Recognizing Programming Plans,

    A. Quilici, “A Memory-Based Approach to Recognizing Programming Plans,”Commun. ACM, vol. 37, no. 5, pp. 84–93, May 1994. [Online]. Available: https://doi.org/10.1145/175290.175301

  18. [18]

    Using Attributed Flow Graph Parsing to Recognize Clich´es in programs,

    L. M. Wills, “Using Attributed Flow Graph Parsing to Recognize Clich´es in programs,” inGraph Gramars and Their Application to Computer Science, 5th International Workshop, Williamsburg, VA, USA, November 13-18, 1994, Selected Papers, ser. Lecture Notes in Computer Science, J. E. Cuny, H. Ehrig, G. Engels, and G. Rozenberg, Eds., vol. 1073. Springer, 1994...

  19. [19]

    Metzger and Z

    R. Metzger and Z. Wen,Automatic algorithm recognition and replacement: a new approach to program optimization. MIT Press, 2000

  20. [20]

    Algorithm Recognition based on Demand-Driven Dataflow Analysis,

    C. Alias and D. Barthou, “Algorithm Recognition based on Demand-Driven Dataflow Analysis,” in10th Working Conference on Reverse Engineering (WCRE 2003), Victoria, Canada, Nov. 2003. [Online]. Available: https://ens-lyon.hal.science/ensl-01663748

  21. [21]

    Autonomous mental development for algorithm recognition,

    G. Zhu and X. Zhu, “Autonomous mental development for algorithm recognition,” inInternational Conference on Information Science and Technology, 2011, pp. 339–347

  22. [22]

    Beacon-and Schema-Based Method for Recognizing Algorithms from Students’ Source Code

    A. Taherkhani and L. Malmi, “Beacon-and Schema-Based Method for Recognizing Algorithms from Students’ Source Code.”Journal of Educational Data Mining, vol. 5, no. 2, pp. 69–101, 2013

  23. [23]

    Towards a Framework for Algorithm Recognition in Binary Code,

    F. Mesnard, E. Payet, and W. Vanhoof, “Towards a Framework for Algorithm Recognition in Binary Code,” inProceedings of the 18th International Symposium on Principles and Practice of Declarative Programming, ser. PPDP ’16. New York, NY , USA: Association for Computing Machinery, 2016, pp. 202–213. [Online]. Available: https://doi.org/10.1145/2967973.2968600

  24. [24]

    ARCC: Assistant for Repetitive Code Comprehension,

    W. Z. Nunez, V . J. Marin, and C. R. Rivero, “ARCC: Assistant for Repetitive Code Comprehension,” inProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2017. New York, NY , USA: Association for Computing Machinery, 2017, pp. 999–1003. [Online]. Available: https://doi.org/10.1145/3106237.3122824

  25. [25]

    Automated Personalized Feedback in Introductory Java Programming MOOCs,

    V . J. Marin, T. Pereira, S. Sridharan, and C. R. Rivero, “Automated Personalized Feedback in Introductory Java Programming MOOCs,” in 2017 IEEE 33rd International Conference on Data Engineering (ICDE), 2017, pp. 1259–1270

  26. [26]

    Multi-View Graph Representation for Programming Language Processing: An Investi- gation into Algorithm Detection,

    T. Long, Y . Xie, X. Chen, W. Zhang, Q. Cao, and Y . Yu, “Multi-View Graph Representation for Programming Language Processing: An Investi- gation into Algorithm Detection,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 5, pp. 5792–5799, Jun. 2022. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/20522

  27. [27]

    Exploring the Effective- ness of Abstract Syntax Tree Patterns for Algorithm Recognition,

    D. Neum¨uller, F. Sihler, R. Straub, and M. Tichy, “Exploring the Effective- ness of Abstract Syntax Tree Patterns for Algorithm Recognition,” in2024 4th International Conference on Code Quality (ICCQ), 2024, pp. 1–18

  28. [28]

    T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introduction to Algorithms, 3rd Edition. MIT Press, 2009. [Online]. Available: http://mitpress.mit.edu/books/introduction-algorithms

  29. [29]

    Providing Information About Implemented Algorithms Improves Program Comprehension: A Con- trolled Experiment,

    D. Neum ¨uller, A. Raschke, and M. Tichy, “Providing Information About Implemented Algorithms Improves Program Comprehension: A Con- trolled Experiment,” inProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering, ser. EASE ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. 383–393. [Online...

  30. [30]

    Large Language Models for Software Engineering: A Sys- tematic Literature Review,

    X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large Language Models for Software Engineering: A Sys- tematic Literature Review,”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, Dec. 2024. [Online]. Available: https://doi.org/10.1145/3695988

  31. [31]

    A Survey on Large Language Models for Code Generation

    J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A Survey on Large Language Models for Code Generation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.00515

  32. [32]

    A survey of large language models for code: Evolution, benchmarking, and future trends

    Z. Zheng, K. Ning, Y . Wang, J. Zhang, D. Zheng, M. Ye, and J. Chen, “A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends,” 2024. [Online]. Available: https://arxiv.org/abs/2311.10372

  33. [33]

    Few-shot training LLMs for project-specific code-summarization,

    T. Ahmed and P. Devanbu, “Few-shot training LLMs for project-specific code-summarization,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’22. New York, NY , USA: Association for Computing Machinery,

  34. [34]

    Available: https://doi.org/10.1145/3551349.3559555

    [Online]. Available: https://doi.org/10.1145/3551349.3559555

  35. [35]

    Understanding Code Se- mantics: An Evaluation of Transformer Models in Summarization,

    D. Mondal, A. Lodha, A. Sahoo, and B. Kumari, “Understanding Code Se- mantics: An Evaluation of Transformer Models in Summarization,” 2023

  36. [36]

    Exploring and Unleashing the Power of Large Language Models in Automated Code Translation,

    Z. Yang, F. Liu, Z. Yu, J. W. Keung, J. Li, S. Liu, Y . Hong, X. Ma, Z. Jin, and G. Li, “Exploring and Unleashing the Power of Large Language Models in Automated Code Translation,”Proc. ACM Softw. Eng., vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3660778

  37. [37]

    Scalable, Validated Code Translation of Entire Projects using Large Language Models,

    H. Zhang, C. David, M. Wang, B. Paulsen, and D. Kroening, “Scalable, Validated Code Translation of Entire Projects using Large Language Models,”Proc. ACM Program. Lang., vol. 9, no. PLDI, Jun. 2025. [Online]. Available: https://doi.org/10.1145/3729315

  38. [38]

    Automated program repair in the era of large pre-trained language models

    C. S. Xia, Y . Wei, and L. Zhang, “Automated Program Repair in the Era of Large Pre-Trained Language Models,” inProceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23. Melbourne, Victoria, Australia: IEEE Press, 2023, pp. 1482–1494. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00129

  39. [39]

    Repair is nearly generation: multilingual program repair with LLMs,

    H. Joshi, J. C. Sanchez, S. Gulwani, V . Le, I. Radiˇcek, and G. Verbruggen, “Repair is nearly generation: multilingual program repair with LLMs,” inProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances ...

  40. [40]
  41. [41]

    Reporting guidelines for controlled experiments in software engineering,

    A. Jedlitschka and D. Pfahl, “Reporting guidelines for controlled experiments in software engineering,” in2005 International Symposium on Empirical Software Engineering, 2005., 2005, pp. 1–10

  42. [42]

    Field and G

    A. Field and G. Hole,How to Design and Report Experiments. London, Thousand Oaks, New Delhi, Singapore, Washington DC: SAGE Publications, 2003

  43. [43]

    Evaluating Large Language Models Trained on Code,

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, Elizabeth et al.,...

  44. [44]

    Instruction-Following Evaluation for Large Language Models

    J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-Following Evaluation for Large Language Models,” 2023. [Online]. Available: https://arxiv.org/abs/2311.07911

  45. [45]

    Comparison of AI Models: Intelligence, Performance & Price Analysis,

    Artificial Analysis, “Comparison of AI Models: Intelligence, Performance & Price Analysis,” Online, Jul. 2025, accessed: 2025-07-09. [Online]. Available: https://artificialanalysis.ai/models

  46. [46]

    GPT-4 Technical Report

    OpenAI and Josh Achiam and Steven Adler and Sandhini Agarwal and Lama Ahmad and Ilge Akkaya and Florencia Leoni Aleman and Diogo Almeida and Janko Altenschmidt and Sam Altman and Shyamal Anadkat and Red Avila and Igor Babuschkin and Suchir Balaji and others, “GPT-4 Technical Report,” 2024. [Online]. Available: https://arxiv.org/abs/2303.08774

  47. [47]

    GPT-4o mini,

    OpenAI, “GPT-4o mini,” OpenAI Website, Jul. 2024, accessed: 2025-07-09. [Online]. Available: https://platform.openai.com/docs/models/gpt-4o-mini

  48. [48]

    Introducing GPT-4.1 in the API,

    ——, “Introducing GPT-4.1 in the API,” OpenAI Website, Apr. 2025, accessed: 2025-07-09. [Online]. Available: https://openai.com/index/gpt-4-1/

  49. [49]

    The Llama 3 Herd of Models

    Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Alex Vaughan and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and others, “The Llama 3 Herd of Models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

  50. [50]

    Cheaper, Better, Faster, Stronger,

    M. A. Team, “Cheaper, Better, Faster, Stronger,” Mistral AI Blog, Apr. 2024, accessed: 2025-07-09. [Online]. Available: https://mistral.ai/news/mixtral-8x22b

  51. [51]

    Open LLM Leaderboard,

    H. F. Team, “Open LLM Leaderboard,” Online, accessed: 2025-07-09. [Online]. Available: https: //huggingface.co/spaces/open-llm-leaderboard/open llm leaderboard

  52. [52]

    GPT-4o mini: advancing cost-efficient intelligence,

    OpenAI, “GPT-4o mini: advancing cost-efficient intelligence,” OpenAI Website, Jul. 2024, accessed: 2025-07-09. [Online]. Available: https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

  53. [53]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timoth ´ee Lacroix and Baptiste Rozi`ere and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample, “LLaMA: Open and Efficient Foundation Language Models,” 2023. [Online]. Available: http...

  54. [54]

    Mixtral of Experts,

    Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and others, “Mixtral of Experts,”

  55. [55]

    Mixtral of Experts

    [Online]. Available: https://arxiv.org/abs/2401.04088

  56. [56]

    Bigcloneeval: A clone detection tool evaluation framework with bigclonebench,

    J. Svajlenko and C. K. Roy, “Bigcloneeval: A clone detection tool evaluation framework with bigclonebench,” in2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 2016, pp. 596–600

  57. [57]

    A. S. E. Group. (2013) IJaDataset 2.0. [Online]. Available: http://web. archive.org/web/20161231055842/http://secold.org/projects/seclone

  58. [58]

    Convolutional Neural Networks over Tree Structures for Programming Language Processing,

    L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, “Convolutional Neural Networks over Tree Structures for Programming Language Processing,” inProceedings of the Thirtieth AAAI Conference on Artificial Intelligence, ser. AAAI’16. Phoenix, Arizona: AAAI Press, 2016, pp. 1287–1293

  59. [59]

    On Precision of Code Clone Detection Tools,

    F. Farmahinifarahani, V . Saini, D. Yang, H. Sajnani, and C. V . Lopes, “On Precision of Code Clone Detection Tools,” in2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Feb. 2019, pp. 84–94

  60. [61]

    Available: https://arxiv.org/abs/2006.15682

    [Online]. Available: https://arxiv.org/abs/2006.15682

  61. [62]

    Towards a Big Data Curated Benchmark of Inter-project Code Clones,

    J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, “Towards a Big Data Curated Benchmark of Inter-project Code Clones,” in2014 IEEE International Conference on Software Maintenance and Evolution, 2014, pp. 476–480

  62. [63]

    A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications,” 2025. [Online]. Available: https://arxiv.org/abs/2402.07927

  63. [64]

    Prompt Engineering in Large Language Models,

    G. Marvin, N. Hellen, D. Jjingo, and J. Nakatumba-Nabende, “Prompt Engineering in Large Language Models,” inData Intelligence and Cognitive Informatics, I. J. Jacob, S. Piramuthu, and P. Falkowski-Gilski, Eds. Singapore: Springer Nature Singapore, 2024, pp. 387–402

  64. [65]

    Unleashing the potential of prompt engineering for large language models,

    B. Chen, Z. Zhang, N. Langren ´e, and S. Zhu, “Unleashing the potential of prompt engineering for large language models,” Patterns, vol. 6, no. 6, p. 101260, Jun. 2025. [Online]. Available: http://dx.doi.org/10.1016/j.patter.2025.101260

  65. [66]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

  66. [67]

    A Survey of Large Language Models,

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, Y . Du, C. Yang, Y . Chen, Z. Chen, J. Jiang, R. Ren, Y . Li, X. Tang, Z. Liu, P. Liu, J.-Y . Nie, and J.-R. Wen, “A Survey of Large Language Models,” 2023

  67. [68]

    A Survey on In-context Learning

    Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, T. Liu, B. Chang, X. Sun, L. Li, and Z. Sui, “A Survey on In-context Learning,” 2024. [Online]. Available: https://arxiv.org/abs/2301.00234

  68. [69]

    Chain-of-thought prompting elicits reasoning in large language models,

    Wei Jason, Wang Xuezhi, Schuurmans Dale, Bosma Maarten, Ichter Brian, Xia Fei, Chi Ed H., Le, Quoc V ., and Zhou, Denny, “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Conference on Neural Information Processing Sys- tems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

  69. [70]

    Large language models are zero-shot reasoners,

    Kojima Takeshi, Gu Shixiang Shane, Reid Machel, Matsuo Yutaka, and Iwasawa Yusuke, “Large language models are zero-shot reasoners,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

  70. [71]

    Automatic chain of thought prompting in large language models,

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola, “Automatic Chain of Thought Prompting in Large Language Models,” 2022. [Online]. Available: https://arxiv.org/abs/2210.03493

  71. [72]

    A Learning Algorithm for Boltzmann Machines,

    D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A Learning Algorithm for Boltzmann Machines,”Cognitive Science, vol. 9, no. 1, pp. 147–169, 1985. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0364021385800124

  72. [73]

    Controlling Linguistic Style Aspects in Neural Language Generation,

    J. Ficler and Y . Goldberg, “Controlling Linguistic Style Aspects in Neural Language Generation,” 2017. [Online]. Available: https://arxiv.org/abs/1707.02633

  73. [74]

    Hierarchical neural story generation.CoRR, abs/1805.04833, 2018

    A. Fan, M. Lewis, and Y . Dauphin, “Hierarchical Neural Story Generation,” 2018. [Online]. Available: https://arxiv.org/abs/1805.04833

  74. [75]

    The Curious Case of Neural Text Degeneration

    A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi, “The Curious Case of Neural Text Degeneration,” 2020. [Online]. Available: https://arxiv.org/abs/1904.09751

  75. [76]

    What’s going on with the Open LLM Leaderboard?

    C. Fourrier, N. Habib, J. Launay, and T. Wolf, “What’s going on with the Open LLM Leaderboard?” Hugging Face Blog, Jun. 2023, accessed: 2025-07-06. [Online]. Available: https://huggingface.co/blog/open-llm-leaderboard-mmlu

  76. [77]

    Exploring the Characteristics of Identifiers: A Large-Scale Empirical Study on 5,000 Open Source Projects,

    Zhang, Jingxuan and Liu, Siyuan and Luo, Junpeng and Liang, Jiahui and Huang, Zhiqiu, “Exploring the Characteristics of Identifiers: A Large-Scale Empirical Study on 5,000 Open Source Projects,”IEEE Access, vol. 8, pp. 140 607–140 620, 2020

  77. [78]

    Algorithm identification in programming assignments,

    P. Chourasia, G. Ramakrishnan, V . Apte, and S. Kumar, “Algorithm identification in programming assignments,” inProceedings of the 30th IEEE/ACM International Conference on Program Comprehension, ser. ICPC ’22. New York, NY , USA: Association for Computing Machinery, 2022, pp. 471–481. [Online]. Available: https://doi.org/10.1145/3524610.3527914

  78. [79]

    BigCloneBench Considered Harmful for Machine Learning,

    J. Krinke and C. Ragkhitwetsagul, “BigCloneBench Considered Harmful for Machine Learning,” in2022 IEEE 16th International Workshop on Software Clones (IWSC), 2022, pp. 1–7

  79. [80]

    Spoon: A Library for Implementing Analyses and Transformations of Java Source Code,

    R. Pawlak, M. Monperrus, N. Petitprez, C. Noguera, and L. Seinturier, “Spoon: A Library for Implementing Analyses and Transformations of Java Source Code,”Software: Practice and Experience, vol. 46, pp. 1155–1179, 2015. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01078532/document

  80. [81]

    Attention Is All You Need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,”

Showing first 80 references.