arxiv: 2604.03048 · v1 · submitted 2026-04-03 · 💻 cs.SE

Recognition: no theorem link

Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition

Denis Neum\"uller , Sebastian Boll , David Sch\"uler , Matthias Tichy

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:28 UTC · model grok-4.3

classification 💻 cs.SE

keywords algorithm recognitionstatic code analysislarge language modelshybrid analysiscode comprehensionF1-score evaluationprompting strategiesidentifier obfuscation

0 comments

The pith

Static code analysis filters paired with LLMs cut model calls by up to 97 percent while raising algorithm recognition accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether lightweight static checks can pre-screen source code before an LLM attempts to name the algorithm inside it. By applying simple filter patterns that look for structural signatures, the method skips the LLM on obvious non-matches. Experiments show the hybrid pipeline needs far fewer LLM queries than a pure model approach and produces higher F1 scores. The same tests reveal that LLMs still succeed even when variable names are replaced with meaningless strings. A reader would care because the technique offers a practical route to faster, cheaper automated code understanding for maintenance and comprehension tasks.

Core claim

Combining LLMs with lightweight static analysis using different filter patterns reduces required LLM calls by 72.39-97.50 percent depending on the pattern chosen. The same combination raises F1-scores by up to 12 percentage points over the LLM-only baseline. In-context learning with two examples gives a practical trade-off of 75-77 percent F1 at modest extra cost, and the models continue to recognize most algorithms even after systematic identifier obfuscation.

What carries the argument

Lightweight static analysis filter patterns that pre-screen code snippets and decide whether to invoke the LLM for algorithm classification.

If this is right

In-context learning with two examples balances accuracy and speed across the tested algorithms.
LLMs retain most of their recognition ability when all identifiers are replaced by meaningless tokens.
The hybrid method delivers both lower runtime and higher F1 than either component used alone.
Different filter patterns produce different trade-offs between call reduction and accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar pre-filters could reduce LLM usage in other static analysis tasks such as bug pattern detection.
The approach suggests a general design pattern of cheap structural checks before expensive model inference.
Developers might integrate these filters into IDEs to provide on-the-fly algorithm labels during browsing.

Load-bearing premise

The static filter patterns reliably exclude only non-matching code without missing any true algorithm implementations or biasing the test set.

What would settle it

A new test collection containing algorithm implementations that the chosen static filters incorrectly reject, measured by whether recall falls below the pure-LLM baseline.

Figures

Figures reproduced from arXiv: 2604.03048 by David Sch\"uler, Denis Neum\"uller, Matthias Tichy, Sebastian Boll.

**Figure 2.** Figure 2: We can see that increasing the score threshold indeed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Average F1-score of in-context learning with varying [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Average F1-scores for the Baseline, (4P+4N) as the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Context: Since it is well-established that developers spend a substantial portion of their time understanding source code, the ability to automatically identify algorithms within source code presents a valuable opportunity. This capability can support program comprehension, facilitate maintenance, and enhance overall software quality. Objective: We empirically evaluate how combining LLMs with static code analysis can improve the automated recognition of algorithms, while also evaluating their standalone performance and dependence on identifier names. Method: We perform multiple experiments evaluating the combination of LLMs with static analysis using different filter patterns. We compare this combined approach against their standalone performance under various prompting strategies and investigate the impact of systematic identifier obfuscation on classification performance and runtime. Results: The combination of LLMs with lightweight static analysis performs surprisingly well, reducing required LLM calls by 72.39-97.50% depending on the filter pattern. This not only lowers runtime significantly but also improves F1-scores by up to 12 percentage points (pp) compared to the baseline. Regarding the different prompting strategies, in-context learning with two examples provides an effective trade-off between classification performance and runtime efficiency, achieving F1-scores of 75-77% with only a modest increase in inference time. Lastly, we find that LLMs are not solely dependent on name-information as they are still able to identify most algorithm implementations when identifiers are obfuscated. Conclusion: By combining LLMs with static analysis, we achieve substantial reductions in runtime while simultaneously improving F1-scores, underscoring the value of a hybrid approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The hybrid cuts LLM calls by 72-97% with up to 12pp F1 gains, but the static filter recall is unverified so the numbers could be inflated by dropped true positives.

read the letter

The headline result is that combining lightweight static filters with LLMs reduces the number of model calls by between 72 and 97 percent depending on the pattern used, while also lifting F1 scores by as much as 12 percentage points over a pure LLM baseline. That efficiency gain is the practical takeaway, and the paper backs it with tests across several filter designs and prompting styles. What the work does well is lay out concrete comparisons. They evaluate standalone static analysis, standalone LLMs with different prompts including in-context examples, and the combined system. The obfuscation experiments are a plus because they show the models are not just guessing from variable names; performance holds up reasonably when identifiers are replaced with generic ones. The runtime measurements add to the story by showing the hybrid approach is faster in practice. These are straightforward empirical contributions that build on existing ideas about hybrid analysis without overclaiming novelty. The main soft spot is the handling of the static filter step. The reported improvements are measured after filtering, so if the filters miss any true algorithm instances, those cases become uncounted errors for the hybrid method. The abstract gives no recall figures for the filters and no description of checking rejected snippets, which leaves open the possibility that the remaining test cases are easier and the gains are partly an artifact of that selection. Sample sizes, exact dataset composition, and any statistical tests are also not mentioned in the summary, though the full paper may include them. This is not a fatal problem, but it does mean the central claims rest on an assumption that the filters are nearly perfect at preserving positives. Overall this is the kind of paper that would interest people working on automated program understanding tools for maintenance and quality tasks. The quantitative results on trade-offs make it worth reading for anyone trying to make LLM-based code analysis more efficient. It is solid enough to go to peer review, with the expectation that reviewers will ask for more detail on the filter behavior and dataset statistics. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper claims that combining LLMs with lightweight static code analysis for algorithm recognition reduces required LLM calls by 72.39-97.50% (depending on filter pattern), improves F1-scores by up to 12 pp over LLM-only baselines, that two-example in-context learning offers an effective performance-runtime trade-off (F1 75-77%), and that LLMs remain effective even under systematic identifier obfuscation.

Significance. If the central results hold after addressing filter evaluation, the hybrid method offers a practical route to faster, more accurate algorithm detection tools that could support program comprehension, maintenance, and quality assurance in software engineering. The multi-strategy experiments (filter patterns, prompting variants, obfuscation) and direct runtime measurements provide concrete evidence of efficiency gains that would be valuable for IDE integration or large-scale codebases.

major comments (2)

[Results] Results section (and abstract): The reported F1 gains (+12 pp) and LLM-call reductions (72-97%) are measured only on the post-filter subset; no recall figures for the static filter patterns are provided, nor is there a manual audit or false-negative count for rejected snippets. If the filters silently drop true algorithm instances, the headline metrics become conditional on near-perfect filter recall and may overstate the hybrid system's advantage on the full input distribution.
[Method] Method section: The experimental description lacks sample sizes, number of distinct algorithms/code snippets, statistical significance tests, and explicit controls for selection bias introduced by the filters. These omissions make it impossible to judge whether the 75-77% F1 range and the obfuscation results are robust or sensitive to the particular corpus and filter thresholds chosen.

minor comments (2)

[Abstract] Abstract and Results: The exact number of prompting strategies, filter patterns, and obfuscation levels tested should be stated numerically rather than described qualitatively as 'multiple'.
[Results] Clarify whether the baseline LLM-only F1 scores were computed on the identical post-filter subset or on the full unfiltered set; the comparison must be apples-to-apples to support the 'improves F1' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity and completeness on the points raised.

read point-by-point responses

Referee: [Results] Results section (and abstract): The reported F1 gains (+12 pp) and LLM-call reductions (72-97%) are measured only on the post-filter subset; no recall figures for the static filter patterns are provided, nor is there a manual audit or false-negative count for rejected snippets. If the filters silently drop true algorithm instances, the headline metrics become conditional on near-perfect filter recall and may overstate the hybrid system's advantage on the full input distribution.

Authors: We agree that the headline metrics are conditional on the post-filter subset and that recall of the static filters is an important missing piece for evaluating the hybrid approach on the full input distribution. In the revised manuscript we will add recall figures for each filter pattern, report the number of true-positive algorithm instances rejected by the filters, and include a manual audit of a random sample of rejected snippets to quantify false-negative rates. These additions will make the conditional nature of the results explicit and allow readers to assess the overall trade-off. revision: yes
Referee: [Method] Method section: The experimental description lacks sample sizes, number of distinct algorithms/code snippets, statistical significance tests, and explicit controls for selection bias introduced by the filters. These omissions make it impossible to judge whether the 75-77% F1 range and the obfuscation results are robust or sensitive to the particular corpus and filter thresholds chosen.

Authors: We acknowledge the need for greater experimental detail. The revised Method section will report the exact number of code snippets and distinct algorithms in the corpus, include statistical significance tests (e.g., McNemar’s test for paired F1 comparisons), and discuss selection bias by reporting results across multiple filter-threshold settings and on the unfiltered corpus where feasible. These changes will allow readers to evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper reports results from direct experimental measurements comparing hybrid LLM+static-analysis pipelines against baselines on code datasets. No equations, parameter fitting, or derivations are present; performance metrics (F1, runtime, call reduction) are computed from observed outcomes rather than constructed from inputs. Self-citations, if any, are not load-bearing for the central claims, which rest on reproducible experimental comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study with no new theoretical parameters, axioms beyond standard ones, or invented entities.

axioms (1)

domain assumption The selected code snippets represent typical implementations of the target algorithms
The performance claims depend on the test set being representative of real-world code.

pith-pipeline@v0.9.0 · 5582 in / 1303 out tokens · 73022 ms · 2026-05-13T19:28:17.222849+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 11 internal anchors

[1]

Since OpenAI’s API pricing is based on token count, we used the reduced dataset for GPT

Results:To choose a sensible baseline for further exper- iments we compared the Yes/No and the score-prompt. Since OpenAI’s API pricing is based on token count, we used the reduced dataset for GPT. Figure 1 shows the average F1-score over all algorithms in the dataset for each of the evaluated LLMs with the two prompting styles. When using the score promp...

work page
[2]

However, the score prompt offers the added benefit of enabling adjustments to the recall-precision trade-off

Discussion:In our experiment, the binary (Y/N) and score prompts yield very similar F1-scores. However, the score prompt offers the added benefit of enabling adjustments to the recall-precision trade-off. For this reason, we select the score prompt as the baseline for all subsequent experiments. 2We also tested a finer 0 – 10 scale. GPT and Llama performe...

work page
[3]

We therefore defined the following two type of negative examples: (i) random negativesare methods that share no similarities with the seven algorithm types used in our dataset

Pre-study Regarding Negative Examples:We also wondered whether negative examples consisting of random code or code that shares conceptual or structural similarity with the algorithm — without actually implementing it — would lead to better results. We therefore defined the following two type of negative examples: (i) random negativesare methods that share...

work page
[4]

For all models providing examples in the context increases performance compared to the baseline

Results:Figure 3 displays the F1-score achieved by the LLMs when using the different example combinations com- pared to the baseline. For all models providing examples in the context increases performance compared to the baseline. GPT improves from a baseline F1-score of 69% to 77% while Llama improves from 70% to 78% with the (4P+4N) combination. For GPT...

work page
[5]

sweet-spot

Discussion:From the experiment, we conclude that in-context learning improves performance by 4–8 percentage points (pp). We also find that positive examples have a higher relative improvement in performance compared to negative examples. Providing more than two positive examples only marginally increases performance while at the same time linearly increas...

work page
[6]

With CoT prompting GPT, Llama and Mixtral achieved F1-scores of 68%, 67% and 72% respectively

Results:Figure 4 displays the results for CoT compared to the baseline and the best performing in-contest learning combination (4P+4N) from our previous experiments. With CoT prompting GPT, Llama and Mixtral achieved F1-scores of 68%, 67% and 72% respectively. Surprisingly the use of CoT prompting leads to a decrease in performance for the GPT and Llama m...

work page
[7]

sweet-spot

Discussion:We find that CoT prompting is inferior when compared to ICL, both in terms of achieved F1-score as well as runtime. One possible explanation for this could be that our models are too small to take full advantage of CoT. This explanation is supported by the experiments of Wei et al. [ 54] who find that CoT prompting only shows its positive effec...

work page
[8]

Creation recipe: First, we extracted all identifiers from the example implementations

Recall Focused:These patterns aim to maximize recall to avoid excluding any true positives. Creation recipe: First, we extracted all identifiers from the example implementations. Next we divided identifiers into explicit (e.g.,transposeMatrix, transpose) and generic (e.g., rows, cols, temp, . . .). We then defined regular expressions for each group and in...

work page
[9]

Although there is no silver bullet that works for each algorithm, the following modifications proved effective in achieving this

Recall Focused Enhanced Precision:With these patterns, our goal was to increase precision considerably while maintaining high recall. Although there is no silver bullet that works for each algorithm, the following modifications proved effective in achieving this. Creation recipe: We removed overly generic keywords shared across multiple patterns and lower...

work page
[10]

These patterns capture only the most important features that are characteristic for the implementations of a specific algorithm

Prominent Feature:The objective of theProminent Featurepatterns is to further enhance precision compared to thekeyword-basedpatterns, while preserving high recall. These patterns capture only the most important features that are characteristic for the implementations of a specific algorithm. Creation recipe: We examined our set of example implemen- tation...

work page
[11]

sweet-spot

Neum ¨uller et al.:To evaluate their DSL, Neum ¨uller et al. [ 16] also published a set of algorithm search-patterns for BCEval. Unlike ourProminent Featurepatterns — which are lightweight filter heuristics to use with LLMs — their patterns are standalone solutions targeting both high precision and high recall by themselves. As a result, these patterns ar...

work page
[12]

Measuring program comprehension: a large-scale field study with professionals,

X. Xia, L. Bao, D. Lo, Z. Xing, A. E. Hassan, and S. Li, “Measuring program comprehension: a large-scale field study with professionals,” inProceedings of the 40th International Conference on Software Engineering, ser. ICSE ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 584. [Online]. Available: https://doi.org/10.1145/3180155.3182538

work page doi:10.1145/3180155.3182538 2018
[13]

I know what you did last summer: an investigation of how developers spend their time,

R. Minelli, A. M. and, and M. Lanza, “I know what you did last summer: an investigation of how developers spend their time,” inProceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, ser. ICPC ’15. Florence, Italy: IEEE Press, 2015, pp. 25–35

work page 2015
[14]

Recovering Architectural Design Decisions,

A. Shahbazian, Y . K. Lee, D. M. Le, Y . Brun, and N. Medvidovic, “Recovering Architectural Design Decisions,” inIEEE International Conference on Software Architecture, ICSA 2018, Seattle, WA, USA, April 30 - May 4, 2018. IEEE Computer Society, 2018, pp. 95–104. [Online]. Available: https://doi.org/10.1109/ICSA.2018.00019

work page doi:10.1109/icsa.2018.00019 2018
[15]

Model-Driven Reverse Engineering Approaches: A systematic literature review,

C. Raibulet, F. A. Fontana, and M. Zanoni, “Model-Driven Reverse Engineering Approaches: A systematic literature review,”IEEE Access, vol. 5, pp. 14 516–14 542, 2017. [Online]. Available: https://doi.org/10.1109/ACCESS.2017.2733518

work page doi:10.1109/access.2017.2733518 2017
[16]

Program concept recognition and transformation,

W. Kozaczynski, J. Ning, and A. Engberts, “Program concept recognition and transformation,”IEEE Transactions on Software Engineering, vol. 18, no. 12, pp. 1065–1075, Dec. 1992

work page 1992
[17]

A Memory-Based Approach to Recognizing Programming Plans,

A. Quilici, “A Memory-Based Approach to Recognizing Programming Plans,”Commun. ACM, vol. 37, no. 5, pp. 84–93, May 1994. [Online]. Available: https://doi.org/10.1145/175290.175301

work page doi:10.1145/175290.175301 1994
[18]

Using Attributed Flow Graph Parsing to Recognize Clich´es in programs,

L. M. Wills, “Using Attributed Flow Graph Parsing to Recognize Clich´es in programs,” inGraph Gramars and Their Application to Computer Science, 5th International Workshop, Williamsburg, VA, USA, November 13-18, 1994, Selected Papers, ser. Lecture Notes in Computer Science, J. E. Cuny, H. Ehrig, G. Engels, and G. Rozenberg, Eds., vol. 1073. Springer, 1994...

work page doi:10.1007/3-540-61228-9 1994
[19]

Metzger and Z

R. Metzger and Z. Wen,Automatic algorithm recognition and replacement: a new approach to program optimization. MIT Press, 2000

work page 2000
[20]

Algorithm Recognition based on Demand-Driven Dataflow Analysis,

C. Alias and D. Barthou, “Algorithm Recognition based on Demand-Driven Dataflow Analysis,” in10th Working Conference on Reverse Engineering (WCRE 2003), Victoria, Canada, Nov. 2003. [Online]. Available: https://ens-lyon.hal.science/ensl-01663748

work page 2003
[21]

Autonomous mental development for algorithm recognition,

G. Zhu and X. Zhu, “Autonomous mental development for algorithm recognition,” inInternational Conference on Information Science and Technology, 2011, pp. 339–347

work page 2011
[22]

Beacon-and Schema-Based Method for Recognizing Algorithms from Students’ Source Code

A. Taherkhani and L. Malmi, “Beacon-and Schema-Based Method for Recognizing Algorithms from Students’ Source Code.”Journal of Educational Data Mining, vol. 5, no. 2, pp. 69–101, 2013

work page 2013
[23]

Towards a Framework for Algorithm Recognition in Binary Code,

F. Mesnard, E. Payet, and W. Vanhoof, “Towards a Framework for Algorithm Recognition in Binary Code,” inProceedings of the 18th International Symposium on Principles and Practice of Declarative Programming, ser. PPDP ’16. New York, NY , USA: Association for Computing Machinery, 2016, pp. 202–213. [Online]. Available: https://doi.org/10.1145/2967973.2968600

work page doi:10.1145/2967973.2968600 2016
[24]

ARCC: Assistant for Repetitive Code Comprehension,

W. Z. Nunez, V . J. Marin, and C. R. Rivero, “ARCC: Assistant for Repetitive Code Comprehension,” inProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2017. New York, NY , USA: Association for Computing Machinery, 2017, pp. 999–1003. [Online]. Available: https://doi.org/10.1145/3106237.3122824

work page doi:10.1145/3106237.3122824 2017
[25]

Automated Personalized Feedback in Introductory Java Programming MOOCs,

V . J. Marin, T. Pereira, S. Sridharan, and C. R. Rivero, “Automated Personalized Feedback in Introductory Java Programming MOOCs,” in 2017 IEEE 33rd International Conference on Data Engineering (ICDE), 2017, pp. 1259–1270

work page 2017
[26]

Multi-View Graph Representation for Programming Language Processing: An Investi- gation into Algorithm Detection,

T. Long, Y . Xie, X. Chen, W. Zhang, Q. Cao, and Y . Yu, “Multi-View Graph Representation for Programming Language Processing: An Investi- gation into Algorithm Detection,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 5, pp. 5792–5799, Jun. 2022. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/20522

work page 2022
[27]

Exploring the Effective- ness of Abstract Syntax Tree Patterns for Algorithm Recognition,

D. Neum¨uller, F. Sihler, R. Straub, and M. Tichy, “Exploring the Effective- ness of Abstract Syntax Tree Patterns for Algorithm Recognition,” in2024 4th International Conference on Code Quality (ICCQ), 2024, pp. 1–18

work page 2024
[28]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introduction to Algorithms, 3rd Edition. MIT Press, 2009. [Online]. Available: http://mitpress.mit.edu/books/introduction-algorithms

work page 2009
[29]

Providing Information About Implemented Algorithms Improves Program Comprehension: A Con- trolled Experiment,

D. Neum ¨uller, A. Raschke, and M. Tichy, “Providing Information About Implemented Algorithms Improves Program Comprehension: A Con- trolled Experiment,” inProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering, ser. EASE ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. 383–393. [Online...

work page doi:10.1145/3756681.3756968 2025
[30]

Large Language Models for Software Engineering: A Sys- tematic Literature Review,

X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large Language Models for Software Engineering: A Sys- tematic Literature Review,”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, Dec. 2024. [Online]. Available: https://doi.org/10.1145/3695988

work page doi:10.1145/3695988 2024
[31]

A Survey on Large Language Models for Code Generation

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A Survey on Large Language Models for Code Generation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.00515

work page internal anchor Pith review arXiv 2024
[32]

A survey of large language models for code: Evolution, benchmarking, and future trends

Z. Zheng, K. Ning, Y . Wang, J. Zhang, D. Zheng, M. Ye, and J. Chen, “A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends,” 2024. [Online]. Available: https://arxiv.org/abs/2311.10372

work page arXiv 2024
[33]

Few-shot training LLMs for project-specific code-summarization,

T. Ahmed and P. Devanbu, “Few-shot training LLMs for project-specific code-summarization,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’22. New York, NY , USA: Association for Computing Machinery,

work page
[34]

Available: https://doi.org/10.1145/3551349.3559555

[Online]. Available: https://doi.org/10.1145/3551349.3559555

work page doi:10.1145/3551349.3559555
[35]

Understanding Code Se- mantics: An Evaluation of Transformer Models in Summarization,

D. Mondal, A. Lodha, A. Sahoo, and B. Kumari, “Understanding Code Se- mantics: An Evaluation of Transformer Models in Summarization,” 2023

work page 2023
[36]

Exploring and Unleashing the Power of Large Language Models in Automated Code Translation,

Z. Yang, F. Liu, Z. Yu, J. W. Keung, J. Li, S. Liu, Y . Hong, X. Ma, Z. Jin, and G. Li, “Exploring and Unleashing the Power of Large Language Models in Automated Code Translation,”Proc. ACM Softw. Eng., vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3660778

work page doi:10.1145/3660778 2024
[37]

Scalable, Validated Code Translation of Entire Projects using Large Language Models,

H. Zhang, C. David, M. Wang, B. Paulsen, and D. Kroening, “Scalable, Validated Code Translation of Entire Projects using Large Language Models,”Proc. ACM Program. Lang., vol. 9, no. PLDI, Jun. 2025. [Online]. Available: https://doi.org/10.1145/3729315

work page doi:10.1145/3729315 2025
[38]

Automated program repair in the era of large pre-trained language models

C. S. Xia, Y . Wei, and L. Zhang, “Automated Program Repair in the Era of Large Pre-Trained Language Models,” inProceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23. Melbourne, Victoria, Australia: IEEE Press, 2023, pp. 1482–1494. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00129

work page doi:10.1109/icse48619.2023.00129 2023
[39]

Repair is nearly generation: multilingual program repair with LLMs,

H. Joshi, J. C. Sanchez, S. Gulwani, V . Le, I. Radiˇcek, and G. Verbruggen, “Repair is nearly generation: multilingual program repair with LLMs,” inProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances ...

work page
[40]

Repair is nearly generation: multilingual program repair with LLMs , year =

[Online]. Available: https://doi.org/10.1609/aaai.v37i4.25642

work page doi:10.1609/aaai.v37i4.25642
[41]

Reporting guidelines for controlled experiments in software engineering,

A. Jedlitschka and D. Pfahl, “Reporting guidelines for controlled experiments in software engineering,” in2005 International Symposium on Empirical Software Engineering, 2005., 2005, pp. 1–10

work page 2005
[42]

Field and G

A. Field and G. Hole,How to Design and Report Experiments. London, Thousand Oaks, New Delhi, Singapore, Washington DC: SAGE Publications, 2003

work page 2003
[43]

Evaluating Large Language Models Trained on Code,

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, Elizabeth et al.,...

work page 2021
[44]

Instruction-Following Evaluation for Large Language Models

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-Following Evaluation for Large Language Models,” 2023. [Online]. Available: https://arxiv.org/abs/2311.07911

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Comparison of AI Models: Intelligence, Performance & Price Analysis,

Artificial Analysis, “Comparison of AI Models: Intelligence, Performance & Price Analysis,” Online, Jul. 2025, accessed: 2025-07-09. [Online]. Available: https://artificialanalysis.ai/models

work page 2025
[46]

GPT-4 Technical Report

OpenAI and Josh Achiam and Steven Adler and Sandhini Agarwal and Lama Ahmad and Ilge Akkaya and Florencia Leoni Aleman and Diogo Almeida and Janko Altenschmidt and Sam Altman and Shyamal Anadkat and Red Avila and Igor Babuschkin and Suchir Balaji and others, “GPT-4 Technical Report,” 2024. [Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

GPT-4o mini,

OpenAI, “GPT-4o mini,” OpenAI Website, Jul. 2024, accessed: 2025-07-09. [Online]. Available: https://platform.openai.com/docs/models/gpt-4o-mini

work page 2024
[48]

Introducing GPT-4.1 in the API,

——, “Introducing GPT-4.1 in the API,” OpenAI Website, Apr. 2025, accessed: 2025-07-09. [Online]. Available: https://openai.com/index/gpt-4-1/

work page 2025
[49]

The Llama 3 Herd of Models

Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Alex Vaughan and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and others, “The Llama 3 Herd of Models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Cheaper, Better, Faster, Stronger,

M. A. Team, “Cheaper, Better, Faster, Stronger,” Mistral AI Blog, Apr. 2024, accessed: 2025-07-09. [Online]. Available: https://mistral.ai/news/mixtral-8x22b

work page 2024
[51]

Open LLM Leaderboard,

H. F. Team, “Open LLM Leaderboard,” Online, accessed: 2025-07-09. [Online]. Available: https: //huggingface.co/spaces/open-llm-leaderboard/open llm leaderboard

work page 2025
[52]

GPT-4o mini: advancing cost-efficient intelligence,

OpenAI, “GPT-4o mini: advancing cost-efficient intelligence,” OpenAI Website, Jul. 2024, accessed: 2025-07-09. [Online]. Available: https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

work page 2024
[53]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timoth ´ee Lacroix and Baptiste Rozi`ere and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample, “LLaMA: Open and Efficient Foundation Language Models,” 2023. [Online]. Available: http...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Mixtral of Experts,

Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and others, “Mixtral of Experts,”

work page
[55]

Mixtral of Experts

[Online]. Available: https://arxiv.org/abs/2401.04088

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Bigcloneeval: A clone detection tool evaluation framework with bigclonebench,

J. Svajlenko and C. K. Roy, “Bigcloneeval: A clone detection tool evaluation framework with bigclonebench,” in2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 2016, pp. 596–600

work page 2016
[57]

A. S. E. Group. (2013) IJaDataset 2.0. [Online]. Available: http://web. archive.org/web/20161231055842/http://secold.org/projects/seclone

work page arXiv 2013
[58]

Convolutional Neural Networks over Tree Structures for Programming Language Processing,

L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, “Convolutional Neural Networks over Tree Structures for Programming Language Processing,” inProceedings of the Thirtieth AAAI Conference on Artificial Intelligence, ser. AAAI’16. Phoenix, Arizona: AAAI Press, 2016, pp. 1287–1293

work page 2016
[59]

On Precision of Code Clone Detection Tools,

F. Farmahinifarahani, V . Saini, D. Yang, H. Sajnani, and C. V . Lopes, “On Precision of Code Clone Detection Tools,” in2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Feb. 2019, pp. 84–94

work page 2019
[61]

Available: https://arxiv.org/abs/2006.15682

[Online]. Available: https://arxiv.org/abs/2006.15682

work page arXiv 2006
[62]

Towards a Big Data Curated Benchmark of Inter-project Code Clones,

J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, “Towards a Big Data Curated Benchmark of Inter-project Code Clones,” in2014 IEEE International Conference on Software Maintenance and Evolution, 2014, pp. 476–480

work page 2014
[63]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications,” 2025. [Online]. Available: https://arxiv.org/abs/2402.07927

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Prompt Engineering in Large Language Models,

G. Marvin, N. Hellen, D. Jjingo, and J. Nakatumba-Nabende, “Prompt Engineering in Large Language Models,” inData Intelligence and Cognitive Informatics, I. J. Jacob, S. Piramuthu, and P. Falkowski-Gilski, Eds. Singapore: Springer Nature Singapore, 2024, pp. 387–402

work page 2024
[65]

Unleashing the potential of prompt engineering for large language models,

B. Chen, Z. Zhang, N. Langren ´e, and S. Zhu, “Unleashing the potential of prompt engineering for large language models,” Patterns, vol. 6, no. 6, p. 101260, Jun. 2025. [Online]. Available: http://dx.doi.org/10.1016/j.patter.2025.101260

work page doi:10.1016/j.patter.2025.101260 2025
[66]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

work page 2020
[67]

A Survey of Large Language Models,

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, Y . Du, C. Yang, Y . Chen, Z. Chen, J. Jiang, R. Ren, Y . Li, X. Tang, Z. Liu, P. Liu, J.-Y . Nie, and J.-R. Wen, “A Survey of Large Language Models,” 2023

work page 2023
[68]

A Survey on In-context Learning

Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, T. Liu, B. Chang, X. Sun, L. Li, and Z. Sui, “A Survey on In-context Learning,” 2024. [Online]. Available: https://arxiv.org/abs/2301.00234

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Chain-of-thought prompting elicits reasoning in large language models,

Wei Jason, Wang Xuezhi, Schuurmans Dale, Bosma Maarten, Ichter Brian, Xia Fei, Chi Ed H., Le, Quoc V ., and Zhou, Denny, “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Conference on Neural Information Processing Sys- tems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

work page 2022
[70]

Large language models are zero-shot reasoners,

Kojima Takeshi, Gu Shixiang Shane, Reid Machel, Matsuo Yutaka, and Iwasawa Yusuke, “Large language models are zero-shot reasoners,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

work page 2022
[71]

Automatic chain of thought prompting in large language models,

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola, “Automatic Chain of Thought Prompting in Large Language Models,” 2022. [Online]. Available: https://arxiv.org/abs/2210.03493

work page arXiv 2022
[72]

A Learning Algorithm for Boltzmann Machines,

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A Learning Algorithm for Boltzmann Machines,”Cognitive Science, vol. 9, no. 1, pp. 147–169, 1985. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0364021385800124

work page 1985
[73]

Controlling Linguistic Style Aspects in Neural Language Generation,

J. Ficler and Y . Goldberg, “Controlling Linguistic Style Aspects in Neural Language Generation,” 2017. [Online]. Available: https://arxiv.org/abs/1707.02633

work page arXiv 2017
[74]

Hierarchical neural story generation.CoRR, abs/1805.04833, 2018

A. Fan, M. Lewis, and Y . Dauphin, “Hierarchical Neural Story Generation,” 2018. [Online]. Available: https://arxiv.org/abs/1805.04833

work page arXiv 2018
[75]

The Curious Case of Neural Text Degeneration

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi, “The Curious Case of Neural Text Degeneration,” 2020. [Online]. Available: https://arxiv.org/abs/1904.09751

work page internal anchor Pith review Pith/arXiv arXiv 2020
[76]

What’s going on with the Open LLM Leaderboard?

C. Fourrier, N. Habib, J. Launay, and T. Wolf, “What’s going on with the Open LLM Leaderboard?” Hugging Face Blog, Jun. 2023, accessed: 2025-07-06. [Online]. Available: https://huggingface.co/blog/open-llm-leaderboard-mmlu

work page 2023
[77]

Exploring the Characteristics of Identifiers: A Large-Scale Empirical Study on 5,000 Open Source Projects,

Zhang, Jingxuan and Liu, Siyuan and Luo, Junpeng and Liang, Jiahui and Huang, Zhiqiu, “Exploring the Characteristics of Identifiers: A Large-Scale Empirical Study on 5,000 Open Source Projects,”IEEE Access, vol. 8, pp. 140 607–140 620, 2020

work page 2020
[78]

Algorithm identification in programming assignments,

P. Chourasia, G. Ramakrishnan, V . Apte, and S. Kumar, “Algorithm identification in programming assignments,” inProceedings of the 30th IEEE/ACM International Conference on Program Comprehension, ser. ICPC ’22. New York, NY , USA: Association for Computing Machinery, 2022, pp. 471–481. [Online]. Available: https://doi.org/10.1145/3524610.3527914

work page doi:10.1145/3524610.3527914 2022
[79]

BigCloneBench Considered Harmful for Machine Learning,

J. Krinke and C. Ragkhitwetsagul, “BigCloneBench Considered Harmful for Machine Learning,” in2022 IEEE 16th International Workshop on Software Clones (IWSC), 2022, pp. 1–7

work page 2022
[80]

Spoon: A Library for Implementing Analyses and Transformations of Java Source Code,

R. Pawlak, M. Monperrus, N. Petitprez, C. Noguera, and L. Seinturier, “Spoon: A Library for Implementing Analyses and Transformations of Java Source Code,”Software: Practice and Experience, vol. 46, pp. 1155–1179, 2015. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01078532/document

work page 2015
[81]

Attention Is All You Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,”

work page

Showing first 80 references.