Recognition: no theorem link
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition
Pith reviewed 2026-05-13 19:28 UTC · model grok-4.3
The pith
Static code analysis filters paired with LLMs cut model calls by up to 97 percent while raising algorithm recognition accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Combining LLMs with lightweight static analysis using different filter patterns reduces required LLM calls by 72.39-97.50 percent depending on the pattern chosen. The same combination raises F1-scores by up to 12 percentage points over the LLM-only baseline. In-context learning with two examples gives a practical trade-off of 75-77 percent F1 at modest extra cost, and the models continue to recognize most algorithms even after systematic identifier obfuscation.
What carries the argument
Lightweight static analysis filter patterns that pre-screen code snippets and decide whether to invoke the LLM for algorithm classification.
If this is right
- In-context learning with two examples balances accuracy and speed across the tested algorithms.
- LLMs retain most of their recognition ability when all identifiers are replaced by meaningless tokens.
- The hybrid method delivers both lower runtime and higher F1 than either component used alone.
- Different filter patterns produce different trade-offs between call reduction and accuracy.
Where Pith is reading between the lines
- Similar pre-filters could reduce LLM usage in other static analysis tasks such as bug pattern detection.
- The approach suggests a general design pattern of cheap structural checks before expensive model inference.
- Developers might integrate these filters into IDEs to provide on-the-fly algorithm labels during browsing.
Load-bearing premise
The static filter patterns reliably exclude only non-matching code without missing any true algorithm implementations or biasing the test set.
What would settle it
A new test collection containing algorithm implementations that the chosen static filters incorrectly reject, measured by whether recall falls below the pure-LLM baseline.
Figures
read the original abstract
Context: Since it is well-established that developers spend a substantial portion of their time understanding source code, the ability to automatically identify algorithms within source code presents a valuable opportunity. This capability can support program comprehension, facilitate maintenance, and enhance overall software quality. Objective: We empirically evaluate how combining LLMs with static code analysis can improve the automated recognition of algorithms, while also evaluating their standalone performance and dependence on identifier names. Method: We perform multiple experiments evaluating the combination of LLMs with static analysis using different filter patterns. We compare this combined approach against their standalone performance under various prompting strategies and investigate the impact of systematic identifier obfuscation on classification performance and runtime. Results: The combination of LLMs with lightweight static analysis performs surprisingly well, reducing required LLM calls by 72.39-97.50% depending on the filter pattern. This not only lowers runtime significantly but also improves F1-scores by up to 12 percentage points (pp) compared to the baseline. Regarding the different prompting strategies, in-context learning with two examples provides an effective trade-off between classification performance and runtime efficiency, achieving F1-scores of 75-77% with only a modest increase in inference time. Lastly, we find that LLMs are not solely dependent on name-information as they are still able to identify most algorithm implementations when identifiers are obfuscated. Conclusion: By combining LLMs with static analysis, we achieve substantial reductions in runtime while simultaneously improving F1-scores, underscoring the value of a hybrid approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that combining LLMs with lightweight static code analysis for algorithm recognition reduces required LLM calls by 72.39-97.50% (depending on filter pattern), improves F1-scores by up to 12 pp over LLM-only baselines, that two-example in-context learning offers an effective performance-runtime trade-off (F1 75-77%), and that LLMs remain effective even under systematic identifier obfuscation.
Significance. If the central results hold after addressing filter evaluation, the hybrid method offers a practical route to faster, more accurate algorithm detection tools that could support program comprehension, maintenance, and quality assurance in software engineering. The multi-strategy experiments (filter patterns, prompting variants, obfuscation) and direct runtime measurements provide concrete evidence of efficiency gains that would be valuable for IDE integration or large-scale codebases.
major comments (2)
- [Results] Results section (and abstract): The reported F1 gains (+12 pp) and LLM-call reductions (72-97%) are measured only on the post-filter subset; no recall figures for the static filter patterns are provided, nor is there a manual audit or false-negative count for rejected snippets. If the filters silently drop true algorithm instances, the headline metrics become conditional on near-perfect filter recall and may overstate the hybrid system's advantage on the full input distribution.
- [Method] Method section: The experimental description lacks sample sizes, number of distinct algorithms/code snippets, statistical significance tests, and explicit controls for selection bias introduced by the filters. These omissions make it impossible to judge whether the 75-77% F1 range and the obfuscation results are robust or sensitive to the particular corpus and filter thresholds chosen.
minor comments (2)
- [Abstract] Abstract and Results: The exact number of prompting strategies, filter patterns, and obfuscation levels tested should be stated numerically rather than described qualitatively as 'multiple'.
- [Results] Clarify whether the baseline LLM-only F1 scores were computed on the identical post-filter subset or on the full unfiltered set; the comparison must be apples-to-apples to support the 'improves F1' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity and completeness on the points raised.
read point-by-point responses
-
Referee: [Results] Results section (and abstract): The reported F1 gains (+12 pp) and LLM-call reductions (72-97%) are measured only on the post-filter subset; no recall figures for the static filter patterns are provided, nor is there a manual audit or false-negative count for rejected snippets. If the filters silently drop true algorithm instances, the headline metrics become conditional on near-perfect filter recall and may overstate the hybrid system's advantage on the full input distribution.
Authors: We agree that the headline metrics are conditional on the post-filter subset and that recall of the static filters is an important missing piece for evaluating the hybrid approach on the full input distribution. In the revised manuscript we will add recall figures for each filter pattern, report the number of true-positive algorithm instances rejected by the filters, and include a manual audit of a random sample of rejected snippets to quantify false-negative rates. These additions will make the conditional nature of the results explicit and allow readers to assess the overall trade-off. revision: yes
-
Referee: [Method] Method section: The experimental description lacks sample sizes, number of distinct algorithms/code snippets, statistical significance tests, and explicit controls for selection bias introduced by the filters. These omissions make it impossible to judge whether the 75-77% F1 range and the obfuscation results are robust or sensitive to the particular corpus and filter thresholds chosen.
Authors: We acknowledge the need for greater experimental detail. The revised Method section will report the exact number of code snippets and distinct algorithms in the corpus, include statistical significance tests (e.g., McNemar’s test for paired F1 comparisons), and discuss selection bias by reporting results across multiple filter-threshold settings and on the unfiltered corpus where feasible. These changes will allow readers to evaluate robustness directly. revision: yes
Circularity Check
No significant circularity in empirical evaluation
full rationale
The paper reports results from direct experimental measurements comparing hybrid LLM+static-analysis pipelines against baselines on code datasets. No equations, parameter fitting, or derivations are present; performance metrics (F1, runtime, call reduction) are computed from observed outcomes rather than constructed from inputs. Self-citations, if any, are not load-bearing for the central claims, which rest on reproducible experimental comparisons.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected code snippets represent typical implementations of the target algorithms
Reference graph
Works this paper leans on
-
[1]
Since OpenAI’s API pricing is based on token count, we used the reduced dataset for GPT
Results:To choose a sensible baseline for further exper- iments we compared the Yes/No and the score-prompt. Since OpenAI’s API pricing is based on token count, we used the reduced dataset for GPT. Figure 1 shows the average F1-score over all algorithms in the dataset for each of the evaluated LLMs with the two prompting styles. When using the score promp...
-
[2]
Discussion:In our experiment, the binary (Y/N) and score prompts yield very similar F1-scores. However, the score prompt offers the added benefit of enabling adjustments to the recall-precision trade-off. For this reason, we select the score prompt as the baseline for all subsequent experiments. 2We also tested a finer 0 – 10 scale. GPT and Llama performe...
-
[3]
Pre-study Regarding Negative Examples:We also wondered whether negative examples consisting of random code or code that shares conceptual or structural similarity with the algorithm — without actually implementing it — would lead to better results. We therefore defined the following two type of negative examples: (i) random negativesare methods that share...
-
[4]
For all models providing examples in the context increases performance compared to the baseline
Results:Figure 3 displays the F1-score achieved by the LLMs when using the different example combinations com- pared to the baseline. For all models providing examples in the context increases performance compared to the baseline. GPT improves from a baseline F1-score of 69% to 77% while Llama improves from 70% to 78% with the (4P+4N) combination. For GPT...
-
[5]
Discussion:From the experiment, we conclude that in-context learning improves performance by 4–8 percentage points (pp). We also find that positive examples have a higher relative improvement in performance compared to negative examples. Providing more than two positive examples only marginally increases performance while at the same time linearly increas...
-
[6]
With CoT prompting GPT, Llama and Mixtral achieved F1-scores of 68%, 67% and 72% respectively
Results:Figure 4 displays the results for CoT compared to the baseline and the best performing in-contest learning combination (4P+4N) from our previous experiments. With CoT prompting GPT, Llama and Mixtral achieved F1-scores of 68%, 67% and 72% respectively. Surprisingly the use of CoT prompting leads to a decrease in performance for the GPT and Llama m...
-
[7]
Discussion:We find that CoT prompting is inferior when compared to ICL, both in terms of achieved F1-score as well as runtime. One possible explanation for this could be that our models are too small to take full advantage of CoT. This explanation is supported by the experiments of Wei et al. [ 54] who find that CoT prompting only shows its positive effec...
-
[8]
Creation recipe: First, we extracted all identifiers from the example implementations
Recall Focused:These patterns aim to maximize recall to avoid excluding any true positives. Creation recipe: First, we extracted all identifiers from the example implementations. Next we divided identifiers into explicit (e.g.,transposeMatrix, transpose) and generic (e.g., rows, cols, temp, . . .). We then defined regular expressions for each group and in...
-
[9]
Recall Focused Enhanced Precision:With these patterns, our goal was to increase precision considerably while maintaining high recall. Although there is no silver bullet that works for each algorithm, the following modifications proved effective in achieving this. Creation recipe: We removed overly generic keywords shared across multiple patterns and lower...
-
[10]
Prominent Feature:The objective of theProminent Featurepatterns is to further enhance precision compared to thekeyword-basedpatterns, while preserving high recall. These patterns capture only the most important features that are characteristic for the implementations of a specific algorithm. Creation recipe: We examined our set of example implemen- tation...
-
[11]
Neum ¨uller et al.:To evaluate their DSL, Neum ¨uller et al. [ 16] also published a set of algorithm search-patterns for BCEval. Unlike ourProminent Featurepatterns — which are lightweight filter heuristics to use with LLMs — their patterns are standalone solutions targeting both high precision and high recall by themselves. As a result, these patterns ar...
-
[12]
Measuring program comprehension: a large-scale field study with professionals,
X. Xia, L. Bao, D. Lo, Z. Xing, A. E. Hassan, and S. Li, “Measuring program comprehension: a large-scale field study with professionals,” inProceedings of the 40th International Conference on Software Engineering, ser. ICSE ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 584. [Online]. Available: https://doi.org/10.1145/3180155.3182538
-
[13]
I know what you did last summer: an investigation of how developers spend their time,
R. Minelli, A. M. and, and M. Lanza, “I know what you did last summer: an investigation of how developers spend their time,” inProceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, ser. ICPC ’15. Florence, Italy: IEEE Press, 2015, pp. 25–35
work page 2015
-
[14]
Recovering Architectural Design Decisions,
A. Shahbazian, Y . K. Lee, D. M. Le, Y . Brun, and N. Medvidovic, “Recovering Architectural Design Decisions,” inIEEE International Conference on Software Architecture, ICSA 2018, Seattle, WA, USA, April 30 - May 4, 2018. IEEE Computer Society, 2018, pp. 95–104. [Online]. Available: https://doi.org/10.1109/ICSA.2018.00019
-
[15]
Model-Driven Reverse Engineering Approaches: A systematic literature review,
C. Raibulet, F. A. Fontana, and M. Zanoni, “Model-Driven Reverse Engineering Approaches: A systematic literature review,”IEEE Access, vol. 5, pp. 14 516–14 542, 2017. [Online]. Available: https://doi.org/10.1109/ACCESS.2017.2733518
-
[16]
Program concept recognition and transformation,
W. Kozaczynski, J. Ning, and A. Engberts, “Program concept recognition and transformation,”IEEE Transactions on Software Engineering, vol. 18, no. 12, pp. 1065–1075, Dec. 1992
work page 1992
-
[17]
A Memory-Based Approach to Recognizing Programming Plans,
A. Quilici, “A Memory-Based Approach to Recognizing Programming Plans,”Commun. ACM, vol. 37, no. 5, pp. 84–93, May 1994. [Online]. Available: https://doi.org/10.1145/175290.175301
-
[18]
Using Attributed Flow Graph Parsing to Recognize Clich´es in programs,
L. M. Wills, “Using Attributed Flow Graph Parsing to Recognize Clich´es in programs,” inGraph Gramars and Their Application to Computer Science, 5th International Workshop, Williamsburg, VA, USA, November 13-18, 1994, Selected Papers, ser. Lecture Notes in Computer Science, J. E. Cuny, H. Ehrig, G. Engels, and G. Rozenberg, Eds., vol. 1073. Springer, 1994...
-
[19]
R. Metzger and Z. Wen,Automatic algorithm recognition and replacement: a new approach to program optimization. MIT Press, 2000
work page 2000
-
[20]
Algorithm Recognition based on Demand-Driven Dataflow Analysis,
C. Alias and D. Barthou, “Algorithm Recognition based on Demand-Driven Dataflow Analysis,” in10th Working Conference on Reverse Engineering (WCRE 2003), Victoria, Canada, Nov. 2003. [Online]. Available: https://ens-lyon.hal.science/ensl-01663748
work page 2003
-
[21]
Autonomous mental development for algorithm recognition,
G. Zhu and X. Zhu, “Autonomous mental development for algorithm recognition,” inInternational Conference on Information Science and Technology, 2011, pp. 339–347
work page 2011
-
[22]
Beacon-and Schema-Based Method for Recognizing Algorithms from Students’ Source Code
A. Taherkhani and L. Malmi, “Beacon-and Schema-Based Method for Recognizing Algorithms from Students’ Source Code.”Journal of Educational Data Mining, vol. 5, no. 2, pp. 69–101, 2013
work page 2013
-
[23]
Towards a Framework for Algorithm Recognition in Binary Code,
F. Mesnard, E. Payet, and W. Vanhoof, “Towards a Framework for Algorithm Recognition in Binary Code,” inProceedings of the 18th International Symposium on Principles and Practice of Declarative Programming, ser. PPDP ’16. New York, NY , USA: Association for Computing Machinery, 2016, pp. 202–213. [Online]. Available: https://doi.org/10.1145/2967973.2968600
-
[24]
ARCC: Assistant for Repetitive Code Comprehension,
W. Z. Nunez, V . J. Marin, and C. R. Rivero, “ARCC: Assistant for Repetitive Code Comprehension,” inProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2017. New York, NY , USA: Association for Computing Machinery, 2017, pp. 999–1003. [Online]. Available: https://doi.org/10.1145/3106237.3122824
-
[25]
Automated Personalized Feedback in Introductory Java Programming MOOCs,
V . J. Marin, T. Pereira, S. Sridharan, and C. R. Rivero, “Automated Personalized Feedback in Introductory Java Programming MOOCs,” in 2017 IEEE 33rd International Conference on Data Engineering (ICDE), 2017, pp. 1259–1270
work page 2017
-
[26]
T. Long, Y . Xie, X. Chen, W. Zhang, Q. Cao, and Y . Yu, “Multi-View Graph Representation for Programming Language Processing: An Investi- gation into Algorithm Detection,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 5, pp. 5792–5799, Jun. 2022. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/20522
work page 2022
-
[27]
Exploring the Effective- ness of Abstract Syntax Tree Patterns for Algorithm Recognition,
D. Neum¨uller, F. Sihler, R. Straub, and M. Tichy, “Exploring the Effective- ness of Abstract Syntax Tree Patterns for Algorithm Recognition,” in2024 4th International Conference on Code Quality (ICCQ), 2024, pp. 1–18
work page 2024
-
[28]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introduction to Algorithms, 3rd Edition. MIT Press, 2009. [Online]. Available: http://mitpress.mit.edu/books/introduction-algorithms
work page 2009
-
[29]
D. Neum ¨uller, A. Raschke, and M. Tichy, “Providing Information About Implemented Algorithms Improves Program Comprehension: A Con- trolled Experiment,” inProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering, ser. EASE ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. 383–393. [Online...
-
[30]
Large Language Models for Software Engineering: A Sys- tematic Literature Review,
X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large Language Models for Software Engineering: A Sys- tematic Literature Review,”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, Dec. 2024. [Online]. Available: https://doi.org/10.1145/3695988
-
[31]
A Survey on Large Language Models for Code Generation
J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A Survey on Large Language Models for Code Generation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.00515
work page internal anchor Pith review arXiv 2024
-
[32]
A survey of large language models for code: Evolution, benchmarking, and future trends
Z. Zheng, K. Ning, Y . Wang, J. Zhang, D. Zheng, M. Ye, and J. Chen, “A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends,” 2024. [Online]. Available: https://arxiv.org/abs/2311.10372
-
[33]
Few-shot training LLMs for project-specific code-summarization,
T. Ahmed and P. Devanbu, “Few-shot training LLMs for project-specific code-summarization,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’22. New York, NY , USA: Association for Computing Machinery,
-
[34]
Available: https://doi.org/10.1145/3551349.3559555
[Online]. Available: https://doi.org/10.1145/3551349.3559555
-
[35]
Understanding Code Se- mantics: An Evaluation of Transformer Models in Summarization,
D. Mondal, A. Lodha, A. Sahoo, and B. Kumari, “Understanding Code Se- mantics: An Evaluation of Transformer Models in Summarization,” 2023
work page 2023
-
[36]
Exploring and Unleashing the Power of Large Language Models in Automated Code Translation,
Z. Yang, F. Liu, Z. Yu, J. W. Keung, J. Li, S. Liu, Y . Hong, X. Ma, Z. Jin, and G. Li, “Exploring and Unleashing the Power of Large Language Models in Automated Code Translation,”Proc. ACM Softw. Eng., vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3660778
-
[37]
Scalable, Validated Code Translation of Entire Projects using Large Language Models,
H. Zhang, C. David, M. Wang, B. Paulsen, and D. Kroening, “Scalable, Validated Code Translation of Entire Projects using Large Language Models,”Proc. ACM Program. Lang., vol. 9, no. PLDI, Jun. 2025. [Online]. Available: https://doi.org/10.1145/3729315
-
[38]
Automated program repair in the era of large pre-trained language models
C. S. Xia, Y . Wei, and L. Zhang, “Automated Program Repair in the Era of Large Pre-Trained Language Models,” inProceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23. Melbourne, Victoria, Australia: IEEE Press, 2023, pp. 1482–1494. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00129
-
[39]
Repair is nearly generation: multilingual program repair with LLMs,
H. Joshi, J. C. Sanchez, S. Gulwani, V . Le, I. Radiˇcek, and G. Verbruggen, “Repair is nearly generation: multilingual program repair with LLMs,” inProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances ...
-
[40]
Repair is nearly generation: multilingual program repair with LLMs , year =
[Online]. Available: https://doi.org/10.1609/aaai.v37i4.25642
-
[41]
Reporting guidelines for controlled experiments in software engineering,
A. Jedlitschka and D. Pfahl, “Reporting guidelines for controlled experiments in software engineering,” in2005 International Symposium on Empirical Software Engineering, 2005., 2005, pp. 1–10
work page 2005
-
[42]
A. Field and G. Hole,How to Design and Report Experiments. London, Thousand Oaks, New Delhi, Singapore, Washington DC: SAGE Publications, 2003
work page 2003
-
[43]
Evaluating Large Language Models Trained on Code,
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, Elizabeth et al.,...
work page 2021
-
[44]
Instruction-Following Evaluation for Large Language Models
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-Following Evaluation for Large Language Models,” 2023. [Online]. Available: https://arxiv.org/abs/2311.07911
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Comparison of AI Models: Intelligence, Performance & Price Analysis,
Artificial Analysis, “Comparison of AI Models: Intelligence, Performance & Price Analysis,” Online, Jul. 2025, accessed: 2025-07-09. [Online]. Available: https://artificialanalysis.ai/models
work page 2025
-
[46]
OpenAI and Josh Achiam and Steven Adler and Sandhini Agarwal and Lama Ahmad and Ilge Akkaya and Florencia Leoni Aleman and Diogo Almeida and Janko Altenschmidt and Sam Altman and Shyamal Anadkat and Red Avila and Igor Babuschkin and Suchir Balaji and others, “GPT-4 Technical Report,” 2024. [Online]. Available: https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
OpenAI, “GPT-4o mini,” OpenAI Website, Jul. 2024, accessed: 2025-07-09. [Online]. Available: https://platform.openai.com/docs/models/gpt-4o-mini
work page 2024
-
[48]
Introducing GPT-4.1 in the API,
——, “Introducing GPT-4.1 in the API,” OpenAI Website, Apr. 2025, accessed: 2025-07-09. [Online]. Available: https://openai.com/index/gpt-4-1/
work page 2025
-
[49]
Aaron Grattafiori and Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Alex Vaughan and Amy Yang and Angela Fan and Anirudh Goyal and Anthony Hartshorn and others, “The Llama 3 Herd of Models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Cheaper, Better, Faster, Stronger,
M. A. Team, “Cheaper, Better, Faster, Stronger,” Mistral AI Blog, Apr. 2024, accessed: 2025-07-09. [Online]. Available: https://mistral.ai/news/mixtral-8x22b
work page 2024
-
[51]
H. F. Team, “Open LLM Leaderboard,” Online, accessed: 2025-07-09. [Online]. Available: https: //huggingface.co/spaces/open-llm-leaderboard/open llm leaderboard
work page 2025
-
[52]
GPT-4o mini: advancing cost-efficient intelligence,
OpenAI, “GPT-4o mini: advancing cost-efficient intelligence,” OpenAI Website, Jul. 2024, accessed: 2025-07-09. [Online]. Available: https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
work page 2024
-
[53]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timoth ´ee Lacroix and Baptiste Rozi`ere and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample, “LLaMA: Open and Efficient Foundation Language Models,” 2023. [Online]. Available: http...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and others, “Mixtral of Experts,”
-
[55]
[Online]. Available: https://arxiv.org/abs/2401.04088
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Bigcloneeval: A clone detection tool evaluation framework with bigclonebench,
J. Svajlenko and C. K. Roy, “Bigcloneeval: A clone detection tool evaluation framework with bigclonebench,” in2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 2016, pp. 596–600
work page 2016
- [57]
-
[58]
Convolutional Neural Networks over Tree Structures for Programming Language Processing,
L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, “Convolutional Neural Networks over Tree Structures for Programming Language Processing,” inProceedings of the Thirtieth AAAI Conference on Artificial Intelligence, ser. AAAI’16. Phoenix, Arizona: AAAI Press, 2016, pp. 1287–1293
work page 2016
-
[59]
On Precision of Code Clone Detection Tools,
F. Farmahinifarahani, V . Saini, D. Yang, H. Sajnani, and C. V . Lopes, “On Precision of Code Clone Detection Tools,” in2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Feb. 2019, pp. 84–94
work page 2019
-
[61]
Available: https://arxiv.org/abs/2006.15682
[Online]. Available: https://arxiv.org/abs/2006.15682
-
[62]
Towards a Big Data Curated Benchmark of Inter-project Code Clones,
J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, “Towards a Big Data Curated Benchmark of Inter-project Code Clones,” in2014 IEEE International Conference on Software Maintenance and Evolution, 2014, pp. 476–480
work page 2014
-
[63]
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications,” 2025. [Online]. Available: https://arxiv.org/abs/2402.07927
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Prompt Engineering in Large Language Models,
G. Marvin, N. Hellen, D. Jjingo, and J. Nakatumba-Nabende, “Prompt Engineering in Large Language Models,” inData Intelligence and Cognitive Informatics, I. J. Jacob, S. Piramuthu, and P. Falkowski-Gilski, Eds. Singapore: Springer Nature Singapore, 2024, pp. 387–402
work page 2024
-
[65]
Unleashing the potential of prompt engineering for large language models,
B. Chen, Z. Zhang, N. Langren ´e, and S. Zhu, “Unleashing the potential of prompt engineering for large language models,” Patterns, vol. 6, no. 6, p. 101260, Jun. 2025. [Online]. Available: http://dx.doi.org/10.1016/j.patter.2025.101260
-
[66]
Language models are few-shot learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...
work page 2020
-
[67]
A Survey of Large Language Models,
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong, Y . Du, C. Yang, Y . Chen, Z. Chen, J. Jiang, R. Ren, Y . Li, X. Tang, Z. Liu, P. Liu, J.-Y . Nie, and J.-R. Wen, “A Survey of Large Language Models,” 2023
work page 2023
-
[68]
A Survey on In-context Learning
Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, T. Liu, B. Chang, X. Sun, L. Li, and Z. Sui, “A Survey on In-context Learning,” 2024. [Online]. Available: https://arxiv.org/abs/2301.00234
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Chain-of-thought prompting elicits reasoning in large language models,
Wei Jason, Wang Xuezhi, Schuurmans Dale, Bosma Maarten, Ichter Brian, Xia Fei, Chi Ed H., Le, Quoc V ., and Zhou, Denny, “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Conference on Neural Information Processing Sys- tems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022
work page 2022
-
[70]
Large language models are zero-shot reasoners,
Kojima Takeshi, Gu Shixiang Shane, Reid Machel, Matsuo Yutaka, and Iwasawa Yusuke, “Large language models are zero-shot reasoners,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022
work page 2022
-
[71]
Automatic chain of thought prompting in large language models,
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola, “Automatic Chain of Thought Prompting in Large Language Models,” 2022. [Online]. Available: https://arxiv.org/abs/2210.03493
-
[72]
A Learning Algorithm for Boltzmann Machines,
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A Learning Algorithm for Boltzmann Machines,”Cognitive Science, vol. 9, no. 1, pp. 147–169, 1985. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0364021385800124
work page 1985
-
[73]
Controlling Linguistic Style Aspects in Neural Language Generation,
J. Ficler and Y . Goldberg, “Controlling Linguistic Style Aspects in Neural Language Generation,” 2017. [Online]. Available: https://arxiv.org/abs/1707.02633
-
[74]
Hierarchical neural story generation.CoRR, abs/1805.04833, 2018
A. Fan, M. Lewis, and Y . Dauphin, “Hierarchical Neural Story Generation,” 2018. [Online]. Available: https://arxiv.org/abs/1805.04833
-
[75]
The Curious Case of Neural Text Degeneration
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi, “The Curious Case of Neural Text Degeneration,” 2020. [Online]. Available: https://arxiv.org/abs/1904.09751
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[76]
What’s going on with the Open LLM Leaderboard?
C. Fourrier, N. Habib, J. Launay, and T. Wolf, “What’s going on with the Open LLM Leaderboard?” Hugging Face Blog, Jun. 2023, accessed: 2025-07-06. [Online]. Available: https://huggingface.co/blog/open-llm-leaderboard-mmlu
work page 2023
-
[77]
Zhang, Jingxuan and Liu, Siyuan and Luo, Junpeng and Liang, Jiahui and Huang, Zhiqiu, “Exploring the Characteristics of Identifiers: A Large-Scale Empirical Study on 5,000 Open Source Projects,”IEEE Access, vol. 8, pp. 140 607–140 620, 2020
work page 2020
-
[78]
Algorithm identification in programming assignments,
P. Chourasia, G. Ramakrishnan, V . Apte, and S. Kumar, “Algorithm identification in programming assignments,” inProceedings of the 30th IEEE/ACM International Conference on Program Comprehension, ser. ICPC ’22. New York, NY , USA: Association for Computing Machinery, 2022, pp. 471–481. [Online]. Available: https://doi.org/10.1145/3524610.3527914
-
[79]
BigCloneBench Considered Harmful for Machine Learning,
J. Krinke and C. Ragkhitwetsagul, “BigCloneBench Considered Harmful for Machine Learning,” in2022 IEEE 16th International Workshop on Software Clones (IWSC), 2022, pp. 1–7
work page 2022
-
[80]
Spoon: A Library for Implementing Analyses and Transformations of Java Source Code,
R. Pawlak, M. Monperrus, N. Petitprez, C. Noguera, and L. Seinturier, “Spoon: A Library for Implementing Analyses and Transformations of Java Source Code,”Software: Practice and Experience, vol. 46, pp. 1155–1179, 2015. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01078532/document
work page 2015
-
[81]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,”
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.