OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research

Emad Shihab; Musfiqur Rahman; SayedHassan Khatoonabadi

arxiv: 2504.15564 · v3 · submitted 2025-04-22 · 💻 cs.SE · cs.AI· cs.LG

OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research

Musfiqur Rahman , SayedHassan Khatoonabadi , Emad Shihab This is my paper

Pith reviewed 2026-05-22 19:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords Python code generationLLM evaluationclass-level benchmarksopen source corpuscode metricsfunctional correctnesssemantic similaritysoftware engineering

0 comments

The pith

A corpus of 324843 real Python classes from open-source projects enables differentiation of LLM code generation performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prior class-level code generation benchmarks are either synthetic with only 100 classes or too small at 400 classes to support robust LLM evaluation. OpenClassGen extracts 324843 actual classes from 2970 engineered projects, supplying each with a self-contained skeleton of signatures and docstrings plus 27 metrics on complexity, coupling, cohesion, and inheritance. On a curated executable subset of 300 classes equipped with test suites at 58 percent branch coverage, three LLMs achieve high semantic similarity yet only a 0.33 pass rate, accompanied by clear variance across models. This outcome, paired with the corpus scale and diversity, demonstrates that the resource supports meaningful comparison of model strengths that smaller datasets cannot provide. The full set is released to enable fine-tuning, retrieval-augmented generation, difficulty modelling, and failure analysis at realistic scale.

Core claim

OpenClassGen is a large-scale corpus of 324843 Python classes extracted from 2970 engineered open-source projects. Each entry supplies a human-written class together with its self-contained skeleton of class and method signatures plus docstrings, enriched by 27 static code metrics covering complexity, coupling, cohesion, and inheritance. Unlike earlier benchmarks, the skeletons require no repository-level context resolution. Evaluation of GPT-4o-mini, Claude-4-Sonnet, and Qwen-3-Coder on a 300-class executable subset with test suites reaching 58 percent branch coverage yields CodeBERTScore-F3 of 0.89 for semantic similarity but a 0.33 functional pass rate, with substantial variance across or

What carries the argument

The OpenClassGen corpus of self-contained class skeletons paired with 27 static code metrics on complexity, coupling, cohesion, and inheritance

If this is right

Fine-tuning LLMs on the full corpus becomes feasible at a scale previously unavailable for class-level tasks.
Retrieval-augmented generation can draw on the provided skeletons and metrics to supply relevant context without repository traversal.
Difficulty modelling can leverage the 27 metrics to predict generation hardness for individual classes.
Failure mode analysis gains statistical power from the volume and diversity of real classes and their associated test outcomes.
Empirical studies of LLM code generation can now examine correlations between static metrics and functional correctness at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future benchmarks could combine OpenClassGen skeletons with higher-coverage test suites to tighten the link between semantic scores and actual runtime success.
The metric set may allow automatic stratification of classes into difficulty tiers for progressive evaluation protocols.
Integration of the corpus with existing method-level datasets could produce hybrid benchmarks that test both isolated methods and full class implementations.
The variance observed suggests that model ranking on class generation may shift when moving from synthetic to real-world distributions.

Load-bearing premise

The curated executable subset of 300 classes together with test suites achieving 58 percent branch coverage is representative of the full corpus and sufficient to demonstrate differentiation of model performance.

What would settle it

A follow-up evaluation on a larger or different subset of the corpus that shows identical performance across all three models or that fails to predict results on the remaining classes would falsify the claim that the corpus enables meaningful differentiation.

read the original abstract

Existing class-level code generation datasets are either synthetic (ClassEval: 100 classes) or insufficient in scale for modern training needs (RealClassEval: 400 classes), hindering robust evaluation and empirical analysis. We present OpenClassGen, a large-scale corpus of 324,843 Python classes extracted from 2,970 engineered open-source projects. Each entry pairs a human-written class with its corresponding skeleton, which comprises class and method signatures with associated docstrings, and is enriched with 27 static code metrics covering complexity, coupling, cohesion, and inheritance properties. Unlike prior benchmarks that require repository-level context resolution, OpenClassGen provides self-contained class skeletons that serve as complete generation specifications. We demonstrate the corpus's utility by evaluating three LLMs (GPT-o4-mini, Claude-4-Sonnet, Qwen-3-Coder) on a curated, executable subset of 300 classes, enriched with test suites achieving 58% branch coverage. Results show strong semantic similarity (CodeBERTScore-F3: 0.89) but moderate functional correctness (pass rate: 0.33), with substantial variance across models. This variance, along with diverse class characteristics, confirms that OpenClassGen enables meaningful differentiation of LLM capabilities. The dataset supports diverse use cases, including fine-tuning, retrieval-augmented generation, difficulty modelling, and failure mode analysis. The complete dataset and curation scripts are publicly available at https://zenodo.org/records/18409150.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases a much larger real-world Python class corpus than prior sets, which is the useful part, but the 300-class evaluation subset is not shown to match the full collection on the reported metrics.

read the letter

The main thing here is a data release: 324k Python classes pulled from 2970 open-source projects, each with a skeleton, docstrings, and 27 static metrics on complexity, coupling, cohesion, and inheritance. That scale is new compared to the 100-class and 400-class sets they cite, and the self-contained format removes the need for full-repo context that complicates other benchmarks. Releasing the full corpus and curation scripts on Zenodo is the practical step that actually lets others use it.

Referee Report

1 major / 2 minor

Summary. The paper introduces OpenClassGen, a corpus of 324,843 real-world Python classes extracted from 2,970 open-source projects. Each class is provided with a self-contained skeleton (signatures and docstrings) and annotated with 27 static metrics on complexity, coupling, cohesion, and inheritance. The authors evaluate three LLMs on a curated executable subset of 300 classes equipped with test suites that achieve 58% branch coverage, reporting pass@1 = 0.33 and CodeBERTScore-F3 = 0.89 together with substantial inter-model variance, and conclude that the corpus enables meaningful differentiation of LLM capabilities for class-level code generation.

Significance. If the central claims hold, the work supplies a substantially larger and more realistic dataset than prior efforts such as ClassEval or RealClassEval, together with metric annotations and public curation scripts. This scale and the self-contained nature of the skeletons directly support downstream tasks including fine-tuning, retrieval-augmented generation, difficulty modeling, and failure-mode analysis.

major comments (1)

[Evaluation description (abstract and §4)] The selection criteria for the 300-class executable subset are not stated, and no distributional comparison (e.g., quantile plots, KS tests, or summary statistics) is provided between this subset and the full 324,843-class corpus on any of the 27 static metrics. Because the claim that observed performance variance demonstrates the corpus's utility for differentiating LLM capabilities rests on the subset being representative, this omission is load-bearing for the central empirical argument.

minor comments (2)

[Abstract] Model names appear as 'GPT-o4-mini' and 'Claude-4-Sonnet'; confirming the exact versions (e.g., GPT-4o-mini, Claude-3.5-Sonnet) would improve reproducibility.
[Evaluation] The 58% branch coverage is reported but not accompanied by any analysis of which code paths remain untested or how coverage correlates with pass rates; a short discussion or supplementary table would clarify the reliability of the functional-correctness metric.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the evaluation section. We address it directly below and will revise the manuscript to strengthen the empirical argument.

read point-by-point responses

Referee: [Evaluation description (abstract and §4)] The selection criteria for the 300-class executable subset are not stated, and no distributional comparison (e.g., quantile plots, KS tests, or summary statistics) is provided between this subset and the full 324,843-class corpus on any of the 27 static metrics. Because the claim that observed performance variance demonstrates the corpus's utility for differentiating LLM capabilities rests on the subset being representative, this omission is load-bearing for the central empirical argument.

Authors: We agree that the selection criteria and distributional comparison were insufficiently detailed. The 300-class subset was curated by first identifying classes for which we could obtain or construct test suites meeting a minimum 50% branch coverage threshold (resulting in the reported 58% average), while sampling to preserve spread across complexity, coupling, and inheritance metrics. In the revision we will explicitly document this process in §4, add a table of summary statistics (means, medians, and inter-quartile ranges) for all 27 metrics comparing the subset to the full corpus, and include a brief discussion of the subset's scope and limitations. These additions will clarify that the observed inter-model variance is measured on a practically executable and diverse slice of the corpus, thereby supporting the claim of utility for capability differentiation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical data construction with no derivations or fitted predictions

full rationale

The paper constructs a corpus by extracting classes from open-source projects, computes static metrics, curates an executable subset, and reports direct LLM evaluation results (pass@1, CodeBERTScore). No equations, parameter fitting, predictions, or self-citations are used to derive claims; the variance and differentiation statements follow immediately from the observed outputs on the curated subset. The work is self-contained as a dataset release plus benchmark run, with no reduction of results to earlier inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset construction paper. It relies on standard open-source mining and static analysis practices but introduces no free parameters fitted to a central claim, no domain axioms beyond ordinary software engineering assumptions, and no invented entities.

pith-pipeline@v0.9.0 · 5812 in / 1135 out tokens · 62526 ms · 2026-05-22T19:16:09.308892+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present OpenClassGen, a large-scale corpus of 324,843 Python classes extracted from 2,970 engineered open-source projects. Each entry pairs a human-written class with its corresponding skeleton... enriched with 27 static code metrics covering complexity, coupling, cohesion, and inheritance properties.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show strong semantic similarity (CodeBERTScore-F3: 0.89) but moderate functional correctness (pass rate: 0.33), with substantial variance across models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Docstring - Wikipedia — en.wikipedia.org

2006. Docstring - Wikipedia — en.wikipedia.org. https://en.wikipedia.org/wiki/Docstring. [Accessed 30 -01-2024]

work page 2006
[2]

GitHub - tkaemming/django-subdomains: Subdomai n helpers for the Django framework, including subdomain-based URL routing

2010. GitHub - tkaemming/django-subdomains: Subdomai n helpers for the Django framework, including subdomain-based URL routing. — github.com. https://github.com/tkaemming/django-subdomains. [Acc essed 20-03-2025]

work page 2010
[3]

ast — Abstract Syntax Trees — docs.python.org

2013. ast — Abstract Syntax Trees — docs.python.org. https://docs.python.org/3/library/ast.html. [Accesse d 28-02-2025]

work page 2013
[4]

Understand: The Software Developer’s Multi-Tool — scitools.com

2024. Understand: The Software Developer’s Multi-Tool — scitools.com. https://scitools.com/. [Version 7.0, Build 1217, Accesse d 28-02-2025]

work page 2024
[5]

LLM Leaderboard 2025 — vellum.ai

2025. LLM Leaderboard 2025 — vellum.ai. https://www.vellum.ai/llm-leaderboard. [Accessed 13-0 3-2025]

work page 2025
[6]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad , Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sa m Altman, Shya- mal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Touﬁque Ahmed and Premkumar Devanbu. 2022. Multilingua l training for soft- ware engineering. In Proceedings of the 44th International Conference on Software Engineering. 1443–1455

work page 2022
[8]

Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, an d Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–37

work page 2018
[9]

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xi aopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingy ue Shang, et al. 2022. Multi-lingual evaluation of code generation mo dels. arXiv preprint arXiv:2210.14868 (2022)

work page arXiv 2022
[10]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bos ma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

work page Pith review Pith/arXiv arXiv 2021
[11]

Anonymous Authors. 2025. Anonymous Github — anonymous .4open.science. https://anonymous.4open.science/r/class-level-bench mark-dataset-B132/. [Ac- cessed 23-03-2025]

work page 2025
[12]

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vag eesh DC, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok, and Shasha nk Shet. 2023. CodePlan: Repository-level Coding using LLMs and Planning .(2023). arXiv preprint cs.SE/2309.12499 (2023)

work page arXiv 2023
[13]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henri que Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language model s trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Erik D Demaine, Shay Mozes, Benjamin Rossman, and Oren W eimann. 2009. An optimal decomposition algorithm for tree edit distance. ACM Transactions on Algorithms (TALG) 6, 1 (2009), 1–19

work page 2009
[15]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually- crafted benchmark for evaluating llms on class-level code g eneration. arXiv preprint arXiv:2308.01861 (2023)

work page arXiv 2023
[16]

Norman E Fenton and Martin Neil. 2000. Software metrics : roadmap. In Proceed- ings of the Conference on the Future of Software Engineering . 357–370

work page 2000
[17]

Zi Gong, Yinpeng Guo, Pingyi Zhou, Cuiyun Gao, Yasheng W ang, and Zenglin Xu. 2022. MultiCoder: Multi-Programming-Lingual Pre-Tra ining for Low- Resource Code Completion. arXiv preprint arXiv:2212.09666 (2022)

work page arXiv 2022
[18]

Danielle Gonzalez, Thomas Zimmermann, and Nachiappan Nagappan. 2020. The state of the ml-universe: 10 years of artiﬁcial intellig ence & machine learn- ing software development on github. In Proceedings of the 17th International con- ference on mining software repositories . 431–442

work page 2020
[19]

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teo doro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauﬀma nn, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks are all you need . arXiv preprint arXiv:2306.11644 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Uniﬁed cross-modal pre-training for code repre sentation. arXiv preprint arXiv:2203.03850 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Kai Hartung, Sambit Mallick, Sören Gröttrup, and Munir Georges. 2024. Evalua- tion Metrics in LLM Code Generation. InInternational Conference on Text, Speech, and Dialogue. Springer, 214–226

work page 2024
[22]

Junda He, Christoph Treude, and David Lo. 2024. LLM-Bas ed Multi-Agent Sys- tems for Software Engineering: Literature Review, Vision a nd the Road Ahead. ACM Transactions on Software Engineering and Methodology (2024). EASE 2025, 17–20 June, 2025, Istanbul, Türkiye Rahman et al

work page 2024
[23]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Mea- suring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and P remkumar Devanbu

work page
[25]

On the naturalness of software. Commun. ACM 59, 5 (2016), 122–131

work page 2016
[26]

Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards rea soning in large language models: A survey. arXiv preprint arXiv:2212.10403 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis A llamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of seman- tic code search. arXiv preprint arXiv:1909.09436 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[28]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Lu ke Zettlemoyer

work page
[29]

Mapping Language to Code in Programmatic Context

Mapping language to code in programmatic context. arXiv preprint arXiv:1808.09588 (2018)

work page Pith review Pith/arXiv arXiv 2018
[30]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sung hun Kim. 2024. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiq i Zhong, Luke Zettle- moyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A nat- ural and reliable benchmark for data science code generatio n. In International Conference on Machine Learning . PMLR, 18319–18345

work page 2023
[32]

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Jul ian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al

work page
[33]

Science 378, 6624 (2022), 1092–1097

Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097

work page 2022
[34]

Chin-Yew Lin. 2004. Rouge: A package for automatic eval uation of summaries. In Text summarization branches out. 74–81

work page 2004
[35]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluat ion of large language models for code generation. Advances in Neural Information Processing Systems 36 (2023), 21558–21572

work page 2023
[36]

Alan MacCormack, John Rusnak, and Carliss Y Baldwin. 20 06. Exploring the structure of complex software designs: An empirical study o f open source and proprietary code. Management Science 52, 7 (2006), 1015–1030

work page 2006
[37]

Alan MacCormack and Daniel J Sturtevant. 2016. Technic al debt and system ar- chitecture: The impact of coupling on defect-related activ ity. Journal of Systems and Software 120 (2016), 170–182

work page 2016
[38]

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Je sse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hal lacy, et al

work page
[39]

Text and Code Embeddings by Contrastive Pre-Training

Text and code embeddings by contrastive pre-training . arXiv preprint arXiv:2201.10005 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jin g Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Li nguistics. 311–318

work page 2002
[41]

Proﬁr-Petru Pârt ,achi and Mahito Sugiyama. 2024. Bringing Structure to Natu- ralness: On the Naturalness of ASTs. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Compan ion Proceedings. 378– 379

work page 2024
[42]

Jan Pašek, Jakub Sido, Miloslav Konopík, and Ondřej Pra žák. 2022. MQDD: Pre- training of Multimodal Question Duplicity Detection for So ftware Engineering Domain. arXiv preprint arXiv:2203.14093 (2022)

work page arXiv 2022
[43]

Mateusz Pawlik and Nikolaus Augsten. 2015. Eﬃcient com putation of the tree edit distance. ACM Transactions on Database Systems (TODS) 40, 1 (2015), 1–40

work page 2015
[44]

Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit di stance: Robust and memory-eﬃcient. Information Systems 56 (2016), 157–173

work page 2016
[45]

Musﬁqur Rahman, SayedHassan Khatoonabadi, Ahmad Abde llatif, and Emad Shihab. 2024. Automatic detection of llm-generated code: A case study of claude 3 haiku. arXiv preprint arXiv:2409.01382 (2024)

work page arXiv 2024
[46]

Musﬁqur Rahman, Dharani Palani, and Peter C Rigby. 2019 . Natural software re- visited. In 2019 IEEE/ACM 41st International Conference on Software En gineering (ICSE). IEEE, 37–48

work page 2019
[47]

Sebastian Raschka, Joshua Patterson, and Corey Nolet. 2020. Machine learning in python: Main developments and technology trends in data s cience, machine learning, and artiﬁcial intelligence. Information 11, 4 (2020), 193

work page 2020
[48]

Iman Saberi, Fatemeh Fard, and Fuxiang Chen. 2023. Util ization of Pre-trained Language Model for Adapter-based Knowledge Transfer in Sof tware Engineer- ing. arXiv preprint arXiv:2307.08540 (2023)

work page arXiv 2023
[49]

Yewei Song, Saad Ezzini, Xunzhu Tang, Cedric Lothritz, Jacques Klein, Tegawendé Bissyandé, Andrey Boytsov, Ulrick Ble, and Anne G oujon. 2024. En- hancing Text-to-SQL translation for ﬁnancial system desig n. In Proceedings of the 46th International Conference on Software Engineering : Software Engineering in Practice. 252–262

work page 2024
[50]

Yewei Song, Cedric Lothritz, Daniel Tang, Tegawendé F B issyandé, and Jacques Klein. 2024. Revisiting code similarity evaluation with ab stract syntax tree edit distance. arXiv preprint arXiv:2404.08817 (2024)

work page arXiv 2024
[51]

Daniel Joseph Sturtevant. 2013. System design and the cost of architectural com- plexity. Ph. D. Dissertation. Massachusetts Institute of Technolo gy

work page 2013
[52]

Sarvar Sultonov. 2023. IMPORTANCE OF PYTHON PROGRAMMI NG LAN- GUAGE IN MACHINE LEARNING. International Bulletin of Engineering and Technology 3, 9 (2023), 28–30

work page 2023
[53]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language model s for code un- derstanding and generation. arXiv preprint arXiv:2305.07922 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Yutao Yang, Jie Zhou, Xuanwen Ding, Tianyu Huai, Shunyu Liu, Qin Chen, Yuan Xie, and Liang He. 2025. Recent advances of foundation langu age models-based continual learning: A survey. Comput. Surveys 57, 5 (2025), 1–38

work page 2025
[55]

Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing H u, Kui Liu, and Xin Xia. 2025. An Empirical Study of Retrieval-Augmented Co de Generation: Challenges and Opportunities. ACM Transactions on Software Engineering and Methodology (2025)

work page 2025
[56]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilesc u, and Graham Neu- big. 2018. Learning to mine aligned code and natural languag e pairs from stack overﬂow. In Proceedings of the 15th international conference on mining s oftware repositories. 476–486

work page 2018
[57]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benc hmark of prag- matic code generation with generative pre-trained models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engine ering. 1–12

work page 2024
[58]

Kaizhong Zhang and Dennis Shasha. 1989. Simple fast alg orithms for the editing distance between trees and related problems. SIAM journal on computing 18, 6 (1989), 1245–1262

work page 1989
[59]

Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi G ong, Hang Yu, Jian- guo Li, and Rui Wang. 2023. Unifying the perspectives of nlp a nd software engi- neering: A survey on language models for code. arXiv preprint arXiv:2311.07989 (2023)

work page arXiv 2023

[1] [1]

Docstring - Wikipedia — en.wikipedia.org

2006. Docstring - Wikipedia — en.wikipedia.org. https://en.wikipedia.org/wiki/Docstring. [Accessed 30 -01-2024]

work page 2006

[2] [2]

GitHub - tkaemming/django-subdomains: Subdomai n helpers for the Django framework, including subdomain-based URL routing

2010. GitHub - tkaemming/django-subdomains: Subdomai n helpers for the Django framework, including subdomain-based URL routing. — github.com. https://github.com/tkaemming/django-subdomains. [Acc essed 20-03-2025]

work page 2010

[3] [3]

ast — Abstract Syntax Trees — docs.python.org

2013. ast — Abstract Syntax Trees — docs.python.org. https://docs.python.org/3/library/ast.html. [Accesse d 28-02-2025]

work page 2013

[4] [4]

Understand: The Software Developer’s Multi-Tool — scitools.com

2024. Understand: The Software Developer’s Multi-Tool — scitools.com. https://scitools.com/. [Version 7.0, Build 1217, Accesse d 28-02-2025]

work page 2024

[5] [5]

LLM Leaderboard 2025 — vellum.ai

2025. LLM Leaderboard 2025 — vellum.ai. https://www.vellum.ai/llm-leaderboard. [Accessed 13-0 3-2025]

work page 2025

[6] [6]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad , Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sa m Altman, Shya- mal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Touﬁque Ahmed and Premkumar Devanbu. 2022. Multilingua l training for soft- ware engineering. In Proceedings of the 44th International Conference on Software Engineering. 1443–1455

work page 2022

[8] [8]

Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, an d Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–37

work page 2018

[9] [9]

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xi aopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingy ue Shang, et al. 2022. Multi-lingual evaluation of code generation mo dels. arXiv preprint arXiv:2210.14868 (2022)

work page arXiv 2022

[10] [10]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bos ma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

work page Pith review Pith/arXiv arXiv 2021

[11] [11]

Anonymous Authors. 2025. Anonymous Github — anonymous .4open.science. https://anonymous.4open.science/r/class-level-bench mark-dataset-B132/. [Ac- cessed 23-03-2025]

work page 2025

[12] [12]

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vag eesh DC, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok, and Shasha nk Shet. 2023. CodePlan: Repository-level Coding using LLMs and Planning .(2023). arXiv preprint cs.SE/2309.12499 (2023)

work page arXiv 2023

[13] [13]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henri que Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language model s trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Erik D Demaine, Shay Mozes, Benjamin Rossman, and Oren W eimann. 2009. An optimal decomposition algorithm for tree edit distance. ACM Transactions on Algorithms (TALG) 6, 1 (2009), 1–19

work page 2009

[15] [15]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually- crafted benchmark for evaluating llms on class-level code g eneration. arXiv preprint arXiv:2308.01861 (2023)

work page arXiv 2023

[16] [16]

Norman E Fenton and Martin Neil. 2000. Software metrics : roadmap. In Proceed- ings of the Conference on the Future of Software Engineering . 357–370

work page 2000

[17] [17]

Zi Gong, Yinpeng Guo, Pingyi Zhou, Cuiyun Gao, Yasheng W ang, and Zenglin Xu. 2022. MultiCoder: Multi-Programming-Lingual Pre-Tra ining for Low- Resource Code Completion. arXiv preprint arXiv:2212.09666 (2022)

work page arXiv 2022

[18] [18]

Danielle Gonzalez, Thomas Zimmermann, and Nachiappan Nagappan. 2020. The state of the ml-universe: 10 years of artiﬁcial intellig ence & machine learn- ing software development on github. In Proceedings of the 17th International con- ference on mining software repositories . 431–442

work page 2020

[19] [19]

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teo doro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauﬀma nn, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks are all you need . arXiv preprint arXiv:2306.11644 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Uniﬁed cross-modal pre-training for code repre sentation. arXiv preprint arXiv:2203.03850 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Kai Hartung, Sambit Mallick, Sören Gröttrup, and Munir Georges. 2024. Evalua- tion Metrics in LLM Code Generation. InInternational Conference on Text, Speech, and Dialogue. Springer, 214–226

work page 2024

[22] [22]

Junda He, Christoph Treude, and David Lo. 2024. LLM-Bas ed Multi-Agent Sys- tems for Software Engineering: Literature Review, Vision a nd the Road Ahead. ACM Transactions on Software Engineering and Methodology (2024). EASE 2025, 17–20 June, 2025, Istanbul, Türkiye Rahman et al

work page 2024

[23] [23]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Mea- suring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[24] [24]

Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and P remkumar Devanbu

work page

[25] [25]

On the naturalness of software. Commun. ACM 59, 5 (2016), 122–131

work page 2016

[26] [26]

Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards rea soning in large language models: A survey. arXiv preprint arXiv:2212.10403 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis A llamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of seman- tic code search. arXiv preprint arXiv:1909.09436 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[28] [28]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Lu ke Zettlemoyer

work page

[29] [29]

Mapping Language to Code in Programmatic Context

Mapping language to code in programmatic context. arXiv preprint arXiv:1808.09588 (2018)

work page Pith review Pith/arXiv arXiv 2018

[30] [30]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sung hun Kim. 2024. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiq i Zhong, Luke Zettle- moyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A nat- ural and reliable benchmark for data science code generatio n. In International Conference on Machine Learning . PMLR, 18319–18345

work page 2023

[32] [32]

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Jul ian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al

work page

[33] [33]

Science 378, 6624 (2022), 1092–1097

Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097

work page 2022

[34] [34]

Chin-Yew Lin. 2004. Rouge: A package for automatic eval uation of summaries. In Text summarization branches out. 74–81

work page 2004

[35] [35]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluat ion of large language models for code generation. Advances in Neural Information Processing Systems 36 (2023), 21558–21572

work page 2023

[36] [36]

Alan MacCormack, John Rusnak, and Carliss Y Baldwin. 20 06. Exploring the structure of complex software designs: An empirical study o f open source and proprietary code. Management Science 52, 7 (2006), 1015–1030

work page 2006

[37] [37]

Alan MacCormack and Daniel J Sturtevant. 2016. Technic al debt and system ar- chitecture: The impact of coupling on defect-related activ ity. Journal of Systems and Software 120 (2016), 170–182

work page 2016

[38] [38]

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Je sse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hal lacy, et al

work page

[39] [39]

Text and Code Embeddings by Contrastive Pre-Training

Text and code embeddings by contrastive pre-training . arXiv preprint arXiv:2201.10005 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jin g Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Li nguistics. 311–318

work page 2002

[41] [41]

Proﬁr-Petru Pârt ,achi and Mahito Sugiyama. 2024. Bringing Structure to Natu- ralness: On the Naturalness of ASTs. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Compan ion Proceedings. 378– 379

work page 2024

[42] [42]

Jan Pašek, Jakub Sido, Miloslav Konopík, and Ondřej Pra žák. 2022. MQDD: Pre- training of Multimodal Question Duplicity Detection for So ftware Engineering Domain. arXiv preprint arXiv:2203.14093 (2022)

work page arXiv 2022

[43] [43]

Mateusz Pawlik and Nikolaus Augsten. 2015. Eﬃcient com putation of the tree edit distance. ACM Transactions on Database Systems (TODS) 40, 1 (2015), 1–40

work page 2015

[44] [44]

Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit di stance: Robust and memory-eﬃcient. Information Systems 56 (2016), 157–173

work page 2016

[45] [45]

Musﬁqur Rahman, SayedHassan Khatoonabadi, Ahmad Abde llatif, and Emad Shihab. 2024. Automatic detection of llm-generated code: A case study of claude 3 haiku. arXiv preprint arXiv:2409.01382 (2024)

work page arXiv 2024

[46] [46]

Musﬁqur Rahman, Dharani Palani, and Peter C Rigby. 2019 . Natural software re- visited. In 2019 IEEE/ACM 41st International Conference on Software En gineering (ICSE). IEEE, 37–48

work page 2019

[47] [47]

Sebastian Raschka, Joshua Patterson, and Corey Nolet. 2020. Machine learning in python: Main developments and technology trends in data s cience, machine learning, and artiﬁcial intelligence. Information 11, 4 (2020), 193

work page 2020

[48] [48]

Iman Saberi, Fatemeh Fard, and Fuxiang Chen. 2023. Util ization of Pre-trained Language Model for Adapter-based Knowledge Transfer in Sof tware Engineer- ing. arXiv preprint arXiv:2307.08540 (2023)

work page arXiv 2023

[49] [49]

Yewei Song, Saad Ezzini, Xunzhu Tang, Cedric Lothritz, Jacques Klein, Tegawendé Bissyandé, Andrey Boytsov, Ulrick Ble, and Anne G oujon. 2024. En- hancing Text-to-SQL translation for ﬁnancial system desig n. In Proceedings of the 46th International Conference on Software Engineering : Software Engineering in Practice. 252–262

work page 2024

[50] [50]

Yewei Song, Cedric Lothritz, Daniel Tang, Tegawendé F B issyandé, and Jacques Klein. 2024. Revisiting code similarity evaluation with ab stract syntax tree edit distance. arXiv preprint arXiv:2404.08817 (2024)

work page arXiv 2024

[51] [51]

Daniel Joseph Sturtevant. 2013. System design and the cost of architectural com- plexity. Ph. D. Dissertation. Massachusetts Institute of Technolo gy

work page 2013

[52] [52]

Sarvar Sultonov. 2023. IMPORTANCE OF PYTHON PROGRAMMI NG LAN- GUAGE IN MACHINE LEARNING. International Bulletin of Engineering and Technology 3, 9 (2023), 28–30

work page 2023

[53] [53]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language model s for code un- derstanding and generation. arXiv preprint arXiv:2305.07922 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Yutao Yang, Jie Zhou, Xuanwen Ding, Tianyu Huai, Shunyu Liu, Qin Chen, Yuan Xie, and Liang He. 2025. Recent advances of foundation langu age models-based continual learning: A survey. Comput. Surveys 57, 5 (2025), 1–38

work page 2025

[55] [55]

Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing H u, Kui Liu, and Xin Xia. 2025. An Empirical Study of Retrieval-Augmented Co de Generation: Challenges and Opportunities. ACM Transactions on Software Engineering and Methodology (2025)

work page 2025

[56] [56]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilesc u, and Graham Neu- big. 2018. Learning to mine aligned code and natural languag e pairs from stack overﬂow. In Proceedings of the 15th international conference on mining s oftware repositories. 476–486

work page 2018

[57] [57]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benc hmark of prag- matic code generation with generative pre-trained models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engine ering. 1–12

work page 2024

[58] [58]

Kaizhong Zhang and Dennis Shasha. 1989. Simple fast alg orithms for the editing distance between trees and related problems. SIAM journal on computing 18, 6 (1989), 1245–1262

work page 1989

[59] [59]

Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi G ong, Hang Yu, Jian- guo Li, and Rui Wang. 2023. Unifying the perspectives of nlp a nd software engi- neering: A survey on language models for code. arXiv preprint arXiv:2311.07989 (2023)

work page arXiv 2023