pith. machine review for the scientific record. sign in

arxiv: 2504.15564 · v3 · submitted 2025-04-22 · 💻 cs.SE · cs.AI· cs.LG

Recognition: unknown

OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research

Authors on Pith no claims yet
classification 💻 cs.SE cs.AIcs.LG
keywords classesclassopenclassgencorpusgenerationanalysiscodecomplete
0
0 comments X
read the original abstract

Existing class-level code generation datasets are either synthetic (ClassEval: 100 classes) or insufficient in scale for modern training needs (RealClassEval: 400 classes), hindering robust evaluation and empirical analysis. We present OpenClassGen, a large-scale corpus of 324,843 Python classes extracted from 2,970 engineered open-source projects. Each entry pairs a human-written class with its corresponding skeleton, which comprises class and method signatures with associated docstrings, and is enriched with 27 static code metrics covering complexity, coupling, cohesion, and inheritance properties. Unlike prior benchmarks that require repository-level context resolution, OpenClassGen provides self-contained class skeletons that serve as complete generation specifications. We demonstrate the corpus's utility by evaluating three LLMs (GPT-o4-mini, Claude-4-Sonnet, Qwen-3-Coder) on a curated, executable subset of 300 classes, enriched with test suites achieving 58% branch coverage. Results show strong semantic similarity (CodeBERTScore-F3: 0.89) but moderate functional correctness (pass rate: 0.33), with substantial variance across models. This variance, along with diverse class characteristics, confirms that OpenClassGen enables meaningful differentiation of LLM capabilities. The dataset supports diverse use cases, including fine-tuning, retrieval-augmented generation, difficulty modelling, and failure mode analysis. The complete dataset and curation scripts are publicly available at https://zenodo.org/records/18409150.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Docstring - Wikipedia — en.wikipedia.org

    2006. Docstring - Wikipedia — en.wikipedia.org. https://en.wikipedia.org/wiki/Docstring. [Accessed 30 -01-2024]

  2. [2]

    GitHub - tkaemming/django-subdomains: Subdomai n helpers for the Django framework, including subdomain-based URL routing

    2010. GitHub - tkaemming/django-subdomains: Subdomai n helpers for the Django framework, including subdomain-based URL routing. — github.com. https://github.com/tkaemming/django-subdomains. [Acc essed 20-03-2025]

  3. [3]

    ast — Abstract Syntax Trees — docs.python.org

    2013. ast — Abstract Syntax Trees — docs.python.org. https://docs.python.org/3/library/ast.html. [Accesse d 28-02-2025]

  4. [4]

    Understand: The Software Developer’s Multi-Tool — scitools.com

    2024. Understand: The Software Developer’s Multi-Tool — scitools.com. https://scitools.com/. [Version 7.0, Build 1217, Accesse d 28-02-2025]

  5. [5]

    LLM Leaderboard 2025 — vellum.ai

    2025. LLM Leaderboard 2025 — vellum.ai. https://www.vellum.ai/llm-leaderboard. [Accessed 13-0 3-2025]

  6. [6]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad , Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sa m Altman, Shya- mal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  7. [7]

    Toufique Ahmed and Premkumar Devanbu. 2022. Multilingua l training for soft- ware engineering. In Proceedings of the 44th International Conference on Software Engineering. 1443–1455

  8. [8]

    Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, an d Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–37

  9. [9]

    Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xi aopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingy ue Shang, et al. 2022. Multi-lingual evaluation of code generation mo dels. arXiv preprint arXiv:2210.14868 (2022)

  10. [10]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bos ma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

  11. [11]

    Anonymous Authors. 2025. Anonymous Github — anonymous .4open.science. https://anonymous.4open.science/r/class-level-bench mark-dataset-B132/. [Ac- cessed 23-03-2025]

  12. [12]

    Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vag eesh DC, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok, and Shasha nk Shet. 2023. CodePlan: Repository-level Coding using LLMs and Planning .(2023). arXiv preprint cs.SE/2309.12499 (2023)

  13. [13]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henri que Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language model s trained on code. arXiv preprint arXiv:2107.03374 (2021)

  14. [14]

    Erik D Demaine, Shay Mozes, Benjamin Rossman, and Oren W eimann. 2009. An optimal decomposition algorithm for tree edit distance. ACM Transactions on Algorithms (TALG) 6, 1 (2009), 1–19

  15. [15]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually- crafted benchmark for evaluating llms on class-level code g eneration. arXiv preprint arXiv:2308.01861 (2023)

  16. [16]

    Norman E Fenton and Martin Neil. 2000. Software metrics : roadmap. In Proceed- ings of the Conference on the Future of Software Engineering . 357–370

  17. [17]

    Zi Gong, Yinpeng Guo, Pingyi Zhou, Cuiyun Gao, Yasheng W ang, and Zenglin Xu. 2022. MultiCoder: Multi-Programming-Lingual Pre-Tra ining for Low- Resource Code Completion. arXiv preprint arXiv:2212.09666 (2022)

  18. [18]

    Danielle Gonzalez, Thomas Zimmermann, and Nachiappan Nagappan. 2020. The state of the ml-universe: 10 years of artificial intellig ence & machine learn- ing software development on github. In Proceedings of the 17th International con- ference on mining software repositories . 431–442

  19. [19]

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teo doro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffma nn, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks are all you need . arXiv preprint arXiv:2306.11644 (2023)

  20. [20]

    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code repre sentation. arXiv preprint arXiv:2203.03850 (2022)

  21. [21]

    Kai Hartung, Sambit Mallick, Sören Gröttrup, and Munir Georges. 2024. Evalua- tion Metrics in LLM Code Generation. InInternational Conference on Text, Speech, and Dialogue. Springer, 214–226

  22. [22]

    Junda He, Christoph Treude, and David Lo. 2024. LLM-Bas ed Multi-Agent Sys- tems for Software Engineering: Literature Review, Vision a nd the Road Ahead. ACM Transactions on Software Engineering and Methodology (2024). EASE 2025, 17–20 June, 2025, Istanbul, Türkiye Rahman et al

  23. [23]

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Mea- suring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021)

  24. [24]

    Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and P remkumar Devanbu

  25. [25]

    On the naturalness of software. Commun. ACM 59, 5 (2016), 122–131

  26. [26]

    Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards rea soning in large language models: A survey. arXiv preprint arXiv:2212.10403 (2022)

  27. [27]

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis A llamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of seman- tic code search. arXiv preprint arXiv:1909.09436 (2019)

  28. [28]

    Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Lu ke Zettlemoyer

  29. [29]

    Mapping Language to Code in Programmatic Context

    Mapping language to code in programmatic context. arXiv preprint arXiv:1808.09588 (2018)

  30. [30]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sung hun Kim. 2024. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515 (2024)

  31. [31]

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiq i Zhong, Luke Zettle- moyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A nat- ural and reliable benchmark for data science code generatio n. In International Conference on Machine Learning . PMLR, 18319–18345

  32. [32]

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Jul ian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al

  33. [33]

    Science 378, 6624 (2022), 1092–1097

    Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097

  34. [34]

    Chin-Yew Lin. 2004. Rouge: A package for automatic eval uation of summaries. In Text summarization branches out. 74–81

  35. [35]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluat ion of large language models for code generation. Advances in Neural Information Processing Systems 36 (2023), 21558–21572

  36. [36]

    Alan MacCormack, John Rusnak, and Carliss Y Baldwin. 20 06. Exploring the structure of complex software designs: An empirical study o f open source and proprietary code. Management Science 52, 7 (2006), 1015–1030

  37. [37]

    Alan MacCormack and Daniel J Sturtevant. 2016. Technic al debt and system ar- chitecture: The impact of coupling on defect-related activ ity. Journal of Systems and Software 120 (2016), 170–182

  38. [38]

    Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Je sse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hal lacy, et al

  39. [39]

    Text and Code Embeddings by Contrastive Pre-Training

    Text and code embeddings by contrastive pre-training . arXiv preprint arXiv:2201.10005 (2022)

  40. [40]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jin g Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Li nguistics. 311–318

  41. [41]

    Profir-Petru Pârt ,achi and Mahito Sugiyama. 2024. Bringing Structure to Natu- ralness: On the Naturalness of ASTs. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Compan ion Proceedings. 378– 379

  42. [42]

    Jan Pašek, Jakub Sido, Miloslav Konopík, and Ondřej Pra žák. 2022. MQDD: Pre- training of Multimodal Question Duplicity Detection for So ftware Engineering Domain. arXiv preprint arXiv:2203.14093 (2022)

  43. [43]

    Mateusz Pawlik and Nikolaus Augsten. 2015. Efficient com putation of the tree edit distance. ACM Transactions on Database Systems (TODS) 40, 1 (2015), 1–40

  44. [44]

    Mateusz Pawlik and Nikolaus Augsten. 2016. Tree edit di stance: Robust and memory-efficient. Information Systems 56 (2016), 157–173

  45. [45]

    Musfiqur Rahman, SayedHassan Khatoonabadi, Ahmad Abde llatif, and Emad Shihab. 2024. Automatic detection of llm-generated code: A case study of claude 3 haiku. arXiv preprint arXiv:2409.01382 (2024)

  46. [46]

    Musfiqur Rahman, Dharani Palani, and Peter C Rigby. 2019 . Natural software re- visited. In 2019 IEEE/ACM 41st International Conference on Software En gineering (ICSE). IEEE, 37–48

  47. [47]

    Sebastian Raschka, Joshua Patterson, and Corey Nolet. 2020. Machine learning in python: Main developments and technology trends in data s cience, machine learning, and artificial intelligence. Information 11, 4 (2020), 193

  48. [48]

    Iman Saberi, Fatemeh Fard, and Fuxiang Chen. 2023. Util ization of Pre-trained Language Model for Adapter-based Knowledge Transfer in Sof tware Engineer- ing. arXiv preprint arXiv:2307.08540 (2023)

  49. [49]

    Yewei Song, Saad Ezzini, Xunzhu Tang, Cedric Lothritz, Jacques Klein, Tegawendé Bissyandé, Andrey Boytsov, Ulrick Ble, and Anne G oujon. 2024. En- hancing Text-to-SQL translation for financial system desig n. In Proceedings of the 46th International Conference on Software Engineering : Software Engineering in Practice. 252–262

  50. [50]

    Yewei Song, Cedric Lothritz, Daniel Tang, Tegawendé F B issyandé, and Jacques Klein. 2024. Revisiting code similarity evaluation with ab stract syntax tree edit distance. arXiv preprint arXiv:2404.08817 (2024)

  51. [51]

    Daniel Joseph Sturtevant. 2013. System design and the cost of architectural com- plexity. Ph. D. Dissertation. Massachusetts Institute of Technolo gy

  52. [52]

    Sarvar Sultonov. 2023. IMPORTANCE OF PYTHON PROGRAMMI NG LAN- GUAGE IN MACHINE LEARNING. International Bulletin of Engineering and Technology 3, 9 (2023), 28–30

  53. [53]

    Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language model s for code un- derstanding and generation. arXiv preprint arXiv:2305.07922 (2023)

  54. [54]

    Yutao Yang, Jie Zhou, Xuanwen Ding, Tianyu Huai, Shunyu Liu, Qin Chen, Yuan Xie, and Liang He. 2025. Recent advances of foundation langu age models-based continual learning: A survey. Comput. Surveys 57, 5 (2025), 1–38

  55. [55]

    Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing H u, Kui Liu, and Xin Xia. 2025. An Empirical Study of Retrieval-Augmented Co de Generation: Challenges and Opportunities. ACM Transactions on Software Engineering and Methodology (2025)

  56. [56]

    Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilesc u, and Graham Neu- big. 2018. Learning to mine aligned code and natural languag e pairs from stack overflow. In Proceedings of the 15th international conference on mining s oftware repositories. 476–486

  57. [57]

    Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benc hmark of prag- matic code generation with generative pre-trained models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engine ering. 1–12

  58. [58]

    Kaizhong Zhang and Dennis Shasha. 1989. Simple fast alg orithms for the editing distance between trees and related problems. SIAM journal on computing 18, 6 (1989), 1245–1262

  59. [59]

    Ziyin Zhang, Chaoyu Chen, Bingchang Liu, Cong Liao, Zi G ong, Hang Yu, Jian- guo Li, and Rui Wang. 2023. Unifying the perspectives of nlp a nd software engi- neering: A survey on language models for code. arXiv preprint arXiv:2311.07989 (2023)