pith. machine review for the scientific record. sign in

arxiv: 2502.06556 · v5 · submitted 2025-02-10 · 💻 cs.SE · cs.CL

Recognition: unknown

MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

Authors on Pith no claims yet
classification 💻 cs.SE cs.CL
keywords multifiletestgenerationllmstestunitfrontiermulti-file-levelbenchmark
0
0 comments X
read the original abstract

Unit test generation has become a promising and important Large Language Model (LLM) use case. However, existing evaluation benchmarks for LLM unit test generation focus on function- or class-level code (single-file) rather than more practical and challenging multi-file-level codebases. To address such a limitation, we propose MultiFileTest, a multi-file-level benchmark for unit test generation covering Python, Java, and JavaScript. MultiFileTest features 20 moderate-sized and high-quality projects per language. We evaluate eleven frontier LLMs on MultiFileTest, and the results show that most frontier LLMs tested exhibit moderate performance on MultiFileTest, highlighting the difficulty of MultiFileTest. We also conduct a thorough error analysis, which shows that even advanced LLMs, such as Gemini-3.0-Pro, exhibit basic yet critical errors, including executability and cascade errors. Motivated by this observation, we further evaluate all frontier LLMs under manual error-fixing and self-error-fixing scenarios to assess their potential when equipped with error-fixing mechanisms. Our code and dataset is available at \href{https://github.com/YiboWANG214/ProjectTest}{MultiFileTest}.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.