TestEval : Benchmarking Large Language Models for Test Case Generation

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, Lei Ma · 2025 · DOI 10.18653/v1/2025.findings-naacl.197

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

cs.SE · 2026-05-21 · unverdicted · novelty 6.0

SWE-Mutation benchmark shows current LLMs achieve low verification (10.20%) and detection (36.15%) rates on 2,636 mutated variants, exposing weaknesses in generating reliable test suites.

Program Structure-aware Language Models: Targeted Software Testing beyond Textual Semantics

cs.SE · 2026-04-20 · unverdicted · novelty 6.0

GLMTest integrates code property graphs and GNNs with LLMs to steer test case generation toward targeted branches, raising branch accuracy from 27.4% to 50.2% on the TestGenEval benchmark.

citing papers explorer

Showing 2 of 2 citing papers.

SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering? cs.SE · 2026-05-21 · unverdicted · none · ref 96
SWE-Mutation benchmark shows current LLMs achieve low verification (10.20%) and detection (36.15%) rates on 2,636 mutated variants, exposing weaknesses in generating reliable test suites.
Program Structure-aware Language Models: Targeted Software Testing beyond Textual Semantics cs.SE · 2026-04-20 · unverdicted · none · ref 34
GLMTest integrates code property graphs and GNNs with LLMs to steer test case generation toward targeted branches, raising branch accuracy from 27.4% to 50.2% on the TestGenEval benchmark.

TestEval : Benchmarking Large Language Models for Test Case Generation

fields

years

verdicts

representative citing papers

citing papers explorer