SWE-Mutation benchmark shows current LLMs achieve low verification (10.20%) and detection (36.15%) rates on 2,636 mutated variants, exposing weaknesses in generating reliable test suites.
Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14) , pages =
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SE 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?
SWE-Mutation benchmark shows current LLMs achieve low verification (10.20%) and detection (36.15%) rates on 2,636 mutated variants, exposing weaknesses in generating reliable test suites.