TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
Unit test up- date through LLM-driven context collection and error- type-aware refinement.arXiv preprint arXiv:2509.24419
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.SE 3years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
ALADDIN is a user-requirement-driven GUI test generation framework that incrementally navigates mobile app UIs and builds LLM-guided oracles to validate both correct and faulty user-requested functionalities across six apps.
MuMuTestUp is a mutation-guided multi-agent framework for updating test cases in evolving software that strengthens assertions via surviving mutants, targets specific coverage gaps, and uses semantic search instead of exact matching.
citing papers explorer
-
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
-
Automated Functional Testing for Malleable Mobile Application Driven from User Intent
ALADDIN is a user-requirement-driven GUI test generation framework that incrementally navigates mobile app UIs and builds LLM-guided oracles to validate both correct and faulty user-requested functionalities across six apps.
-
MuMuTestUp: Mutation-based Multi-Agent Test Case Update
MuMuTestUp is a mutation-guided multi-agent framework for updating test cases in evolving software that strengthens assertions via surviving mutants, targets specific coverage gaps, and uses semantic search instead of exact matching.