PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
SWE-QA creates a new repository-level code QA benchmark with 576 pairs and an agentic LLM framework, showing promise but open challenges for models handling complex codebases.
citing papers explorer
-
PlayCoder: Making LLM-Generated GUI Code Playable
PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.
-
When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
-
SWE-QA: Can Language Models Answer Repository-level Code Questions?
SWE-QA creates a new repository-level code QA benchmark with 576 pairs and an agentic LLM framework, showing promise but open challenges for models handling complex codebases.