Recognition: unknown
CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
read the original abstract
Modern software development demands code that is maintainable, testable, and scalable by organizing the implementation into modular components with iterative reuse of existing codes. We formalize this iterative, multi-turn paradigm as codeflow and introduce CodeFlowBench, the first benchmark designed to comprehensively evaluate LLMs' ability to perform codeflow - implementing new functionality by reusing existing functions over multiple turns. CodeFlowBench comprises two complementary components: CodeFlowBench-Comp, a core collection of 5,000+ competitive programming problems from Codeforces updated via an automated pipeline and CodeFlowBench-Repo, which is sourced from GitHub repositories to better reflect real-world scenarios. Furthermore, a novel evaluation framework featured dual assessment protocol and structural metrics derived from dependency trees is introduced. Extensive experiments reveal significant performance degradation in multi-turn codeflow scenarios. Furthermore, our in-depth analysis illustrates that model performance inversely correlates with dependency complexity. These findings not only highlight the critical challenges for supporting real-world workflows, but also establish CodeFlowBench as an essential tool for advancing code generation research.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
MCPP is a Monte Carlo simulation-based online planner that improves the probability of agentic workflows completing successfully under explicit budget and deadline constraints compared to baselines on CodeFlow and Pro...
-
On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
MCPP uses Monte Carlo simulations of workflow executions to dynamically allocate resources and replan, raising constrained completion probability over baselines on CodeFlow and ProofFlow.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.