pith. sign in

arxiv: 2510.14509 · v4 · submitted 2025-10-16 · 💻 cs.SE · cs.AI· cs.CL

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

Pith reviewed 2026-05-18 06:43 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords End-to-End Software DevelopmentLLM BenchmarkingBehavior-Driven DevelopmentAutomated Software TestingSoftware EngineeringLarge Language ModelsBenchmark Construction
0
0 comments X

The pith

E2EDev benchmark shows LLMs and frameworks persistently struggle to produce software meeting real user needs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents E2EDev, a benchmark for end-to-end software development that relies on Behavior-Driven Development to specify fine-grained user requirements along with executable test scenarios. These scenarios use Python step definitions to simulate actual user interactions and determine whether generated code fulfills the original needs. Evaluations across multiple E2ESD frameworks and LLM backbones reveal consistent failures on these tasks. Earlier benchmarks suffered from overly broad requirements and weak evaluation methods, so this approach supplies a clearer picture of current limitations. The results point to the necessity of developing more capable and efficient solutions for full software creation pipelines.

Core claim

By grounding evaluation in BDD principles, E2EDev supplies fine-grained requirements, multiple test scenarios with Python implementations for each, and a fully automated Behave-based testing pipeline; when applied to existing E2ESD frameworks and LLM backbones, the results demonstrate a persistent inability to generate software that satisfies user needs through realistic interaction tests.

What carries the argument

The E2EDev benchmark, built around BDD test scenarios with corresponding Python step implementations and an automated testing pipeline on the Behave framework, created via a Human-in-the-Loop Multi-Agent Annotation Framework to keep annotation effort low while maintaining quality.

If this is right

  • Existing E2ESD frameworks and LLM backbones are currently insufficient for reliable end-to-end development.
  • More effective and cost-efficient E2ESD solutions are required to overcome the observed limitations.
  • Fine-grained BDD-based testing provides a stricter and more reliable measure of capability than prior coarse-grained benchmarks.
  • Automated pipelines built on Behave can scale evaluation without manual intervention once the scenarios are defined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether decomposing tasks into smaller verified modules before LLM generation raises pass rates on the same BDD scenarios.
  • The benchmark setup suggests that combining LLM generation with formal specification tools might address gaps the current evaluations expose.
  • Extending the same BDD protocol to additional domains or languages would allow direct comparison of framework performance beyond the original test set.

Load-bearing premise

The BDD test scenarios and their Python step implementations accurately determine whether generated software meets user needs via mimicking real user interactions.

What would settle it

A single E2ESD framework or LLM backbone that passes a large majority of the E2EDev BDD tests across diverse requirements would directly contradict the reported persistent struggle; conversely, empirical evidence that high test-pass rates still produce software users reject in real deployments would invalidate the evaluation protocol.

Figures

Figures reproduced from arXiv: 2510.14509 by Chen Huang, Jingyao Liu, Wenqiang Lei, Yang Deng, Zhizhao Guan.

Figure 1
Figure 1. Figure 1: Overview of E2EDev: a dataset and BDD-based automated evaluation pipeline for E2ESD tasks. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparing existing work with ours, existing benchmarks use coarse-grained requirements with [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: HITL-MAA framework for semi-automated dataset construction. Given source code, the framework [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Soft Req. Acc. and Req. Acc. under three representative LLMs. Here, V-LLM refers to the Vanilla [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparisons across agentic frameworks and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Requirement-level error distribution across different frameworks. Pie segments show error propor [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Error distribution of MetaGPT Communication breakdowns within its multi-agent archi￾tecture significantly impair code consistency and require￾ment fidelity. As seen in [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: This figure illustrates the features of E2EDev, which encompasses seven common types of web [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Number of communications in each phase of ChatDev across different backbone models. [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: BDD-based automated evaluation pipeline for E2ESD tasks. [PITH_FULL_IMAGE:figures/full_fig_p037_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example directory structure for Behave testing with E2ESD. [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗
read the original abstract

The rapid advancement in large language models (LLMs) has demonstrated significant potential in End-to-End Software Development (E2ESD). However, existing E2ESD benchmarks are limited by coarse-grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework capabilities. To address these limitations, we present E2EDev, a novel benchmark grounded in the principles of Behavior-Driven Development (BDD), which evaluates the capabilities of E2ESD frameworks by assessing whether the generated software meets user needs through mimicking real user interactions (Figure 1). E2EDev comprises (i) a fine-grained set of user requirements, (ii) multiple BDD test scenarios with corresponding Python step implementations for each requirement, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). By evaluating various E2ESD frameworks and LLM backbones with E2EDev, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces E2EDev, a benchmark for assessing large language models and frameworks in end-to-end software development (E2ESD) tasks. It is based on Behavior-Driven Development (BDD) principles, consisting of fine-grained user requirements, BDD test scenarios with corresponding Python step implementations, and an automated testing pipeline using the Behave framework. The benchmark is created using a proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA) to balance quality and annotation effort. Through evaluations, the paper finds that existing E2ESD frameworks and LLM backbones persistently struggle to solve the tasks effectively, calling for more advanced and cost-efficient solutions. The benchmark and code are publicly released on GitHub.

Significance. If the BDD-based evaluation protocol reliably measures whether generated software meets user needs by simulating real interactions, the benchmark would offer a substantial improvement over prior coarse-grained E2ESD benchmarks. The public availability of the benchmark and codebase supports reproducibility and community use. The findings underscore current limitations in the field, which could drive research towards better E2ESD systems. The use of HITL-MAA for efficient annotation is a positive aspect for benchmark construction.

major comments (2)
  1. [Evaluation Protocol] The central claim that current E2ESD frameworks persistently struggle to solve tasks rests on the BDD test scenarios and Python step implementations serving as a faithful proxy for whether generated software meets user needs via mimicking real user interactions. No section reports an independent validation (e.g., human judgment of requirement satisfaction correlated against test outcomes) to confirm the proxy holds. If the step definitions are under-specified or fail to exercise key runtime behaviors, low pass rates could reflect test weakness rather than framework inadequacy, directly affecting the strength of the main finding.
  2. [Results and Analysis] The results section reports evaluations across frameworks and LLM backbones but does not provide sufficient detail on exact pass rates per task, comparisons against human baselines or simpler non-LLM approaches, or controls for data splits and metric definitions. This makes it hard to determine whether the 'persistent struggle' conclusion is robust or sensitive to post-hoc analysis choices.
minor comments (2)
  1. [Abstract] The abstract mentions Figure 1 but does not summarize the scale of the benchmark (e.g., number of requirements or total scenarios), which would help readers quickly gauge its scope.
  2. [Benchmark Construction] Notation for the HITL-MAA framework components could be clarified with a small diagram or explicit pseudocode to distinguish agent roles from human oversight steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and insightful comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Evaluation Protocol] The central claim that current E2ESD frameworks persistently struggle to solve tasks rests on the BDD test scenarios and Python step implementations serving as a faithful proxy for whether generated software meets user needs via mimicking real user interactions. No section reports an independent validation (e.g., human judgment of requirement satisfaction correlated against test outcomes) to confirm the proxy holds. If the step definitions are under-specified or fail to exercise key runtime behaviors, low pass rates could reflect test weakness rather than framework inadequacy, directly affecting the strength of the main finding.

    Authors: We appreciate the referee highlighting the importance of validating our evaluation proxy. The BDD scenarios and step implementations were developed through the HITL-MAA process specifically to align with fine-grained user requirements and exercise core runtime behaviors via the Behave framework. While the original manuscript did not include a separate human correlation study, we agree this would strengthen confidence in the proxy. In the revision we will add a dedicated subsection describing the scenario design rationale and report results from a small-scale human evaluation correlating test pass rates with expert judgments of requirement satisfaction. revision: yes

  2. Referee: [Results and Analysis] The results section reports evaluations across frameworks and LLM backbones but does not provide sufficient detail on exact pass rates per task, comparisons against human baselines or simpler non-LLM approaches, or controls for data splits and metric definitions. This makes it hard to determine whether the 'persistent struggle' conclusion is robust or sensitive to post-hoc analysis choices.

    Authors: We agree that greater granularity and transparency in the results would help readers assess robustness. We will expand the results section to include a table of exact per-task pass rates. We will also add comparisons against simpler non-LLM baselines (e.g., template-based generation) where they provide useful context, and we will explicitly state that the benchmark uses a fixed task set with no data splits. Metric definitions and any controls will be clarified in the revised text. These additions address the concern about sensitivity to analysis choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and evaluation

full rationale

The paper introduces E2EDev as a new BDD-grounded benchmark for end-to-end software development tasks, built via the HITL-MAA annotation process and evaluated through an automated Behave pipeline on multiple LLM frameworks. No derivation chain, equations, fitted parameters presented as predictions, or load-bearing self-citations exist in the provided text. The central claim of persistent struggles by current frameworks rests on direct experimental outcomes from the benchmark rather than any reduction to its own inputs by construction. This is a standard self-contained empirical proposal with public artifacts, warranting no circularity flags.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard software engineering assumptions about BDD reflecting user needs and automated tests being reliable proxies for functionality; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Behavior-Driven Development test scenarios accurately capture whether software meets user needs through real user interactions.
    Invoked in the abstract when describing how the benchmark evaluates generated software.

pith-pipeline@v0.9.0 · 5772 in / 1222 out tokens · 27647 ms · 2026-05-18T06:43:01.637570+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

    cs.AI 2026-05 unverdicted novelty 7.0

    WebGameBench is a benchmark that evaluates coding agents by having them generate browser-native games from specifications, then running those games in a real browser to assign EXCELLENT, USABLE, or UNUSABLE labels, wi...

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper

  1. [1]

    **HTML Structure **:

  2. [2]

    - Implement the clear button functionality to reset the display

    **JavaScript Functionality **: - Implement event listeners for number and operator buttons to update the display. - Implement the clear button functionality to reset the display. - Implement the backspace button functionality to remove the last character from the display. - Implement the equal button functionality to evaluate the expression and handle spe...

  3. [3]

    35 Published as a conference paper at Arxiv Analysis:MapCoder generates solutions based on analogies to unrelated projects, which resemble few-shot prompting

    **Testing**: ... 35 Published as a conference paper at Arxiv Analysis:MapCoder generates solutions based on analogies to unrelated projects, which resemble few-shot prompting. However, none of the referenced examples include edge-case handling for consecutive operators. Moreover, the planning stage does not treat operator input as a separate requirement, ...

  4. [4]

    **HTML Structure (‘index.html‘) **

  5. [5]

    - **handleOperatorClick**: Appends clicked operator to the display

    **JavaScript Functionality (‘script.js‘) - **handleNumberClick**: Appends clicked number to the display. - **handleOperatorClick**: Appends clicked operator to the display. - **handleClear**: Clears the display. - **handleBackspace**: Removes the last character from the display. - **handleEquals**: Evaluates the expression on the display. - **toggleTheme*...

  6. [6]

    I’m HUNGRY

    **CSS Styling (‘styles.css‘) ... Analysis:ChatDev does not mention the edge case of consecutive operator input at all. Furthermore, the function handleOperatorClick — which is crucial to this behavior — lacks any instruction regarding handling such cases. As a result, the generated implementation fails to address the edge case, leading to incorrect behavi...

  7. [7]

    - Do not modify the existing functionality, structure, or behavior of the code

    **Preserve Original Code Logic & Structure **: - Only add ‘data-testid‘ attributes to components without modifying the original code. - Do not modify the existing functionality, structure, or behavior of the code

  8. [8]

    - **If an element already has an ‘id‘, it must also have a ‘data- testid‘ with the same value or an appropriate variation **

    **Interactive Components (‘data-testid‘ Assignment) **: - Assign a ‘data-testid‘ to any interactive component that lacks one, including ‘button‘, ‘input‘, ‘select‘, ‘textarea‘, ‘a‘, ‘label‘, ‘link‘, etc. - **If an element already has an ‘id‘, it must also have a ‘data- testid‘ with the same value or an appropriate variation **. - If an element **already h...

  9. [9]

    - Example: - ‘submit-button‘: A button used to submit a form

    **Naming Convention for ‘data-testid‘s **: - Use clear and meaningful names that describe the element’s purpose or role. - Example: - ‘submit-button‘: A button used to submit a form. - ‘active-menu-item‘: A menu item that is currently active. - ‘close-modal-button‘: A button used to close a modal

  10. [10]

    **Ensure Unique ‘data-testid‘s **: - If multiple similar elements exist, append an increasing number (e.g ., ‘menu-item-1‘, ‘menu-item-2‘)

  11. [11]

    ""Specific Prompt for HTML File Annotator

    **Output**: - Provide the rewritten HTML and JavaScript files with correctly assigned ‘data-testid‘s. - Maintain the original code structure and logic while ensuring compliance with the naming rules. - **No modifications to the CSS files are required **. **Limitations**: - Do not change the code structure: Only add ‘data-testid‘ attributes. Do not refacto...

  12. [12]

    List of all UI elements (buttons, input fields, links, etc.) with their ‘id‘, ‘class‘, and role

  13. [13]

    Any form-related elements and their expected interactions

  14. [14]

    ""system prompt for js

    A concise summary of the UI structure. Ensure your response is structured and clear, as this information will be used by another agent to extract user requirements. """system prompt for js""" You are an expert in analyzing JavaScript code from Web applications. Your task is to extract event handlers, functions, and their relationships with UI elements. Yo...

  15. [15]

    JavaScript functions that handle user interactions (e.g., ‘onclick‘, ‘ onchange‘)

  16. [16]

    The ‘id‘ or ‘class‘ of the elements these functions interact with

  17. [17]

    ""system prompt

    A concise summary of how JavaScript controls the page’s behavior. Ensure your response is structured and clear, as this information will be used by another agent to extract user requirements. Requirement Extractor """system prompt""" You are an expert in extracting **functional** user requirements from web applications. Generate a **comprehensive and test...

  18. [18]

    **ID**: A unique identifier (e.g., REQ-001)

  19. [19]

    summary": {

    **Description**: A clear statement of the user requirement, including: - **Context**: The scenario or condition under which the functionality occurs. - **User Action **: What the user does (e.g., clicks, types, scrolls). - **System Response **: The expected outcome after the user action. ### **Rules** - Only include **functional requirements ** - i.e., ob...

  20. [20]

    - The ‘Feature‘ description should clearly summarize the purpose and scope of the requirement

    **Mapping Requirements to Features **: - Each user requirement **must** be mapped to a corresponding ‘Feature ‘. - The ‘Feature‘ description should clearly summarize the purpose and scope of the requirement

  21. [21]

    - **[Edge]** Unusual or extreme conditions

    **Scenario Coverage **: - Each ‘Feature‘ must include multiple ‘Scenario‘ blocks covering: - **[Normal]** Expected behavior. - **[Edge]** Unusual or extreme conditions. - **[Error]** Invalid inputs or failures. - **Label each Scenario ** with ‘[Normal]‘, ‘[Edge]‘, or ‘[Error]‘

  22. [22]

    When the user enters ’testuser’ into the username field with data-testid ’username-input’

    **Gherkin Syntax & Data Specificity **:**: - **All Given, When, Then steps must include explicit values if they are known. ** - If a value is dynamic or uncertain, describe its purpose instead of using a placeholder. 44 Published as a conference paper at Arxiv - Reference relevant UI elements (data-testid) for stable and precise element identification. - ...

  23. [23]

    - **Before any interaction, the test must ensure the correct webpage is loaded

    **Scenario Independence & Page Initialization **: - Each ‘Scenario‘ **must** be **independent, complete, and executable on its own **. - **Before any interaction, the test must ensure the correct webpage is loaded. **

  24. [24]

    ""system prompt(Step Implementation)

    **Output Format **: - Wrap the entire Gherkin test cases in a single code block with the language tag ‘gherkin‘. Test Automation Engineer """system prompt(Step Implementation)""" You are an expert in implementing Selenium-based automated test scripts using Behave. Your task is to convert Gherkin test cases into Python step implementations that adhere to t...

  25. [25]

    - **DO NOT MODIFY THE ORIGINAL STEP NAMES **: The text inside the decorators must exactly match the Gherkin step descriptions

    **Step Definitions **: - Each ‘Given‘, ‘When‘, and ‘Then‘ step must have a corresponding ‘ @given‘, ‘@when‘, or ‘@then‘ function. - **DO NOT MODIFY THE ORIGINAL STEP NAMES **: The text inside the decorators must exactly match the Gherkin step descriptions. - If the Gherkin test case includes a ‘Background‘, implement it first and ensure all ‘Scenario‘ ste...

  26. [26]

    **Selenium Best Practices **:

  27. [27]

    [data-testid=’submit- button’]

    Selector Usage: - Prioritize using data-testid attributes for locating elements. Example: 45 Published as a conference paper at Arxiv driver.find_element(By.CSS_SELECTOR, "[data-testid=’submit- button’]") - If data-testid is not available, use stable alternatives like class names or IDs. - Avoid using fragile or overly complex XPath expressions unless necessary

  28. [28]

    [data-testid=’submit-button’]

    User Interaction Handling: - Always wait for elements to be present and interactable before performing actions. - Use WebDriverWait to ensure visibility or clickability. Example: WebDriverWait(driver, 10).until(EC.element_to_be_clickable(( By.CSS_SELECTOR, "[data-testid=’submit-button’]"))) - Handle interactions like clicking, typing, and checking visibil...

  29. [29]

    aria-expanded

    Component State Checks: - To check if a component is expanded or collapsed: - Prefer checking the value of aria-expanded or state-indicative CSS classes. - Check ‘data- *‘ attributes like ‘data-expanded‘, or look at CSS properties (e.g., display). - Define a helper function to check expansion state robustly: Example: def is_expanded(element): # Check aria...

  30. [30]

    file_path_placeholder

    **Test Setup and Teardown **: - Load the test page from a local file using ‘file_path‘. - Ensure the browser driver is properly initialized and closed at the end of the test. 47 Published as a conference paper at Arxiv - Include the placeholder ‘file_path = "file_path_placeholder"‘ in the implementation for dynamic file path handling

  31. [31]

    - **After each interaction with a web element (e.g., ‘.click()‘, ‘

    **Code Quality **: - Follow best practices for maintainability: - Use explicit waits (‘WebDriverWait‘) instead of implicit waits. - **After each interaction with a web element (e.g., ‘.click()‘, ‘. send_keys()‘, ‘.get()‘), insert ‘time.sleep(1)‘ to improve test robustness.** - Avoid hardcoding values such as URLs or element locators when possible. - Write...

  32. [32]

    ""system prompt(Step Definition Fixer)

    **Output Format **: - Provide the corrected Python code wrapped in a code block with the language tag ‘python‘. """system prompt(Step Definition Fixer)""" You are an AI assistant that helps users fix issues in Behave step definitions (step.py). Your task is to analyze the errors reported during a Behave dry run and modify the code while adhering to the fo...

  33. [33]

    - Do not modify the content inside the decorators (e.g., step descriptions)

    **Step Definitions **: - Each ‘Given‘, ‘When‘, and ‘Then‘ step must have a corresponding ‘ @given‘, ‘@when‘, or ‘@then‘ function. - Do not modify the content inside the decorators (e.g., step descriptions)

  34. [34]

    These errors typically indicate missing step definitions, syntax issues, or other problems

    **Error Analysis **: - Analyze the errors reported during the dry run. These errors typically indicate missing step definitions, syntax issues, or other problems. - Ensure that all undefined steps are implemented correctly

  35. [35]

    - Handle user interactions (clicking, inputting text, checking visibility) correctly

    **Code Quality **: - Follow best practices for maintainability and robustness: - Use proper selectors (e.g., Selenium locators) where applicable. - Handle user interactions (clicking, inputting text, checking visibility) correctly. - Avoid hardcoding values such as URLs or element locators when possible

  36. [36]

    **Resource Management **: - Ensure the driver is closed at the end of the test if it was opened

  37. [37]

    ""system prompt(Step Logic Fixer)

    **Code Block **: - Provide the corrected Python code wrapped in a code block with the language tag ‘python‘. """system prompt(Step Logic Fixer)""" 48 Published as a conference paper at Arxiv You are an AI assistant that helps users fix issues in Behave step definitions (step.py). Your task is to analyze the failure logs and then modify the code while adhe...

  38. [38]

    - Ensure that the step definitions remain intact (e.g., function signatures and decorator mappings)

    **Do not alter the structure or framework of the code **: - Do not modify the content inside the ‘@given‘, ‘@when‘, or ‘@then‘ decorators. - Ensure that the step definitions remain intact (e.g., function signatures and decorator mappings)

  39. [39]

    - Ensure the corrected code resolves the reported issues without altering the intended behavior

    **Focus only on fixing the implementation logic **: - Update the internal logic of the functions if there are errors or missing parts. - Ensure the corrected code resolves the reported issues without altering the intended behavior

  40. [40]

    ""Step Checker

    **Provide the corrected Python code in a code block **: - Wrap the corrected Python code in a code block with the language tag ‘python‘. Test Runner Agent """Step Checker""" def run_dry_run(self, project_root): """Run Behave in dry-run mode to verify that the definition of step. py is correct.""" try: print("[TestRunnerAgent] Started Behave dry-run mode.....

  41. [41]

    Identify failed scenarios in the log

  42. [42]

    Extract the specific step that failed

  43. [43]

    Identify and summarize the error message

  44. [44]

    failed_scenarios

    Return the results in a structured format. ### **Output Format: ** { "failed_scenarios": [ { "scenario": "Scenario Name", "failed_step": "Step that caused the failure", "error_message": "Summarized error message" }, ... ] } Ensure accuracy and completeness in summarizing the errors. 51 Published as a conference paper at Arxiv H LLMSUSAGESTATEMENT LLMs hav...