pith. sign in

arxiv: 2604.27148 · v2 · submitted 2026-04-29 · 💻 cs.SE

CI-Repair-Bench: A Repository-Aware Benchmark for Automated Patch Validation via CI Workflows

Pith reviewed 2026-05-07 09:12 UTC · model grok-4.3

classification 💻 cs.SE
keywords continuous integrationprogram repairbenchmarkGitHub Actionspatch validationautomated repairCI failuresrepository-level repair
0
0 comments X

The pith

CI-Repair-Bench evaluates automated program repairs by requiring patches to pass full re-execution of original continuous integration workflows from real repositories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a benchmark of 567 CI failure instances drawn from 103 GitHub repositories to measure whether generated patches restore the original multi-stage validation process. Unlike earlier repair datasets that rely on unit tests or isolated code changes in fixed environments, this collection captures failures involving dependencies, configurations, and non-source artifacts. The authors supply a reference workflow that parses logs, localizes faults, and produces candidate fixes, then re-runs each patch through the unmodified CI pipeline. Results indicate higher success on localized tool-enforced errors such as formatting while environment and dependency failures prove harder for current large language models.

Core claim

CI-Repair-Bench supplies 567 repository-level CI failure cases across twelve error categories and requires every candidate patch to be judged correct only if it passes complete re-execution under the original GitHub Actions workflow; when a reference log-analysis and patch-generation method is applied, the strongest model reaches an 18.9 percent success rate, with performance varying sharply by error type.

What carries the argument

CI-Repair-Bench, a collection of real CI failure instances that enforces correctness by re-executing each patched repository against its unmodified original workflow.

If this is right

  • Repair methods must address non-code artifacts and environment issues in addition to source changes.
  • Success metrics for automated repair should incorporate full workflow re-execution rather than test-suite passage alone.
  • Error-type breakdowns allow targeted improvement on categories such as formatting versus dependency resolution.
  • Current large-language-model approaches achieve limited but measurable effectiveness on localized CI failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could embed similar re-execution checks inside their own CI pipelines to validate proposed patches before merge.
  • The benchmark design could be extended to other continuous-integration systems to expose platform-specific repair difficulties.
  • Hybrid techniques that combine log analysis with simulated environment setup may raise success rates beyond the observed 18.9 percent.

Load-bearing premise

The 567 selected failures from 103 repositories stand in for the broader range of CI problems that arise in typical software projects.

What would settle it

A new collection of CI failures assembled from a larger or differently sampled set of repositories that produces markedly higher or lower repair success rates under the same patch-generation method would show the current benchmark selection is not representative.

Figures

Figures reproduced from arXiv: 2604.27148 by Md Nakhla Rafi, Rabeya Khatun Muna, Tse-Hsun (Peter) Chen.

Figure 1
Figure 1. Figure 1: Overview of CI-Repair-Bench. failures and evaluating candidate repairs. 4 Failure-to-Success Commit Pair, consisting of a failing commit and its nearest subsequent passing com￾mit on the same branch under the same CI workflow; 5 Ground-Truth Patch: a repair-relevant patch with the minimal set of changes necessary to resolve the CI failure, derived from the commit pair by retaining only causally related mod… view at source ↗
Figure 2
Figure 2. Figure 2: Example CI-Repair-Bench instance illustrating repository meta￾data, workflow configuration, CI failure signal, and the validated ground-truth patch. 4 Reference CI Repair Framework We define a reference CI repair framework that operationalizes CI-Repair￾Bench for evaluating automated CI repair under realistic validation condi￾tions. Given a failing benchmark instance consisting of the repository state at t… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the LLM-based CI repair pipeline used for evaluation in view at source ↗
read the original abstract

Continuous Integration (CI) enforces repository-level correctness through multi-stage workflows and is central to modern software development, yet diagnosing and repairing CI failures remains challenging. Unlike traditional program repair, CI failures frequently involve non-code artifacts, environment and dependency issues, noisy execution logs, and workflow-level constraints. Existing program repair benchmarks fall short in this setting: they are largely test-centric, restrict repairs to source code, assume fixed execution environments, and evaluate under simplified CI workflows that do not reflect real repository-level validation. We introduce CI-Repair-Bench, a benchmark for CI-verified, repository-level program repair constructed from real GitHub Actions executions. It contains 567 CI failure instances from 103 repositories and evaluates repair correctness exclusively through full CI re-execution under original workflows. Failures are categorized into 12 CI error types, enabling fine-grained, error-type-aware evaluation. To demonstrate benchmark usage, we include a reference CI repair workflow that analyzes CI logs to localize faults and generate candidate patches. Empirical results show that automated repair is most effective for localized, tool-enforced failures such as formatting and linting, while environment, dependency, and configuration-related failures remain challenging; the best-performing LLM achieves an 18.9% repair success rate. CI-Repair-Bench provides a realistic evaluation foundation for advancing research on CI-native automated program repair.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces CI-Repair-Bench, a benchmark for CI-verified, repository-level program repair constructed from real GitHub Actions executions. It contains 567 CI failure instances from 103 repositories, categorized into 12 error types. Repair correctness is evaluated exclusively through full re-execution of patches under the original workflows rather than simplified tests. A reference repair workflow using LLMs for log analysis, fault localization, and patch generation is provided, with empirical results showing the best LLM achieving an 18.9% success rate, performing better on localized issues like formatting/linting than on environment/dependency failures.

Significance. If the benchmark instances prove representative and the CI re-execution evaluation is shown to be robust, this work offers a realistic foundation for repository-level repair research that goes beyond test-centric benchmarks. The use of independent real-world CI workflows as ground truth, the error-type categorization, and the inclusion of a reference workflow are strengths that could enable more targeted advances on challenging CI issues.

major comments (3)
  1. Abstract and benchmark construction: The description of the 567 instances provides no details on selection criteria, filtering process, or analysis of potential selection bias across the 103 repositories; this directly affects the generalizability of the reported 18.9% success rate and the claim that the benchmark reflects typical real-world CI issues.
  2. Evaluation section (empirical results): The central claim that full CI re-execution under original workflows provides an unbiased and complete correctness signal is not supported by any reported controls for workflow non-determinism, such as multiple re-executions per patch, flakiness detection, or handling of cached artifacts/external dependencies; a single run per patch risks noise that could inflate or deflate the 18.9% figure.
  3. Results and discussion: No statistical tests (e.g., confidence intervals or significance testing) are mentioned around the 18.9% rate or the per-error-type breakdowns, which weakens the fine-grained claims about which error types are more amenable to automated repair.
minor comments (1)
  1. The abstract and introduction could more explicitly contrast the benchmark against prior program repair datasets (e.g., by citing specific limitations of test-centric benchmarks) to strengthen the motivation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our paper. We address each of the major comments below and indicate the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract and benchmark construction: The description of the 567 instances provides no details on selection criteria, filtering process, or analysis of potential selection bias across the 103 repositories; this directly affects the generalizability of the reported 18.9% success rate and the claim that the benchmark reflects typical real-world CI issues.

    Authors: We agree that additional details on the benchmark construction are necessary to support claims of representativeness. While the manuscript describes the collection of CI failure instances from GitHub Actions workflows across 103 repositories and their categorization into 12 error types, we did not provide a dedicated analysis of selection criteria or potential biases. In the revised version, we will expand Section 3 to include explicit selection criteria (e.g., repositories with active CI, failure logs containing specific error patterns), filtering steps (such as excluding incomplete workflows or non-reproducible failures), and an analysis of repository characteristics (e.g., size, language distribution) to address generalizability concerns. This will better contextualize the 18.9% success rate. revision: yes

  2. Referee: Evaluation section (empirical results): The central claim that full CI re-execution under original workflows provides an unbiased and complete correctness signal is not supported by any reported controls for workflow non-determinism, such as multiple re-executions per patch, flakiness detection, or handling of cached artifacts/external dependencies; a single run per patch risks noise that could inflate or deflate the 18.9% figure.

    Authors: This point highlights an important limitation in our evaluation methodology. Our approach prioritizes realism by re-executing patches in the exact original CI environment and workflow, which we believe provides a more complete signal than simplified test suites. However, we did not implement controls such as repeated executions to detect flakiness or account for non-determinism from external dependencies and caches. We will revise the evaluation section to explicitly discuss this limitation, including potential impacts on the reported success rates, and suggest best practices for benchmark users, such as running evaluations multiple times where feasible. We maintain that for many deterministic error types (e.g., linting failures), the single-run evaluation is reliable, but acknowledge the risk for others. revision: partial

  3. Referee: Results and discussion: No statistical tests (e.g., confidence intervals or significance testing) are mentioned around the 18.9% rate or the per-error-type breakdowns, which weakens the fine-grained claims about which error types are more amenable to automated repair.

    Authors: We concur that the absence of statistical analysis limits the strength of our claims regarding differential performance across error types. In the revised manuscript, we will add confidence intervals (using the Wilson score interval for binomial proportions) for the overall 18.9% success rate and for each of the 12 error types. Additionally, we will include a brief discussion of the statistical significance of observed differences where appropriate, to better support the fine-grained analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses independent external CI re-executions

full rationale

The paper defines CI-Repair-Bench from real GitHub Actions logs and measures patch success solely by re-running the original workflows on the patched repositories. This ground truth is external to any repair algorithm, fitted parameters, or self-derived quantities. No equations, predictions, or self-citations reduce the reported 18.9% success rate or the benchmark construction to a tautology. The derivation chain (log analysis → patch generation → CI re-execution) remains self-contained against an independent oracle.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the representativeness of the collected failures and the validity of full CI re-execution as ground truth, with no free parameters or invented entities.

axioms (2)
  • domain assumption The 567 CI failure instances selected from 103 repositories are representative of real-world CI issues.
    Invoked to support fine-grained evaluation across 12 error types.
  • domain assumption Re-executing the original CI workflows provides a reliable and complete validation of patch correctness.
    This is the core evaluation criterion stated in the abstract.

pith-pipeline@v0.9.0 · 5554 in / 1526 out tokens · 114649 ms · 2026-05-07T09:12:17.481014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all ...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...

  3. [3]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 'after.sentence := #3 '...

  4. [4]

    write newline

    " write newline "" before.all 'output.state := FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION n.separate 't := "" #0 'numnames := t empty not t #-1 #1 subs...

  5. [5]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 'after.sentence := #3 '...

  6. [6]

    write newline

    " write newline "" before.all 'output.state := FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION n.separate 't := "" #0 'numnames := t empty not t #-1 #1 subs...