CI-Repair-Bench: A Repository-Aware Benchmark for Automated Patch Validation via CI Workflows
Pith reviewed 2026-05-07 09:12 UTC · model grok-4.3
The pith
CI-Repair-Bench evaluates automated program repairs by requiring patches to pass full re-execution of original continuous integration workflows from real repositories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CI-Repair-Bench supplies 567 repository-level CI failure cases across twelve error categories and requires every candidate patch to be judged correct only if it passes complete re-execution under the original GitHub Actions workflow; when a reference log-analysis and patch-generation method is applied, the strongest model reaches an 18.9 percent success rate, with performance varying sharply by error type.
What carries the argument
CI-Repair-Bench, a collection of real CI failure instances that enforces correctness by re-executing each patched repository against its unmodified original workflow.
If this is right
- Repair methods must address non-code artifacts and environment issues in addition to source changes.
- Success metrics for automated repair should incorporate full workflow re-execution rather than test-suite passage alone.
- Error-type breakdowns allow targeted improvement on categories such as formatting versus dependency resolution.
- Current large-language-model approaches achieve limited but measurable effectiveness on localized CI failures.
Where Pith is reading between the lines
- Teams could embed similar re-execution checks inside their own CI pipelines to validate proposed patches before merge.
- The benchmark design could be extended to other continuous-integration systems to expose platform-specific repair difficulties.
- Hybrid techniques that combine log analysis with simulated environment setup may raise success rates beyond the observed 18.9 percent.
Load-bearing premise
The 567 selected failures from 103 repositories stand in for the broader range of CI problems that arise in typical software projects.
What would settle it
A new collection of CI failures assembled from a larger or differently sampled set of repositories that produces markedly higher or lower repair success rates under the same patch-generation method would show the current benchmark selection is not representative.
Figures
read the original abstract
Continuous Integration (CI) enforces repository-level correctness through multi-stage workflows and is central to modern software development, yet diagnosing and repairing CI failures remains challenging. Unlike traditional program repair, CI failures frequently involve non-code artifacts, environment and dependency issues, noisy execution logs, and workflow-level constraints. Existing program repair benchmarks fall short in this setting: they are largely test-centric, restrict repairs to source code, assume fixed execution environments, and evaluate under simplified CI workflows that do not reflect real repository-level validation. We introduce CI-Repair-Bench, a benchmark for CI-verified, repository-level program repair constructed from real GitHub Actions executions. It contains 567 CI failure instances from 103 repositories and evaluates repair correctness exclusively through full CI re-execution under original workflows. Failures are categorized into 12 CI error types, enabling fine-grained, error-type-aware evaluation. To demonstrate benchmark usage, we include a reference CI repair workflow that analyzes CI logs to localize faults and generate candidate patches. Empirical results show that automated repair is most effective for localized, tool-enforced failures such as formatting and linting, while environment, dependency, and configuration-related failures remain challenging; the best-performing LLM achieves an 18.9% repair success rate. CI-Repair-Bench provides a realistic evaluation foundation for advancing research on CI-native automated program repair.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CI-Repair-Bench, a benchmark for CI-verified, repository-level program repair constructed from real GitHub Actions executions. It contains 567 CI failure instances from 103 repositories, categorized into 12 error types. Repair correctness is evaluated exclusively through full re-execution of patches under the original workflows rather than simplified tests. A reference repair workflow using LLMs for log analysis, fault localization, and patch generation is provided, with empirical results showing the best LLM achieving an 18.9% success rate, performing better on localized issues like formatting/linting than on environment/dependency failures.
Significance. If the benchmark instances prove representative and the CI re-execution evaluation is shown to be robust, this work offers a realistic foundation for repository-level repair research that goes beyond test-centric benchmarks. The use of independent real-world CI workflows as ground truth, the error-type categorization, and the inclusion of a reference workflow are strengths that could enable more targeted advances on challenging CI issues.
major comments (3)
- Abstract and benchmark construction: The description of the 567 instances provides no details on selection criteria, filtering process, or analysis of potential selection bias across the 103 repositories; this directly affects the generalizability of the reported 18.9% success rate and the claim that the benchmark reflects typical real-world CI issues.
- Evaluation section (empirical results): The central claim that full CI re-execution under original workflows provides an unbiased and complete correctness signal is not supported by any reported controls for workflow non-determinism, such as multiple re-executions per patch, flakiness detection, or handling of cached artifacts/external dependencies; a single run per patch risks noise that could inflate or deflate the 18.9% figure.
- Results and discussion: No statistical tests (e.g., confidence intervals or significance testing) are mentioned around the 18.9% rate or the per-error-type breakdowns, which weakens the fine-grained claims about which error types are more amenable to automated repair.
minor comments (1)
- The abstract and introduction could more explicitly contrast the benchmark against prior program repair datasets (e.g., by citing specific limitations of test-centric benchmarks) to strengthen the motivation.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive comments on our paper. We address each of the major comments below and indicate the revisions we plan to make to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract and benchmark construction: The description of the 567 instances provides no details on selection criteria, filtering process, or analysis of potential selection bias across the 103 repositories; this directly affects the generalizability of the reported 18.9% success rate and the claim that the benchmark reflects typical real-world CI issues.
Authors: We agree that additional details on the benchmark construction are necessary to support claims of representativeness. While the manuscript describes the collection of CI failure instances from GitHub Actions workflows across 103 repositories and their categorization into 12 error types, we did not provide a dedicated analysis of selection criteria or potential biases. In the revised version, we will expand Section 3 to include explicit selection criteria (e.g., repositories with active CI, failure logs containing specific error patterns), filtering steps (such as excluding incomplete workflows or non-reproducible failures), and an analysis of repository characteristics (e.g., size, language distribution) to address generalizability concerns. This will better contextualize the 18.9% success rate. revision: yes
-
Referee: Evaluation section (empirical results): The central claim that full CI re-execution under original workflows provides an unbiased and complete correctness signal is not supported by any reported controls for workflow non-determinism, such as multiple re-executions per patch, flakiness detection, or handling of cached artifacts/external dependencies; a single run per patch risks noise that could inflate or deflate the 18.9% figure.
Authors: This point highlights an important limitation in our evaluation methodology. Our approach prioritizes realism by re-executing patches in the exact original CI environment and workflow, which we believe provides a more complete signal than simplified test suites. However, we did not implement controls such as repeated executions to detect flakiness or account for non-determinism from external dependencies and caches. We will revise the evaluation section to explicitly discuss this limitation, including potential impacts on the reported success rates, and suggest best practices for benchmark users, such as running evaluations multiple times where feasible. We maintain that for many deterministic error types (e.g., linting failures), the single-run evaluation is reliable, but acknowledge the risk for others. revision: partial
-
Referee: Results and discussion: No statistical tests (e.g., confidence intervals or significance testing) are mentioned around the 18.9% rate or the per-error-type breakdowns, which weakens the fine-grained claims about which error types are more amenable to automated repair.
Authors: We concur that the absence of statistical analysis limits the strength of our claims regarding differential performance across error types. In the revised manuscript, we will add confidence intervals (using the Wilson score interval for binomial proportions) for the overall 18.9% success rate and for each of the 12 error types. Additionally, we will include a brief discussion of the statistical significance of observed differences where appropriate, to better support the fine-grained analysis. revision: yes
Circularity Check
No significant circularity; evaluation uses independent external CI re-executions
full rationale
The paper defines CI-Repair-Bench from real GitHub Actions logs and measures patch success solely by re-running the original workflows on the patched repositories. This ground truth is external to any repair algorithm, fitted parameters, or self-derived quantities. No equations, predictions, or self-citations reduce the reported 18.9% success rate or the benchmark construction to a tautology. The derivation chain (log analysis → patch generation → CI re-execution) remains self-contained against an independent oracle.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 567 CI failure instances selected from 103 repositories are representative of real-world CI issues.
- domain assumption Re-executing the original CI workflows provides a reliable and complete validation of patch correctness.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all ...
-
[2]
" write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...
-
[3]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 'after.sentence := #3 '...
-
[4]
" write newline "" before.all 'output.state := FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION n.separate 't := "" #0 'numnames := t empty not t #-1 #1 subs...
-
[5]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 'after.sentence := #3 '...
-
[6]
" write newline "" before.all 'output.state := FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION n.separate 't := "" #0 'numnames := t empty not t #-1 #1 subs...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.