ComPass: Contrastive Learning for Automated Patch Correctness Assessment in Program Repair

Chunrong Fang; Haichuan Hu; Liang Xiao; Quanjun Zhang; Ye Shang; Zhenyu Chen

arxiv: 2602.07561 · v2 · submitted 2026-02-07 · 💻 cs.SE

ComPass: Contrastive Learning for Automated Patch Correctness Assessment in Program Repair

Quanjun Zhang , Ye Shang , Haichuan Hu , Chunrong Fang , Zhenyu Chen , Liang Xiao This is my paper

Pith reviewed 2026-05-16 06:14 UTC · model grok-4.3

classification 💻 cs.SE

keywords automated program repairpatch correctness assessmentcontrastive learningpre-trained language modelscode transformationspatch overfittingsoftware maintenanceDefects4J

0 comments

The pith

ComPass pre-trains language models with contrastive learning on transformed code snippets to assess whether automated repair patches are truly correct.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Automated program repair tools often generate patches that pass existing tests yet fail to fix the underlying bug correctly. ComPass addresses this overfitting problem by generating semantically equivalent code variants through transformation rules and using them to pre-train models via contrastive learning, which teaches the model to recognize features shared across structurally different but functionally identical snippets. The approach then fine-tunes the model jointly on patch representations and a binary classifier. This yields an accuracy of 88.35 percent on 2274 real-world patches drawn from Defects4J, exceeding prior methods that were constrained by limited labeled data and weaker training signals. A reader would care because reliable automatic verification could cut the manual effort required to validate and deploy repairs in large codebases.

Core claim

ComPass pre-trains pre-trained language models with contrastive learning on pairs of semantic-preserving code snippets produced by code transformation rules, then integrates patch embeddings and fine-tunes the models with a binary classifier to determine patch correctness, achieving 88.35 percent accuracy on 2274 Defects4J patches and outperforming the prior state-of-the-art APPT baseline.

What carries the argument

Contrastive learning applied to code transformation rules that generate positive pairs of semantically equivalent but syntactically varied snippets, enabling the model to learn features invariant to structure yet indicative of patch correctness.

If this is right

Large unlabeled code corpora become usable for pre-training without requiring extensive manual labels for every patch.
Joint representation learning and classification improves the model's ability to reason about semantic equivalence in repair contexts.
The method scales assessment to thousands of patches while maintaining higher accuracy than previous PLM-based baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transformation-plus-contrastive pipeline could be tested on other code understanding tasks such as clone detection or semantic bug localization.
If the transformations prove robust across languages, the technique might transfer to non-Java repair datasets without retraining from scratch.
Combining the learned representations with test-execution feedback could further reduce remaining false positives in deployed repair tools.

Load-bearing premise

The code transformation rules produce pairs whose contrastive signals align with actual semantic correctness of patches rather than superficial syntactic traits.

What would settle it

A collection of patches with independently verified ground-truth correctness where the model trained under the described procedure systematically misclassifies a substantial fraction would falsify the performance claim.

read the original abstract

Automated program repair (APR) attempts to reduce manual debugging efforts and plays a vital role in software maintenance. Despite remarkable progress, APR is still limited in generating overfitting patches, i.e., patches passing available test suites but incorrect. This issue, known as patch overfitting, has become a key concern in the APR community, with numerous approaches proposed to address it. Very recent work proposes a pre-trained language model (PLM)-based automated patch correctness assessment (APCA) approach, indicating the potential of such PLMs in reasoning about patch correctness. Despite being promising, it is still far from perfect due to various limitations, such as the training paradigm and training dataset. In this paper, we present ComPass, a PLM-based APCA approach that leverages contrastive learning and data augmentation to address the technical limitations of prior work. Our work is inspired by the opportunity to integrate contrastive learning with recent PLMs in the field of patch correctness assessment, where large-scale labeled patches are difficult to obtain. ComPass utilizes code transformation rules to generate semantic-preserving code snippets for both unlabeled pre-training corpus and labeled fine-tuning patches. ComPass then pre-trains PLMs with contrastive learning, which captures code features with the same semantics but different structures. ComPass finally integrates representation embeddings of patch code snippets and fine-tunes PLMs with a binary classifier jointly to assess patch code correctness. Experimental results on 2274 real-world patches from Defects4J demonstrate that ComPass achieves an accuracy of 88.35%, significantly outperforming state-of-the-art baseline APPT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ComPass adds contrastive pre-training on PLMs using code transformation rules to judge patch correctness and reports 88% accuracy on Defects4J, but the supporting details are missing.

read the letter

ComPass applies contrastive learning to pre-trained language models for assessing whether patches generated by automated program repair tools are correct or just overfitting. It claims 88.35% accuracy on 2274 real patches from Defects4J, beating the APPT baseline. The new element is the use of code transformation rules to create semantic-preserving snippets for contrastive pre-training on unlabeled data, followed by joint fine-tuning with a binary classifier. This directly targets the scarcity of labeled patches that constrained earlier PLM work in the area. The paper frames the overfitting problem clearly and shows how contrastive objectives can push embeddings to capture semantic equivalence across structural variants, which is a reasonable extension of recent PLM techniques in software engineering. The approach is straightforward and builds on standard contrastive losses without introducing circular derivations. The main soft spot is the load-bearing assumption that the transformation rules reliably produce pairs whose only differences are syntactic. The abstract states they generate semantic-preserving snippets, but supplies no validation that this holds for the specific buggy and patched code in Defects4J, no list of the rules, and no checks for behavioral equivalence. If even a modest fraction of pairs alter observable behavior, the contrastive signal becomes noisy and the accuracy lift could be artifactual rather than evidence of better correctness reasoning. The experimental section is also thin: no ablation on the contrastive component, no statistical tests, and limited baseline description. The citation pattern follows the usual APR and PLM references without obvious gaps or self-referential inflation. This paper is for researchers working on ML-based automated repair who need better filters for overfitting patches. A reader already familiar with PLM applications in SE would get value from the data-augmentation idea and could adapt the contrastive setup. It deserves peer review so the full methods, rule validation, and results tables can be examined for reproducibility.

Referee Report

2 major / 2 minor

Summary. The paper introduces ComPass, a PLM-based approach for automated patch correctness assessment (APCA) that uses contrastive learning on code snippets generated by transformation rules claimed to be semantic-preserving. It pre-trains PLMs to capture semantic features despite syntactic variation, then fine-tunes with a binary classifier on labeled patches. Evaluation on 2274 Defects4J patches reports 88.35% accuracy, outperforming the APPT baseline.

Significance. If the performance claims and underlying assumptions hold, ComPass would meaningfully advance APR by improving detection of overfitting patches, thereby reducing wasted verification effort. The contrastive pre-training strategy is a reasonable response to scarce labeled patch data and could generalize to other code reasoning tasks. The scale of the Defects4J evaluation is a positive feature.

major comments (2)

[§3.2] §3.2 (Code Transformation Rules): The central claim that the rules produce reliably semantic-preserving positive pairs for contrastive pre-training is load-bearing, yet the manuscript supplies no empirical validation (e.g., differential testing on Defects4J programs or equivalence checking) confirming that all rules preserve observable behavior. A modest fraction of non-preserving pairs would contaminate the learned embeddings and could explain the reported lift over APPT without reflecting genuine correctness reasoning.
[§4.3] §4.3 (Experimental Results): The reported 88.35% accuracy and outperformance over APPT lack supporting details such as ablation studies isolating contrastive pre-training from data augmentation, statistical significance tests, or confidence intervals. Without these, it is impossible to assess whether the improvement is robust or attributable to the proposed method.

minor comments (2)

[Abstract] The abstract and §4.1 would benefit from explicit quantification of the improvement over APPT (absolute and relative) together with the exact baseline accuracy for direct comparison.
[§3.3] Notation in the contrastive loss description (likely §3.3) should more clearly distinguish positive-pair and negative-pair sampling to avoid ambiguity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential impact. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (Code Transformation Rules): The central claim that the rules produce reliably semantic-preserving positive pairs for contrastive pre-training is load-bearing, yet the manuscript supplies no empirical validation (e.g., differential testing on Defects4J programs or equivalence checking) confirming that all rules preserve observable behavior. A modest fraction of non-preserving pairs would contaminate the learned embeddings and could explain the reported lift over APPT without reflecting genuine correctness reasoning.

Authors: We agree that explicit empirical validation of semantic preservation would strengthen the central assumption. The transformation rules follow standard practices from prior code augmentation literature (e.g., variable renaming, dead-code insertion, and equivalent expression rewrites), which are intended to preserve semantics by construction. However, we acknowledge the absence of direct validation in the current manuscript. In the revision we will add a dedicated subsection with results from differential testing on a random sample of 200 Defects4J programs, reporting the observed preservation rate and any identified exceptions. revision: yes
Referee: [§4.3] §4.3 (Experimental Results): The reported 88.35% accuracy and outperformance over APPT lack supporting details such as ablation studies isolating contrastive pre-training from data augmentation, statistical significance tests, or confidence intervals. Without these, it is impossible to assess whether the improvement is robust or attributable to the proposed method.

Authors: We concur that the evaluation section would benefit from additional rigor. The current manuscript reports only end-to-end accuracy; we will expand §4.3 to include (1) ablation studies that separately disable contrastive pre-training and data augmentation, (2) McNemar’s test for statistical significance of the improvement over APPT, and (3) 95% confidence intervals computed via bootstrap resampling. These additions will clarify the contribution of each component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML pipeline with external evaluation

full rationale

The paper describes a standard contrastive pre-training plus fine-tuning pipeline on PLMs. Code transformation rules generate positive pairs for contrastive loss, followed by binary classification on labeled patches. No equations, derivations, or fitted parameters are presented that reduce the reported 88.35% accuracy to a self-referential quantity or input by construction. Performance is measured on the external Defects4J benchmark (2274 patches) against an independent baseline (APPT). The approach relies on the empirical validity of the transformation rules and standard contrastive objectives rather than any self-definitional or self-citation load-bearing step. This is a normal non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full methods, equations, and experimental sections unavailable, so ledger entries are inferred from high-level description and marked as provisional.

axioms (1)

domain assumption Code transformation rules produce semantic-preserving variants of patches
Invoked to generate training pairs for contrastive learning on both unlabeled and labeled data

pith-pipeline@v0.9.0 · 5599 in / 1285 out tokens · 39510 ms · 2026-05-16T06:14:28.771922+00:00 · methodology

ComPass: Contrastive Learning for Automated Patch Correctness Assessment in Program Repair

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)