Fine-Tuning Pre-Trained Code Models for AI-Generated Code Detection

Jany-Gabriel Ispas; Sergiu Nisioi

arxiv: 2605.01596 · v1 · submitted 2026-05-02 · 💻 cs.CL

Fine-Tuning Pre-Trained Code Models for AI-Generated Code Detection

Jany-Gabriel Ispas , Sergiu Nisioi This is my paper

Pith reviewed 2026-05-09 13:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords AI-generated code detectionfine-tuningpre-trained code modelsbinary classificationmodel attributiondata augmentationshared task

0 comments

The pith

Fine-tuning pre-trained code models with targeted strategies detects AI-generated code at competitive levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that adapting four existing pre-trained code models through subtask-specific fine-tuning produces strong results on both binary detection of AI-generated code and attribution to the generating model. It begins with a simple TF-IDF baseline and then applies techniques such as leave-one-language-out validation, code augmentation, and ensembling to reach macro-F1 scores of 0.737 and 0.422. If these adaptations hold, existing models for code understanding can be repurposed for authenticity verification without requiring entirely new architectures. Readers would see value in practical methods that address code provenance in shared evaluation settings. The approach yields top-ten rankings among participating systems.

Core claim

Starting from a TF-IDF and Logistic Regression baseline, we fine-tune CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+ with separate strategies for each subtask. For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset. For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A and 0.422 on Subtask-B.

What carries the argument

Subtask-specific fine-tuning of pre-trained code models together with data augmentation, cross-validation, and ensembling

If this is right

Binary classification of human-written versus AI-generated code reaches a macro-F1 of 0.737.
Attribution among 11 possible generating models reaches a macro-F1 of 0.422.
These scores correspond to sixth place among 81 teams in the binary subtask and seventh place among 34 teams in the attribution subtask.
The fine-tuned models outperform the TF-IDF and logistic regression baseline on both subtasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Chunked inference with trimmed-mean aggregation may help handle variable-length code inputs in other detection settings.
Class-balanced loss combined with multi-seed ensembling offers a template for handling imbalanced classes in code-related classification tasks.
The performance gap between binary detection and fine-grained attribution suggests that identifying the exact source model remains harder than distinguishing human from machine code.

Load-bearing premise

The described fine-tuning strategies and data-augmentation choices will continue to perform well on future, unseen code distributions and model families beyond the SemEval-2026 test sets.

What would settle it

Testing the fine-tuned models on code generated by a previously unseen AI model or in a programming language absent from the training distribution and finding macro-F1 scores well below 0.737 or 0.422 would indicate the strategies do not generalize as claimed.

Figures

Figures reproduced from arXiv: 2605.01596 by Jany-Gabriel Ispas, Sergiu Nisioi.

**Figure 1.** Figure 1: Subtask-A training data. (a) Language distri view at source ↗

**Figure 3.** Figure 3: Code length distribution per class in Subtask view at source ↗

read the original abstract

This paper describes the system submitted by team \textbf{Archaeology} to SemEval-2026 Task~13 on AI-generated code detection. The shared task consists of three subtasks; we participate in Subtask-A (binary classification: human-written vs.\ AI-generated code) and Subtask-B (11-class attribution of the generating model). Starting from a TF-IDF and Logistic Regression baseline, we fine-tune four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+) with separate strategies for each subtask. For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset. For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A (6th/81 teams) and 0.422 on Subtask-B (7th/34 teams).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward shared-task system paper that applies standard fine-tuning and ensembling to AI-generated code detection and lands mid-pack rankings, with no new methods or analysis.

read the letter

The main takeaway is that the authors took four off-the-shelf code models, added some practical engineering steps, and placed 6th out of 81 on binary detection and 7th out of 34 on model attribution in SemEval-2026 Task 13. That is the whole story. They start from a TF-IDF baseline, then fine-tune CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+ separately for each subtask. For the binary case they use leave-one-language-out cross-validation, simple code augmentation, chunked inference with trimmed-mean pooling, and threshold tuning. For the 11-class attribution they add sandwich token packing, class-balanced loss, multi-seed ensembling, and test-time augmentation. The best runs reach macro-F1 of 0.737 and 0.422 respectively. Those numbers are new for this exact benchmark and the description of the pipeline is clear enough that someone could re-implement the main ideas without much trouble. The work is honest about being a competition entry rather than a research contribution. No one is claiming a new algorithm or theoretical result. The soft spots are exactly what you would expect from a system description: there are no ablation tables, no error bars, no statistical tests, and no error analysis that would let a reader judge which of the listed tricks actually moved the needle. The paper also gives no evidence that the same choices would hold up on future model families or code distributions outside the SemEval test sets. Because the central claims are limited to “these steps produced these scores on this data,” the absence of deeper diagnostics is noticeable but not fatal for its intended venue. This paper is for people who are entering the same shared task or who need a quick reference for what worked on this particular benchmark. It does not contain enough new insight or rigorous evidence to justify a full journal review, but it is solid enough for the SemEval proceedings or workshop track. I would send it to peer review rather than desk-reject it.

Referee Report

1 major / 0 minor

Summary. The manuscript describes the Archaeology team's submission to SemEval-2026 Task 13 on AI-generated code detection. It starts from a TF-IDF + Logistic Regression baseline and fine-tunes four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, CodeT5+) using task-specific techniques: for Subtask A (binary human vs. AI-generated classification) these include leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration; for Subtask B (11-class model attribution) they include sandwich token packing, class-balanced loss, multi-seed ensembling, and test-time augmentation. The best submissions achieve macro-F1 of 0.737 (6th/81 teams) on Subtask A and 0.422 (7th/34 teams) on Subtask B.

Significance. If the reported rankings hold, the work supplies a competitive, reproducible system description that demonstrates the practical value of combining standard fine-tuning with targeted augmentation and ensembling for code-generation detection. The explicit enumeration of strategies (e.g., leave-one-language-out CV, class-balanced loss) offers a useful reference point for other participants and future benchmark work. However, the absence of ablation tables, error bars, or significance tests restricts the ability to isolate which components drive the observed performance, thereby limiting the paper's broader methodological contribution.

major comments (1)

[Results] Results section (and abstract): the macro-F1 scores 0.737 and 0.422 are stated without error bars, standard deviations across seeds, ablation tables, or statistical significance tests against the TF-IDF baseline or other submissions; this directly weakens verification of the central performance claims and the associated rankings.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the positive summary of our system description and the recommendation for minor revision. We address the concern regarding the presentation of results below.

read point-by-point responses

Referee: [Results] Results section (and abstract): the macro-F1 scores 0.737 and 0.422 are stated without error bars, standard deviations across seeds, ablation tables, or statistical significance tests against the TF-IDF baseline or other submissions; this directly weakens verification of the central performance claims and the associated rankings.

Authors: We agree that the lack of error bars, standard deviations, ablation tables, and statistical significance tests limits the ability to fully verify and attribute the performance gains. As this is a system paper for SemEval-2026 Task 13, the focus was on documenting the reproducible pipeline that achieved the reported rankings within the shared-task timeline. In the revised version we will add: (i) standard deviations across the multiple seeds used for ensembling in both subtasks, (ii) a partial ablation table isolating the effects of code augmentation (Subtask A), sandwich packing, class-balanced loss, and test-time augmentation (Subtask B), and (iii) a statistical comparison (e.g., McNemar test on cross-validation folds) between our best model and the TF-IDF + Logistic Regression baseline. We will also explicitly note that official rankings are provided by the task organizers and that we lack access to other teams' raw predictions or per-run variances, precluding direct significance tests against competing submissions. revision: partial

standing simulated objections not resolved

Statistical significance testing against other submissions' results, as we do not have access to their individual model outputs or variance estimates.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical report on fine-tuning and ensembling strategies for a shared-task benchmark. It starts from a TF-IDF baseline, applies standard fine-tuning, augmentation, and aggregation techniques, and reports macro-F1 scores on held-out test data. No equations, first-principles derivations, or self-referential definitions appear; the central claims are measured performance numbers that do not reduce to fitted inputs or prior self-citations by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new theoretical entities; the work rests entirely on empirical training of existing pre-trained models.

pith-pipeline@v0.9.0 · 5488 in / 1011 out tokens · 40028 ms · 2026-05-09T13:56:49.550989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

2020 , eprint=

CodeBERT: A Pre-Trained Model for Programming and Natural Languages , author=. 2020 , eprint=

work page 2020
[2]

2021 , eprint=

GraphCodeBERT: Pre-training Code Representations with Data Flow , author=. 2021 , eprint=

work page 2021
[3]

2022 , eprint=

UniXcoder: Unified Cross-Modal Pre-training for Code Representation , author=. 2022 , eprint=

work page 2022
[4]

S em E val-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios

Orel, Daniil and Azizov, Dilshod and Paul, Indraneil and Wang, Yuxia and Gurevych, Iryna and Nakov, Preslav. S em E val-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios. Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026). 2026

work page 2026
[5]

2024 , eprint=

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , author=. 2024 , eprint=

work page 2024
[6]

2024 , eprint=

Qwen2.5-Coder Technical Report , author=. 2024 , eprint=

work page 2024
[7]

2025 , eprint=

Yi: Open Foundation Models by 01.AI , author=. 2025 , eprint=

work page 2025
[8]

2023 , eprint=

StarCoder: may the source be with you! , author=. 2023 , eprint=

work page 2023
[9]

2024 , eprint=

Gemma: Open Models Based on Gemini Research and Technology , author=. 2024 , eprint=

work page 2024
[10]

2024 , eprint=

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

work page 2024
[11]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[12]

2024 , eprint=

Granite Code Models: A Family of Open Foundation Models for Code Intelligence , author=. 2024 , eprint=

work page 2024
[13]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

work page 2023
[14]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024
[15]

2023 , eprint=

CodeT5+: Open Code Large Language Models for Code Understanding and Generation , author=. 2023 , eprint=

work page 2023
[16]

Journal of the Royal Statistical Society: Series B (Methodological) , volume=

The Regression Analysis of Binary Sequences , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1958 , doi=

work page 1958
[17]

2019 , eprint=

Class-Balanced Loss Based on Effective Number of Samples , author=. 2019 , eprint=

work page 2019
[18]

and Varoquaux, G

Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in

work page

[1] [1]

2020 , eprint=

CodeBERT: A Pre-Trained Model for Programming and Natural Languages , author=. 2020 , eprint=

work page 2020

[2] [2]

2021 , eprint=

GraphCodeBERT: Pre-training Code Representations with Data Flow , author=. 2021 , eprint=

work page 2021

[3] [3]

2022 , eprint=

UniXcoder: Unified Cross-Modal Pre-training for Code Representation , author=. 2022 , eprint=

work page 2022

[4] [4]

S em E val-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios

Orel, Daniil and Azizov, Dilshod and Paul, Indraneil and Wang, Yuxia and Gurevych, Iryna and Nakov, Preslav. S em E val-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios. Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026). 2026

work page 2026

[5] [5]

2024 , eprint=

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , author=. 2024 , eprint=

work page 2024

[6] [6]

2024 , eprint=

Qwen2.5-Coder Technical Report , author=. 2024 , eprint=

work page 2024

[7] [7]

2025 , eprint=

Yi: Open Foundation Models by 01.AI , author=. 2025 , eprint=

work page 2025

[8] [8]

2023 , eprint=

StarCoder: may the source be with you! , author=. 2023 , eprint=

work page 2023

[9] [9]

2024 , eprint=

Gemma: Open Models Based on Gemini Research and Technology , author=. 2024 , eprint=

work page 2024

[10] [10]

2024 , eprint=

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

work page 2024

[11] [11]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023

[12] [12]

2024 , eprint=

Granite Code Models: A Family of Open Foundation Models for Code Intelligence , author=. 2024 , eprint=

work page 2024

[13] [13]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

work page 2023

[14] [14]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024

[15] [15]

2023 , eprint=

CodeT5+: Open Code Large Language Models for Code Understanding and Generation , author=. 2023 , eprint=

work page 2023

[16] [16]

Journal of the Royal Statistical Society: Series B (Methodological) , volume=

The Regression Analysis of Binary Sequences , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1958 , doi=

work page 1958

[17] [17]

2019 , eprint=

Class-Balanced Loss Based on Effective Number of Samples , author=. 2019 , eprint=

work page 2019

[18] [18]

and Varoquaux, G

Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in

work page