Fine-Tuning Pre-Trained Code Models for AI-Generated Code Detection
Pith reviewed 2026-05-09 13:56 UTC · model grok-4.3
The pith
Fine-tuning pre-trained code models with targeted strategies detects AI-generated code at competitive levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting from a TF-IDF and Logistic Regression baseline, we fine-tune CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+ with separate strategies for each subtask. For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset. For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A and 0.422 on Subtask-B.
What carries the argument
Subtask-specific fine-tuning of pre-trained code models together with data augmentation, cross-validation, and ensembling
If this is right
- Binary classification of human-written versus AI-generated code reaches a macro-F1 of 0.737.
- Attribution among 11 possible generating models reaches a macro-F1 of 0.422.
- These scores correspond to sixth place among 81 teams in the binary subtask and seventh place among 34 teams in the attribution subtask.
- The fine-tuned models outperform the TF-IDF and logistic regression baseline on both subtasks.
Where Pith is reading between the lines
- Chunked inference with trimmed-mean aggregation may help handle variable-length code inputs in other detection settings.
- Class-balanced loss combined with multi-seed ensembling offers a template for handling imbalanced classes in code-related classification tasks.
- The performance gap between binary detection and fine-grained attribution suggests that identifying the exact source model remains harder than distinguishing human from machine code.
Load-bearing premise
The described fine-tuning strategies and data-augmentation choices will continue to perform well on future, unseen code distributions and model families beyond the SemEval-2026 test sets.
What would settle it
Testing the fine-tuned models on code generated by a previously unseen AI model or in a programming language absent from the training distribution and finding macro-F1 scores well below 0.737 or 0.422 would indicate the strategies do not generalize as claimed.
Figures
read the original abstract
This paper describes the system submitted by team \textbf{Archaeology} to SemEval-2026 Task~13 on AI-generated code detection. The shared task consists of three subtasks; we participate in Subtask-A (binary classification: human-written vs.\ AI-generated code) and Subtask-B (11-class attribution of the generating model). Starting from a TF-IDF and Logistic Regression baseline, we fine-tune four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+) with separate strategies for each subtask. For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset. For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A (6th/81 teams) and 0.422 on Subtask-B (7th/34 teams).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the Archaeology team's submission to SemEval-2026 Task 13 on AI-generated code detection. It starts from a TF-IDF + Logistic Regression baseline and fine-tunes four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, CodeT5+) using task-specific techniques: for Subtask A (binary human vs. AI-generated classification) these include leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration; for Subtask B (11-class model attribution) they include sandwich token packing, class-balanced loss, multi-seed ensembling, and test-time augmentation. The best submissions achieve macro-F1 of 0.737 (6th/81 teams) on Subtask A and 0.422 (7th/34 teams) on Subtask B.
Significance. If the reported rankings hold, the work supplies a competitive, reproducible system description that demonstrates the practical value of combining standard fine-tuning with targeted augmentation and ensembling for code-generation detection. The explicit enumeration of strategies (e.g., leave-one-language-out CV, class-balanced loss) offers a useful reference point for other participants and future benchmark work. However, the absence of ablation tables, error bars, or significance tests restricts the ability to isolate which components drive the observed performance, thereby limiting the paper's broader methodological contribution.
major comments (1)
- [Results] Results section (and abstract): the macro-F1 scores 0.737 and 0.422 are stated without error bars, standard deviations across seeds, ablation tables, or statistical significance tests against the TF-IDF baseline or other submissions; this directly weakens verification of the central performance claims and the associated rankings.
Simulated Author's Rebuttal
We thank the referee for the positive summary of our system description and the recommendation for minor revision. We address the concern regarding the presentation of results below.
read point-by-point responses
-
Referee: [Results] Results section (and abstract): the macro-F1 scores 0.737 and 0.422 are stated without error bars, standard deviations across seeds, ablation tables, or statistical significance tests against the TF-IDF baseline or other submissions; this directly weakens verification of the central performance claims and the associated rankings.
Authors: We agree that the lack of error bars, standard deviations, ablation tables, and statistical significance tests limits the ability to fully verify and attribute the performance gains. As this is a system paper for SemEval-2026 Task 13, the focus was on documenting the reproducible pipeline that achieved the reported rankings within the shared-task timeline. In the revised version we will add: (i) standard deviations across the multiple seeds used for ensembling in both subtasks, (ii) a partial ablation table isolating the effects of code augmentation (Subtask A), sandwich packing, class-balanced loss, and test-time augmentation (Subtask B), and (iii) a statistical comparison (e.g., McNemar test on cross-validation folds) between our best model and the TF-IDF + Logistic Regression baseline. We will also explicitly note that official rankings are provided by the task organizers and that we lack access to other teams' raw predictions or per-run variances, precluding direct significance tests against competing submissions. revision: partial
- Statistical significance testing against other submissions' results, as we do not have access to their individual model outputs or variance estimates.
Circularity Check
No significant circularity
full rationale
The paper is a purely empirical report on fine-tuning and ensembling strategies for a shared-task benchmark. It starts from a TF-IDF baseline, applies standard fine-tuning, augmentation, and aggregation techniques, and reports macro-F1 scores on held-out test data. No equations, first-principles derivations, or self-referential definitions appear; the central claims are measured performance numbers that do not reduce to fitted inputs or prior self-citations by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
CodeBERT: A Pre-Trained Model for Programming and Natural Languages , author=. 2020 , eprint=
work page 2020
-
[2]
GraphCodeBERT: Pre-training Code Representations with Data Flow , author=. 2021 , eprint=
work page 2021
-
[3]
UniXcoder: Unified Cross-Modal Pre-training for Code Representation , author=. 2022 , eprint=
work page 2022
-
[4]
Orel, Daniil and Azizov, Dilshod and Paul, Indraneil and Wang, Yuxia and Gurevych, Iryna and Nakov, Preslav. S em E val-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios. Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026). 2026
work page 2026
-
[5]
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , author=. 2024 , eprint=
work page 2024
- [6]
- [7]
- [8]
-
[9]
Gemma: Open Models Based on Gemini Research and Technology , author=. 2024 , eprint=
work page 2024
-
[10]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=
work page 2024
-
[11]
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
work page 2023
-
[12]
Granite Code Models: A Family of Open Foundation Models for Code Intelligence , author=. 2024 , eprint=
work page 2024
- [13]
- [14]
-
[15]
CodeT5+: Open Code Large Language Models for Code Understanding and Generation , author=. 2023 , eprint=
work page 2023
-
[16]
Journal of the Royal Statistical Society: Series B (Methodological) , volume=
The Regression Analysis of Binary Sequences , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1958 , doi=
work page 1958
-
[17]
Class-Balanced Loss Based on Effective Number of Samples , author=. 2019 , eprint=
work page 2019
-
[18]
Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.