pith. sign in

arxiv: 2605.01596 · v1 · submitted 2026-05-02 · 💻 cs.CL

Fine-Tuning Pre-Trained Code Models for AI-Generated Code Detection

Pith reviewed 2026-05-09 13:56 UTC · model grok-4.3

classification 💻 cs.CL
keywords AI-generated code detectionfine-tuningpre-trained code modelsbinary classificationmodel attributiondata augmentationshared task
0
0 comments X

The pith

Fine-tuning pre-trained code models with targeted strategies detects AI-generated code at competitive levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that adapting four existing pre-trained code models through subtask-specific fine-tuning produces strong results on both binary detection of AI-generated code and attribution to the generating model. It begins with a simple TF-IDF baseline and then applies techniques such as leave-one-language-out validation, code augmentation, and ensembling to reach macro-F1 scores of 0.737 and 0.422. If these adaptations hold, existing models for code understanding can be repurposed for authenticity verification without requiring entirely new architectures. Readers would see value in practical methods that address code provenance in shared evaluation settings. The approach yields top-ten rankings among participating systems.

Core claim

Starting from a TF-IDF and Logistic Regression baseline, we fine-tune CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+ with separate strategies for each subtask. For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset. For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A and 0.422 on Subtask-B.

What carries the argument

Subtask-specific fine-tuning of pre-trained code models together with data augmentation, cross-validation, and ensembling

If this is right

  • Binary classification of human-written versus AI-generated code reaches a macro-F1 of 0.737.
  • Attribution among 11 possible generating models reaches a macro-F1 of 0.422.
  • These scores correspond to sixth place among 81 teams in the binary subtask and seventh place among 34 teams in the attribution subtask.
  • The fine-tuned models outperform the TF-IDF and logistic regression baseline on both subtasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Chunked inference with trimmed-mean aggregation may help handle variable-length code inputs in other detection settings.
  • Class-balanced loss combined with multi-seed ensembling offers a template for handling imbalanced classes in code-related classification tasks.
  • The performance gap between binary detection and fine-grained attribution suggests that identifying the exact source model remains harder than distinguishing human from machine code.

Load-bearing premise

The described fine-tuning strategies and data-augmentation choices will continue to perform well on future, unseen code distributions and model families beyond the SemEval-2026 test sets.

What would settle it

Testing the fine-tuned models on code generated by a previously unseen AI model or in a programming language absent from the training distribution and finding macro-F1 scores well below 0.737 or 0.422 would indicate the strategies do not generalize as claimed.

Figures

Figures reproduced from arXiv: 2605.01596 by Jany-Gabriel Ispas, Sergiu Nisioi.

Figure 1
Figure 1. Figure 1: Subtask-A training data. (a) Language distri view at source ↗
Figure 3
Figure 3. Figure 3: Code length distribution per class in Subtask view at source ↗
read the original abstract

This paper describes the system submitted by team \textbf{Archaeology} to SemEval-2026 Task~13 on AI-generated code detection. The shared task consists of three subtasks; we participate in Subtask-A (binary classification: human-written vs.\ AI-generated code) and Subtask-B (11-class attribution of the generating model). Starting from a TF-IDF and Logistic Regression baseline, we fine-tune four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+) with separate strategies for each subtask. For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset. For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A (6th/81 teams) and 0.422 on Subtask-B (7th/34 teams).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript describes the Archaeology team's submission to SemEval-2026 Task 13 on AI-generated code detection. It starts from a TF-IDF + Logistic Regression baseline and fine-tunes four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, CodeT5+) using task-specific techniques: for Subtask A (binary human vs. AI-generated classification) these include leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration; for Subtask B (11-class model attribution) they include sandwich token packing, class-balanced loss, multi-seed ensembling, and test-time augmentation. The best submissions achieve macro-F1 of 0.737 (6th/81 teams) on Subtask A and 0.422 (7th/34 teams) on Subtask B.

Significance. If the reported rankings hold, the work supplies a competitive, reproducible system description that demonstrates the practical value of combining standard fine-tuning with targeted augmentation and ensembling for code-generation detection. The explicit enumeration of strategies (e.g., leave-one-language-out CV, class-balanced loss) offers a useful reference point for other participants and future benchmark work. However, the absence of ablation tables, error bars, or significance tests restricts the ability to isolate which components drive the observed performance, thereby limiting the paper's broader methodological contribution.

major comments (1)
  1. [Results] Results section (and abstract): the macro-F1 scores 0.737 and 0.422 are stated without error bars, standard deviations across seeds, ablation tables, or statistical significance tests against the TF-IDF baseline or other submissions; this directly weakens verification of the central performance claims and the associated rankings.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the positive summary of our system description and the recommendation for minor revision. We address the concern regarding the presentation of results below.

read point-by-point responses
  1. Referee: [Results] Results section (and abstract): the macro-F1 scores 0.737 and 0.422 are stated without error bars, standard deviations across seeds, ablation tables, or statistical significance tests against the TF-IDF baseline or other submissions; this directly weakens verification of the central performance claims and the associated rankings.

    Authors: We agree that the lack of error bars, standard deviations, ablation tables, and statistical significance tests limits the ability to fully verify and attribute the performance gains. As this is a system paper for SemEval-2026 Task 13, the focus was on documenting the reproducible pipeline that achieved the reported rankings within the shared-task timeline. In the revised version we will add: (i) standard deviations across the multiple seeds used for ensembling in both subtasks, (ii) a partial ablation table isolating the effects of code augmentation (Subtask A), sandwich packing, class-balanced loss, and test-time augmentation (Subtask B), and (iii) a statistical comparison (e.g., McNemar test on cross-validation folds) between our best model and the TF-IDF + Logistic Regression baseline. We will also explicitly note that official rankings are provided by the task organizers and that we lack access to other teams' raw predictions or per-run variances, precluding direct significance tests against competing submissions. revision: partial

standing simulated objections not resolved
  • Statistical significance testing against other submissions' results, as we do not have access to their individual model outputs or variance estimates.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical report on fine-tuning and ensembling strategies for a shared-task benchmark. It starts from a TF-IDF baseline, applies standard fine-tuning, augmentation, and aggregation techniques, and reports macro-F1 scores on held-out test data. No equations, first-principles derivations, or self-referential definitions appear; the central claims are measured performance numbers that do not reduce to fitted inputs or prior self-citations by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new theoretical entities; the work rests entirely on empirical training of existing pre-trained models.

pith-pipeline@v0.9.0 · 5488 in / 1011 out tokens · 40028 ms · 2026-05-09T13:56:49.550989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    2020 , eprint=

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages , author=. 2020 , eprint=

  2. [2]

    2021 , eprint=

    GraphCodeBERT: Pre-training Code Representations with Data Flow , author=. 2021 , eprint=

  3. [3]

    2022 , eprint=

    UniXcoder: Unified Cross-Modal Pre-training for Code Representation , author=. 2022 , eprint=

  4. [4]

    S em E val-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios

    Orel, Daniil and Azizov, Dilshod and Paul, Indraneil and Wang, Yuxia and Gurevych, Iryna and Nakov, Preslav. S em E val-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios. Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026). 2026

  5. [5]

    2024 , eprint=

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , author=. 2024 , eprint=

  6. [6]

    2024 , eprint=

    Qwen2.5-Coder Technical Report , author=. 2024 , eprint=

  7. [7]

    2025 , eprint=

    Yi: Open Foundation Models by 01.AI , author=. 2025 , eprint=

  8. [8]

    2023 , eprint=

    StarCoder: may the source be with you! , author=. 2023 , eprint=

  9. [9]

    2024 , eprint=

    Gemma: Open Models Based on Gemini Research and Technology , author=. 2024 , eprint=

  10. [10]

    2024 , eprint=

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

  11. [11]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  12. [12]

    2024 , eprint=

    Granite Code Models: A Family of Open Foundation Models for Code Intelligence , author=. 2024 , eprint=

  13. [13]

    2023 , eprint=

    Mistral 7B , author=. 2023 , eprint=

  14. [14]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  15. [15]

    2023 , eprint=

    CodeT5+: Open Code Large Language Models for Code Understanding and Generation , author=. 2023 , eprint=

  16. [16]

    Journal of the Royal Statistical Society: Series B (Methodological) , volume=

    The Regression Analysis of Binary Sequences , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1958 , doi=

  17. [17]

    2019 , eprint=

    Class-Balanced Loss Based on Effective Number of Samples , author=. 2019 , eprint=

  18. [18]

    and Varoquaux, G

    Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in