UCSC-NLP at SemEval-2026 Task 13: Multi-View Generalization and Diagnostic Analysis of Machine-Generated Code Detection

Kargi Chauhan; Sadiba Nusrat Nur

arxiv: 2604.26990 · v1 · submitted 2026-04-28 · 💻 cs.SE

UCSC-NLP at SemEval-2026 Task 13: Multi-View Generalization and Diagnostic Analysis of Machine-Generated Code Detection

Kargi Chauhan , Sadiba Nusrat Nur This is my paper

Pith reviewed 2026-05-07 15:45 UTC · model grok-4.3

classification 💻 cs.SE

keywords machine generated code detectionmulti-view trainingUniXcoderclass imbalancegenerator invariant representationsSemEvalLLM attributiondelexicalization

0 comments

The pith

A multi-view training framework on UniXcoder detects machine-generated code across unseen languages and domains while class-weighted training mitigates imbalance in multi-class attribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system for SemEval-2026 Task 13 on multilingual machine-generated code detection. For binary classification between human and AI code, the authors fine-tune UniXcoder-base using a multi-view framework that includes domain-specific structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation to promote generator-invariant representations. This approach achieves strong performance on validation data and generalizes to a test set with unseen languages and domains. For the multi-class subtask of attributing code to specific LLM families, they demonstrate that severe class imbalance causes standard fine-tuning to fail on minority classes, but a class-weighted extension substantially improves results. Distinguishing human-written from AI-generated code supports academic integrity, hiring decisions, and software security.

Core claim

We fine-tune UniXcoder-base with a multi-view training framework that promotes generator-invariant representations. The framework combines domain-specific structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation. Our system achieves 0.993 macro F1 on validation and 0.845 macro F1 on the test set, which spans unseen languages and domains. For Subtask B, we show that severe class imbalance causes catastrophic minority-class failure under standard fine-tuning, with macro F1 collapsing to 0.086 despite 88.4% accuracy. A class-weighted extension trained for 3 epochs recovers macro F1 to 0.345.

What carries the argument

multi-view training framework combining domain-specific structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation to produce generator-invariant representations

If this is right

The multi-view framework generalizes binary detection to unseen languages and domains.
Class imbalance leads to near-zero macro F1 on minority classes in multi-class attribution despite high overall accuracy.
Class-weighted training recovers over 300% relative improvement in macro F1 for multi-class attribution.
Imbalance-aware training strategies are required for effective multi-class LLM family attribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to detecting code from newer LLMs by maintaining invariance to generator specifics.
Similar imbalance issues could affect other tasks involving attribution to multiple sources in code or text analysis.
Testing on additional programming languages beyond the SemEval set would further validate the generalization claims.
Integrating the framework with other code models might yield even stronger invariant features.

Load-bearing premise

That the multi-view techniques truly create generator-invariant representations rather than the performance gains arising from the base UniXcoder model or the particular distribution of the SemEval dataset.

What would settle it

Running the trained model on code samples generated by an LLM family or in a programming language not present in the training or test data of the SemEval task.

Figures

Figures reproduced from arXiv: 2604.26990 by Kargi Chauhan, Sadiba Nusrat Nur.

**Figure 1.** Figure 1: Subtask A architecture. Three views of each input are processed through a shared UniXcoder encoder; view at source ↗

**Figure 2.** Figure 2: Subtask B architecture. A standard CodeBERT encoder with softmax classification head serves as a view at source ↗

**Figure 3.** Figure 3: Subtask A training dynamics. Validation loss view at source ↗

**Figure 4.** Figure 4: Empirical t-SNE projections (perplexity=30, view at source ↗

**Figure 5.** Figure 5: Representative misclassifications. Both errors view at source ↗

read the original abstract

With the rapid growth of large language models for code generation, distinguishing between human-written and AI-generated code has become increasingly critical for academic integrity, hiring evaluations, and software security. We present our system for SemEval-2026 Task 13: Multilingual Machine-Generated Code Detection, participating in Subtask A (binary detection) and Subtask B (multi-class attribution across 10 LLM families). For Subtask A, we fine-tune UniXcoder-base with a multi-view training framework that promotes generator-invariant representations. The framework combines domain-specific structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation. Our system achieves 0.993 macro F1 on validation and 0.845 macro F1 on the test set, which spans unseen languages and domains. For Subtask B, we show that severe class imbalance (88.4% human code, 221:1 majority-to-minority ratio) causes catastrophic minority-class failure under standard fine-tuning, with macro F1 collapsing to 0.086 despite 88.4% accuracy. A class-weighted extension trained for 3 epochs recovers macro F1 to 0.345 (+301% relative), confirming that multi-class attribution requires imbalance-aware training strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shared-task report that usefully shows class imbalance wrecking multi-class code attribution but provides no ablations to back its multi-view generalization claims.

read the letter

This paper is a participation report for SemEval-2026 Task 13 on detecting machine-generated code. For binary detection they fine-tune UniXcoder-base with structural prefixes, delexicalization plus symmetric KL loss, token dropout, and mixed augmentation, reporting 0.993 macro F1 on validation and 0.845 on the unseen test set. For multi-class attribution across ten LLM families they document how severe imbalance (221:1) makes standard fine-tuning collapse to 0.086 macro F1 despite 88% accuracy, then show class weighting lifts it to 0.345 after three epochs.

Referee Report

1 major / 2 minor

Summary. The paper presents the UCSC-NLP system for SemEval-2026 Task 13 on multilingual machine-generated code detection. For Subtask A (binary detection), it fine-tunes UniXcoder-base with a multi-view framework using domain-specific structural prefixes, delexicalization plus symmetric KL consistency loss, token dropout, and mixed-content augmentation, reporting 0.993 macro F1 on validation and 0.845 macro F1 on a test set covering unseen languages and domains. For Subtask B (multi-class attribution to 10 LLM families), it documents catastrophic minority-class failure under standard fine-tuning due to 221:1 class imbalance (macro F1 0.086 despite 88.4% accuracy) and shows that a class-weighted variant trained for 3 epochs raises macro F1 to 0.345.

Significance. If the multi-view components can be shown to produce generator-invariant representations, the work would offer a concrete, reproducible recipe for improving cross-domain generalization in code provenance detection, with direct relevance to academic integrity and software security applications. The explicit before-after comparison for class weighting in Subtask B is a clear strength, as is the reporting of concrete macro F1 numbers on both validation and a challenging test split.

major comments (1)

Abstract and system description: the central claim that the combination of structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation produces generator-invariant representations (and thereby drives the 0.845 test macro F1 on unseen languages/domains) is load-bearing but unsupported by any baseline comparison to plain fine-tuning of the same UniXcoder-base model or by leave-one-component-out ablations. Without these controls it is impossible to determine whether the reported gains arise from the proposed multi-view framework or from the base model and the particular SemEval data distribution.

minor comments (2)

Abstract: the description of 'domain-specific structural prefixes' is too terse; a brief example or definition of the prefix construction would improve reproducibility.
Subtask B section: while the 221:1 imbalance ratio and the +301% relative macro-F1 improvement are stated, the exact method used to compute class weights and the modified loss function are not specified, which limits the diagnostic value of the analysis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the value of our explicit before-after analysis of class weighting in Subtask B. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract and system description: the central claim that the combination of structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation produces generator-invariant representations (and thereby drives the 0.845 test macro F1 on unseen languages/domains) is load-bearing but unsupported by any baseline comparison to plain fine-tuning of the same UniXcoder-base model or by leave-one-component-out ablations. Without these controls it is impossible to determine whether the reported gains arise from the proposed multi-view framework or from the base model and the particular SemEval data distribution.

Authors: We agree that the manuscript would be strengthened by explicit baseline comparisons and ablations. The current version presents the full multi-view system and its results on the challenging test set of unseen languages and domains but does not include a plain fine-tuning baseline of UniXcoder-base or leave-one-component-out experiments. In the revised manuscript we will add both: (1) performance of a standard fine-tuned UniXcoder-base model (no structural prefixes, no delexicalization+symmetric KL consistency loss, no token dropout, no mixed-content augmentation) on the identical validation and test splits, and (2) leave-one-component-out ablations that remove each element in turn while keeping the others. These additions will allow readers to isolate the contribution of the multi-view framework to the reported 0.845 macro F1. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical shared-task system report

full rationale

The paper is a participation report for SemEval-2026 Task 13 describing a fine-tuned UniXcoder model with added training components (structural prefixes, delexicalization + symmetric KL loss, token dropout, mixed-content augmentation) and reporting macro F1 scores on validation (0.993) and test (0.845) sets. Subtask B discusses observed class imbalance effects and a weighted-training fix. No equations, derivations, or first-principles claims exist; results are presented as direct experimental outcomes on the task data. No self-citations, fitted parameters renamed as predictions, or self-definitional reductions are present. The generalization narrative rests on empirical test-set performance rather than any closed mathematical loop.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard supervised-learning assumptions about data representativeness and the effectiveness of the listed training tricks; no new entities are postulated and the only free parameters are typical training hyperparameters whose exact values are not reported.

free parameters (2)

class weights
Chosen to counter the 221:1 imbalance; specific values and selection procedure not stated in the abstract.
training hyperparameters
Learning rate, batch size, and exact loss coefficients for the 3-epoch class-weighted run.

axioms (1)

domain assumption The SemEval training distribution is representative of real-world human and LLM-generated code across languages and domains
Invoked to support generalization claims to the unseen test set.

pith-pipeline@v0.9.0 · 5538 in / 1724 out tokens · 63096 ms · 2026-05-07T15:45:22.534296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Yifan Wang, Yujie Li, Rui Zhang, and Minghao Chen

IEEE. Yifan Wang, Yujie Li, Rui Zhang, and Minghao Chen

work page
[2]

InProceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP)

Evaluating the robustness of ai-generated con- tent detection. InProceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP). Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. 2021. Codet5: Identifier-aware unified pre- trained encoder-decoder models for code understand- ing and generation. InProceedings of the...

work page 2024

[1] [1]

Yifan Wang, Yujie Li, Rui Zhang, and Minghao Chen

IEEE. Yifan Wang, Yujie Li, Rui Zhang, and Minghao Chen

work page

[2] [2]

InProceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP)

Evaluating the robustness of ai-generated con- tent detection. InProceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP). Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. 2021. Codet5: Identifier-aware unified pre- trained encoder-decoder models for code understand- ing and generation. InProceedings of the...

work page 2024