UCSC-NLP at SemEval-2026 Task 13: Multi-View Generalization and Diagnostic Analysis of Machine-Generated Code Detection
Pith reviewed 2026-05-07 15:45 UTC · model grok-4.3
The pith
A multi-view training framework on UniXcoder detects machine-generated code across unseen languages and domains while class-weighted training mitigates imbalance in multi-class attribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We fine-tune UniXcoder-base with a multi-view training framework that promotes generator-invariant representations. The framework combines domain-specific structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation. Our system achieves 0.993 macro F1 on validation and 0.845 macro F1 on the test set, which spans unseen languages and domains. For Subtask B, we show that severe class imbalance causes catastrophic minority-class failure under standard fine-tuning, with macro F1 collapsing to 0.086 despite 88.4% accuracy. A class-weighted extension trained for 3 epochs recovers macro F1 to 0.345.
What carries the argument
multi-view training framework combining domain-specific structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation to produce generator-invariant representations
If this is right
- The multi-view framework generalizes binary detection to unseen languages and domains.
- Class imbalance leads to near-zero macro F1 on minority classes in multi-class attribution despite high overall accuracy.
- Class-weighted training recovers over 300% relative improvement in macro F1 for multi-class attribution.
- Imbalance-aware training strategies are required for effective multi-class LLM family attribution.
Where Pith is reading between the lines
- The approach may extend to detecting code from newer LLMs by maintaining invariance to generator specifics.
- Similar imbalance issues could affect other tasks involving attribution to multiple sources in code or text analysis.
- Testing on additional programming languages beyond the SemEval set would further validate the generalization claims.
- Integrating the framework with other code models might yield even stronger invariant features.
Load-bearing premise
That the multi-view techniques truly create generator-invariant representations rather than the performance gains arising from the base UniXcoder model or the particular distribution of the SemEval dataset.
What would settle it
Running the trained model on code samples generated by an LLM family or in a programming language not present in the training or test data of the SemEval task.
Figures
read the original abstract
With the rapid growth of large language models for code generation, distinguishing between human-written and AI-generated code has become increasingly critical for academic integrity, hiring evaluations, and software security. We present our system for SemEval-2026 Task 13: Multilingual Machine-Generated Code Detection, participating in Subtask A (binary detection) and Subtask B (multi-class attribution across 10 LLM families). For Subtask A, we fine-tune UniXcoder-base with a multi-view training framework that promotes generator-invariant representations. The framework combines domain-specific structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation. Our system achieves 0.993 macro F1 on validation and 0.845 macro F1 on the test set, which spans unseen languages and domains. For Subtask B, we show that severe class imbalance (88.4% human code, 221:1 majority-to-minority ratio) causes catastrophic minority-class failure under standard fine-tuning, with macro F1 collapsing to 0.086 despite 88.4% accuracy. A class-weighted extension trained for 3 epochs recovers macro F1 to 0.345 (+301% relative), confirming that multi-class attribution requires imbalance-aware training strategies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the UCSC-NLP system for SemEval-2026 Task 13 on multilingual machine-generated code detection. For Subtask A (binary detection), it fine-tunes UniXcoder-base with a multi-view framework using domain-specific structural prefixes, delexicalization plus symmetric KL consistency loss, token dropout, and mixed-content augmentation, reporting 0.993 macro F1 on validation and 0.845 macro F1 on a test set covering unseen languages and domains. For Subtask B (multi-class attribution to 10 LLM families), it documents catastrophic minority-class failure under standard fine-tuning due to 221:1 class imbalance (macro F1 0.086 despite 88.4% accuracy) and shows that a class-weighted variant trained for 3 epochs raises macro F1 to 0.345.
Significance. If the multi-view components can be shown to produce generator-invariant representations, the work would offer a concrete, reproducible recipe for improving cross-domain generalization in code provenance detection, with direct relevance to academic integrity and software security applications. The explicit before-after comparison for class weighting in Subtask B is a clear strength, as is the reporting of concrete macro F1 numbers on both validation and a challenging test split.
major comments (1)
- Abstract and system description: the central claim that the combination of structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation produces generator-invariant representations (and thereby drives the 0.845 test macro F1 on unseen languages/domains) is load-bearing but unsupported by any baseline comparison to plain fine-tuning of the same UniXcoder-base model or by leave-one-component-out ablations. Without these controls it is impossible to determine whether the reported gains arise from the proposed multi-view framework or from the base model and the particular SemEval data distribution.
minor comments (2)
- Abstract: the description of 'domain-specific structural prefixes' is too terse; a brief example or definition of the prefix construction would improve reproducibility.
- Subtask B section: while the 221:1 imbalance ratio and the +301% relative macro-F1 improvement are stated, the exact method used to compute class weights and the modified loss function are not specified, which limits the diagnostic value of the analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the value of our explicit before-after analysis of class weighting in Subtask B. We address the single major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Abstract and system description: the central claim that the combination of structural prefixes, delexicalization with symmetric KL consistency loss, token dropout, and mixed-content augmentation produces generator-invariant representations (and thereby drives the 0.845 test macro F1 on unseen languages/domains) is load-bearing but unsupported by any baseline comparison to plain fine-tuning of the same UniXcoder-base model or by leave-one-component-out ablations. Without these controls it is impossible to determine whether the reported gains arise from the proposed multi-view framework or from the base model and the particular SemEval data distribution.
Authors: We agree that the manuscript would be strengthened by explicit baseline comparisons and ablations. The current version presents the full multi-view system and its results on the challenging test set of unseen languages and domains but does not include a plain fine-tuning baseline of UniXcoder-base or leave-one-component-out experiments. In the revised manuscript we will add both: (1) performance of a standard fine-tuned UniXcoder-base model (no structural prefixes, no delexicalization+symmetric KL consistency loss, no token dropout, no mixed-content augmentation) on the identical validation and test splits, and (2) leave-one-component-out ablations that remove each element in turn while keeping the others. These additions will allow readers to isolate the contribution of the multi-view framework to the reported 0.845 macro F1. revision: yes
Circularity Check
No circularity: purely empirical shared-task system report
full rationale
The paper is a participation report for SemEval-2026 Task 13 describing a fine-tuned UniXcoder model with added training components (structural prefixes, delexicalization + symmetric KL loss, token dropout, mixed-content augmentation) and reporting macro F1 scores on validation (0.993) and test (0.845) sets. Subtask B discusses observed class imbalance effects and a weighted-training fix. No equations, derivations, or first-principles claims exist; results are presented as direct experimental outcomes on the task data. No self-citations, fitted parameters renamed as predictions, or self-definitional reductions are present. The generalization narrative rests on empirical test-set performance rather than any closed mathematical loop.
Axiom & Free-Parameter Ledger
free parameters (2)
- class weights
- training hyperparameters
axioms (1)
- domain assumption The SemEval training distribution is representative of real-world human and LLM-generated code across languages and domains
Reference graph
Works this paper leans on
-
[1]
Yifan Wang, Yujie Li, Rui Zhang, and Minghao Chen
IEEE. Yifan Wang, Yujie Li, Rui Zhang, and Minghao Chen
-
[2]
InProceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP)
Evaluating the robustness of ai-generated con- tent detection. InProceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing (EMNLP). Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. 2021. Codet5: Identifier-aware unified pre- trained encoder-decoder models for code understand- ing and generation. InProceedings of the...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.