KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection

Archie Sage; Salvatore Greco

arxiv: 2603.06552 · v2 · submitted 2026-03-06 · 💻 cs.CL

KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection

Archie Sage , Salvatore Greco This is my paper

Pith reviewed 2026-05-15 14:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords political discourseevasion detectionambiguity classificationencoder modelszero-shot learningSemEval shared taskRoBERTaGPT

0 comments

The pith

RoBERTa-large leads on public tests for political evasion detection while zero-shot GPT-5.2 generalizes better to hidden sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates two ways to classify clarity and evasion in political discourse for a SemEval shared task. One approach predicts the clarity label directly. The other predicts an evasion label first and then derives the clarity label using the task's taxonomy hierarchy. Encoder models were fine-tuned on the data while decoder models were tested zero-shot. The two formulations produced similar overall scores. RoBERTa-large gave the best public test results among encoders, but zero-shot GPT-5.2 performed better on the hidden evaluation set.

Core claim

In the CLARITY shared task on classifying ambiguity and evasion techniques in political discourse, two modelling formulations were compared: direct prediction of clarity labels and prediction of evasion labels followed by derivation of clarity labels through the task taxonomy hierarchy. Encoder-based models were trained on the provided data while decoder-only models were evaluated zero-shot under the evasion-first formulation. The formulations yield comparable performance overall. Among encoder models RoBERTa-large achieves the strongest results on the public test set while zero-shot GPT-5.2 generalises better on the hidden evaluation set.

What carries the argument

Two modelling formulations (direct clarity prediction versus evasion-first prediction with taxonomy-derived clarity) applied to fine-tuned encoder models and zero-shot decoder models.

If this is right

Encoder models reach competitive scores on public test sets for this evasion detection task.
Zero-shot decoder models provide stronger performance when evaluation data differs from training distributions.
Direct and derived clarity formulations produce roughly equivalent end-to-end results.
Auxiliary training variants were tested but did not change the main ranking between RoBERTa-large and GPT-5.2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems intended for live political discourse monitoring may prefer zero-shot models to handle topic shifts that public test sets do not capture.
The taxonomy derivation step could be replaced by joint multi-task training if the hierarchy proves brittle on new domains.
Combining the two formulations into an ensemble might improve robustness across both public and hidden evaluation regimes.

Load-bearing premise

The task taxonomy hierarchy reliably converts evasion predictions into correct clarity labels and zero-shot models generalize to hidden data without any exposure to public test signals.

What would settle it

A side-by-side check on a new annotated set where clarity labels derived from evasion predictions are compared directly to human clarity annotations, or a measurable drop in zero-shot accuracy on the hidden set when the public test distribution is deliberately altered.

read the original abstract

This paper describes the KCLarity team's participation in CLARITY, a shared task at SemEval 2026 on classifying ambiguity and evasion techniques in political discourse. We investigate two modelling formulations: (i) directly predicting the clarity label, and (ii) predicting the evasion label and deriving clarity through the task taxonomy hierarchy. We further explore several auxiliary training variants and evaluate decoder-only models in a zero-shot setting under the evasion-first formulation. Overall, the two formulations yield comparable performance. Among encoder-based models, RoBERTa-large achieves the strongest results on the public test set, while zero-shot GPT-5.2 generalises better on the hidden evaluation set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Straightforward SemEval report using RoBERTa and zero-shot GPT on evasion detection, with comparable results from two formulations but no training details or taxonomy validation.

read the letter

This is a standard shared-task writeup from the KCLarity team on SemEval-2026 Task 6. They compare direct clarity prediction against predicting evasion first and deriving clarity labels through the task taxonomy, plus some encoder runs and zero-shot GPT-5.2 prompting. RoBERTa-large leads on the public test set among encoders, while the zero-shot model holds up better on the hidden set, and the two formulations come out roughly even overall.

Referee Report

2 major / 1 minor

Summary. The paper describes the KCLarity team's participation in SemEval-2026 Task 6 (CLARITY) on classifying ambiguity and evasion techniques in political discourse. It compares two formulations: (i) direct prediction of clarity labels and (ii) prediction of evasion labels followed by derivation of clarity labels via the task taxonomy hierarchy. Encoder models (including RoBERTa-large) and zero-shot decoder-only models (GPT-5.2) are evaluated, with the claim that the formulations yield comparable performance, RoBERTa-large is strongest on the public test set among encoders, and zero-shot GPT-5.2 generalizes better on the hidden evaluation set.

Significance. If the empirical comparisons hold after verification, the work provides useful evidence on the relative strengths of fine-tuned encoder models versus zero-shot LLMs for political evasion detection, and on the viability of deriving clarity labels from evasion predictions. This could inform task design in future shared tasks on discourse analysis by showing flexibility between direct and hierarchical formulations.

major comments (2)

[Methods and Experimental Setup] Methods and Experimental Setup sections: No details are provided on training procedures, data splits, hyperparameter choices, or statistical significance testing for the reported performance numbers. This directly limits verification of the central claim that RoBERTa-large achieves the strongest results among encoders on the public test set and that the two formulations are comparable.
[Task Formulations] Task Formulations section: The comparison between direct clarity prediction and derivation from evasion predictions rests on the assumption that the task taxonomy hierarchy reliably maps evasion labels to clarity labels. No inter-annotator agreement metrics, validation on held-out data, or error-propagation analysis for this derivation are reported, which is load-bearing for interpreting whether the derived results measure the same construct.

minor comments (1)

[Abstract and Results] The abstract and results tables could explicitly state the exact metrics (e.g., F1, accuracy) used for the public vs. hidden sets to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our participation in SemEval-2026 Task 6. The comments help strengthen the paper's methodological transparency and the interpretation of our results. We provide point-by-point responses below and will make revisions as indicated.

read point-by-point responses

Referee: [Methods and Experimental Setup] Methods and Experimental Setup sections: No details are provided on training procedures, data splits, hyperparameter choices, or statistical significance testing for the reported performance numbers. This directly limits verification of the central claim that RoBERTa-large achieves the strongest results among encoders on the public test set and that the two formulations are comparable.

Authors: We fully agree with this assessment. The original manuscript omitted these details due to space constraints typical of shared task papers, but we recognize their importance. In the revised manuscript, we will add a detailed description of the training procedures for all models, including the specific hyperparameters used for RoBERTa-large and other encoders (e.g., learning rate of 2e-5, batch size 16, 3 epochs), the data split ratios, and results of statistical significance tests comparing the models. This will enable verification of the claims regarding RoBERTa-large's performance and the comparability of formulations. revision: yes
Referee: [Task Formulations] Task Formulations section: The comparison between direct clarity prediction and derivation from evasion predictions rests on the assumption that the task taxonomy hierarchy reliably maps evasion labels to clarity labels. No inter-annotator agreement metrics, validation on held-out data, or error-propagation analysis for this derivation are reported, which is load-bearing for interpreting whether the derived results measure the same construct.

Authors: The mapping is based directly on the official task taxonomy provided by the organizers, which defines how evasion labels derive clarity labels. We will revise the Task Formulations section to explicitly state this and include a brief validation by checking consistency on a sample of the public data. We will also add an error-propagation analysis showing the impact of mispredicted evasion labels on derived clarity scores. Inter-annotator agreement is not reported in our paper as it pertains to the dataset creation by the task organizers; we will cite the task description paper for this information. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison on shared-task benchmark

full rationale

The paper reports direct performance numbers from training encoder models (RoBERTa-large etc.) and running zero-shot decoder models on the SemEval-2026 CLARITY task data. The two formulations (direct clarity prediction vs. evasion prediction followed by taxonomy-derived clarity) are evaluated side-by-side on public and hidden test sets; no equations, fitted parameters, or self-citations are used to derive the reported scores. The taxonomy mapping is an input from the shared task definition, not constructed inside the paper. All claims rest on external benchmark evaluation rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical system-description paper with no mathematical derivations, free parameters, axioms, or invented entities beyond routine machine-learning training practices.

pith-pipeline@v0.9.0 · 5410 in / 949 out tokens · 33068 ms · 2026-05-15T14:48:57.140005+00:00 · methodology

KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)