pith. machine review for the scientific record. sign in

arxiv: 2604.20168 · v1 · submitted 2026-04-22 · 💻 cs.CL

Recognition: unknown

Duluth at SemEval-2026 Task 6: DeBERTa with LLM-Augmented Data for Unmasking Political Question Evasions

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords DeBERTaLLM data augmentationpolitical question evasionSemEval-2026 Task 6text classificationclass imbalancediscourse analysisfocal loss
0
0 comments X

The pith

DeBERTa-V3-base with LLM-generated examples for minority classes reaches 0.76 Macro F1 on political question evasion classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt a transformer model for two tasks that label U.S. presidential interview answers according to a two-level clarity taxonomy. It combines DeBERTa-V3-base with focal loss, layer-wise learning-rate decay, and simple discourse indicators, then uses synthetic examples from Gemini 3 and Claude Sonnet 4.5 to enlarge the smaller response categories. The resulting system records 0.76 Macro F1 on the first task, eighth among forty entries, and the authors note that the added data lifts recall on the underrepresented labels. Error patterns remain concentrated on the boundary between ambivalent and fully clear replies, the same boundary that human annotators also find difficult. The central demonstration is therefore that targeted LLM augmentation can measurably help a standard classifier handle the skewed distribution typical of nuanced political discourse.

Core claim

A DeBERTa-V3-base model extended with focal loss, layer-wise learning-rate decay, and boolean discourse features, when further trained on minority-class examples synthesized by Gemini 3 and Claude Sonnet 4.5, attains a Macro F1 of 0.76 on the clarity-level classification task and improves recall on the rarer evasion categories; the dominant remaining errors are confusions between Ambivalent and Clear Reply responses that mirror human annotator disagreement.

What carries the argument

DeBERTa-V3-base classifier augmented by focal loss, layer-wise learning-rate decay, boolean discourse features, and synthetic minority-class examples produced by Gemini 3 and Claude Sonnet 4.5.

If this is right

  • LLM augmentation raises minority-class recall on the clarity-level task without changing the overall error distribution.
  • The same confusion between Ambivalent and Clear Reply responses dominates both model mistakes and human disagreements.
  • The approach yields an 8th-place result out of 40 teams on the official evaluation set whose mean score is 0.70.
  • Discourse-level boolean features and focal loss are retained as stable components alongside the data-augmentation step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same augmentation recipe could be tested on other imbalanced discourse-classification problems such as detecting hedging in scientific abstracts or distinguishing evasion in legislative hearings.
  • If the boundary between ambivalent and clear replies remains the chief source of error, future work might replace the current binary discourse features with richer context windows or chain-of-thought prompting.
  • The performance gap to the top system (0.89) suggests that further gains may come from larger base models or ensemble methods rather than from additional synthetic data alone.

Load-bearing premise

The synthetic examples produced by the two large language models faithfully capture the intended minority evasion types and do not add noise or systematic bias that would hurt generalization.

What would settle it

Retraining the identical DeBERTa configuration on the original unaugmented data and measuring whether minority-class recall rises, stays flat, or falls; or inspecting whether the added examples contain detectable artifacts that correlate with the model’s remaining errors.

read the original abstract

This paper presents the Duluth approach to SemEval-2026 Task 6 on CLARITY: Unmasking Political Question Evasions. We address Task 1 (clarity-level classification) and Task 2 (evasion-level classification), both of which involve classifying question--answer pairs from U.S.\ presidential interviews using a two-level taxonomy of response clarity. Our system is based on DeBERTa-V3-base, extended with focal loss, layer-wise learning rate decay, and boolean discourse features. To address class imbalance in the training data, we augment minority classes using synthetic examples generated by Gemini 3 and Claude Sonnet 4.5. Our best configuration achieved a Macro F1 of 0.76 on the Task 1 evaluation set, placing 8th out of 40 teams. The top-ranked system (TeleAI) achieved 0.89, while the mean score across participants was 0.70. Error analysis reveals that the dominant source of misclassification is confusion between Ambivalent and Clear Reply responses, a pattern that mirrors disagreements among human annotators. Our findings demonstrate that LLM-based data augmentation can meaningfully improve minority-class recall on nuanced political discourse tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript describes the Duluth system for SemEval-2026 Task 6 on classifying clarity levels (Task 1) and evasion types (Task 2) in U.S. presidential interview question-answer pairs. It employs DeBERTa-V3-base fine-tuned with focal loss, layer-wise learning rate decay, and boolean discourse features, augmented by synthetic minority-class examples generated via Gemini 3 and Claude Sonnet 4.5 to address imbalance. The best configuration reports a Macro F1 of 0.76 on the Task 1 evaluation set (8th of 40 teams), with error analysis noting dominant confusion between Ambivalent and Clear categories that aligns with human annotator disagreements. The authors conclude that LLM augmentation meaningfully improves minority-class recall on nuanced political discourse tasks.

Significance. If the augmentation's isolated contribution can be established, the work offers a practical demonstration of LLM-based data augmentation for handling class imbalance in fine-grained, pragmatically nuanced NLP classification. The reported ranking provides a useful reference point for the shared task, and the observation that model errors mirror human disagreements strengthens the ecological validity of the evaluation. The approach relies on established techniques (DeBERTa fine-tuning, focal loss) rather than introducing novel methods, so its primary value lies in the empirical application and the shared-task benchmark results.

major comments (3)
  1. [Abstract / system description] Abstract and system description: The central claim that 'LLM-based data augmentation can meaningfully improve minority-class recall' lacks supporting ablation studies that isolate the contribution of the Gemini 3 / Claude Sonnet 4.5 synthetic examples from the other modeling choices (focal loss, discourse features, layer-wise decay). Without a with/without-augmentation comparison on the same base model, the Macro F1 of 0.76 cannot be confidently attributed to the augmentation rather than regularization effects or other hyperparameters.
  2. [Abstract / data augmentation section] Abstract and data augmentation section: No validation of the synthetic data is reported (e.g., human evaluation of fidelity to real minority-class linguistic/pragmatic features, inter-annotator agreement on generated examples, or distributional similarity metrics to the original training data). Given the taxonomy's nuance and the noted human disagreements on Ambivalent vs. Clear, unvalidated synthetics risk introducing label noise or stylistic artifacts that could inflate recall without improving generalization.
  3. [System description] System description: Exact prompts used to generate the synthetic examples, the number of examples per minority class, and the specific focal-loss parameters and layer-wise learning-rate decay schedule are not provided. These omissions prevent assessment of reproducibility and make it impossible to determine whether the reported gains depend on carefully tuned but unreported choices.
minor comments (2)
  1. [Abstract] The abstract refers to the 'Task 1 evaluation set' without clarifying whether this is the official SemEval test set or an internal held-out split; this should be stated explicitly for clarity.
  2. [Data section] No mention of the total size of the original training set or the proportion of synthetic data added; these statistics would help contextualize the augmentation scale.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the manuscript would benefit from additional analyses and details to strengthen the claims regarding LLM-based augmentation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / system description] Abstract and system description: The central claim that 'LLM-based data augmentation can meaningfully improve minority-class recall' lacks supporting ablation studies that isolate the contribution of the Gemini 3 / Claude Sonnet 4.5 synthetic examples from the other modeling choices (focal loss, discourse features, layer-wise decay). Without a with/without-augmentation comparison on the same base model, the Macro F1 of 0.76 cannot be confidently attributed to the augmentation rather than regularization effects or other hyperparameters.

    Authors: We acknowledge that explicit ablation studies isolating the augmentation effect would provide stronger support for the central claim. The current results demonstrate competitive performance (0.76 Macro F1, 8th of 40 teams) with error patterns mirroring human annotator disagreements, but we agree this does not fully isolate the augmentation contribution. In the revised manuscript we will add ablation experiments comparing the base DeBERTa-V3 model (with focal loss, discourse features, and layer-wise decay) against the same configuration with LLM-augmented data. revision: yes

  2. Referee: [Abstract / data augmentation section] Abstract and data augmentation section: No validation of the synthetic data is reported (e.g., human evaluation of fidelity to real minority-class linguistic/pragmatic features, inter-annotator agreement on generated examples, or distributional similarity metrics to the original training data). Given the taxonomy's nuance and the noted human disagreements on Ambivalent vs. Clear, unvalidated synthetics risk introducing label noise or stylistic artifacts that could inflate recall without improving generalization.

    Authors: We agree that validation of the synthetic examples is essential given the pragmatic nuance of the taxonomy. The revised manuscript will include a new subsection on synthetic data quality, reporting human evaluation of a random sample of generated examples (assessing label accuracy and fidelity to real minority-class features) together with distributional similarity metrics (e.g., n-gram overlap and embedding cosine similarity) between synthetic and original training instances. revision: yes

  3. Referee: [System description] System description: Exact prompts used to generate the synthetic examples, the number of examples per minority class, and the specific focal-loss parameters and layer-wise learning-rate decay schedule are not provided. These omissions prevent assessment of reproducibility and make it impossible to determine whether the reported gains depend on carefully tuned but unreported choices.

    Authors: We will expand the system description section to include the exact prompts used with Gemini 3 and Claude Sonnet 4.5, the precise number of synthetic examples added per minority class, the focal-loss gamma value, and the full layer-wise learning-rate decay schedule (including base learning rate and decay factors per layer). These details will be presented in a new reproducibility subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description with held-out evaluation

full rationale

The paper reports an empirical ML pipeline (DeBERTa-V3-base + focal loss + discourse features + LLM data augmentation) trained on provided data and scored on a held-out SemEval evaluation set. No equations, derivations, or parameter-fitting steps are described that could reduce to self-definition or fitted-input-as-prediction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The augmentation step is a preprocessing choice whose impact is measured by standard metrics on unseen data rather than assumed by construction. This is a standard shared-task system paper whose central claims rest on external test-set performance.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The performance claim depends on the assumption that LLM-generated synthetic data faithfully augments minority classes and that the added discourse features and focal loss provide genuine gains. These rest on domain assumptions rather than independent evidence in the abstract.

free parameters (2)
  • focal loss parameters
    Gamma and alpha values chosen to address class imbalance; specific numbers not stated.
  • layer-wise learning rate decay schedule
    Decay factors across transformer layers selected during tuning; values not reported.
axioms (1)
  • domain assumption Synthetic data from Gemini 3 and Claude Sonnet 4.5 accurately represents the target minority classes without bias or noise.
    Invoked to justify data augmentation for improving recall.

pith-pipeline@v0.9.0 · 5524 in / 1468 out tokens · 39015 ms · 2026-05-10T00:38:36.357972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 4 canonical work pages

  1. [1]

    I Never Said That

    "I Never Said That": A dataset, taxonomy and baselines on response clarity classification , author=. 2024 , eprint=

  2. [2]

    Charles J Wallace, Connie J Nelson, Robert Paul Liber- man, Robert A Aitchison, David Lukoff, John P El- der, and Chris Ferris

    EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , author=. arXiv preprint arXiv:1901.11196 , year=

  3. [3]

    arXiv preprint arXiv:2403.02990 , year=

    CASA: Context-Aware Synthetic Augmentation for Text Classification , author=. arXiv preprint arXiv:2403.02990 , year=

  4. [4]

    Focal Loss for Dense Object Detection

    Tsung. Focal Loss for Dense Object Detection , journal =. 2017 , url =. 1708.02002 , timestamp =

  5. [5]

    Political Equivocation: A Situational Explanation , volume =

    Bavelas, Janet and Black, Alex and Bryson, Lisa and Mullett, Jennifer , year =. Political Equivocation: A Situational Explanation , volume =. Journal of Language and Social Psychology , doi =

  6. [6]

    On Identifying Questions, Replies, and Non-Replies in Political Interviews , volume =

    Bull, Peter , year =. On Identifying Questions, Replies, and Non-Replies in Political Interviews , volume =. Journal of Language and Social Psychology , doi =

  7. [7]

    Can’t Answer? Won’t Answer? An Analysis of Equivocal Responses by Theresa May in Prime Minister’s Questions , volume =

    Bull, Peter and Strawson, Will , year =. Can’t Answer? Won’t Answer? An Analysis of Equivocal Responses by Theresa May in Prime Minister’s Questions , volume =. Parliamentary Affairs , doi =

  8. [8]

    2021 , eprint=

    Revisiting Few-sample BERT Fine-tuning , author=. 2021 , eprint=

  9. [9]

    Pengcheng He and Jianfeng Gao and Weizhu Chen , booktitle=. De. 2023 , url=

  10. [10]

    Did they answer? Subjective acts and intents in conversational discourse

    Ferracane, Elisa and Durrett, Greg and Li, Junyi Jessy and Erk, Katrin. Did they answer? Subjective acts and intents in conversational discourse. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.129