arxiv: 2604.26375 · v1 · submitted 2026-04-29 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection

Gabriel Stefan , Sergiu Nisioi

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords evasion detectionpolitical interviewsRoBERTachunkingmax-poolingmulti-task learninglong-context classificationSemEval

0 comments

The pith

Sliding-window chunking and max-pooling let a shared RoBERTa-large model classify long political interview responses for both coarse clarity and fine evasion strategies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a practical system for SemEval-2026 Task 6 that detects evasions in English political interviews. Long responses are split into overlapping chunks fed through a RoBERTa-large encoder, then aggregated by element-wise max-pooling to produce a fixed-length representation. Two task heads are trained jointly on a 3-way clarity label and a 9-way evasion strategy label. The resulting model reaches 0.80 macro-F1 on the coarse subtask and 0.51 on the fine-grained subtask while ranking 11th in both. A reader would care because real interview answers routinely exceed the 512-token limit of standard transformers, and this method offers one way to handle them without simple truncation.

Core claim

A shared RoBERTa-large encoder combined with an overlapping sliding-window chunking strategy and element-wise max-pooling aggregation supplies representations that support joint multi-task classification of 3-way clarity and 9-way evasion strategy, achieving macro-F1 scores of 0.80 and 0.51 respectively on the evaluation sets.

What carries the argument

Overlapping sliding-window chunking with element-wise max-pooling aggregation of chunk representations from a shared RoBERTa-large encoder, followed by two task-specific heads trained under a multi-task objective and 7-fold ensembled inference.

If this is right

The approach avoids truncation loss on responses longer than 512 tokens.
Joint training of the two heads allows information sharing between coarse clarity and fine strategy prediction.
7-fold stratified cross-validation plus inference ensembling improves stability on the modest training data typical of SemEval tasks.
The same chunking-plus-pooling pattern can be applied to any downstream classifier that uses a fixed-length encoder on variable-length political text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same chunking method might be tested on other long-document domains such as legal or medical transcripts where evasion-like phenomena occur.
Replacing max-pooling with learned attention over chunks could be compared directly on this dataset to isolate the contribution of the pooling step.
If the 9-way labels contain hierarchical structure, a hierarchical head on top of the pooled representation might raise the 0.51 score without changing the chunking stage.

Load-bearing premise

Element-wise max-pooling over the chunk representations sufficiently preserves the information needed to distinguish fine-grained evasion strategies in long responses.

What would settle it

Running the identical model on the same long responses but with simple truncation at 512 tokens instead of chunking and measuring whether macro-F1 on the 9-way subtask drops substantially would test the claim.

Figures

Figures reproduced from arXiv: 2604.26375 by Gabriel Stefan, Sergiu Nisioi.

**Figure 1.** Figure 1: Distribution of token counts for the concate view at source ↗

**Figure 2.** Figure 2: System architecture. The tokenized concate view at source ↗

**Figure 3.** Figure 3: Normalized Confusion Matrix for Subtask 1 view at source ↗

**Figure 4.** Figure 4: Normalized Confusion Matrix for Subtask 2 view at source ↗

read the original abstract

We describe our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which classifies English political interview responses by coarse-grained clarity (3-way) and fine-grained evasion strategy (9-way). Since responses frequently exceed the 512-token limit of standard Transformer encoders, we apply an overlapping sliding-window chunking strategy with element-wise Max-Pooling aggregation over chunk representations. A shared RoBERTa-large encoder supplies two task-specific heads trained jointly via a multi-task objective, with inference-time ensembling over 7-fold stratified cross-validation. Our system achieves a Macro-F1 of 0.80 on Subtask 1 and 0.51 on Subtask 2, ranking 11th in both subtasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a plain shared-task system paper that combines RoBERTa chunking and max-pooling for decent mid-pack scores on political evasion detection without new methods or supporting analysis.

read the letter

The main takeaway is that this paper reports a working entry for SemEval-2026 Task 6 on classifying political interview responses for clarity and evasion strategies. It reaches Macro-F1 of 0.80 on the 3-way subtask and 0.51 on the 9-way one, placing 11th in both, by using a shared RoBERTa-large encoder with overlapping sliding windows, element-wise max pooling, multi-task heads, and 7-fold ensembling. Nothing in the approach is new on its own, but the combination is applied cleanly to this domain and the scores are reported plainly against the shared-task test set.

Referee Report

2 major / 1 minor

Summary. The manuscript describes a system for SemEval-2026 Task 6 on classifying English political interview responses for coarse-grained clarity (3-way) and fine-grained evasion strategies (9-way). To address responses exceeding the 512-token limit of standard Transformers, the authors apply overlapping sliding-window chunking with element-wise Max-Pooling aggregation over representations from a shared RoBERTa-large encoder. Two task-specific heads are trained jointly under a multi-task objective, with inference-time ensembling across 7-fold stratified cross-validation. The system reports Macro-F1 scores of 0.80 on Subtask 1 and 0.51 on Subtask 2, ranking 11th in both subtasks.

Significance. If the reported scores hold, the work offers a practical, reproducible baseline for long-context classification in shared tasks by combining standard RoBERTa with simple chunking and pooling. The multi-task setup and cross-validation ensembling are clear strengths that could be adopted elsewhere. However, the mid-tier ranking and the large performance drop from coarse to fine-grained labels indicate that the contribution is incremental rather than transformative, particularly without supporting analyses to confirm the aggregation method's sufficiency.

major comments (2)

Abstract and method description: the claim that element-wise Max-Pooling over chunk representations sufficiently handles long responses for 9-way evasion strategy classification is not supported by any ablation or alternative aggregation comparison. Max-pooling is order-invariant and position-agnostic, which directly risks discarding localized cues (e.g., initial deflection vs. terminal non-answer) that are plausibly relevant to Subtask 2; the observed gap (0.80 vs. 0.51) is consistent with this potential bottleneck, yet no length-stratified error analysis or pooling ablation is referenced.
Evaluation section: the manuscript reports only aggregate Macro-F1 on the shared-task test set with no error analysis, no statistical significance tests, and no ablation studies on the chunking window, pooling operator, or multi-task objective. These omissions leave the central performance claims only moderately supported and make it impossible to isolate whether the reported ranking reflects the proposed architecture or other factors.

minor comments (1)

The abstract and method description would benefit from explicit statement of chunk size, overlap, and the precise multi-task loss weighting, as these details are necessary for reproducibility even if the overall architecture is standard.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight opportunities to better support our methodological choices and evaluation claims. We respond to each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: Abstract and method description: the claim that element-wise Max-Pooling over chunk representations sufficiently handles long responses for 9-way evasion strategy classification is not supported by any ablation or alternative aggregation comparison. Max-pooling is order-invariant and position-agnostic, which directly risks discarding localized cues (e.g., initial deflection vs. terminal non-answer) that are plausibly relevant to Subtask 2; the observed gap (0.80 vs. 0.51) is consistent with this potential bottleneck, yet no length-stratified error analysis or pooling ablation is referenced.

Authors: We appreciate the referee's point regarding the lack of supporting analyses for the Max-Pooling aggregation. Max-Pooling was selected as a computationally efficient, standard approach for aggregating chunk representations in long-context settings, consistent with prior work on document-level classification. We agree that its order-invariance could potentially overlook position-dependent cues relevant to certain evasion strategies. The performance gap between subtasks is also likely driven by the greater complexity and lower separability of the 9-way labels. In the revised manuscript, we will expand the method section to explicitly discuss this design choice and its limitations, and we will add a length-stratified error analysis using the development set to examine whether longer inputs exhibit distinct error patterns. A comprehensive ablation comparing pooling operators is not feasible within the shared-task timeline and page constraints, so this will be noted as a limitation. revision: partial
Referee: Evaluation section: the manuscript reports only aggregate Macro-F1 on the shared-task test set with no error analysis, no statistical significance tests, and no ablation studies on the chunking window, pooling operator, or multi-task objective. These omissions leave the central performance claims only moderately supported and make it impossible to isolate whether the reported ranking reflects the proposed architecture or other factors.

Authors: We acknowledge that the evaluation section is limited to aggregate results, as is common in shared-task system descriptions. To strengthen the manuscript, the revised version will include a dedicated error analysis subsection that categorizes frequent confusions in the 9-way subtask (e.g., between semantically similar evasion strategies). We will also add statistical significance tests (such as McNemar's test) comparing our system to the task baselines. A concise summary of internal development-set experiments on the multi-task objective versus single-task training will be incorporated to provide additional support for the joint training approach. Full ablations on chunking parameters and pooling will be referenced only where they informed design decisions, given space limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description with independent evaluation

full rationale

The paper reports an empirical NLP system for a shared SemEval task, using standard components (RoBERTa-large encoder, overlapping chunking, element-wise max-pooling, multi-task heads, 7-fold CV ensembling) and evaluates performance via Macro-F1 on held-out task data. No mathematical derivations, predictions, or first-principles claims are made that reduce to fitted parameters, self-definitions, or self-citation chains. The results and architecture choices are externally verifiable against the task benchmark without internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering submission with no mathematical derivations, no new axioms, and no postulated entities; it relies on the standard assumptions of pre-trained transformer encoders and cross-validation for performance estimation.

pith-pipeline@v0.9.0 · 5446 in / 1100 out tokens · 57156 ms · 2026-05-07T13:14:50.559363+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 18 canonical work pages · 4 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. https://doi.org/10.48550/arXiv.2004.05150 Longformer: The long-document transformer

work page internal anchor Pith review doi:10.48550/arxiv.2004.05150 2020
[4]

Peter Bull. 2003. The Microanalysis of Political Communication: Claptrap and Ambiguity. Routledge, London

2003
[5]

Peter Bull and Kate Mayer. 1993. https://doi.org/10.2307/3791379 How not to answer questions in political interviews . Political Psychology, 14(4):651--666

work page doi:10.2307/3791379 1993
[6]

Rich Caruana. 1997. https://doi.org/10.1023/A:1007379606734 Multitask learning . Machine Learning, 28(1):41--75

work page doi:10.1023/a:1007379606734 1997
[7]

Steven E. Clayman. 2001. https://doi.org/10.1017/S0047404501003037 Answers and evasions . Language in Society, 30(3):403--442

work page doi:10.1017/s0047404501003037 2001
[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long an...

work page doi:10.18653/v1/n19-1423 2019
[9]

Dietterich, Richard H

Thomas G. Dietterich, Richard H. Lathrop, and Tom \'a s Lozano-P \'e rez. 1997. https://doi.org/10.1016/S0004-3702(96)00034-3 Solving the multiple instance problem with axis-parallel rectangles . Artificial Intelligence, 89(1--2):31--71

work page doi:10.1016/s0004-3702(96)00034-3 1997
[10]

Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera. 2012. https://doi.org/10.1109/TSMCC.2011.2161285 A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches . IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4):463--484

work page doi:10.1109/tsmcc.2011.2161285 2012
[11]

Sandra Harris. 1991. Evasive action: How politicians respond to questions in political interviews. In Paddy Scannell, editor, Broadcast Talk, pages 76--99. Sage, London

1991
[12]

Maximilian Ilse, Jakub Tomczak, and Max Welling. 2018. https://proceedings.mlr.press/v80/ilse18a.html Attention-based deep multiple instance learning . In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 2127--2136. PMLR

2018
[13]

Omar Khattab and Matei Zaharia. 2020. https://doi.org/10.1145/3397271.3401075 Colbert: Efficient and effective passage search via contextualized late interaction over bert . In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 39--48. Association for Computing Machinery

work page doi:10.1145/3397271.3401075 2020
[14]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll \'a r. 2017. https://doi.org/10.1109/ICCV.2017.324 Focal loss for dense object detection . In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2980--2988. IEEE

work page doi:10.1109/iccv.2017.324 2017
[15]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. http://arxiv.org/abs/1907.11692 Roberta: A robustly optimized bert pretraining approach

work page internal anchor Pith review arXiv 2019
[16]

Ilya Loshchilov and Frank Hutter. 2019. https://doi.org/10.48550/arXiv.1711.05101 Decoupled weight decay regularization . arXiv preprint arXiv:1711.05101

work page Pith review doi:10.48550/arxiv.1711.05101 2019
[17]

Raghavendra Pappagari, Piotr Zelasko, Jes \'u s Villalba, Yishay Carmiel, and Najim Dehak. 2019. https://doi.org/10.1109/ASRU46091.2019.9003958 Hierarchical transformers for long document classification . In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 838--844. IEEE

work page doi:10.1109/asru46091.2019.9003958 2019
[18]

Parameswary Rasiah. 2010. https://doi.org/10.1016/j.pragma.2009.07.010 A framework for the systematic analysis of evasion in parliamentary discourse . Journal of Pragmatics, 42(3):664--680

work page doi:10.1016/j.pragma.2009.07.010 2010
[19]

Sebastian Ruder. 2017. https://doi.org/10.48550/arXiv.1706.05098 An overview of multi-task learning in deep neural networks

work page internal anchor Pith review doi:10.48550/arxiv.1706.05098 2017
[20]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. https://doi.org/10.48550/arXiv.1910.01108 DistilBERT , a distilled version of BERT : smaller, faster, cheaper and lighter . In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019

work page internal anchor Pith review doi:10.48550/arxiv.1910.01108 2019
[21]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929--1958

2014
[22]

Konstantinos Thomas, George Filandrianos, Maria Lymperaiou, Chrysoula Zerva, and Giorgos Stamou. 2026. Semeval-2026 task 6: Clarity - unmasking political question evasions. In Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026). Association for Computational Linguistics

2026
[23]

I N ever S aid T hat

Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, and Giorgos Stamou. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.300 " I N ever S aid T hat": A dataset, taxonomy and baselines on response clarity classification . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5204--5233, Miami, Florid...

work page doi:10.18653/v1/2024.findings-emnlp.300 2024
[24]

Benjamin Warner, Antoine Chaffin, Benjamin Clavi \'e , Orion Weller, Oskar Hallstr \"o m, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Griffin Thomas Adams, Jeremy Howard, and Iacopo Poli. 2025. https://doi.org/10.18653/v1/2025.acl-long.127 Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory effi...

work page doi:10.18653/v1/2025.acl-long.127 2025
[25]

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2021. https://doi.org/10.48550/arXiv.2007.14062 Big bird: Transformers for longer sequences

work page doi:10.48550/arxiv.2007.14062 2021