UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text
Pith reviewed 2026-05-09 21:53 UTC · model grok-4.3
The pith
LLMs capture current emotions from text well, but recent numeric trajectories explain short-term changes better than text semantics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. The system that combined LLM prompting with a neural regression model using trajectories and user embeddings ranked first in both Subtask 1 and Subtask 2A under the official metric.
What carries the argument
The lightweight neural regression model that incorporates recent affective trajectories and trainable user embeddings, shown to outperform text-based approaches for modeling short-term changes.
Load-bearing premise
The SemEval-2026 Task 2 dataset and evaluation metric provide a valid test of real-world affective dynamics modeling, with no major biases in the chronologically ordered texts or labels.
What would settle it
A follow-up experiment on a new chronologically ordered text dataset where adding text features improves short-term change prediction accuracy beyond what numeric trajectories alone achieve.
read the original abstract
This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. Our system ranked first among participating teams in both Subtask 1 and Subtask 2A based on the official evaluation metric.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the UKP_Psycontrol system submitted to SemEval-2026 Task 2, which requires predicting both current valence/arousal levels and short-term affective changes from chronologically ordered user-generated texts. It evaluates three approaches: (1) LLM prompting under user-aware and user-agnostic conditions, (2) a pairwise Maximum Entropy model incorporating Ising-style interactions for transition modeling, and (3) a lightweight neural regression model that uses recent numeric affective trajectories plus trainable user embeddings. The central claim is that LLMs capture static affective signals from text effectively, while short-term dynamics in this dataset are better explained by numeric state trajectories than by textual semantics; the system achieved first place in Subtask 1 and Subtask 2A.
Significance. If the empirical contrast holds after addressing dataset concerns, the work usefully separates static versus dynamic affective modeling and shows that incorporating short-term numeric history can outperform text-only or LLM-based predictors for transitions. The top shared-task ranking and the use of complementary structured and neural methods provide a practical baseline for future user-state tracking systems.
major comments (1)
- [Experimental results / Discussion] The headline finding that numeric trajectories outperform textual semantics for short-term change (abstract and experimental results) is load-bearing for the paper's contrast between approaches. The skeptic note correctly flags that this superiority could arise from dataset artifacts such as temporal autocorrelation, stable per-user baselines, or annotation propagation across sequences rather than genuine semantic limitations. No autocorrelation plots, user-level variance decomposition, order-permutation controls, or similar diagnostics appear to be reported; without them the claim that trajectories are 'more strongly explained' than text remains vulnerable to the chronological ordering bias.
minor comments (2)
- [Approaches and Experiments] Implementation details, hyper-parameters, exact training procedures, and full quantitative tables (including ablations, error bars, and per-subtask scores) are referenced only at a high level; expanding these would strengthen verifiability.
- [Abstract] The abstract states the ranking result but does not include the official metric values or direct comparisons to the other participating systems; adding a concise results table would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's significance. We address the major comment below regarding potential dataset artifacts in our comparison of numeric trajectories versus textual semantics for short-term affective dynamics.
read point-by-point responses
-
Referee: [Experimental results / Discussion] The headline finding that numeric trajectories outperform textual semantics for short-term change (abstract and experimental results) is load-bearing for the paper's contrast between approaches. The skeptic note correctly flags that this superiority could arise from dataset artifacts such as temporal autocorrelation, stable per-user baselines, or annotation propagation across sequences rather than genuine semantic limitations. No autocorrelation plots, user-level variance decomposition, order-permutation controls, or similar diagnostics appear to be reported; without them the claim that trajectories are 'more strongly explained' than text remains vulnerable to the chronological ordering bias.
Authors: We agree that the absence of these diagnostics leaves the central claim vulnerable to alternative explanations rooted in dataset structure rather than the semantic limitations of text. Our neural regression model relies on recent numeric trajectories and user embeddings precisely to capture dynamic changes beyond static baselines, while the LLM approaches rely on textual input; however, without explicit controls we cannot fully rule out autocorrelation or ordering effects. In the revised manuscript we will add (1) autocorrelation plots of valence and arousal sequences per user, (2) a variance decomposition separating between-user stable components from within-user temporal variation, and (3) order-permutation controls that randomly shuffle sequence order within users before re-training and evaluating the trajectory-based model. These additions will directly test whether the predictive advantage of numeric trajectories depends on genuine short-term dynamics. revision: yes
Circularity Check
No circularity: empirical modeling paper with no derivations or self-referential reductions
full rationale
The paper reports results from three standard modeling approaches (LLM prompting, MaxEnt with Ising interactions, and neural regression on trajectories plus embeddings) trained and evaluated on the SemEval-2026 Task 2 dataset using held-out testing. No equations, derivations, or parameter-fitting steps are described that would reduce a claimed prediction to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central empirical contrast between static LLM performance and trajectory-based dynamics is presented as an observation on this specific dataset rather than a self-contained logical necessity. This is a typical competition-system paper whose claims remain externally falsifiable via the shared task data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
InAdvances in Neural Information Processing Systems 33: Annual Confer- ence on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual
Language models are few-shot learners. InAdvances in Neural Information Processing Systems 33: Annual Confer- ence on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Sven Buechel and Udo Hahn
2020
-
[2]
BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Stefan ...
2019
-
[3]
A com- plex systems model of temporal fluctuations in de- pressive symptomatology.Preprint at https://doi. org/10.31234/osf. io/fm76b. Hamidreza Jamalabadi, Tahmineh A Koosha, Elina Stocker, Andreas Jansen, Ulrich W Ebner-Priemer, Ricarda KK Proppert, Carlotta L Rieble, Rayyan Tutunji, and Eiko I Fried
-
[4]
InFindings of the Association for Computational Linguistics: ACL 2025, pages 11575–11596, Vienna, Austria
Large language models are miscalibrated in-context learners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11575–11596, Vienna, Austria. Associa- tion for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov
2025
-
[5]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Roberta: A robustly optimized bert pretraining ap- proach.Preprint, arXiv:1907.11692. Ilya Loshchilov and Frank Hutter
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[6]
De- coupled weight decay regularization.Preprint, arXiv:1711.05101. Jakob H. Macke, Iain Murray, and Peter E. Latham
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Gonzalo Mart’inez, Juan Diego Molero, Sandra Gonz’alez, Javier Conde, Marc Brysbaert, and Pedro Reviriego
Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pages 2034–2042. Gonzalo Mart’inez, Juan Diego Molero, Sandra Gonz’alez, Javier Conde, Marc Brysbaert, and Pedro Reviriego
2011
-
[8]
gpt-oss-120b & gpt-oss-20b Model Card
gpt-oss-120b & gpt- oss-20b model card.Preprint, arXiv:2508.10925. James Russell
work page internal anchor Pith review arXiv
-
[9]
SemEval 2026 Task 2 Organizers
A circumplex model of af- fect.Journal of personality and social psychology, 39(6):1161–1178. SemEval 2026 Task 2 Organizers
2026
-
[10]
https://github.com/semeval2026task2/ EmotionValArouTimeVariation2026/tree/ main/semeval2026-task2-eval
Se- meval 2026 task 2: Emotion valence and arousal time variation – official evaluation script. https://github.com/semeval2026task2/ EmotionValArouTimeVariation2026/tree/ main/semeval2026-task2-eval. GitHub reposi- tory. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram...
2026
-
[11]
Openai gpt-5 system card.Preprint, arXiv:2601.03267. Nikita Soni, H. Andrew Schwartz, Ryan L. Boyd, Phi Long Bui, Syeda Mahwish, August Håkan Nils- son, Adithya V Ganesan, Lyle Ungar, Niranjan Balasubramanian, and Saif M. Mohammad
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
InProceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026)
SemEval-2026 task 2: Predicting variation in emo- tional valence and arousal over time from ecological essays. InProceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026). Association for Computational Linguistics. Julia Elina Stocker, Georgia Koppe, Hanna Reich, Saei- deh Heshmati, Sarah Kittel-Schneider, Stefan G Hof- mann, Ti...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.