UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text

Amaia Zurinaga; Darya Hryhoryeva; Hamidreza Jamalabadi; Iryna Gurevych

arxiv: 2604.21534 · v1 · submitted 2026-04-23 · 💻 cs.CL

UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text

Darya Hryhoryeva , Amaia Zurinaga , Hamidreza Jamalabadi , Iryna Gurevych This is my paper

Pith reviewed 2026-05-09 21:53 UTC · model grok-4.3

classification 💻 cs.CL

keywords affective computingvalence and arousalLLM promptingemotional dynamicsneural regressionuser embeddingsSemEval taskshort-term change modeling

0 comments

The pith

LLMs capture current emotions from text well, but recent numeric trajectories explain short-term changes better than text semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three methods for tracking both a person's current emotional state and how it shifts over short sequences of their own texts. Large language models are used to read the words for the current state, a structured transition model handles ordered changes, and a neural model adds the recent history of emotion numbers plus user-specific details. Results show text works for the static part while the recent numbers drive the dynamic part more reliably. This distinction matters for building systems that follow emotional flow over time rather than just labeling single messages.

Core claim

Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. The system that combined LLM prompting with a neural regression model using trajectories and user embeddings ranked first in both Subtask 1 and Subtask 2A under the official metric.

What carries the argument

The lightweight neural regression model that incorporates recent affective trajectories and trainable user embeddings, shown to outperform text-based approaches for modeling short-term changes.

Load-bearing premise

The SemEval-2026 Task 2 dataset and evaluation metric provide a valid test of real-world affective dynamics modeling, with no major biases in the chronologically ordered texts or labels.

What would settle it

A follow-up experiment on a new chronologically ordered text dataset where adding text features improves short-term change prediction accuracy beyond what numeric trajectories alone achieve.

read the original abstract

This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. Our system ranked first among participating teams in both Subtask 1 and Subtask 2A based on the official evaluation metric.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript describes the UKP_Psycontrol system submitted to SemEval-2026 Task 2, which requires predicting both current valence/arousal levels and short-term affective changes from chronologically ordered user-generated texts. It evaluates three approaches: (1) LLM prompting under user-aware and user-agnostic conditions, (2) a pairwise Maximum Entropy model incorporating Ising-style interactions for transition modeling, and (3) a lightweight neural regression model that uses recent numeric affective trajectories plus trainable user embeddings. The central claim is that LLMs capture static affective signals from text effectively, while short-term dynamics in this dataset are better explained by numeric state trajectories than by textual semantics; the system achieved first place in Subtask 1 and Subtask 2A.

Significance. If the empirical contrast holds after addressing dataset concerns, the work usefully separates static versus dynamic affective modeling and shows that incorporating short-term numeric history can outperform text-only or LLM-based predictors for transitions. The top shared-task ranking and the use of complementary structured and neural methods provide a practical baseline for future user-state tracking systems.

major comments (1)

[Experimental results / Discussion] The headline finding that numeric trajectories outperform textual semantics for short-term change (abstract and experimental results) is load-bearing for the paper's contrast between approaches. The skeptic note correctly flags that this superiority could arise from dataset artifacts such as temporal autocorrelation, stable per-user baselines, or annotation propagation across sequences rather than genuine semantic limitations. No autocorrelation plots, user-level variance decomposition, order-permutation controls, or similar diagnostics appear to be reported; without them the claim that trajectories are 'more strongly explained' than text remains vulnerable to the chronological ordering bias.

minor comments (2)

[Approaches and Experiments] Implementation details, hyper-parameters, exact training procedures, and full quantitative tables (including ablations, error bars, and per-subtask scores) are referenced only at a high level; expanding these would strengthen verifiability.
[Abstract] The abstract states the ranking result but does not include the official metric values or direct comparisons to the other participating systems; adding a concise results table would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address the major comment below regarding potential dataset artifacts in our comparison of numeric trajectories versus textual semantics for short-term affective dynamics.

read point-by-point responses

Referee: [Experimental results / Discussion] The headline finding that numeric trajectories outperform textual semantics for short-term change (abstract and experimental results) is load-bearing for the paper's contrast between approaches. The skeptic note correctly flags that this superiority could arise from dataset artifacts such as temporal autocorrelation, stable per-user baselines, or annotation propagation across sequences rather than genuine semantic limitations. No autocorrelation plots, user-level variance decomposition, order-permutation controls, or similar diagnostics appear to be reported; without them the claim that trajectories are 'more strongly explained' than text remains vulnerable to the chronological ordering bias.

Authors: We agree that the absence of these diagnostics leaves the central claim vulnerable to alternative explanations rooted in dataset structure rather than the semantic limitations of text. Our neural regression model relies on recent numeric trajectories and user embeddings precisely to capture dynamic changes beyond static baselines, while the LLM approaches rely on textual input; however, without explicit controls we cannot fully rule out autocorrelation or ordering effects. In the revised manuscript we will add (1) autocorrelation plots of valence and arousal sequences per user, (2) a variance decomposition separating between-user stable components from within-user temporal variation, and (3) order-permutation controls that randomly shuffle sequence order within users before re-training and evaluating the trajectory-based model. These additions will directly test whether the predictive advantage of numeric trajectories depends on genuine short-term dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical modeling paper with no derivations or self-referential reductions

full rationale

The paper reports results from three standard modeling approaches (LLM prompting, MaxEnt with Ising interactions, and neural regression on trajectories plus embeddings) trained and evaluated on the SemEval-2026 Task 2 dataset using held-out testing. No equations, derivations, or parameter-fitting steps are described that would reduce a claimed prediction to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central empirical contrast between static LLM performance and trajectory-based dynamics is presented as an observation on this specific dataset rather than a self-contained logical necessity. This is a typical competition-system paper whose claims remain externally falsifiable via the shared task data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or novel theoretical constructs; relies on standard supervised learning assumptions for the three models.

pith-pipeline@v0.9.0 · 5457 in / 1014 out tokens · 42894 ms · 2026-05-09T21:53:26.386792+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 4 internal anchors

[1]

InAdvances in Neural Information Processing Systems 33: Annual Confer- ence on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual

Language models are few-shot learners. InAdvances in Neural Information Processing Systems 33: Annual Confer- ence on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Sven Buechel and Udo Hahn

2020
[2]

BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Stefan ...

2019
[3]

org/10.31234/osf

A com- plex systems model of temporal fluctuations in de- pressive symptomatology.Preprint at https://doi. org/10.31234/osf. io/fm76b. Hamidreza Jamalabadi, Tahmineh A Koosha, Elina Stocker, Andreas Jansen, Ulrich W Ebner-Priemer, Ricarda KK Proppert, Carlotta L Rieble, Rayyan Tutunji, and Eiko I Fried

work page doi:10.31234/osf
[4]

InFindings of the Association for Computational Linguistics: ACL 2025, pages 11575–11596, Vienna, Austria

Large language models are miscalibrated in-context learners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11575–11596, Vienna, Austria. Associa- tion for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov

2025
[5]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining ap- proach.Preprint, arXiv:1907.11692. Ilya Loshchilov and Frank Hutter

work page internal anchor Pith review Pith/arXiv arXiv 1907
[6]

De- coupled weight decay regularization.Preprint, arXiv:1711.05101. Jakob H. Macke, Iain Murray, and Peter E. Latham

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Gonzalo Mart’inez, Juan Diego Molero, Sandra Gonz’alez, Javier Conde, Marc Brysbaert, and Pedro Reviriego

Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pages 2034–2042. Gonzalo Mart’inez, Juan Diego Molero, Sandra Gonz’alez, Javier Conde, Marc Brysbaert, and Pedro Reviriego

2011
[8]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt- oss-20b model card.Preprint, arXiv:2508.10925. James Russell

work page internal anchor Pith review arXiv
[9]

SemEval 2026 Task 2 Organizers

A circumplex model of af- fect.Journal of personality and social psychology, 39(6):1161–1178. SemEval 2026 Task 2 Organizers

2026
[10]

https://github.com/semeval2026task2/ EmotionValArouTimeVariation2026/tree/ main/semeval2026-task2-eval

Se- meval 2026 task 2: Emotion valence and arousal time variation – official evaluation script. https://github.com/semeval2026task2/ EmotionValArouTimeVariation2026/tree/ main/semeval2026-task2-eval. GitHub reposi- tory. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram...

2026
[11]

OpenAI GPT-5 System Card

Openai gpt-5 system card.Preprint, arXiv:2601.03267. Nikita Soni, H. Andrew Schwartz, Ryan L. Boyd, Phi Long Bui, Syeda Mahwish, August Håkan Nils- son, Adithya V Ganesan, Lyle Ungar, Niranjan Balasubramanian, and Saif M. Mohammad

work page internal anchor Pith review Pith/arXiv arXiv
[12]

InProceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026)

SemEval-2026 task 2: Predicting variation in emo- tional valence and arousal over time from ecological essays. InProceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026). Association for Computational Linguistics. Julia Elina Stocker, Georgia Koppe, Hanna Reich, Saei- deh Heshmati, Sarah Kittel-Schneider, Stefan G Hof- mann, Ti...

2026

[1] [1]

InAdvances in Neural Information Processing Systems 33: Annual Confer- ence on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual

Language models are few-shot learners. InAdvances in Neural Information Processing Systems 33: Annual Confer- ence on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Sven Buechel and Udo Hahn

2020

[2] [2]

BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Stefan ...

2019

[3] [3]

org/10.31234/osf

A com- plex systems model of temporal fluctuations in de- pressive symptomatology.Preprint at https://doi. org/10.31234/osf. io/fm76b. Hamidreza Jamalabadi, Tahmineh A Koosha, Elina Stocker, Andreas Jansen, Ulrich W Ebner-Priemer, Ricarda KK Proppert, Carlotta L Rieble, Rayyan Tutunji, and Eiko I Fried

work page doi:10.31234/osf

[4] [4]

InFindings of the Association for Computational Linguistics: ACL 2025, pages 11575–11596, Vienna, Austria

Large language models are miscalibrated in-context learners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11575–11596, Vienna, Austria. Associa- tion for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov

2025

[5] [5]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining ap- proach.Preprint, arXiv:1907.11692. Ilya Loshchilov and Frank Hutter

work page internal anchor Pith review Pith/arXiv arXiv 1907

[6] [6]

De- coupled weight decay regularization.Preprint, arXiv:1711.05101. Jakob H. Macke, Iain Murray, and Peter E. Latham

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Gonzalo Mart’inez, Juan Diego Molero, Sandra Gonz’alez, Javier Conde, Marc Brysbaert, and Pedro Reviriego

Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pages 2034–2042. Gonzalo Mart’inez, Juan Diego Molero, Sandra Gonz’alez, Javier Conde, Marc Brysbaert, and Pedro Reviriego

2011

[8] [8]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt- oss-20b model card.Preprint, arXiv:2508.10925. James Russell

work page internal anchor Pith review arXiv

[9] [9]

SemEval 2026 Task 2 Organizers

A circumplex model of af- fect.Journal of personality and social psychology, 39(6):1161–1178. SemEval 2026 Task 2 Organizers

2026

[10] [10]

https://github.com/semeval2026task2/ EmotionValArouTimeVariation2026/tree/ main/semeval2026-task2-eval

Se- meval 2026 task 2: Emotion valence and arousal time variation – official evaluation script. https://github.com/semeval2026task2/ EmotionValArouTimeVariation2026/tree/ main/semeval2026-task2-eval. GitHub reposi- tory. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram...

2026

[11] [11]

OpenAI GPT-5 System Card

Openai gpt-5 system card.Preprint, arXiv:2601.03267. Nikita Soni, H. Andrew Schwartz, Ryan L. Boyd, Phi Long Bui, Syeda Mahwish, August Håkan Nils- son, Adithya V Ganesan, Lyle Ungar, Niranjan Balasubramanian, and Saif M. Mohammad

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

InProceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026)

SemEval-2026 task 2: Predicting variation in emo- tional valence and arousal over time from ecological essays. InProceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026). Association for Computational Linguistics. Julia Elina Stocker, Georgia Koppe, Hanna Reich, Saei- deh Heshmati, Sarah Kittel-Schneider, Stefan G Hof- mann, Ti...

2026