UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text
Pith reviewed 2026-05-09 21:53 UTC · model grok-4.3
The pith
LLMs capture current emotions from text well, but recent numeric trajectories explain short-term changes better than text semantics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. The system that combined LLM prompting with a neural regression model using trajectories and user embeddings ranked first in both Subtask 1 and Subtask 2A under the official metric.
What carries the argument
The lightweight neural regression model that incorporates recent affective trajectories and trainable user embeddings, shown to outperform text-based approaches for modeling short-term changes.
Load-bearing premise
The SemEval-2026 Task 2 dataset and evaluation metric provide a valid test of real-world affective dynamics modeling, with no major biases in the chronologically ordered texts or labels.
What would settle it
A follow-up experiment on a new chronologically ordered text dataset where adding text features improves short-term change prediction accuracy beyond what numeric trajectories alone achieve.
read the original abstract
This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. Our system ranked first among participating teams in both Subtask 1 and Subtask 2A based on the official evaluation metric.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the UKP_Psycontrol system submitted to SemEval-2026 Task 2, which requires predicting both current valence/arousal levels and short-term affective changes from chronologically ordered user-generated texts. It evaluates three approaches: (1) LLM prompting under user-aware and user-agnostic conditions, (2) a pairwise Maximum Entropy model incorporating Ising-style interactions for transition modeling, and (3) a lightweight neural regression model that uses recent numeric affective trajectories plus trainable user embeddings. The central claim is that LLMs capture static affective signals from text effectively, while short-term dynamics in this dataset are better explained by numeric state trajectories than by textual semantics; the system achieved first place in Subtask 1 and Subtask 2A.
Significance. If the empirical contrast holds after addressing dataset concerns, the work usefully separates static versus dynamic affective modeling and shows that incorporating short-term numeric history can outperform text-only or LLM-based predictors for transitions. The top shared-task ranking and the use of complementary structured and neural methods provide a practical baseline for future user-state tracking systems.
major comments (1)
- [Experimental results / Discussion] The headline finding that numeric trajectories outperform textual semantics for short-term change (abstract and experimental results) is load-bearing for the paper's contrast between approaches. The skeptic note correctly flags that this superiority could arise from dataset artifacts such as temporal autocorrelation, stable per-user baselines, or annotation propagation across sequences rather than genuine semantic limitations. No autocorrelation plots, user-level variance decomposition, order-permutation controls, or similar diagnostics appear to be reported; without them the claim that trajectories are 'more strongly explained' than text remains vulnerable to the chronological ordering bias.
minor comments (2)
- [Approaches and Experiments] Implementation details, hyper-parameters, exact training procedures, and full quantitative tables (including ablations, error bars, and per-subtask scores) are referenced only at a high level; expanding these would strengthen verifiability.
- [Abstract] The abstract states the ranking result but does not include the official metric values or direct comparisons to the other participating systems; adding a concise results table would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's significance. We address the major comment below regarding potential dataset artifacts in our comparison of numeric trajectories versus textual semantics for short-term affective dynamics.
read point-by-point responses
-
Referee: [Experimental results / Discussion] The headline finding that numeric trajectories outperform textual semantics for short-term change (abstract and experimental results) is load-bearing for the paper's contrast between approaches. The skeptic note correctly flags that this superiority could arise from dataset artifacts such as temporal autocorrelation, stable per-user baselines, or annotation propagation across sequences rather than genuine semantic limitations. No autocorrelation plots, user-level variance decomposition, order-permutation controls, or similar diagnostics appear to be reported; without them the claim that trajectories are 'more strongly explained' than text remains vulnerable to the chronological ordering bias.
Authors: We agree that the absence of these diagnostics leaves the central claim vulnerable to alternative explanations rooted in dataset structure rather than the semantic limitations of text. Our neural regression model relies on recent numeric trajectories and user embeddings precisely to capture dynamic changes beyond static baselines, while the LLM approaches rely on textual input; however, without explicit controls we cannot fully rule out autocorrelation or ordering effects. In the revised manuscript we will add (1) autocorrelation plots of valence and arousal sequences per user, (2) a variance decomposition separating between-user stable components from within-user temporal variation, and (3) order-permutation controls that randomly shuffle sequence order within users before re-training and evaluating the trajectory-based model. These additions will directly test whether the predictive advantage of numeric trajectories depends on genuine short-term dynamics. revision: yes
Circularity Check
No circularity: empirical modeling paper with no derivations or self-referential reductions
full rationale
The paper reports results from three standard modeling approaches (LLM prompting, MaxEnt with Ising interactions, and neural regression on trajectories plus embeddings) trained and evaluated on the SemEval-2026 Task 2 dataset using held-out testing. No equations, derivations, or parameter-fitting steps are described that would reduce a claimed prediction to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central empirical contrast between static LLM performance and trajectory-based dynamics is presented as an observation on this specific dataset rather than a self-contained logical necessity. This is a typical competition-system paper whose claims remain externally falsifiable via the shared task data.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.