arxiv: 2604.07193 · v1 · submitted 2026-04-08 · 💻 cs.CL · cs.ET

Recognition: 2 theorem links

· Lean Theorem

LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics

Kosmas Pinitas , Ilias Maglogiannis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:05 UTC · model grok-4.3

classification 💻 cs.CL cs.ET

keywords affective computingvalence arousal predictionlanguage modelssemantic conditioninginterpretable AIfacial geometry featuresacoustic descriptorsemotion dynamics

0 comments

The pith

Language models turn handcrafted facial and acoustic features into semantic descriptions that serve as priors for predicting changes in valence and arousal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that affective dynamics can be modeled more effectively by routing interpretable expert features through a language model rather than feeding raw signals into opaque deep networks. Handcrafted geometry and acoustic descriptors are first rendered as plain-language statements of their emotional implications; a pretrained language model then produces context embeddings that condition the prediction of valence and arousal shifts. On the Aff-Wild2 and SEWA datasets this hybrid route yields higher accuracy than either pure handcrafted baselines or end-to-end embedding models while keeping every feature traceable. A sympathetic reader cares because the method offers a concrete way to retain expert oversight without paying a performance penalty in unconstrained environments.

Core claim

By converting structured facial-geometry and acoustic features into symbolic natural-language descriptions and passing those descriptions through a pretrained language model, the resulting semantic context embeddings act as high-level priors that improve the prediction of valence and arousal changes; the same pipeline remains fully interpretable and computationally lighter than fully end-to-end architectures.

What carries the argument

Semantic context embeddings produced by a pretrained language model that processes natural-language encodings of handcrafted affect descriptors.

If this is right

Affective models stay transparent enough for domain experts to inspect or edit the conditioning descriptions.
Prediction accuracy for both valence and arousal rises above handcrafted-only and deep-embedding baselines on Aff-Wild2 and SEWA.
The framework remains computationally lighter than full end-to-end deep networks while scaling to unconstrained video.
Expert knowledge encoded in features can be injected at the language level without retraining the entire pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of turning domain features into language priors could be tested on other time-series affect tasks such as pain intensity or engagement prediction.
If the language-model step generalizes across datasets, hybrid symbolic-neural pipelines may reduce the data hunger typical of pure deep models in affective computing.
Psychologists could directly edit the natural-language descriptions to steer model behavior without touching neural weights.

Load-bearing premise

Converting numerical handcrafted features into natural-language statements preserves their affective meaning without adding noise or domain mismatch that would degrade the language model's priors.

What would settle it

Replacing the accurate natural-language descriptions with randomly generated or semantically mismatched sentences and measuring whether valence and arousal prediction accuracy falls back to or below the handcrafted-only baseline would show that the semantic conditioning step is not supplying useful information.

Figures

Figures reproduced from arXiv: 2604.07193 by Ilias Maglogiannis, Kosmas Pinitas.

**Figure 1.** Figure 1: Overview of the LaScA framework. Handcrafted facial and acoustic features are first extracted and filtered via data-driven [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Histogram of unique multimodal prompts 2. facial: left brow down focused irritation, right brow down skeptical tension, left blink stress regulation, right blink stress regulation, left gaze down shame reflection, right gaze down shame reflection, left eye squint suspicious focus, right eye squint evaluative doubt audio: Acoustic markers indicate high arousal energy. 3. facial: inner brows up sad vulnerabi… view at source ↗

read the original abstract

Predicting affect in unconstrained environments remains a fundamental challenge in human-centered AI. While deep neural embeddings dominate contemporary approaches, they often lack interpretability and limit expert-driven refinement. We propose a novel framework that uses Language Models (LMs) as semantic context conditioners over handcrafted affect descriptors to model changes in Valence and Arousal. Our approach begins with interpretable facial geometry and acoustic features derived from structured domain knowledge. These features are transformed into symbolic natural-language descriptions encoding their affective implications. A pretrained LM processes these descriptions to generate semantic context embeddings that act as high-level priors over affective dynamics. Unlike end-to-end black-box pipelines, our framework preserves feature transparency while leveraging the contextual abstraction capabilities of LMs. We evaluate the proposed method on the Aff-Wild2 and SEWA datasets for affect change prediction. Experimental results show consistent improvements in accuracy for both Valence and Arousal compared to handcrafted-only and deep-embedding baselines. Our findings demonstrate that semantic conditioning enables interpretable affect modelling without sacrificing predictive performance, offering a transparent and computationally efficient alternative to fully end-to-end architectures

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a hybrid pipeline that turns handcrafted features into LM-conditioned priors for valence-arousal prediction, but the abstract supplies no numbers or ablations to back the claim that semantics are what drive any gains.

read the letter

The core idea is to take structured facial and acoustic descriptors, turn them into natural-language statements about their affective meaning, run those through a pretrained LM to get context embeddings, and use the embeddings as high-level priors when modeling changes in valence and arousal. It reports better accuracy than handcrafted-only and deep-embedding baselines on Aff-Wild2 and SEWA while keeping the original features visible for expert editing. That pipeline is not a routine extension of the cited prior work, and the transparency angle is a reasonable middle path between fully interpretable but weak models and black-box embeddings. The execution so far is thin. The abstract states consistent improvements without any quantitative values, statistical tests, error bars, cross-validation details, or description of how the language descriptions were generated and validated. The stress-test point lands cleanly: nothing isolates whether the pretrained LM is supplying useful semantic priors or whether the gains come from dimensionality, regularization, or incidental effects of the text transformation step. Without controls that keep the descriptions fixed and replace the LM with random or non-contextual embeddings, the attribution to semantic conditioning stays unsecured. If the full paper contains those ablations, reproducible numbers, and a clear account of the feature-to-text step, the work would be worth a reading group discussion among people working on hybrid affective models. As presented, the evidence is too light to judge whether the central claim holds. It still deserves peer review because the framing is coherent and the problem it targets is real, even if heavy revision on the experimental side is likely.

Referee Report

2 major / 1 minor

Summary. The manuscript presents LaScA, a framework for affective dynamics modeling that converts handcrafted facial and acoustic features into natural language descriptions, encodes them using a pretrained language model to produce semantic embeddings as priors, and uses these to predict changes in Valence and Arousal. It reports consistent accuracy improvements on the Aff-Wild2 and SEWA datasets compared to handcrafted-only and deep-embedding baselines, emphasizing interpretability and efficiency over end-to-end approaches.

Significance. Should the central claims be substantiated with rigorous controls and quantitative evidence, this approach could offer a valuable middle ground in affective computing between fully interpretable but limited handcrafted methods and high-performing but opaque deep models. The integration of domain knowledge via language descriptions with LM capabilities is promising for scalable, expert-refinable systems.

major comments (2)

[Abstract and Experimental Results] The abstract claims 'consistent improvements in accuracy for both Valence and Arousal' without providing any numerical values, details on statistical tests, error bars, or cross-validation procedures. This absence makes it impossible to gauge the practical significance or reliability of the reported gains, which is critical for validating the 'without sacrificing predictive performance' assertion.
[Methodology and Ablation Studies] The pipeline relies on transforming features to text and then LM encoding, but no experiments are described that control for non-semantic effects, such as using fixed language descriptions with random or non-pretrained embeddings. Without such isolation, it is unclear if performance benefits stem from semantic priors or from the transformation process itself, threatening the attribution to 'semantic conditioning' as a high-level prior.

minor comments (1)

[Abstract] The abstract would benefit from including specific quantitative results and more details on the generation and validation of language descriptions to enhance clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of clarity and experimental rigor. We address each major comment below and outline planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and Experimental Results] The abstract claims 'consistent improvements in accuracy for both Valence and Arousal' without providing any numerical values, details on statistical tests, error bars, or cross-validation procedures. This absence makes it impossible to gauge the practical significance or reliability of the reported gains, which is critical for validating the 'without sacrificing predictive performance' assertion.

Authors: We agree that the abstract would be strengthened by including specific quantitative results. The body of the manuscript reports accuracy metrics, statistical significance, error bars, and cross-validation details for the Aff-Wild2 and SEWA datasets. In the revised version, we will update the abstract to concisely include key numerical improvements and reference the evaluation protocol, ensuring the claims are better substantiated without exceeding length constraints. revision: yes
Referee: [Methodology and Ablation Studies] The pipeline relies on transforming features to text and then LM encoding, but no experiments are described that control for non-semantic effects, such as using fixed language descriptions with random or non-pretrained embeddings. Without such isolation, it is unclear if performance benefits stem from semantic priors or from the transformation process itself, threatening the attribution to 'semantic conditioning' as a high-level prior.

Authors: We acknowledge the importance of isolating semantic effects for stronger attribution. Our current comparisons to handcrafted-only and deep-embedding baselines already provide evidence that gains arise from the language-conditioned approach rather than feature transformation alone. To further address this concern, we will add an ablation using fixed descriptions with random embeddings in the revised manuscript, allowing direct quantification of the pretrained LM's semantic contribution. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical pipeline: handcrafted facial/acoustic features are converted to natural-language descriptions, encoded by a pretrained LM into semantic embeddings, and used as conditioning priors for valence/arousal change prediction. Performance is assessed via accuracy improvements on the external Aff-Wild2 and SEWA datasets against handcrafted-only and deep-embedding baselines. No equations, fitting procedures, or self-citations are presented that would reduce any claimed prediction or first-principles result to the inputs by construction. The central claim rests on experimental outcomes rather than self-definitional steps, fitted-input renamings, or load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework implicitly assumes that LM embeddings derived from domain-language descriptions are valid affective priors, but this is not formalized.

pith-pipeline@v0.9.0 · 5494 in / 1149 out tokens · 58504 ms · 2026-05-10T18:05:21.245377+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

semantic context embeddings that act as high-level priors over affective dynamics... fusion... preference learner
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

handcrafted facial geometry and acoustic features... symbolic natural-language descriptions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Maven: Multi-modal attention for valence-arousal emotion network

Vrushank Ahire, Kunal Shah, Mudasir Khan, Nikhil Pakhale, Lownish Sookha, Mudasir Ganaie, and Abhinav Dhall. Maven: Multi-modal attention for valence-arousal emotion network. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5789–5799, 2025. 2

2025
[2]

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural infor- mation processing systems, 33:12449–12460, 2020. 4

2020
[3]

Knowing your annota- tor: Rapidly testing the reliability of affect annotation.arXiv preprint arXiv:2308.16029, 2023

Matthew Barthet, Chintan Trivedi, Kosmas Pinitas, Em- manouil Xylakis, Konstantinos Makantasis, Antonios Li- apis, and Georgios N Yannakakis. Knowing your annota- tor: Rapidly testing the reliability of affect annotation.arXiv preprint arXiv:2308.16029, 2023. 1

work page arXiv 2023
[4]

Vggface2: A dataset for recognising faces across pose and age

Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and An- drew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In2018 13th IEEE international con- ference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018. 4

2018
[5]

Predicting architectural space preferences using eeg- based emotion analysis: a cnn-lstm approach.Applied Sci- ences, 15(8):4217, 2025

Ju Eun Cho, Se Yeon Kang, Yi Yeon Hong, and Han Jong Jun. Predicting architectural space preferences using eeg- based emotion analysis: a cnn-lstm approach.Applied Sci- ences, 15(8):4217, 2025. 2

2025
[6]

Mma-dfer: Multimodal adaptation of unimodal models for dynamic facial expression recognition in-the- wild

Kateryna Chumachenko, Alexandros Iosifidis, and Moncef Gabbouj. Mma-dfer: Multimodal adaptation of unimodal models for dynamic facial expression recognition in-the- wild. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) Workshops,
[7]

Emotiw 2018: Audio-video, student engagement and group-level affect prediction

Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. Emotiw 2018: Audio-video, student engagement and group-level affect prediction. InProceedings of the 20th ACM International Conference on Multimodal Interaction, pages 653–656, 2018. 2

2018
[8]

Masked autoencoders that listen

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that listen. InNeurIPS, 2022. 4

2022
[9]

Abaw: Valence-arousal estimation, ex- pression recognition, action unit detection & multi-task learning challenges

Dimitrios Kollias. Abaw: Valence-arousal estimation, ex- pression recognition, action unit detection & multi-task learning challenges. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 2328–2336, 2022. 2

2022
[10]

Aff-wild2: Ex- tending the aff-wild database for affect recognition.arXiv preprint arXiv:1811.07770, 2018

Dimitrios Kollias and Stefanos Zafeiriou. Aff-wild2: Ex- tending the aff-wild database for affect recognition.arXiv preprint arXiv:1811.07770, 2018. 2

work page arXiv 2018
[11]

Expression, affect, action unit recog- nition: Aff-wild2, multi-task learning and arcface,

Dimitrios Kollias and Stefanos Zafeiriou. Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface.arXiv preprint arXiv:1910.04855, 2019. 5

work page arXiv 1910
[12]

Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework.arXiv preprint arXiv:2103.15792, 2021

Dimitrios Kollias and Stefanos Zafeiriou. Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework.arXiv preprint arXiv:2103.15792, 2021. 1

work page arXiv 2021
[13]

Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architec- tures, and beyond.International Journal of Computer Vision, 127(6):907–929, 2019

Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, Björn Schuller, Irene Kotsia, and Stefanos Zafeiriou. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architec- tures, and beyond.International Journal of Computer Vision, 127(6):907–929, 2019. 2

2019
[14]

Sewa db: A rich database for audio-visual emotion and sentiment re- search in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(3):1022–1040, 2019

Jean Kossaifi, Robert Walecki, Yannis Panagakis, Jie Shen, Maximilian Schmitt, Fabien Ringeval, Jing Han, Vedhas Pandit, Antoine Toisoul, Björn Schuller, et al. Sewa db: A rich database for audio-visual emotion and sentiment re- search in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(3):1022–1040, 2019. 2, 5

2019
[15]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- moyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[16]

Practical considerations on the use of preference learning for ranking emotional speech

Reza Lotfian and Carlos Busso. Practical considerations on the use of preference learning for ranking emotional speech. In2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5205–5209. IEEE,
[17]

Facial action unit detection and intensity estima- tion from self-supervised representation.arXiv preprint arXiv:2210.15878, 2022

Bowen Ma, Rudong An, Wei Zhang, Yu Ding, Zeng Zhao, Rongsheng Zhang, Tangjie Lv, Changjie Fan, and Zhipeng Hu. Facial action unit detection and intensity estima- tion from self-supervised representation.arXiv preprint arXiv:2210.15878, 2022. 4

work page arXiv 2022
[18]

The invariant ground truth of affect

Konstantinos Makantasis, Kosmas Pinitas, Antonios Liapis, and Georgios N Yannakakis. The invariant ground truth of affect. In2022 10th International Conference on Affective Computing and Intelligent Interaction Workshops and De- mos (ACIIW), pages 1–8. IEEE, 2022. 2

2022
[19]

From the lab to the wild: Affect modeling via privileged information.IEEE Transactions on Affective Computing, 15(2):380–392, 2023

Konstantinos Makantasis, Kosmas Pinitas, Antonios Liapis, and Georgios N Yannakakis. From the lab to the wild: Affect modeling via privileged information.IEEE Transactions on Affective Computing, 15(2):380–392, 2023. 1

2023
[20]

Facial expres- sion recognition using feature based techniques and model based techniques: A survey

Bishwas Mishra, Steven L Fernandes, K Abhishek, Aish- warya Alva, Chaithra Shetty, Chandan V Ajila, Dhanush Shetty, Harshitha Rao, and Priyanka Shetty. Facial expres- sion recognition using feature based techniques and model based techniques: A survey. In2015 2nd international con- ference on electronics and communication systems (ICECS), pages 589–594. I...

2015
[21]

Ranklist–a listwise preference learning framework for predicting subjective preferences.arXiv preprint arXiv:2508.09826, 2025

Abinay Reddy Naini, Fernando Diaz, and Carlos Busso. Ranklist–a listwise preference learning framework for predicting subjective preferences.arXiv preprint arXiv:2508.09826, 2025. 2

work page arXiv 2025
[22]

Rankneat: outperforming stochastic gradient search in preference learning tasks

Kosmas Pinitas, Konstantinos Makantasis, Antonios Liapis, and Georgios N Yannakakis. Rankneat: outperforming stochastic gradient search in preference learning tasks. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1084–1092, 2022. 1, 2

2022
[23]

Supervised contrastive learn- ing for affect modelling

Kosmas Pinitas, Konstantinos Makantasis, Antonios Liapis, and Georgios N Yannakakis. Supervised contrastive learn- ing for affect modelling. InProceedings of the 2022 Inter- national Conference on Multimodal Interaction, pages 531– 539, 2022. 2

2022
[24]

Varying the context to advance 9 affect modelling: A study on game engagement prediction

Kosmas Pinitas, Nemanja Rasajski, Matthew Barthet, Maria Kaselimi, Konstantinos Makantasis, Antonios Liapis, and Georgios N Yannakakis. Varying the context to advance 9 affect modelling: A study on game engagement prediction. In2024 12th International Conference on Affective Comput- ing and Intelligent Interaction (ACII), pages 194–202. IEEE,
[25]

Privileged contrastive pretraining for multi- modal affect modelling

Kosmas Pinitas, Konstantinos Makantasis, and Georgios Yannakakis. Privileged contrastive pretraining for multi- modal affect modelling. InProceedings of the 27th Inter- national Conference on Multimodal Interaction, pages 317– 325, 2025. 2

2025
[26]

Lixiong Qin, Mei Wang, Chao Deng, Ke Wang, Xi Chen, Jiani Hu, and Weihong Deng. Swinface: A multi-task trans- former for face recognition, expression recognition, age esti- mation and attribute estimation.IEEE Transactions on Cir- cuits and Systems for Video Technology, 34(4):2223–2234,
[27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 2

2021
[28]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural lan- guage processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019. 4

2019
[29]

Deep learning for human affect recognition: Insights and new developments.IEEE Transactions on Affective Com- puting, 12(2):524–543, 2019

Philipp V Rouast, Marc TP Adam, and Raymond Chiong. Deep learning for human affect recognition: Insights and new developments.IEEE Transactions on Affective Com- puting, 12(2):524–543, 2019. 1

2019
[30]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review arXiv 1910
[31]

A survey on automatic multimodal emotion recognition in the wild

Garima Sharma and Abhinav Dhall. A survey on automatic multimodal emotion recognition in the wild. InAdvances in data science: Methodologies and applications, pages 35–64. Springer, 2020. 2

2020
[32]

Mpnet: Masked and permuted pre-training for language understanding.Advances in neural information processing systems, 33:16857–16867, 2020

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding.Advances in neural information processing systems, 33:16857–16867, 2020. 4

2020
[33]

Hic- mae: Hierarchical contrastive masked autoencoder for self- supervised audio-visual emotion recognition.Information Fusion, 108:102382, 2024

Licai Sun, Zheng Lian, Bin Liu, and Jianhua Tao. Hic- mae: Hierarchical contrastive masked autoencoder for self- supervised audio-visual emotion recognition.Information Fusion, 108:102382, 2024. 4

2024
[34]

Avec 2016: Depression, mood, and emotion recognition workshop and challenge

Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. InProceedings of the 6th international work- shop on audio/visual emotion challenge, pages 3–10, 2016. 2

2016
[35]

A survey on interpretability in visual recognition.arXiv preprint arXiv:2507.11099, 2025

Qiyang Wan, Chengzhi Gao, Ruiping Wang, and Xilin Chen. A survey on interpretability in visual recognition.arXiv preprint arXiv:2507.11099, 2025. 2

work page arXiv 2025
[36]

Survey on explainable ai: From approaches, limitations and applications aspects.Human-Centric Intelli- gent Systems, 3(3):161–188, 2023

Wenli Yang, Yuchen Wei, Hanyu Wei, Yanyu Chen, Guan Huang, Xiang Li, Renjie Li, Naimeng Yao, Xinyi Wang, Xi- aotong Gu, et al. Survey on explainable ai: From approaches, limitations and applications aspects.Human-Centric Intelli- gent Systems, 3(3):161–188, 2023. 1

2023
[37]

Zahra Yousefi, Fiona C Bakar, Massimiliano de Zambotti, and Mohamad Forouzanfar. Deeparousal-net: A multi-block recurrent deep learning model for proactive forecasting of non-apneic arousals from multichannel psg.IEEE Transac- tions on Biomedical Engineering, 2025. 2

2025
[38]

Affective computing for learning in education: A systematic review and bibliometric analysis

Rajamanickam Yuvaraj, Rakshit Mittal, A Amalin Prince, and Jun Song Huang. Affective computing for learning in education: A systematic review and bibliometric analysis. Education Sciences, 15(1):65, 2025. 2

2025
[39]

in-the- wild

Stefanos Zafeiriou, Athanasios Papaioannou, Irene Kotsia, Mihalis Nicolaou, and Guoying Zhao. Facial affect “in-the- wild”: A survey and a new database. In2016 IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW). Institute of Electrical and Electronics En- gineers, 2016. 1

2016
[40]

An analyst-inspector framework for evaluating reproducibility of llms in data science.arXiv e-prints, pages arXiv–2502, 2025

Qiuhai Zeng, Claire Jin, Xinyue Wang, Yuhan Zheng, and Qunhua Li. An analyst-inspector framework for evaluating reproducibility of llms in data science.arXiv e-prints, pages arXiv–2502, 2025. 2

2025
[41]

Video-llama: An instruction-tuned audio-visual language model for video un- derstanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. InProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, pages 543–553, 2023. 2

2023
[42]

Xinjie Zhang, Tenggan Zhang, Lei Sun, Jinming Zhao, and Qin Jin. Exploring interpretability in deep learning for af- fective computing: a comprehensive review.ACM Transac- tions on Multimedia Computing, Communications and Ap- plications, 21(7):1–28, 2025. 2 10 LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics Supplementary Material A. L...

2025
[43]

A brief physical cue
[44]

A brief affect cue STRICT RULES:
[45]

Output must be a valid Python dictionary
[46]

No explanations outside the dictionary
[47]

Each value 3 to 6 words
[48]

No punctuation inside values
[49]

<expression cue> <affect cue>

Format: "<expression cue> <affect cue>"
[50]

Video: describe visible movement
[51]

Audio: describe acoustic change
[52]

Avoid deterministic language
[53]

Keep vocabulary consistent
[54]

Optimise for SentenceTransformer embeddings
[55]

Example format: ’feature_name’: ’feature meaning and affect indication’ Return only the dictionary

Prioritise reproducibility. Example format: ’feature_name’: ’feature meaning and affect indication’ Return only the dictionary. B. Description Generation B.1. Audio Descriptors Table 8. Affect-aware semantic labels for MFCC acoustic features generated offline. Feature Name Semantic Label mfcc_0 high arousal energy mfcc_1 dominant low tone mfcc_2 neutral s...