Emotion Recognition in Sign Language Conversation

Keyu Mao; Kotaro Funakoshi; Minghao Shao; Takao Obi; Yusong Wang

arxiv: 2605.23328 · v1 · pith:QPMJQCYCnew · submitted 2026-05-22 · 💻 cs.CL

Emotion Recognition in Sign Language Conversation

Yusong Wang , Keyu Mao , Takao Obi , Minghao Shao , Kotaro Funakoshi This is my paper

Pith reviewed 2026-05-25 04:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords sign languageemotion recognition in conversationdomain gapmultimodal modelsvisual extractorsconversational datasetsaffective computingeJSL Dialog dataset

0 comments

The pith

Generic multimodal emotion recognition models exhibit a domain gap on sign language conversations, requiring specialized context-aware visual extractors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends emotion recognition in conversation to sign language videos by releasing the eJSL Dialog dataset built from STUDIES corpus scripts. This resource supplies 1920 video samples across 480 dialogues to overcome the limitation that prior sign language emotion data consists only of isolated sentences. Benchmarking experiments compare isolated visual networks against full multimodal conversational architectures and document consistent performance drops when standard models are applied to signing. The results establish that sign language requires visual extractors that explicitly incorporate dialogue context rather than generic multimodal pipelines. The work therefore identifies larger-scale conversational sign language corpora as a prerequisite for effective pre-training.

Core claim

The central claim is that generic multimodal conversational emotion recognition models display a domain gap when transferred to sign language, as shown by systematic benchmarking on the new eJSL Dialog dataset of 1920 videos in 480 dialogues; this gap demonstrates the explicit need for context-aware visual extractors tailored to sign language and for expanded conversational datasets that enable large-scale pre-training.

What carries the argument

The eJSL Dialog dataset of 1920 sign language video samples in 480 unique dialogues, used to benchmark models ranging from isolated visual networks to multimodal conversational architectures.

If this is right

Isolated-sentence sign language models lose accuracy once dialogue history is introduced.
Multimodal architectures trained on spoken-language data transfer poorly to signing without visual extractor changes.
Context must be modeled jointly with sign-specific visual features rather than added as an afterthought.
Future progress in sign language emotion recognition depends on scaling conversational video corpora for pre-training.
Real-world deployment of affective sign language systems will require extractors that process manual and non-manual signals across turns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed gap may arise because emotional cues in signing rely more heavily on simultaneous non-manual signals and spatial grammar than on sequential prosody.
Architectures originally developed for spoken ERC could be adapted by replacing their visual backbone with one pre-trained on large sign language corpora.
The same benchmarking approach could be applied to other under-resourced visual communication systems such as gesture-based or tactile languages.
If larger datasets close the gap, existing multimodal ERC codebases might be reused with only modest visual front-end changes.

Load-bearing premise

The eJSL Dialog dataset constructed from STUDIES corpus scripts is representative enough of real-world sign language conversational dynamics to support conclusions about domain gaps.

What would settle it

A controlled experiment in which a generic multimodal conversational model achieves parity with sign-language-adapted models on an independently collected, larger sign language dialogue corpus without any domain-specific modifications.

Figures

Figures reproduced from arXiv: 2605.23328 by Keyu Mao, Kotaro Funakoshi, Minghao Shao, Takao Obi, Yusong Wang.

**Figure 2.** Figure 2: Emotion transition structure on eJSL Dialogue. Panel (A) reports [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Case studies illustrating complementary roles of conversational [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Emotion Recognition in Conversation is a core component of affective computing, while current resources of sign language emotion datasets primarily focus on isolated sentences and lack conversational context. Models trained exclusively on these isolated utterances demonstrate degraded performance in real world scenarios because they cannot utilize historical dialogue flow. To address this structural limitation, we introduce the ERC task to sign language video analysis and propose the eJSL Dialog dataset. Constructed using the scripts from the STUDIES corpus, the dataset contains 1,920 video samples organized into 480 unique dialogues. We conduct systematic benchmarking on this dataset using models ranging from isolated visual networks to multimodal conversational architectures. The results reveal a domain gap when applying generic multimodal conversational emotion recognition models to sign language. These findings demonstrate the explicit need for context aware visual extractors specific to sign language and indicate that expanding the scale of conversational datasets to support large scale pre-training is a necessary next step for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a conversational sign language ERC dataset from scripted material and shows generic models lag, but the scripting undercuts how much the domain gap tells us about real sign language.

read the letter

The main takeaway is straightforward: they built eJSL Dialog with 1920 video samples across 480 dialogues pulled from the STUDIES corpus scripts, then ran ERC benchmarks on it. Existing multimodal conversational models drop in performance compared to what they do on spoken-language data, which points to a need for sign-language-specific visual context handling and bigger pre-training sets. That dataset construction and the systematic model sweep are the concrete new pieces here. Prior sign language emotion work stayed with isolated sentences, so moving to dialogue flow is a clear step forward on the resource side. The benchmarking itself is useful as a first map of where off-the-shelf ERC approaches break. The soft spot sits right at the data source. Scripted dialogues tend to lack the spontaneous turn-taking, prosody shifts, and integrated gestures that show up in actual sign language conversations. Any performance gap could trace to that artificial structure rather than something inherent to signing. The abstract gives no sign they validated the enacted scripts against spontaneous recordings or measured how close the dynamics match real use. Without that check, the call for specialized context-aware extractors rests on evidence that may not travel outside the scripted setting. This work is aimed at people building affective tools for deaf and hard-of-hearing users or extending ERC beyond speech. A reader already working on sign language datasets or accessibility models could pull the resource and the baseline numbers for follow-up experiments. It is worth sending to peer review so referees can examine the exact model setups, any error analysis, and whether the authors have plans to test against unscripted data. The core idea of conversational context matters, but the current evidence needs that extra grounding before the stronger claims land cleanly.

Referee Report

1 major / 1 minor

Summary. The paper introduces the ERC task to sign language by releasing the eJSL Dialog dataset (1,920 video samples in 480 dialogues) constructed from scripts of the STUDIES corpus. It benchmarks isolated visual networks through multimodal conversational models and reports a domain gap for generic ERC architectures on sign language data, concluding that context-aware visual extractors specific to sign language are required and that larger-scale conversational pre-training data are needed.

Significance. The dataset introduction fills a clear resource gap, as prior sign-language emotion datasets are limited to isolated utterances. If the domain-gap result is robust, the work usefully flags that off-the-shelf multimodal conversational models do not transfer directly and motivates sign-language-specific modeling. The empirical benchmarking itself is a concrete starting point, though its interpretive weight depends on dataset representativeness.

major comments (1)

[Dataset Construction] Dataset Construction (abstract and §3): the dataset is assembled from scripted STUDIES corpus material. Scripted dialogues typically omit spontaneous turn-taking, prosodic variation, and co-speech gesture integration that characterize natural sign-language conversation. Any measured performance drop versus generic ERC models could therefore arise from artificiality rather than sign-language-specific visual features, weakening the claim that specialized context-aware extractors are demonstrably required.

minor comments (1)

[Abstract] Abstract: the phrasing 'eJSL Dialog dataset' should be accompanied by a brief expansion or consistent acronym on first use to aid readers unfamiliar with the STUDIES corpus.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: [Dataset Construction] Dataset Construction (abstract and §3): the dataset is assembled from scripted STUDIES corpus material. Scripted dialogues typically omit spontaneous turn-taking, prosodic variation, and co-speech gesture integration that characterize natural sign-language conversation. Any measured performance drop versus generic ERC models could therefore arise from artificiality rather than sign-language-specific visual features, weakening the claim that specialized context-aware extractors are demonstrably required.

Authors: We agree that the scripted nature of the STUDIES corpus dialogues represents a limitation, as it may not fully reflect spontaneous turn-taking, prosody, or gesture integration present in natural sign-language conversation. This could indeed contribute to the observed performance differences and tempers the strength of the claim that the domain gap is attributable solely to sign-language-specific visual features. We will revise Section 3 to expand the discussion of dataset construction limitations and will moderate the abstract and conclusion statements to note that the results motivate sign-language-specific modeling while acknowledging the need for validation on spontaneous data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset introduction and benchmarking

full rationale

The paper introduces the eJSL Dialog dataset from the existing STUDIES corpus scripts and reports benchmarking results across visual and multimodal models. No derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. Claims about domain gap rest on direct experimental comparisons rather than self-referential definitions or self-citation chains. The work is self-contained against external benchmarks and falsifiable via new data collection.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the STUDIES corpus provides suitable scripts for constructing representative conversational sign language data and that the selected benchmarking models adequately represent generic multimodal approaches. No free parameters or invented entities are evident from the abstract.

axioms (1)

domain assumption The STUDIES corpus scripts accurately represent natural sign language conversational dynamics for emotion labeling.
Dataset is constructed using scripts from the STUDIES corpus.

pith-pipeline@v0.9.0 · 5686 in / 1169 out tokens · 23563 ms · 2026-05-25T04:50:51.299455+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the ERC task to sign language video analysis and propose the eJSL Dialog dataset. Constructed using the scripts from the STUDIES corpus, the dataset contains 1,920 video samples organized into 480 unique dialogues.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

R. W. Picard,Affective computing. MIT press, 2000

work page 2000
[2]

A review of affective computing: From unimodal analysis to multimodal fusion,

S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information fusion, vol. 37, pp. 98–125, 2017

work page 2017
[3]

A comprehensive review of multimodal emotion recognition: Techniques, challenges, and future directions,

Y . Wu, Q. Mi, and T. Gao, “A comprehensive review of multimodal emotion recognition: Techniques, challenges, and future directions,” Biomimetics, vol. 10, no. 7, p. 418, 2025

work page 2025
[4]

Quantitative survey of the state of the art in sign language recognition,

O. Koller, “Quantitative survey of the state of the art in sign language recognition,”arXiv preprint arXiv:2008.09918, 2020

work page arXiv 2008
[5]

Deep learning for sign lan- guage recognition: Current techniques, benchmarks, and open issues,

M. Al-Qurishi, T. Khalid, and R. Souissi, “Deep learning for sign lan- guage recognition: Current techniques, benchmarks, and open issues,” IEEE Access, vol. 9, pp. 126 917–126 951, 2021

work page 2021
[6]

Neuropsychological studies of linguistic and affective facial expressions in deaf signers,

D. P. Corina, U. Bellugi, and J. Reilly, “Neuropsychological studies of linguistic and affective facial expressions in deaf signers,”Language and Speech, vol. 42, no. 2-3, pp. 307–331, 1999

work page 1999
[7]

Facial expressions, emotions, and sign languages,

E. A. Elliott and A. M. Jacobs, “Facial expressions, emotions, and sign languages,”Frontiers in psychology, vol. 4, p. 39013, 2013

work page 2013
[8]

Grammat- ical facial expression recognition in sign language discourse: a study at the syntax level,

F. A. Freitas, S. M. Peres, C. A. Lima, and F. V . Barbosa, “Grammat- ical facial expression recognition in sign language discourse: a study at the syntax level,”Information Systems Frontiers, vol. 19, no. 6, pp. 1243–1259, 2017

work page 2017
[9]

Emotion recognition in signers,

K. Funakoshi and Y . Zhu, “Emotion recognition in signers,”arXiv preprint arXiv:2512.15376, 2025

work page arXiv 2025
[10]

Emosign: A multimodal dataset for understanding emotions in american sign language,

P. Chua, C. M. Fang, T. Ohkawa, R. Kushalnagar, S. Nanayakkara, and P. Maes, “Emosign: A multimodal dataset for understanding emotions in american sign language,”arXiv preprint arXiv:2505.17090, 2025

work page arXiv 2025
[11]

Icon: Interactive conversational memory network for multimodal emotion detection,

D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, “Icon: Interactive conversational memory network for multimodal emotion detection,” inProceedings of the 2018 conference on empir- ical methods in natural language processing, 2018, pp. 2594–2604

work page 2018
[12]

Dialoguernn: An attentive rnn for emotion detection in conversations,

N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in conversations,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 6818–6825

work page 2019
[13]

Cosmic: Commonsense knowledge for emotion identification in conversations,

D. Ghosal, N. Majumder, A. Gelbukh, R. Mihalcea, and S. Po- ria, “Cosmic: Commonsense knowledge for emotion identification in conversations,” inFindings of the association for computational linguistics: EMNLP 2020, 2020, pp. 2470–2481

work page 2020
[14]

Studies: Corpus of japanese empathetic dialogue speech towards friendly voice agent,

Y . Saito, Y . Nishimura, S. Takamichi, K. Tachibana, and H. Saruwatari, “Studies: Corpus of japanese empathetic dialogue speech towards friendly voice agent,”arXiv preprint arXiv:2203.14757, 2022

work page arXiv 2022
[15]

Sandler and D

W. Sandler and D. C. Lillo-Martin,Sign language and linguistic universals. Cambridge University Press, 2006

work page 2006
[16]

Brentari,Sign languages

D. Brentari,Sign languages. Cambridge University Press, 2010

work page 2010
[17]

Recognition of affective and grammatical facial expressions: A study for brazilian sign language,

E. P. d. Silva, P. D. P. Costa, K. M. O. Kumada, J. M. De Martino, and G. A. Florentino, “Recognition of affective and grammatical facial expressions: A study for brazilian sign language,” inECCV 2020 Workshops. Springer, 2020, pp. 218–236. [Online]. Available: https://doi.org/10.1007/978-3-030-66096-3 16

work page doi:10.1007/978-3-030-66096-3 2020
[18]

Wife: Wifi and vision based unobtrusive emotion recognition via gesture and facial expression,

Y . Gu, X. Zhang, H. Yan, J. Huang, Z. Liu, M. Dong, and F. Ren, “Wife: Wifi and vision based unobtrusive emotion recognition via gesture and facial expression,”IEEE Transactions on Affective Com- puting, vol. 14, no. 4, pp. 2567–2581, 2023

work page 2023
[19]

Learning facial expression and body gesture visual information for video emotion recognition,

J. Wei, G. Hu, X. Yang, A. T. Luu, and Y . Dong, “Learning facial expression and body gesture visual information for video emotion recognition,”Expert Systems with Applications, vol. 237, p. 121419, 2024

work page 2024
[20]

Coupled multimodal emotional feature analysis based on broad-deep fusion networks in human–robot interaction,

L. Chen, M. Li, M. Wu, W. Pedrycz, and K. Hirota, “Coupled multimodal emotional feature analysis based on broad-deep fusion networks in human–robot interaction,”IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 7, pp. 9663–9673, 2024

work page 2024
[21]

Weakly supervised video emotion detection and prediction via cross-modal temporal erasing network,

Z. Zhang, L. Wang, and J. Yang, “Weakly supervised video emotion detection and prediction via cross-modal temporal erasing network,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 888–18 897

work page 2023
[22]

Mart: Masked affective representation learning via masked temporal distribution distillation,

Z. Zhang, P. Zhao, E. Park, and J. Yang, “Mart: Masked affective representation learning via masked temporal distribution distillation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 12 830–12 840

work page 2024
[23]

Affective video content analysis: Decade review and new perspectives,

J. Xue, J. Wang, X. Liu, Q. Zhang, and X. Wu, “Affective video content analysis: Decade review and new perspectives,”Big Data Mining and Analytics, vol. 8, no. 1, pp. 118–144, 2024

work page 2024
[24]

Deep emotion recognition in textual conversations: A survey,

P. Pereira, H. Moniz, and J. P. Carvalho, “Deep emotion recognition in textual conversations: A survey,”Artificial Intelligence Review, vol. 58, no. 1, p. 10, 2024

work page 2024
[25]

In search of a robust facial expressions recognition model: A large-scale visual cross- corpus study,

E. Ryumina, D. Dresvyanskiy, and A. Karpov, “In search of a robust facial expressions recognition model: A large-scale visual cross- corpus study,”Neurocomputing, 2022. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0925231222012656

work page 2022
[26]

TelME: Teacher- leading multimodal fusion network for emotion recognition in conversation,

T. Yun, H. Lim, J. Lee, and M. Song, “TelME: Teacher- leading multimodal fusion network for emotion recognition in conversation,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexic...

work page 2024
[27]

Emotrans: Emotional transition-based model for emotion recognition in conver- sation,

Z. Jian, A. Wang, J. Su, J. Yao, M. Wang, and Q. Wu, “Emotrans: Emotional transition-based model for emotion recognition in conver- sation,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 5723–5733

work page 2024
[28]

Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversa- tion,

J. Hu, Y . Liu, J. Zhao, and Q. Jin, “Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversa- tion,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 5666–5675

work page 2021
[29]

BOBSL: BBC-Oxford British Sign Language Dataset,

S. Albanie, G. Varol, L. Momeni, H. Bull, T. Afouras, H. Chowdhury, N. Fox, B. Woll, R. Cooper, A. McParland, and A. Zisserman, “BOBSL: BBC-Oxford British Sign Language Dataset,”arXiv, 2021

work page 2021
[30]

Retinaface: Single-shot multi-level face localisation in the wild,

J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5203–5212

work page 2020
[31]

Rtmpose: Real-time multi-person pose estimation based on mmpose,

T. Jiang, P. Lu, L. Zhang, N. Ma, R. Han, C. Lyu, Y . Li, and K. Chen, “Rtmpose: Real-time multi-person pose estimation based on mmpose,” 2023. [Online]. Available: https://arxiv.org/abs/2303.07399

work page arXiv 2023

[1] [1]

R. W. Picard,Affective computing. MIT press, 2000

work page 2000

[2] [2]

A review of affective computing: From unimodal analysis to multimodal fusion,

S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information fusion, vol. 37, pp. 98–125, 2017

work page 2017

[3] [3]

A comprehensive review of multimodal emotion recognition: Techniques, challenges, and future directions,

Y . Wu, Q. Mi, and T. Gao, “A comprehensive review of multimodal emotion recognition: Techniques, challenges, and future directions,” Biomimetics, vol. 10, no. 7, p. 418, 2025

work page 2025

[4] [4]

Quantitative survey of the state of the art in sign language recognition,

O. Koller, “Quantitative survey of the state of the art in sign language recognition,”arXiv preprint arXiv:2008.09918, 2020

work page arXiv 2008

[5] [5]

Deep learning for sign lan- guage recognition: Current techniques, benchmarks, and open issues,

M. Al-Qurishi, T. Khalid, and R. Souissi, “Deep learning for sign lan- guage recognition: Current techniques, benchmarks, and open issues,” IEEE Access, vol. 9, pp. 126 917–126 951, 2021

work page 2021

[6] [6]

Neuropsychological studies of linguistic and affective facial expressions in deaf signers,

D. P. Corina, U. Bellugi, and J. Reilly, “Neuropsychological studies of linguistic and affective facial expressions in deaf signers,”Language and Speech, vol. 42, no. 2-3, pp. 307–331, 1999

work page 1999

[7] [7]

Facial expressions, emotions, and sign languages,

E. A. Elliott and A. M. Jacobs, “Facial expressions, emotions, and sign languages,”Frontiers in psychology, vol. 4, p. 39013, 2013

work page 2013

[8] [8]

Grammat- ical facial expression recognition in sign language discourse: a study at the syntax level,

F. A. Freitas, S. M. Peres, C. A. Lima, and F. V . Barbosa, “Grammat- ical facial expression recognition in sign language discourse: a study at the syntax level,”Information Systems Frontiers, vol. 19, no. 6, pp. 1243–1259, 2017

work page 2017

[9] [9]

Emotion recognition in signers,

K. Funakoshi and Y . Zhu, “Emotion recognition in signers,”arXiv preprint arXiv:2512.15376, 2025

work page arXiv 2025

[10] [10]

Emosign: A multimodal dataset for understanding emotions in american sign language,

P. Chua, C. M. Fang, T. Ohkawa, R. Kushalnagar, S. Nanayakkara, and P. Maes, “Emosign: A multimodal dataset for understanding emotions in american sign language,”arXiv preprint arXiv:2505.17090, 2025

work page arXiv 2025

[11] [11]

Icon: Interactive conversational memory network for multimodal emotion detection,

D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, “Icon: Interactive conversational memory network for multimodal emotion detection,” inProceedings of the 2018 conference on empir- ical methods in natural language processing, 2018, pp. 2594–2604

work page 2018

[12] [12]

Dialoguernn: An attentive rnn for emotion detection in conversations,

N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in conversations,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 6818–6825

work page 2019

[13] [13]

Cosmic: Commonsense knowledge for emotion identification in conversations,

D. Ghosal, N. Majumder, A. Gelbukh, R. Mihalcea, and S. Po- ria, “Cosmic: Commonsense knowledge for emotion identification in conversations,” inFindings of the association for computational linguistics: EMNLP 2020, 2020, pp. 2470–2481

work page 2020

[14] [14]

Studies: Corpus of japanese empathetic dialogue speech towards friendly voice agent,

Y . Saito, Y . Nishimura, S. Takamichi, K. Tachibana, and H. Saruwatari, “Studies: Corpus of japanese empathetic dialogue speech towards friendly voice agent,”arXiv preprint arXiv:2203.14757, 2022

work page arXiv 2022

[15] [15]

Sandler and D

W. Sandler and D. C. Lillo-Martin,Sign language and linguistic universals. Cambridge University Press, 2006

work page 2006

[16] [16]

Brentari,Sign languages

D. Brentari,Sign languages. Cambridge University Press, 2010

work page 2010

[17] [17]

Recognition of affective and grammatical facial expressions: A study for brazilian sign language,

E. P. d. Silva, P. D. P. Costa, K. M. O. Kumada, J. M. De Martino, and G. A. Florentino, “Recognition of affective and grammatical facial expressions: A study for brazilian sign language,” inECCV 2020 Workshops. Springer, 2020, pp. 218–236. [Online]. Available: https://doi.org/10.1007/978-3-030-66096-3 16

work page doi:10.1007/978-3-030-66096-3 2020

[18] [18]

Wife: Wifi and vision based unobtrusive emotion recognition via gesture and facial expression,

Y . Gu, X. Zhang, H. Yan, J. Huang, Z. Liu, M. Dong, and F. Ren, “Wife: Wifi and vision based unobtrusive emotion recognition via gesture and facial expression,”IEEE Transactions on Affective Com- puting, vol. 14, no. 4, pp. 2567–2581, 2023

work page 2023

[19] [19]

Learning facial expression and body gesture visual information for video emotion recognition,

J. Wei, G. Hu, X. Yang, A. T. Luu, and Y . Dong, “Learning facial expression and body gesture visual information for video emotion recognition,”Expert Systems with Applications, vol. 237, p. 121419, 2024

work page 2024

[20] [20]

Coupled multimodal emotional feature analysis based on broad-deep fusion networks in human–robot interaction,

L. Chen, M. Li, M. Wu, W. Pedrycz, and K. Hirota, “Coupled multimodal emotional feature analysis based on broad-deep fusion networks in human–robot interaction,”IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 7, pp. 9663–9673, 2024

work page 2024

[21] [21]

Weakly supervised video emotion detection and prediction via cross-modal temporal erasing network,

Z. Zhang, L. Wang, and J. Yang, “Weakly supervised video emotion detection and prediction via cross-modal temporal erasing network,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 888–18 897

work page 2023

[22] [22]

Mart: Masked affective representation learning via masked temporal distribution distillation,

Z. Zhang, P. Zhao, E. Park, and J. Yang, “Mart: Masked affective representation learning via masked temporal distribution distillation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 12 830–12 840

work page 2024

[23] [23]

Affective video content analysis: Decade review and new perspectives,

J. Xue, J. Wang, X. Liu, Q. Zhang, and X. Wu, “Affective video content analysis: Decade review and new perspectives,”Big Data Mining and Analytics, vol. 8, no. 1, pp. 118–144, 2024

work page 2024

[24] [24]

Deep emotion recognition in textual conversations: A survey,

P. Pereira, H. Moniz, and J. P. Carvalho, “Deep emotion recognition in textual conversations: A survey,”Artificial Intelligence Review, vol. 58, no. 1, p. 10, 2024

work page 2024

[25] [25]

In search of a robust facial expressions recognition model: A large-scale visual cross- corpus study,

E. Ryumina, D. Dresvyanskiy, and A. Karpov, “In search of a robust facial expressions recognition model: A large-scale visual cross- corpus study,”Neurocomputing, 2022. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0925231222012656

work page 2022

[26] [26]

TelME: Teacher- leading multimodal fusion network for emotion recognition in conversation,

T. Yun, H. Lim, J. Lee, and M. Song, “TelME: Teacher- leading multimodal fusion network for emotion recognition in conversation,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexic...

work page 2024

[27] [27]

Emotrans: Emotional transition-based model for emotion recognition in conver- sation,

Z. Jian, A. Wang, J. Su, J. Yao, M. Wang, and Q. Wu, “Emotrans: Emotional transition-based model for emotion recognition in conver- sation,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 5723–5733

work page 2024

[28] [28]

Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversa- tion,

J. Hu, Y . Liu, J. Zhao, and Q. Jin, “Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversa- tion,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 5666–5675

work page 2021

[29] [29]

BOBSL: BBC-Oxford British Sign Language Dataset,

S. Albanie, G. Varol, L. Momeni, H. Bull, T. Afouras, H. Chowdhury, N. Fox, B. Woll, R. Cooper, A. McParland, and A. Zisserman, “BOBSL: BBC-Oxford British Sign Language Dataset,”arXiv, 2021

work page 2021

[30] [30]

Retinaface: Single-shot multi-level face localisation in the wild,

J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5203–5212

work page 2020

[31] [31]

Rtmpose: Real-time multi-person pose estimation based on mmpose,

T. Jiang, P. Lu, L. Zhang, N. Ma, R. Han, C. Lyu, Y . Li, and K. Chen, “Rtmpose: Real-time multi-person pose estimation based on mmpose,” 2023. [Online]. Available: https://arxiv.org/abs/2303.07399

work page arXiv 2023