Recognition: unknown
Inter-Stance: A Dyadic Multimodal Corpus for Conversational Stance Analysis
Pith reviewed 2026-05-08 12:15 UTC · model grok-4.3
The pith
A new 20TB multimodal corpus records 45 dyads with synchronized video, thermal, voice, physiology and stance labels to model interpersonal behavior.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a new data corpus of multimodal dyadic interaction (45 dyads, 90 persons) that includes synchronized multi-modality behavior (2D face video, 3D face geometry, thermal spectrum dynamics, voice and speech behavior, physiology (PPG, EDA, heart-rate, blood pressure, and respiration), and self-reported affect of all participants in a communicative interaction scenario. Two types of dyads are included: persons with shared past history and strangers. Annotations include social signals, agreement, disagreement, and neutral stance. With a potent emotion induction, these multimodal data will enable novel modeling of multimodal interpersonal behavior. The dataset includes 20TB of multimodal
What carries the argument
The Inter-Stance dyadic corpus, a synchronized multimodal recording and annotation resource that pairs visual, thermal, audio, and physiological channels with stance labels for both familiar and unfamiliar pairs.
If this is right
- Models can now be trained on synchronized multimodal signals to predict stance in conversations.
- Comparisons between familiar and stranger dyads can reveal how interpersonal history shapes multimodal expression of agreement or disagreement.
- Physiological signals can be integrated with visual and audio cues for joint analysis of affect and stance.
- Community-wide development of social signal processing techniques becomes possible with this shared resource.
Where Pith is reading between the lines
- The thermal and physiological channels may allow future studies to link subconscious bodily responses to expressed stance in ways single-modality data cannot.
- Real-time stance detection systems for video calls or virtual agents could be trained directly on the synchronized recordings.
- Psychological research on social appraisal could combine this dyadic corpus with existing individual datasets to test how context modifies behavior.
Load-bearing premise
The collected recordings, stance annotations, and self-reports are reliable, and the full 20TB dataset will be released publicly with enough documentation for independent use by other researchers.
What would settle it
Failure to release the complete 20TB dataset with documentation, or inability of independent teams to train multimodal models that improve stance classification over single-modality baselines, would show the enabling claim does not hold.
Figures
read the original abstract
Social interactions dominate our perceptions of the world and shape our daily behavior by attaching social meaning to acts as simple and spontaneous as gestures, facial expressions, voice, and speech. People mimic and otherwise respond to each other's postures, facial expressions, mannerisms, and other verbal and nonverbal behavior, and form appraisals or evaluations in the process. Yet, no publicly-available dataset includes multimodal recordings and self-report measures of multiple persons in social interaction. Dyadic recordings and annotation are lacking. We present a new data corpus of multimodal dyadic interaction (45 dyads, 90 persons) that includes synchronized multi-modality behavior (2D face video, 3D face geometry, thermal spectrum dynamics, voice and speech behavior, physiology (PPG, EDA, heart-rate, blood pressure, and respiration), and self-reported affect of all participants in a communicative interaction scenario. Two types of dyads are included: persons with shared past history and strangers. Annotations include social signals, agreement, disagreement, and neutral stance. With a potent emotion induction, these multimodal data will enable novel modeling of multimodal interpersonal behavior. We present extensive experiments to evaluate multimodal dyadic communication of dyads with and without interpersonal history, and their affect. This new database will make multimodal modeling of social interaction never possible before. The dataset includes 20TB of multimodal data to share with the research community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Inter-Stance, a new dyadic multimodal corpus of 45 dyads (90 participants) featuring synchronized recordings of 2D/3D face video, thermal spectrum, audio/speech, and physiology (PPG, EDA, heart rate, blood pressure, respiration) plus self-reported affect. It distinguishes shared-history from stranger pairs, provides annotations for social signals and stance (agreement/disagreement/neutral), and states that extensive experiments were performed on multimodal dyadic communication and affect. The authors claim the 20 TB dataset will enable multimodal modeling of social interaction that was never possible before and will be shared publicly.
Significance. If the full dataset is released with usable documentation and the stance/affect annotations are validated, the corpus would fill a documented gap in public dyadic multimodal resources that combine multiple synchronized modalities with stance labels. This could support new work on interpersonal dynamics, multimodal fusion, and conversational analysis that existing single-person or unimodal datasets cannot address.
major comments (3)
- [Abstract] Abstract: the manuscript asserts that 'extensive experiments' evaluate multimodal dyadic communication of dyads with and without history, yet supplies no quantitative results, baselines, error analysis, or specific findings, leaving the utility of the corpus for the claimed novel modeling unverified.
- [Annotation Procedures] Annotation section: no inter-annotator agreement statistics or other reliability measures are reported for the stance labels (agreement/disagreement/neutral) or social-signal annotations, which are load-bearing for any downstream multimodal modeling claims.
- [Dataset Release] Dataset availability statement: although the paper states that the 20 TB corpus 'will be shared with the research community,' it provides no repository link, access protocol, metadata documentation, or licensing information required for independent use and verification.
minor comments (1)
- [Abstract] The title emphasizes 'Conversational Stance Analysis' while the abstract and experiments also highlight affect and emotion induction; a brief clarification of scope would reduce potential reader confusion.
Simulated Author's Rebuttal
Thank you for the detailed and constructive review of our manuscript introducing the Inter-Stance corpus. We appreciate the referee's identification of areas where clarity and completeness can be improved. We address each major comment below and commit to revisions that strengthen the paper without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the manuscript asserts that 'extensive experiments' evaluate multimodal dyadic communication of dyads with and without history, yet supplies no quantitative results, baselines, error analysis, or specific findings, leaving the utility of the corpus for the claimed novel modeling unverified.
Authors: We agree that the abstract would benefit from explicitly summarizing the experimental outcomes to better substantiate the corpus's value for novel multimodal modeling. The full manuscript contains an experiments section that performs multimodal fusion analyses for stance detection and affect recognition, including comparisons between shared-history and stranger dyads. In the revised version, we will update the abstract to concisely report key quantitative results (e.g., fusion performance gains and dyad-type differences), reference the baselines employed, and note the error analysis approach. revision: yes
-
Referee: [Annotation Procedures] Annotation section: no inter-annotator agreement statistics or other reliability measures are reported for the stance labels (agreement/disagreement/neutral) or social-signal annotations, which are load-bearing for any downstream multimodal modeling claims.
Authors: We concur that inter-annotator agreement metrics are necessary to validate the stance and social-signal annotations. We will expand the annotation section to include these statistics (Cohen's kappa for stance labels and suitable multi-rater measures for social signals), along with details on the number of annotators, annotation guidelines, and any training or adjudication procedures used. revision: yes
-
Referee: [Dataset Release] Dataset availability statement: although the paper states that the 20 TB corpus 'will be shared with the research community,' it provides no repository link, access protocol, metadata documentation, or licensing information required for independent use and verification.
Authors: We acknowledge that a concrete data-release plan is required for reproducibility. We will revise the availability statement to specify the repository (with DOI), access protocol for the full 20 TB dataset, metadata documentation, and licensing (e.g., CC-BY 4.0). The release will occur upon paper acceptance to allow time for complete documentation and verification. revision: yes
Circularity Check
No circularity: empirical data collection with no derivation chain or fitted predictions.
full rationale
The manuscript presents a new multimodal dyadic interaction corpus (45 dyads, synchronized 2D/3D face, thermal, audio, physiology, and stance/affect annotations) and describes collection/annotation procedures. No equations, models, or first-principles derivations appear; the central claim is that the released dataset will enable future modeling 'never possible before.' This is an empirical contribution whose validity hinges on public release and annotation quality, not on any self-referential reduction of predictions to inputs. No self-citations, ansatzes, or uniqueness theorems are invoked to support a closed derivation. The paper is self-contained as a data resource description and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Barry and J
B. Barry and J. M. Crant. Dyadic communication relationships in organizations: An attribution/expectancy approach.Organization Science, 11(6):648–664, 2000
2000
-
[2]
Bilakhia, S
S. Bilakhia, S. Petridis, A. Nijholt, and M. Pantic. The mahnob mimicry database: A database of naturalistic human interactions. Pattern recognition letters, 66:52–61, 2015
2015
-
[3]
I. Biopac. Biopac physiology signal capture system. https://www.biopac.com, 2020
2020
-
[4]
S. M. Boker and J. F. Cohn. Real-time dissociation of facial appearance and dynamics during natural conversation.Dynamic faces: Insights from experiments and computation (pp, 2011
2011
-
[5]
S. M. Boker, J. F. Cohn, B.-J. Theobald, I. Matthews, M. Mangini, J. R. Spies, Z. Ambadar, and T. R. Brick. Something in the way we move: Motion dynamics, not perceived sex, influence head movements in conversation.Journal of Experimental Psychology: Human Perception and Performance, 37(3):874, 2011
2011
-
[6]
Busso, M
C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: interactive emotional dyadic motion capture database.Lang. Resour. Evaluation, 42(4):335–359, 2008
2008
-
[7]
Busso, S
C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost. Msp-improv: An acted corpus of dyadic interactions to study emotion perception.IEEE Transactions on Affective Computing, 8(1):67–80, 2016
2016
-
[8]
Canavan, P
S. Canavan, P. Liu, X. Zhang, and L. Yin. Landmark localization on 3d/4d range data using a shape index-based statistical shape model with global and local constraints.Computer Vision and Image Understanding, 139:136–148, 2015
2015
-
[9]
Chang, Y
D. Chang, Y . Yin, Z. Li, M. Tran, and M. Soleymani. Libreface: An open-source toolkit for deep facial expression analysis. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2024. To appear
2024
-
[10]
T. L. Chartrand and J. A. Bargh. The chameleon effect: the perception– behavior link and social interaction.Journal of personality and social psychology, 76(6):893, 1999
1999
-
[11]
C. Chen. PyTorch Face Landmark: A fast and accurate facial landmark detector, 2021
2021
-
[12]
T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models-their training and application.Computer vision and image understanding, 61(1):38–59, 1995
1995
-
[13]
Delaherche, M
E. Delaherche, M. Chetouani, A. Mahdhaoui, C. Saint-Georges, S. Viaux, and D. Cohen. Interpersonal synchrony: A survey of evaluation methods across disciplines.IEEE Transactions on Affective Computing, 3(3):349–365, 2012
2012
-
[14]
Devlin, M.-W
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), 2019
2019
-
[15]
Di4D-DimentionalImaging
D. Di4D-DimentionalImaging. Di4d dual capture system. https://di4d.com, 2020
2020
-
[16]
H. A. Elfenbein, M. D. Foo, J. Boldry, and H. H. Tan. Brief report dyadic effects in nonverbal communication: A variance partitioning analysis.Cognition & Emotion, 20(1):149–159, 2006
2006
-
[17]
Eyben, M
F. Eyben, M. W ¨ollmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. InProceedings of the 18th ACM international conference on Multimedia, pages 1459– 1462, 2010
2010
-
[18]
I. FLIR. Flir therma imaging camera.https://www.flir.com, 2020
2020
-
[19]
Gao, M.-M
S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y . Zhang, M.-H. Yang, and P. Torr. Res2net: A new multi-scale backbone architecture.IEEE TPAMI, 2021
2021
-
[20]
Gatica-Perez
D. Gatica-Perez. Automatic nonverbal analysis of social interaction in small groups: A review.Image and vision computing, 27(12):1775– 1787, 2009
2009
-
[21]
J. M. Girard. Carma: Software for continuous affect rating and media annotation.Journal of open research software, 2(1):e5, 2014
2014
-
[22]
J. M. Girard, J. F. Cohn, M. H. Mahoor, S. Mavadati, and D. P. Rosenwald. Social risk and depression: Evidence from manual and automatic facial expression analysis. In2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pages 1–8. IEEE, 2013
2013
-
[23]
Gueguen, C
N. Gueguen, C. Jacob, and A. Martin. Mimicry in social interaction: Its effect on human judgment and behavior.European Journal of Social Sciences, 8(2):253–259, 2009
2009
-
[24]
Hess and A
U. Hess and A. Fischer. Emotional mimicry as social regulation. Personality and social psychology review, 17(2):142–157, 2013
2013
-
[25]
Jovanovic, R
N. Jovanovic, R. op den Akker, and A. Nijholt. A corpus for studying addressing behaviour in multi-party dialogues.Language Resources and Evaluation, 40(1):5–23, 2006
2006
-
[26]
J. L. Lakin, V . E. Jefferis, C. M. Cheng, and T. L. Chartrand. The chameleon effect as social glue: Evidence for the evolutionary significance of nonconscious mimicry.Journal of nonverbal behavior, 27(3):145–162, 2003
2003
-
[27]
X. Li, Z. Zhang, X. Zhang, T. Wang, Z. Li, H. Yang, U. Ciftci, Q. Ji, J. Cohn, and L. Yin. Disagreement matters: Exploring internal diversification for redundant attention in generic facial action analysis. IEEE Transactions on Affective Computing, 2023
2023
-
[28]
Y . Li, T. Sosea, A. Sawant, A. J. Nair, D. Inkpen, and C. Caragea. P-stance: A large dataset for stance detection in political domain. InFindings of the association for computational linguistics: ACL- IJCNLP 2021, pages 2355–2365, 2021
2021
- [29]
-
[30]
McKeown, M
G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder. The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent.IEEE transactions on affective computing, 3(1):5–17, 2011
2011
-
[31]
Mohammad, S
S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, and C. Cherry. Semeval-2016 task 6: Detecting stance in tweets. InProceedings of the 10th international workshop on semantic evaluation (SemEval- 2016), pages 31–41, 2016
2016
-
[32]
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations
S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations.arXiv preprint arXiv:1810.02508, 2018
work page Pith review arXiv 2018
-
[33]
Radford, J
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervi- sion. InInternational conference on machine learning, pages 28492– 28518. PMLR, 2023
2023
-
[34]
J. Rehg, G. Abowd, A. Rozga, M. Romero, M. Clements, S. Sclaroff, I. Essa, O. Ousley, Y . Li, C. Kim, et al. Decoding children’s social behavior. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3414–3421, 2013
2013
-
[35]
Ringeval, A
F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne. Introducing the recola multimodal corpus of remote collaborative and affective inter- actions. In2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pages 1–8. IEEE, 2013
2013
-
[36]
A. V . Savchenko, L. V . Savchenko, and I. Makarov. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network.IEEE Transactions on Affective Computing, 2022
2022
-
[37]
Serengil and A
S. Serengil and A. Ozpinar. A benchmark of facial recognition pipelines and co-usability performances of modules.Journal of Information Technologies, 17(2):95–107, 2024
2024
-
[38]
Sherstinsky
A. Sherstinsky. Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network.Physica D: Nonlinear Phenomena, 404:132306, 2020
2020
-
[39]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
2017
-
[40]
Y . Wang, M. Huang, X. Zhu, and L. Zhao. Attention-based lstm for aspect-level sentiment classification. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 606–615, 2016
2016
-
[41]
Zhang, T
X. Zhang, T. Wang, X. Li, H. Yang, and L. Yin. Weakly-supervised text-driven contrastive learning for facial behavior understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20751–20762, 2023
2023
-
[42]
Zhang, L
X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard. Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database.Image and Vision Computing, 32(10):692–706, 2014
2014
-
[43]
Zhang, J
Z. Zhang, J. M. Girard, Y . Wu, X. Zhang, L. Yin, et al. Multi- modal spontaneous emotion corpus for human behavior analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3438–3446, 2016
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.