The 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge
Pith reviewed 2026-05-08 19:07 UTC · model grok-4.3
The pith
A new benchmark with three sub-challenges on 945 conversations pushes models to predict how partners influence each other, time their turns, and develop rapport.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The challenge establishes three tasks on one dataset: directional interpersonal influence prediction, next-speaker and time-to-next-speech turn-taking prediction, and rapport trajectory prediction across full interactions, with baseline results showing that current methods reach 0.40 CCC for influence, 0.66 Macro-F1 for turn-taking, and 0.68 CCC for rapport yet still fall short on directional dependence and long-horizon dynamics.
What carries the argument
The three coordinated sub-challenges built on a shared dataset of 945 dyadic conversations, which supplies multimodal signals, fixed train-validation-test splits, and standardized metrics to support temporal reasoning and cross-context evaluation.
Load-bearing premise
The collected conversations supply representative coverage of directional influence, timing coordination, and rapport development so that the chosen metrics and splits will drive progress beyond speaker-centric methods.
What would settle it
A demonstration that models achieving strong scores on the three sub-challenges show no improvement over existing speaker-centric baselines when tested on separate, independent sets of dyadic interactions.
Figures
read the original abstract
The 2026 ACII Dyadic Conversations (ACII-DaiKon) Workshop & Challenge introduces a benchmark for modeling interpersonal affect and social dynamics in dyadic conversations. Although conversational affect modeling has advanced rapidly, most benchmarks remain speaker-centric and underrepresent coupled, time-evolving processes between partners, including directional influence, conversational timing coordination, and rapport development. To address this gap, ACII-DaiKon presents three coordinated sub-challenges built on a shared dataset: (1) directional interpersonal influence prediction, (2) turn-taking prediction (next-speaker and time-to-next-speech), and (3) rapport trajectory prediction across full interactions. The challenge is built on the Hume-DaiKon dataset, comprising 945 dyadic conversations (743.4 hours of audiovisual data) collected under naturalistic conditions across five languages. The benchmark supports multimodal modeling, temporal reasoning, and cross-context generalization through fixed train/validation/test splits, standardized metrics, and released baseline systems. Evaluation uses Concordance Correlation Coefficient (CCC), Pearson correlation, Macro-F1, and Mean Absolute Error (MAE) depending on the sub-challenge. Baseline experiments establish initial reference performance, with best test results of 0.40 CCC and 0.50 Pearson for influence prediction, 0.66 Macro-F1 and 1.50~s MAE for turn-taking, and 0.68 CCC and 0.70 Pearson for rapport trajectory modeling. These results indicate that while current methods capture coarse dyadic patterns, robust modeling of directional dependence and long-horizon interpersonal dynamics remains challenging. The workshop provides a shared platform for rigorous comparison and cross-disciplinary discussion on data validity, evaluation protocols, and culturally aware modeling for dyadic interaction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript announces the 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge, introducing a benchmark for interpersonal affect and social dynamics in dyadic conversations. It defines three sub-challenges on the shared Hume-DaiKon dataset (945 naturalistic conversations across five languages, 743.4 hours): (1) directional interpersonal influence prediction, (2) turn-taking prediction (next-speaker and time-to-next-speech), and (3) rapport trajectory prediction. The benchmark supplies fixed train/validation/test splits, metrics (CCC, Pearson correlation, Macro-F1, MAE), released baseline systems, and initial test scores (0.40 CCC / 0.50 Pearson for influence; 0.66 Macro-F1 / 1.50 s MAE for turn-taking; 0.68 CCC / 0.70 Pearson for rapport) that are presented as evidence that current methods capture only coarse patterns.
Significance. If the benchmark is widely adopted, it could meaningfully shift affective computing research toward coupled, time-evolving dyadic processes rather than speaker-centric models, providing a reproducible platform for multimodal, temporal, and cross-lingual work. The release of a large naturalistic multilingual corpus together with baseline code and fixed evaluation protocols is a concrete strength that supports community-driven progress.
Simulated Author's Rebuttal
We thank the referee for the positive review, the recognition of the benchmark's potential impact on shifting affective computing toward dyadic and temporal modeling, and the recommendation to accept. We appreciate the acknowledgment of the dataset release, fixed splits, baselines, and evaluation protocols as concrete strengths.
Circularity Check
No circularity; benchmark definition is independent of any model or derivation
full rationale
This is a workshop/challenge announcement paper that defines a new benchmark (three sub-challenges on the Hume-DaiKon dataset with fixed splits, metrics, and released baselines) without any claimed derivation, first-principles prediction, or mathematical chain. The dataset, splits, evaluation protocols (CCC, Pearson, Macro-F1, MAE), and baseline reference scores are presented as design choices and empirical starting points, not as outputs derived from or fitted to the challenge itself. No equations, ansatzes, uniqueness theorems, or self-citations are invoked to justify load-bearing premises; the central claim is simply that the proposed benchmark will enable future progress. The announcement is therefore self-contained and contains no steps that reduce to their own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Interpersonal emotion regulation,
J. Zaki and W. C. Williams, “Interpersonal emotion regulation,”Emotion, vol. 13, no. 5, pp. 803–810, 2013
2013
-
[2]
A simplest systematics for the organization of turn-taking for conversation,
H. Sacks, E. A. Schegloff, and G. Jefferson, “A simplest systematics for the organization of turn-taking for conversation,”Language, vol. 50, no. 4, pp. 696–735, 1974
1974
-
[3]
The nature of rapport and its nonverbal correlates,
L. Tickle-Degnen and R. Rosenthal, “The nature of rapport and its nonverbal correlates,”Psychological Inquiry, vol. 1, no. 4, pp. 285–293, 1990
1990
-
[4]
Multimodal machine learning: A survey and taxonomy,
T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, 2019
2019
-
[5]
R. W. Picard,Affective Computing. Cambridge, MA, USA: MIT Press, 1997
1997
-
[6]
Robust speech recognition via large-scale weak supervision,
A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023
2023
-
[7]
FaceNet: A unified embedding for face recognition and clustering,
F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” inProc. CVPR, 2015
2015
-
[8]
Qwen Team, “Qwen2.5 Technical Report,” arXiv:2412.15115, 2025
work page Pith review arXiv 2025
-
[9]
Efficient memory management for large language model serving with PagedAttention,
W. Kwonet al., “Efficient memory management for large language model serving with PagedAttention,” inProc. ACM SOSP, 2023
2023
-
[10]
Modeling inter- personal perception in dyadic interactions: towards robot-assisted social mediation in the real world,
H. Javed, W. Wang, A. B. Usman, and N. Jamali, “Modeling inter- personal perception in dyadic interactions: towards robot-assisted social mediation in the real world,”Frontiers in Robotics and AI, vol. 11, p. 1410957, 2024
2024
-
[11]
Affective computing methods for multimodal embodied AI human–computer interaction,
X. Song, C. Liu, L. Xu, Z. Lu, and Y . Zhang, “Affective computing methods for multimodal embodied AI human–computer interaction,” Aslib Journal of Information Management, 2025
2025
-
[12]
PyTorch: An imperative style, high-performance deep learning library,
A. Paszkeet al., “PyTorch: An imperative style, high-performance deep learning library,” inAdvances in Neural Information Processing Systems 32, 2019
2019
-
[13]
On certain integrals of Lipschitz-Hankel type involving products of Bessel functions,
G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of Lipschitz-Hankel type involving products of Bessel functions,” Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, April 1955
1955
-
[14]
Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol
J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol
-
[15]
Oxford: Clarendon, 1892, pp.68–73
-
[16]
Fine particles, thin films and exchange anisotropy,
I. S. Jacobs and C. P. Bean, “Fine particles, thin films and exchange anisotropy,” in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271–350
1963
-
[17]
Title of paper if known,
K. Elissa, “Title of paper if known,” unpublished
-
[18]
Title of paper with only first word capitalized,
R. Nicole, “Title of paper with only first word capitalized,” J. Name Stand. Abbrev., in press
-
[19]
Electron spectroscopy studies on magneto-optical media and plastic substrate interface,
Y . Yorozu, M. Hirano, K. Oka, and Y . Tagawa, “Electron spectroscopy studies on magneto-optical media and plastic substrate interface,” IEEE Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]
1987
-
[20]
Young, The Technical Writer’s Handbook
M. Young, The Technical Writer’s Handbook. Mill Valley, CA: Univer- sity Science, 1989
1989
-
[21]
A circumplex model of affect,
J. A. Russell, “A circumplex model of affect,”Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161–1178, 1980
1980
-
[22]
The ICML 2022 Expressive V ocalizations Workshop and Competition: Recognizing, generating, and personalizing vocal bursts,
A. Baird, P. Tzirakis, G. Gidel, M. Jiralerspong, E. B. Muller, K. Mathewson, B. Schuller, E. Cambria, D. Keltner, and A. Cowen, “The ICML 2022 Expressive V ocalizations Workshop and Competition: Recognizing, generating, and personalizing vocal bursts,” inProc. ICML Expressive Vocalizations Workshop and Competition, 2022
2022
-
[23]
Large-scale nonverbal vocalization detection using transformers,
P. Tzirakis, A. Baird, J. Brooks, C. Gagne, L. Kim, M. Opara, C. Gregory, J. Metrick, G. Boseck, V . Tiruvadi, B. Schuller, D. Keltner, and A. Cowen, “Large-scale nonverbal vocalization detection using transformers,” inICASSP 2023 – IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp. 1–5, 2023
2023
-
[24]
MuSe 2020 Challenge and Workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness de- tection in real-life media,
L. Stappen, A. Baird, G. Rizos, P. Tzirakis, X. Du, F. Hafner, L. Schumann, A. Mallol-Ragolta, B. W. Schuller, I. Lefter, E. Cambria, and I. Kompatsiaris, “MuSe 2020 Challenge and Workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness de- tection in real-life media,” inProc. 1st Int. Multimodal Sentiment Analysis in Real-li...
2020
-
[25]
The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked emotions, cross-cultural humour, and personalisation,
L. Christ, S. Amiriparian, A. Baird, A. Kathan, N. M ¨uller, S. Klug, C. Gagne, P. Tzirakis, L. Stappen, E.-M. Meßner, A. K ¨onig, A. Cowen, E. Cambria, and B. W. Schuller, “The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked emotions, cross-cultural humour, and personalisation,” inProc. 4th Multimodal Sentiment Analysis Challenge and Workshop, 2023
2023
-
[26]
End-to-end multimodal emotion recognition using deep neural networks,
P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.