pith. sign in

arxiv: 2605.02672 · v1 · submitted 2026-05-04 · 💻 cs.AI · cs.CL· cs.HC

The 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge

Pith reviewed 2026-05-08 19:07 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HC
keywords dyadic conversationsinterpersonal influenceturn-taking predictionrapport trajectorymultimodal modelingsocial dynamicsbenchmark challenge
0
0 comments X

The pith

A new benchmark with three sub-challenges on 945 conversations pushes models to predict how partners influence each other, time their turns, and develop rapport.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a workshop and challenge to address the limitation that most conversational affect modeling remains focused on individual speakers. It proposes coordinated tasks that target directional influence between partners, the prediction of who speaks next and when, and the trajectory of rapport across entire interactions. These tasks rest on a shared collection of naturalistic conversations recorded in multiple languages. The work supplies fixed data splits, evaluation metrics, and initial baseline results to enable direct comparison of approaches. If the benchmark succeeds, progress would shift from isolated speaker analysis toward coupled, time-evolving interpersonal processes.

Core claim

The challenge establishes three tasks on one dataset: directional interpersonal influence prediction, next-speaker and time-to-next-speech turn-taking prediction, and rapport trajectory prediction across full interactions, with baseline results showing that current methods reach 0.40 CCC for influence, 0.66 Macro-F1 for turn-taking, and 0.68 CCC for rapport yet still fall short on directional dependence and long-horizon dynamics.

What carries the argument

The three coordinated sub-challenges built on a shared dataset of 945 dyadic conversations, which supplies multimodal signals, fixed train-validation-test splits, and standardized metrics to support temporal reasoning and cross-context evaluation.

Load-bearing premise

The collected conversations supply representative coverage of directional influence, timing coordination, and rapport development so that the chosen metrics and splits will drive progress beyond speaker-centric methods.

What would settle it

A demonstration that models achieving strong scores on the three sub-challenges show no improvement over existing speaker-centric baselines when tested on separate, independent sets of dyadic interactions.

Figures

Figures reproduced from arXiv: 2605.02672 by Alice Baird, Emilia Parada-Cabaleiro, Jakub Piotr Clapa, Jeffrey Brooks, Jens Madsen, Lukas Stappen, Panagiotis Tzirakis, Sharath Rao, Theo Lebryk.

Figure 1
Figure 1. Figure 1: Example frames from the Hume DaiKon challenge dataset page view at source ↗
Figure 2
Figure 2. Figure 2: Arousal–valence visualization from the public Hume DaiKon chal view at source ↗
read the original abstract

The 2026 ACII Dyadic Conversations (ACII-DaiKon) Workshop & Challenge introduces a benchmark for modeling interpersonal affect and social dynamics in dyadic conversations. Although conversational affect modeling has advanced rapidly, most benchmarks remain speaker-centric and underrepresent coupled, time-evolving processes between partners, including directional influence, conversational timing coordination, and rapport development. To address this gap, ACII-DaiKon presents three coordinated sub-challenges built on a shared dataset: (1) directional interpersonal influence prediction, (2) turn-taking prediction (next-speaker and time-to-next-speech), and (3) rapport trajectory prediction across full interactions. The challenge is built on the Hume-DaiKon dataset, comprising 945 dyadic conversations (743.4 hours of audiovisual data) collected under naturalistic conditions across five languages. The benchmark supports multimodal modeling, temporal reasoning, and cross-context generalization through fixed train/validation/test splits, standardized metrics, and released baseline systems. Evaluation uses Concordance Correlation Coefficient (CCC), Pearson correlation, Macro-F1, and Mean Absolute Error (MAE) depending on the sub-challenge. Baseline experiments establish initial reference performance, with best test results of 0.40 CCC and 0.50 Pearson for influence prediction, 0.66 Macro-F1 and 1.50~s MAE for turn-taking, and 0.68 CCC and 0.70 Pearson for rapport trajectory modeling. These results indicate that while current methods capture coarse dyadic patterns, robust modeling of directional dependence and long-horizon interpersonal dynamics remains challenging. The workshop provides a shared platform for rigorous comparison and cross-disciplinary discussion on data validity, evaluation protocols, and culturally aware modeling for dyadic interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript announces the 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge, introducing a benchmark for interpersonal affect and social dynamics in dyadic conversations. It defines three sub-challenges on the shared Hume-DaiKon dataset (945 naturalistic conversations across five languages, 743.4 hours): (1) directional interpersonal influence prediction, (2) turn-taking prediction (next-speaker and time-to-next-speech), and (3) rapport trajectory prediction. The benchmark supplies fixed train/validation/test splits, metrics (CCC, Pearson correlation, Macro-F1, MAE), released baseline systems, and initial test scores (0.40 CCC / 0.50 Pearson for influence; 0.66 Macro-F1 / 1.50 s MAE for turn-taking; 0.68 CCC / 0.70 Pearson for rapport) that are presented as evidence that current methods capture only coarse patterns.

Significance. If the benchmark is widely adopted, it could meaningfully shift affective computing research toward coupled, time-evolving dyadic processes rather than speaker-centric models, providing a reproducible platform for multimodal, temporal, and cross-lingual work. The release of a large naturalistic multilingual corpus together with baseline code and fixed evaluation protocols is a concrete strength that supports community-driven progress.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, the recognition of the benchmark's potential impact on shifting affective computing toward dyadic and temporal modeling, and the recommendation to accept. We appreciate the acknowledgment of the dataset release, fixed splits, baselines, and evaluation protocols as concrete strengths.

Circularity Check

0 steps flagged

No circularity; benchmark definition is independent of any model or derivation

full rationale

This is a workshop/challenge announcement paper that defines a new benchmark (three sub-challenges on the Hume-DaiKon dataset with fixed splits, metrics, and released baselines) without any claimed derivation, first-principles prediction, or mathematical chain. The dataset, splits, evaluation protocols (CCC, Pearson, Macro-F1, MAE), and baseline reference scores are presented as design choices and empirical starting points, not as outputs derived from or fitted to the challenge itself. No equations, ansatzes, uniqueness theorems, or self-citations are invoked to justify load-bearing premises; the central claim is simply that the proposed benchmark will enable future progress. The announcement is therefore self-contained and contains no steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark and workshop announcement paper; it introduces no mathematical derivations, fitted parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5651 in / 1142 out tokens · 22116 ms · 2026-05-08T19:07:07.343388+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 1 canonical work pages

  1. [1]

    Interpersonal emotion regulation,

    J. Zaki and W. C. Williams, “Interpersonal emotion regulation,”Emotion, vol. 13, no. 5, pp. 803–810, 2013

  2. [2]

    A simplest systematics for the organization of turn-taking for conversation,

    H. Sacks, E. A. Schegloff, and G. Jefferson, “A simplest systematics for the organization of turn-taking for conversation,”Language, vol. 50, no. 4, pp. 696–735, 1974

  3. [3]

    The nature of rapport and its nonverbal correlates,

    L. Tickle-Degnen and R. Rosenthal, “The nature of rapport and its nonverbal correlates,”Psychological Inquiry, vol. 1, no. 4, pp. 285–293, 1990

  4. [4]

    Multimodal machine learning: A survey and taxonomy,

    T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, 2019

  5. [5]

    R. W. Picard,Affective Computing. Cambridge, MA, USA: MIT Press, 1997

  6. [6]

    Robust speech recognition via large-scale weak supervision,

    A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023

  7. [7]

    FaceNet: A unified embedding for face recognition and clustering,

    F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” inProc. CVPR, 2015

  8. [8]

    Qwen2.5 Technical Report

    Qwen Team, “Qwen2.5 Technical Report,” arXiv:2412.15115, 2025

  9. [9]

    Efficient memory management for large language model serving with PagedAttention,

    W. Kwonet al., “Efficient memory management for large language model serving with PagedAttention,” inProc. ACM SOSP, 2023

  10. [10]

    Modeling inter- personal perception in dyadic interactions: towards robot-assisted social mediation in the real world,

    H. Javed, W. Wang, A. B. Usman, and N. Jamali, “Modeling inter- personal perception in dyadic interactions: towards robot-assisted social mediation in the real world,”Frontiers in Robotics and AI, vol. 11, p. 1410957, 2024

  11. [11]

    Affective computing methods for multimodal embodied AI human–computer interaction,

    X. Song, C. Liu, L. Xu, Z. Lu, and Y . Zhang, “Affective computing methods for multimodal embodied AI human–computer interaction,” Aslib Journal of Information Management, 2025

  12. [12]

    PyTorch: An imperative style, high-performance deep learning library,

    A. Paszkeet al., “PyTorch: An imperative style, high-performance deep learning library,” inAdvances in Neural Information Processing Systems 32, 2019

  13. [13]

    On certain integrals of Lipschitz-Hankel type involving products of Bessel functions,

    G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of Lipschitz-Hankel type involving products of Bessel functions,” Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, April 1955

  14. [14]

    Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol

    J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol

  15. [15]

    Oxford: Clarendon, 1892, pp.68–73

  16. [16]

    Fine particles, thin films and exchange anisotropy,

    I. S. Jacobs and C. P. Bean, “Fine particles, thin films and exchange anisotropy,” in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271–350

  17. [17]

    Title of paper if known,

    K. Elissa, “Title of paper if known,” unpublished

  18. [18]

    Title of paper with only first word capitalized,

    R. Nicole, “Title of paper with only first word capitalized,” J. Name Stand. Abbrev., in press

  19. [19]

    Electron spectroscopy studies on magneto-optical media and plastic substrate interface,

    Y . Yorozu, M. Hirano, K. Oka, and Y . Tagawa, “Electron spectroscopy studies on magneto-optical media and plastic substrate interface,” IEEE Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]

  20. [20]

    Young, The Technical Writer’s Handbook

    M. Young, The Technical Writer’s Handbook. Mill Valley, CA: Univer- sity Science, 1989

  21. [21]

    A circumplex model of affect,

    J. A. Russell, “A circumplex model of affect,”Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161–1178, 1980

  22. [22]

    The ICML 2022 Expressive V ocalizations Workshop and Competition: Recognizing, generating, and personalizing vocal bursts,

    A. Baird, P. Tzirakis, G. Gidel, M. Jiralerspong, E. B. Muller, K. Mathewson, B. Schuller, E. Cambria, D. Keltner, and A. Cowen, “The ICML 2022 Expressive V ocalizations Workshop and Competition: Recognizing, generating, and personalizing vocal bursts,” inProc. ICML Expressive Vocalizations Workshop and Competition, 2022

  23. [23]

    Large-scale nonverbal vocalization detection using transformers,

    P. Tzirakis, A. Baird, J. Brooks, C. Gagne, L. Kim, M. Opara, C. Gregory, J. Metrick, G. Boseck, V . Tiruvadi, B. Schuller, D. Keltner, and A. Cowen, “Large-scale nonverbal vocalization detection using transformers,” inICASSP 2023 – IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp. 1–5, 2023

  24. [24]

    MuSe 2020 Challenge and Workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness de- tection in real-life media,

    L. Stappen, A. Baird, G. Rizos, P. Tzirakis, X. Du, F. Hafner, L. Schumann, A. Mallol-Ragolta, B. W. Schuller, I. Lefter, E. Cambria, and I. Kompatsiaris, “MuSe 2020 Challenge and Workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness de- tection in real-life media,” inProc. 1st Int. Multimodal Sentiment Analysis in Real-li...

  25. [25]

    The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked emotions, cross-cultural humour, and personalisation,

    L. Christ, S. Amiriparian, A. Baird, A. Kathan, N. M ¨uller, S. Klug, C. Gagne, P. Tzirakis, L. Stappen, E.-M. Meßner, A. K ¨onig, A. Cowen, E. Cambria, and B. W. Schuller, “The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked emotions, cross-cultural humour, and personalisation,” inProc. 4th Multimodal Sentiment Analysis Challenge and Workshop, 2023

  26. [26]

    End-to-end multimodal emotion recognition using deep neural networks,

    P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017