The 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge

Alice Baird; Emilia Parada-Cabaleiro; Jakub Piotr Clapa; Jeffrey Brooks; Jens Madsen; Lukas Stappen; Panagiotis Tzirakis; Sharath Rao; Theo Lebryk

arxiv: 2605.02672 · v1 · submitted 2026-05-04 · 💻 cs.AI · cs.CL· cs.HC

The 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge

Panagiotis Tzirakis , Alice Baird , Jeffrey Brooks , Emilia Parada-Cabaleiro , Lukas Stappen , Sharath Rao , Theo Lebryk , Jakub Piotr Clapa

show 1 more author

Jens Madsen

This is my paper

Pith reviewed 2026-05-08 19:07 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HC

keywords dyadic conversationsinterpersonal influenceturn-taking predictionrapport trajectorymultimodal modelingsocial dynamicsbenchmark challenge

0 comments

The pith

A new benchmark with three sub-challenges on 945 conversations pushes models to predict how partners influence each other, time their turns, and develop rapport.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a workshop and challenge to address the limitation that most conversational affect modeling remains focused on individual speakers. It proposes coordinated tasks that target directional influence between partners, the prediction of who speaks next and when, and the trajectory of rapport across entire interactions. These tasks rest on a shared collection of naturalistic conversations recorded in multiple languages. The work supplies fixed data splits, evaluation metrics, and initial baseline results to enable direct comparison of approaches. If the benchmark succeeds, progress would shift from isolated speaker analysis toward coupled, time-evolving interpersonal processes.

Core claim

The challenge establishes three tasks on one dataset: directional interpersonal influence prediction, next-speaker and time-to-next-speech turn-taking prediction, and rapport trajectory prediction across full interactions, with baseline results showing that current methods reach 0.40 CCC for influence, 0.66 Macro-F1 for turn-taking, and 0.68 CCC for rapport yet still fall short on directional dependence and long-horizon dynamics.

What carries the argument

The three coordinated sub-challenges built on a shared dataset of 945 dyadic conversations, which supplies multimodal signals, fixed train-validation-test splits, and standardized metrics to support temporal reasoning and cross-context evaluation.

Load-bearing premise

The collected conversations supply representative coverage of directional influence, timing coordination, and rapport development so that the chosen metrics and splits will drive progress beyond speaker-centric methods.

What would settle it

A demonstration that models achieving strong scores on the three sub-challenges show no improvement over existing speaker-centric baselines when tested on separate, independent sets of dyadic interactions.

Figures

Figures reproduced from arXiv: 2605.02672 by Alice Baird, Emilia Parada-Cabaleiro, Jakub Piotr Clapa, Jeffrey Brooks, Jens Madsen, Lukas Stappen, Panagiotis Tzirakis, Sharath Rao, Theo Lebryk.

**Figure 1.** Figure 1: Example frames from the Hume DaiKon challenge dataset page view at source ↗

**Figure 2.** Figure 2: Arousal–valence visualization from the public Hume DaiKon chal view at source ↗

read the original abstract

The 2026 ACII Dyadic Conversations (ACII-DaiKon) Workshop & Challenge introduces a benchmark for modeling interpersonal affect and social dynamics in dyadic conversations. Although conversational affect modeling has advanced rapidly, most benchmarks remain speaker-centric and underrepresent coupled, time-evolving processes between partners, including directional influence, conversational timing coordination, and rapport development. To address this gap, ACII-DaiKon presents three coordinated sub-challenges built on a shared dataset: (1) directional interpersonal influence prediction, (2) turn-taking prediction (next-speaker and time-to-next-speech), and (3) rapport trajectory prediction across full interactions. The challenge is built on the Hume-DaiKon dataset, comprising 945 dyadic conversations (743.4 hours of audiovisual data) collected under naturalistic conditions across five languages. The benchmark supports multimodal modeling, temporal reasoning, and cross-context generalization through fixed train/validation/test splits, standardized metrics, and released baseline systems. Evaluation uses Concordance Correlation Coefficient (CCC), Pearson correlation, Macro-F1, and Mean Absolute Error (MAE) depending on the sub-challenge. Baseline experiments establish initial reference performance, with best test results of 0.40 CCC and 0.50 Pearson for influence prediction, 0.66 Macro-F1 and 1.50~s MAE for turn-taking, and 0.68 CCC and 0.70 Pearson for rapport trajectory modeling. These results indicate that while current methods capture coarse dyadic patterns, robust modeling of directional dependence and long-horizon interpersonal dynamics remains challenging. The workshop provides a shared platform for rigorous comparison and cross-disciplinary discussion on data validity, evaluation protocols, and culturally aware modeling for dyadic interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a workshop announcement that introduces a new dyadic benchmark with three sub-challenges on a large multilingual dataset, but it functions more as a call for participation than a research result.

read the letter

The core contribution is the coordinated setup of three sub-challenges on the Hume-DaiKon dataset: directional interpersonal influence prediction, turn-taking (next speaker and timing), and rapport trajectory prediction. The dataset itself—945 naturalistic conversations totaling 743 hours across five languages—plus fixed splits and released baselines, gives the field a concrete shared resource that prior speaker-centric benchmarks did not provide in this form. Reporting initial scores (0.40 CCC for influence, 0.66 Macro-F1 for turn-taking, 0.68 CCC for rapport) shows the tasks are non-trivial and highlights where current methods fall short on directional and long-horizon aspects. That is useful for standardizing evaluation in affective computing and social robotics. The main limitation is that the document is an announcement rather than a full empirical paper. There is little detail on how the metrics were chosen, no error analysis, and no deeper validation that the splits or collection protocol actually isolate the claimed dyadic phenomena. The representativeness claim for multilingual naturalistic data is stated but not tested against alternatives. This is the kind of resource paper that researchers working on multimodal social signal processing or virtual agents would find practical for testing models. It is not something I would cite for a new method, but the benchmark definition itself could be referenced once the challenge runs. I would send it to peer review in a workshop track so the design gets feedback on metrics and data validity before the 2026 event.

Referee Report

0 major / 0 minor

Summary. The manuscript announces the 2026 ACII Dyadic Conversations (DaiKon) Workshop & Challenge, introducing a benchmark for interpersonal affect and social dynamics in dyadic conversations. It defines three sub-challenges on the shared Hume-DaiKon dataset (945 naturalistic conversations across five languages, 743.4 hours): (1) directional interpersonal influence prediction, (2) turn-taking prediction (next-speaker and time-to-next-speech), and (3) rapport trajectory prediction. The benchmark supplies fixed train/validation/test splits, metrics (CCC, Pearson correlation, Macro-F1, MAE), released baseline systems, and initial test scores (0.40 CCC / 0.50 Pearson for influence; 0.66 Macro-F1 / 1.50 s MAE for turn-taking; 0.68 CCC / 0.70 Pearson for rapport) that are presented as evidence that current methods capture only coarse patterns.

Significance. If the benchmark is widely adopted, it could meaningfully shift affective computing research toward coupled, time-evolving dyadic processes rather than speaker-centric models, providing a reproducible platform for multimodal, temporal, and cross-lingual work. The release of a large naturalistic multilingual corpus together with baseline code and fixed evaluation protocols is a concrete strength that supports community-driven progress.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, the recognition of the benchmark's potential impact on shifting affective computing toward dyadic and temporal modeling, and the recommendation to accept. We appreciate the acknowledgment of the dataset release, fixed splits, baselines, and evaluation protocols as concrete strengths.

Circularity Check

0 steps flagged

No circularity; benchmark definition is independent of any model or derivation

full rationale

This is a workshop/challenge announcement paper that defines a new benchmark (three sub-challenges on the Hume-DaiKon dataset with fixed splits, metrics, and released baselines) without any claimed derivation, first-principles prediction, or mathematical chain. The dataset, splits, evaluation protocols (CCC, Pearson, Macro-F1, MAE), and baseline reference scores are presented as design choices and empirical starting points, not as outputs derived from or fitted to the challenge itself. No equations, ansatzes, uniqueness theorems, or self-citations are invoked to justify load-bearing premises; the central claim is simply that the proposed benchmark will enable future progress. The announcement is therefore self-contained and contains no steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark and workshop announcement paper; it introduces no mathematical derivations, fitted parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5651 in / 1142 out tokens · 22116 ms · 2026-05-08T19:07:07.343388+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 1 canonical work pages

[1]

Interpersonal emotion regulation,

J. Zaki and W. C. Williams, “Interpersonal emotion regulation,”Emotion, vol. 13, no. 5, pp. 803–810, 2013

2013
[2]

A simplest systematics for the organization of turn-taking for conversation,

H. Sacks, E. A. Schegloff, and G. Jefferson, “A simplest systematics for the organization of turn-taking for conversation,”Language, vol. 50, no. 4, pp. 696–735, 1974

1974
[3]

The nature of rapport and its nonverbal correlates,

L. Tickle-Degnen and R. Rosenthal, “The nature of rapport and its nonverbal correlates,”Psychological Inquiry, vol. 1, no. 4, pp. 285–293, 1990

1990
[4]

Multimodal machine learning: A survey and taxonomy,

T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, 2019

2019
[5]

R. W. Picard,Affective Computing. Cambridge, MA, USA: MIT Press, 1997

1997
[6]

Robust speech recognition via large-scale weak supervision,

A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023

2023
[7]

FaceNet: A unified embedding for face recognition and clustering,

F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” inProc. CVPR, 2015

2015
[8]

Qwen2.5 Technical Report

Qwen Team, “Qwen2.5 Technical Report,” arXiv:2412.15115, 2025

work page Pith review arXiv 2025
[9]

Efficient memory management for large language model serving with PagedAttention,

W. Kwonet al., “Efficient memory management for large language model serving with PagedAttention,” inProc. ACM SOSP, 2023

2023
[10]

Modeling inter- personal perception in dyadic interactions: towards robot-assisted social mediation in the real world,

H. Javed, W. Wang, A. B. Usman, and N. Jamali, “Modeling inter- personal perception in dyadic interactions: towards robot-assisted social mediation in the real world,”Frontiers in Robotics and AI, vol. 11, p. 1410957, 2024

2024
[11]

Affective computing methods for multimodal embodied AI human–computer interaction,

X. Song, C. Liu, L. Xu, Z. Lu, and Y . Zhang, “Affective computing methods for multimodal embodied AI human–computer interaction,” Aslib Journal of Information Management, 2025

2025
[12]

PyTorch: An imperative style, high-performance deep learning library,

A. Paszkeet al., “PyTorch: An imperative style, high-performance deep learning library,” inAdvances in Neural Information Processing Systems 32, 2019

2019
[13]

On certain integrals of Lipschitz-Hankel type involving products of Bessel functions,

G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of Lipschitz-Hankel type involving products of Bessel functions,” Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, April 1955

1955
[14]

Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol

J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol
[15]

Oxford: Clarendon, 1892, pp.68–73
[16]

Fine particles, thin films and exchange anisotropy,

I. S. Jacobs and C. P. Bean, “Fine particles, thin films and exchange anisotropy,” in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271–350

1963
[17]

Title of paper if known,

K. Elissa, “Title of paper if known,” unpublished
[18]

Title of paper with only first word capitalized,

R. Nicole, “Title of paper with only first word capitalized,” J. Name Stand. Abbrev., in press
[19]

Electron spectroscopy studies on magneto-optical media and plastic substrate interface,

Y . Yorozu, M. Hirano, K. Oka, and Y . Tagawa, “Electron spectroscopy studies on magneto-optical media and plastic substrate interface,” IEEE Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]

1987
[20]

Young, The Technical Writer’s Handbook

M. Young, The Technical Writer’s Handbook. Mill Valley, CA: Univer- sity Science, 1989

1989
[21]

A circumplex model of affect,

J. A. Russell, “A circumplex model of affect,”Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161–1178, 1980

1980
[22]

The ICML 2022 Expressive V ocalizations Workshop and Competition: Recognizing, generating, and personalizing vocal bursts,

A. Baird, P. Tzirakis, G. Gidel, M. Jiralerspong, E. B. Muller, K. Mathewson, B. Schuller, E. Cambria, D. Keltner, and A. Cowen, “The ICML 2022 Expressive V ocalizations Workshop and Competition: Recognizing, generating, and personalizing vocal bursts,” inProc. ICML Expressive Vocalizations Workshop and Competition, 2022

2022
[23]

Large-scale nonverbal vocalization detection using transformers,

P. Tzirakis, A. Baird, J. Brooks, C. Gagne, L. Kim, M. Opara, C. Gregory, J. Metrick, G. Boseck, V . Tiruvadi, B. Schuller, D. Keltner, and A. Cowen, “Large-scale nonverbal vocalization detection using transformers,” inICASSP 2023 – IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp. 1–5, 2023

2023
[24]

MuSe 2020 Challenge and Workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness de- tection in real-life media,

L. Stappen, A. Baird, G. Rizos, P. Tzirakis, X. Du, F. Hafner, L. Schumann, A. Mallol-Ragolta, B. W. Schuller, I. Lefter, E. Cambria, and I. Kompatsiaris, “MuSe 2020 Challenge and Workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness de- tection in real-life media,” inProc. 1st Int. Multimodal Sentiment Analysis in Real-li...

2020
[25]

The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked emotions, cross-cultural humour, and personalisation,

L. Christ, S. Amiriparian, A. Baird, A. Kathan, N. M ¨uller, S. Klug, C. Gagne, P. Tzirakis, L. Stappen, E.-M. Meßner, A. K ¨onig, A. Cowen, E. Cambria, and B. W. Schuller, “The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked emotions, cross-cultural humour, and personalisation,” inProc. 4th Multimodal Sentiment Analysis Challenge and Workshop, 2023

2023
[26]

End-to-end multimodal emotion recognition using deep neural networks,

P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017

2017

[1] [1]

Interpersonal emotion regulation,

J. Zaki and W. C. Williams, “Interpersonal emotion regulation,”Emotion, vol. 13, no. 5, pp. 803–810, 2013

2013

[2] [2]

A simplest systematics for the organization of turn-taking for conversation,

H. Sacks, E. A. Schegloff, and G. Jefferson, “A simplest systematics for the organization of turn-taking for conversation,”Language, vol. 50, no. 4, pp. 696–735, 1974

1974

[3] [3]

The nature of rapport and its nonverbal correlates,

L. Tickle-Degnen and R. Rosenthal, “The nature of rapport and its nonverbal correlates,”Psychological Inquiry, vol. 1, no. 4, pp. 285–293, 1990

1990

[4] [4]

Multimodal machine learning: A survey and taxonomy,

T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, 2019

2019

[5] [5]

R. W. Picard,Affective Computing. Cambridge, MA, USA: MIT Press, 1997

1997

[6] [6]

Robust speech recognition via large-scale weak supervision,

A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023

2023

[7] [7]

FaceNet: A unified embedding for face recognition and clustering,

F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” inProc. CVPR, 2015

2015

[8] [8]

Qwen2.5 Technical Report

Qwen Team, “Qwen2.5 Technical Report,” arXiv:2412.15115, 2025

work page Pith review arXiv 2025

[9] [9]

Efficient memory management for large language model serving with PagedAttention,

W. Kwonet al., “Efficient memory management for large language model serving with PagedAttention,” inProc. ACM SOSP, 2023

2023

[10] [10]

Modeling inter- personal perception in dyadic interactions: towards robot-assisted social mediation in the real world,

H. Javed, W. Wang, A. B. Usman, and N. Jamali, “Modeling inter- personal perception in dyadic interactions: towards robot-assisted social mediation in the real world,”Frontiers in Robotics and AI, vol. 11, p. 1410957, 2024

2024

[11] [11]

Affective computing methods for multimodal embodied AI human–computer interaction,

X. Song, C. Liu, L. Xu, Z. Lu, and Y . Zhang, “Affective computing methods for multimodal embodied AI human–computer interaction,” Aslib Journal of Information Management, 2025

2025

[12] [12]

PyTorch: An imperative style, high-performance deep learning library,

A. Paszkeet al., “PyTorch: An imperative style, high-performance deep learning library,” inAdvances in Neural Information Processing Systems 32, 2019

2019

[13] [13]

On certain integrals of Lipschitz-Hankel type involving products of Bessel functions,

G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of Lipschitz-Hankel type involving products of Bessel functions,” Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, April 1955

1955

[14] [14]

Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol

J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol

[15] [15]

Oxford: Clarendon, 1892, pp.68–73

[16] [16]

Fine particles, thin films and exchange anisotropy,

I. S. Jacobs and C. P. Bean, “Fine particles, thin films and exchange anisotropy,” in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271–350

1963

[17] [17]

Title of paper if known,

K. Elissa, “Title of paper if known,” unpublished

[18] [18]

Title of paper with only first word capitalized,

R. Nicole, “Title of paper with only first word capitalized,” J. Name Stand. Abbrev., in press

[19] [19]

Electron spectroscopy studies on magneto-optical media and plastic substrate interface,

Y . Yorozu, M. Hirano, K. Oka, and Y . Tagawa, “Electron spectroscopy studies on magneto-optical media and plastic substrate interface,” IEEE Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]

1987

[20] [20]

Young, The Technical Writer’s Handbook

M. Young, The Technical Writer’s Handbook. Mill Valley, CA: Univer- sity Science, 1989

1989

[21] [21]

A circumplex model of affect,

J. A. Russell, “A circumplex model of affect,”Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161–1178, 1980

1980

[22] [22]

The ICML 2022 Expressive V ocalizations Workshop and Competition: Recognizing, generating, and personalizing vocal bursts,

A. Baird, P. Tzirakis, G. Gidel, M. Jiralerspong, E. B. Muller, K. Mathewson, B. Schuller, E. Cambria, D. Keltner, and A. Cowen, “The ICML 2022 Expressive V ocalizations Workshop and Competition: Recognizing, generating, and personalizing vocal bursts,” inProc. ICML Expressive Vocalizations Workshop and Competition, 2022

2022

[23] [23]

Large-scale nonverbal vocalization detection using transformers,

P. Tzirakis, A. Baird, J. Brooks, C. Gagne, L. Kim, M. Opara, C. Gregory, J. Metrick, G. Boseck, V . Tiruvadi, B. Schuller, D. Keltner, and A. Cowen, “Large-scale nonverbal vocalization detection using transformers,” inICASSP 2023 – IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp. 1–5, 2023

2023

[24] [24]

MuSe 2020 Challenge and Workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness de- tection in real-life media,

L. Stappen, A. Baird, G. Rizos, P. Tzirakis, X. Du, F. Hafner, L. Schumann, A. Mallol-Ragolta, B. W. Schuller, I. Lefter, E. Cambria, and I. Kompatsiaris, “MuSe 2020 Challenge and Workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness de- tection in real-life media,” inProc. 1st Int. Multimodal Sentiment Analysis in Real-li...

2020

[25] [25]

The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked emotions, cross-cultural humour, and personalisation,

L. Christ, S. Amiriparian, A. Baird, A. Kathan, N. M ¨uller, S. Klug, C. Gagne, P. Tzirakis, L. Stappen, E.-M. Meßner, A. K ¨onig, A. Cowen, E. Cambria, and B. W. Schuller, “The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked emotions, cross-cultural humour, and personalisation,” inProc. 4th Multimodal Sentiment Analysis Challenge and Workshop, 2023

2023

[26] [26]

End-to-end multimodal emotion recognition using deep neural networks,

P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017

2017