pith. machine review for the scientific record. sign in

arxiv: 2604.11594 · v2 · submitted 2026-04-13 · 📡 eess.AS · cs.SD

Recognition: unknown

HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords emotional intelligence benchmarkaudio language modelsmulti-turn dialogueempathy evaluationacoustic semantic conflicthuman recorded datamultiple choice questions
0
0 comments X

The pith

Audio language models struggle with tracking emotions over multiple conversation turns and favor text over audio cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HumDial-EIBench, a benchmark built from real human-recorded multi-turn dialogues to test how well audio language models handle emotional intelligence. It converts emotional tracking and causal reasoning into multiple-choice questions with tricky distractors and adds tasks for generating empathetic replies plus spotting when audio and text contradict each other. Evaluations show current models have difficulty following emotional states across turns and reasoning about unspoken causes, while also treating text-based and sound-based empathy as separate skills. A reader would care because voice-based AI needs to understand feelings naturally to be useful in conversations, and poor performance here highlights gaps in current technology. The multiple-choice format reduces the subjectivity that comes with open-ended scoring in older tests.

Core claim

HumDial-EIBench reformulates tasks from human dialogues into multiple-choice questions for emotional tracking and implicit causal reasoning, while retaining empathetic response generation and introducing an acoustic-semantic conflict task, revealing that most of the eight evaluated ALMs struggle with multi-turn emotional tracking, implicit causal reasoning, decoupled textual and acoustic empathy, and text-dominance bias in conflicts.

What carries the argument

HumDial-EIBench, a benchmark that uses real-recorded human dialogues from the ICASSP 2026 HumDial Challenge to create multiple-choice questions with adversarial distractors for cognitive EI tasks and includes generation and conflict assessment.

If this is right

  • ALMs require improved architectures for maintaining emotional context across conversation turns rather than processing each turn independently.
  • Training must address the separation between textual empathy and acoustic empathy to achieve integrated multimodal understanding.
  • Models need specific handling for cases where text and audio signals conflict to reduce the observed text-dominance bias.
  • Future development of ALMs should incorporate implicit reasoning capabilities instead of relying on explicit emotional cues.
  • Benchmarks for emotional intelligence should prioritize real human data over synthesized speech to better reflect practical performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improving on this benchmark might require new training objectives focused on dialogue history and causal inference rather than next-token prediction alone.
  • The text-dominance bias suggests that audio features are underutilized even in models designed for speech, pointing to a need for better audio encoders or fusion methods.
  • Real-world applications like customer service bots could benefit from passing this benchmark to ensure they respond appropriately to emotional shifts.
  • This work highlights that current scaling of language models to audio does not automatically solve emotional intelligence challenges.

Load-bearing premise

Reformulating emotional tracking and causal reasoning from human dialogues into multiple-choice questions with adversarial distractors measures true emotional intelligence without missing important aspects or adding unintended biases.

What would settle it

A model that scores highly on HumDial-EIBench but fails to maintain emotional coherence or generate appropriate empathy in actual live multi-turn conversations with people would indicate the benchmark does not fully capture real EI.

Figures

Figures reproduced from arXiv: 2604.11594 by Chengyou Wang, Hongfei Xue, Hui Bu, Lei Xie, Shuai Wang, Shuiyuan Wang, Xin Xu, Zhixian Zhao.

Figure 1
Figure 1. Figure 1: Data construction and task overview of HumDial-EIBench. Left: Three-stage pipeline—Stage 1: Dialogue Script Design; Stage 2: Authentic Human Enactment and Quality Control; Stage 3: Multiple-Choice Reformulation and Distractor Construction. Right: Representative examples of the four tasks. 3.2. Data construction pipeline The data construction pipeline consists of three sequential stages. 3.2.1. Dialogue scr… view at source ↗
read the original abstract

Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a comprehensive benchmark for evaluating ALMs' EI. Using real-recorded human dialogues from the ICASSP 2026 HumDial Challenge, it reformulates emotional tracking and causal reasoning into multiple-choice questions with adversarial distractors, mitigating subjective scoring bias for cognitive tasks. It retains the generation of empathetic responses and introduces an acoustic-semantic conflict task to assess robustness against contradictory multimodal signals. Evaluations of eight ALMs reveal that most models struggle with multi-turn emotional tracking and implicit causal reasoning. Furthermore, all models exhibit decoupled textual and acoustic empathy, alongside a severe text-dominance bias during cross-modal conflicts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces HumDial-EIBench, a benchmark for assessing emotional intelligence in audio language models using real human-recorded multi-turn dialogues from the ICASSP 2026 HumDial Challenge. It reformulates tasks like emotional tracking and causal reasoning into multiple-choice questions with adversarial distractors to reduce subjective scoring, includes empathetic response generation, and introduces an acoustic-semantic conflict task. Evaluations of eight ALMs show that most struggle with multi-turn emotional tracking and implicit causal reasoning, exhibit decoupled textual and acoustic empathy, and display text-dominance bias in cross-modal conflicts.

Significance. If the benchmark is validated to measure the intended EI components without introducing artifacts, this would represent a meaningful advance by shifting from synthetic/single-turn setups to ecologically valid human dialogues and multimodal conflict testing. The reliance on real-recorded data from an existing challenge is a clear strength that could support more reliable ALM evaluation.

major comments (3)
  1. [Abstract] Abstract: The headline claims that eight ALMs struggle with multi-turn emotional tracking, implicit causal reasoning, decoupled empathy, and text-dominance bias are presented without any reported sample sizes, statistical tests, data splits, or question counts, preventing assessment of whether the observed failures are reliable or replicable.
  2. [Benchmark construction] Benchmark construction section: The reformulation of emotional tracking and causal reasoning into MCQs with adversarial distractors derived from real dialogues lacks any validation such as human performance baselines, inter-annotator agreement on question quality, or correlation with external EI instruments; without these, it is impossible to rule out that model failures arise from test artifacts rather than genuine EI deficits.
  3. [Evaluation] Evaluation section: No details are provided on the number of instances per task, how train/test splits were constructed to prevent leakage from the source dialogues, or any significance testing for performance differences across models or modalities.
minor comments (1)
  1. [Abstract] The abstract could briefly note the total number of questions or models evaluated to give readers an immediate sense of scale.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address each major comment point by point below, indicating the changes made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claims that eight ALMs struggle with multi-turn emotional tracking, implicit causal reasoning, decoupled empathy, and text-dominance bias are presented without any reported sample sizes, statistical tests, data splits, or question counts, preventing assessment of whether the observed failures are reliable or replicable.

    Authors: We agree that the abstract would benefit from explicit quantitative context to support the summary claims. In the revised manuscript, we have updated the abstract to include the overall scale of the evaluation (number of questions and models) and a reference to the statistical analyses. We have also expanded the Evaluation section to report the exact number of instances per task, the train/test split construction (ensuring no dialogue leakage), and the statistical tests used for comparing performances across models and modalities. revision: yes

  2. Referee: [Benchmark construction] Benchmark construction section: The reformulation of emotional tracking and causal reasoning into MCQs with adversarial distractors derived from real dialogues lacks any validation such as human performance baselines, inter-annotator agreement on question quality, or correlation with external EI instruments; without these, it is impossible to rule out that model failures arise from test artifacts rather than genuine EI deficits.

    Authors: We acknowledge that explicit validation strengthens claims about the benchmark measuring intended EI components. In the revised manuscript, we have added a validation subsection to the Benchmark Construction section. This includes human performance baselines on sampled questions, inter-annotator agreement for question and distractor quality, and discussion of alignment with established EI frameworks. These additions help demonstrate that model shortcomings reflect EI limitations rather than construction artifacts. revision: yes

  3. Referee: [Evaluation] Evaluation section: No details are provided on the number of instances per task, how train/test splits were constructed to prevent leakage from the source dialogues, or any significance testing for performance differences across models or modalities.

    Authors: We agree that these details are necessary for replicability and assessment of results. In the revised manuscript, we have substantially expanded the Evaluation section to specify the number of instances per task, describe the train/test split procedure (turn-level partitioning with no overlapping source dialogues to prevent leakage), and report significance testing for performance differences across models and input modalities. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and direct model evaluations are self-contained

full rationale

The paper introduces HumDial-EIBench by reformulating existing human dialogues into MCQs and reports empirical results from evaluating eight ALMs. No derivations, equations, fitted parameters, or predictions are present that could reduce to self-referential inputs. Claims about model struggles are direct observations from the benchmark runs, not constructed equivalences. Self-citation is absent from load-bearing positions, and the work is independent of any prior author results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on assumptions about how emotional intelligence should be measured and the representativeness of the source data.

axioms (2)
  • domain assumption Emotional intelligence in multi-turn dialogues can be validly assessed via multiple-choice questions on tracking and causal reasoning plus open-ended empathy generation.
    Paper reformulates tasks this way to reduce subjective scoring bias.
  • domain assumption Real-recorded dialogues from the ICASSP 2026 HumDial Challenge sufficiently represent natural emotional interactions for benchmarking.
    Used as the sole data source for the benchmark.

pith-pipeline@v0.9.0 · 5470 in / 1273 out tokens · 35625 ms · 2026-05-10T15:17:52.118523+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 24 canonical work pages · 9 internal anchors

  1. [1]

    Introduction Traditional spoken dialogue systems rely on cascaded architec- tures (automatic speech recognition→large language model →text-to-speech), where the intermediate text transcription in- evitably discards critical paralinguistic cues, such as intonation and emotion. The recent paradigm shift toward end-to-end au- dio language models (ALMs) [1, 2...

  2. [2]

    Related Work Audio Language Models.Traditional spoken dialogue systems adopt a cascaded ASR→LLM→TTS architecture, where the intermediate text transcription inevitably discards paralinguis- tic cues such as intonation and emotion. End-to-end ALMs— including open-source models like Moshi [21], Qwen2.5- Omni [22], and Qwen3-Omni [23], alongside closed-source...

  3. [3]

    I won a prize! Two movie tickets!

    HumDial-EIBench HumDial-EIBench is directly built upon the test set of the ICASSP 2026 HumDial Challenge. To ensure both controllabil- ity and realism, the foundational data were created by design- ing specific dialogue scenarios and speaker turns, which were then naturalistically enacted. This paper extends this founda- tion by reformulating the open-end...

  4. [4]

    Conflict

    Experiments 4.1. Experimental setup We evaluate eight ALMs in two categories. The open-source group includes Freeze-Omni [27], GLM-4-V oice [28], Kimi- Audio [29], Step-Audio-2-mini [30], and Qwen2.5-Omni [22]. The closed-source group consists of Doubao-realtime, GPT-4o- audio [1], and Gemini-2.5-flash [2]. These models represent the current state of the ...

  5. [5]

    text-dominance bias

    Discussion and Conclusion This paper introduces HumDial-EIBench, an objective evalu- ation framework utilizing authentic human-recorded dialogues to assess the emotional intelligence of ALMs. By reformulat- ing open-ended scenarios into multiple-choice tasks, the bench- mark successfully isolates multi-turn emotional memory and reasoning capacities from s...

  6. [6]

    The authors are fully responsible and accountable for the final content of this paper

    Generative AI Use Disclosure Generative AI models, including Qwen3-TTS, Qwen2.5-Omni and Gemini 2.5 Pro, were used for data generation, and response evaluation. The authors are fully responsible and accountable for the final content of this paper. All authors agree with the submission of this paper

  7. [7]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelmanet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdevaet al., “Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next genera- tion agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  9. [9]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

  10. [10]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guoet al., “Qwen2- audio technical report,”arXiv preprint arXiv:2407.10759, 2024

  11. [11]

    Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,

    D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y . Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” inEMNLP, ser. Findings of ACL, vol. EMNLP, 2023, pp. 15 757–15 773

  12. [12]

    E-chat: Emotion-sensitive spoken dialogue system with large language models,

    H. Xue, Y . Liang, B. Mu, S. Zhang, M. Chen, Q. Chen, and L. Xie, “E-chat: Emotion-sensitive spoken dialogue system with large language models,” inProc. ISCSLP. IEEE, 2024, pp. 586–590

  13. [13]

    OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

    X. Geng, K. Wei, Q. Shao, S. Liu, Z. Lin, Z. Zhao, G. Li, W. Tian, P. Chen, Y . Li, P. Guo, M. Shao, S. Wang, Y . Cao, C. Wang, T. Xu, Y . Dai, X. Zhu, Y . Li, L. Zhang, and L. Xie, “OSUM: advanc- ing open speech understanding models with limited resources in academia,”arXiv preprint arXiv:2501.13306, 2025

  14. [14]

    Osum-echat: Enhancing end-to- end empathetic spoken chatbot via understanding-driven spoken dialogue,

    X. Geng, Q. Shao, H. Xue, S. Wang, H. Xie, Z. Guo, Y . Zhao, G. Li, W. Tian, C. Wang, Z. Zhaoet al., “Osum-echat: Enhancing end-to-end empathetic spoken chatbot via understanding-driven spoken dialogue,”arXiv preprint arXiv:2508.09600, 2025

  15. [15]

    Emoomni: Bridging emotional understanding and expression in omni-modal llms, 2026

    W. Tian, Z. Zhao, J. Hu, H. Chen, H. Liu, B. Mu, and L. Xie, “EmoOmni: Bridging emotional understanding and expression in omni-modal LLMs,”arXiv preprint arXiv:2602.21900, 2026

  16. [16]

    When tone and words disagree: Towards robust speech emotion recognition un- der acoustic-semantic conflict,

    D. Huang, Y . Lv, R. Xiong, C. Jin, and X. Peng, “When tone and words disagree: Towards robust speech emotion recognition un- der acoustic-semantic conflict,”arXiv preprint arXiv:2601.04564, 2026

  17. [17]

    The ICASSP 2026 humdial challenge: Bench- marking human-like spoken dialogue systems in the LLM era,

    Z. Zhao, S. Wang, G. Li, H. Xue, C. Wang, S. Wang, L. Xiao, Z. Zhang, H. Bu, X. Xu, X. Wang, H. Liu, E. S. Chng, H. Lee, H. Li, and L. Xie, “The ICASSP 2026 humdial challenge: Bench- marking human-like spoken dialogue systems in the LLM era,” arXiv preprint arXiv:2601.05564, 2026

  18. [18]

    Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,

    C. Huang, K. Lu, S. Wang, C. Hsiao, C. Kuan, H. Wu, S. Arora, K. Chang, J. Shi, Y . Peng, R. S. Sharma, S. Watanabe, B. Ra- makrishnan, S. Shehata, and H. Lee, “Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,” inProc. ICASSP. IEEE, 2024, pp. 12 136–12 140

  19. [19]

    Air-bench: Benchmarking large audio-language models via generative comprehension,

    Q. Yang, J. Xu, W. Liu, Y . Chu, Z. Jiang, X. Zhou, Y . Leng, Y . Lv, Z. Zhao, C. Zhou, and J. Zhou, “Air-bench: Benchmarking large audio-language models via generative comprehension,” inProc. ACL, L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computational Linguistics, 2024, pp. 1979–1998

  20. [20]

    HPSU: A benchmark for human-level percep- tion in real-world spoken speech understanding,

    C. Li, P. Yang, Y . Zhong, J. Yu, Z. Wang, Z. Gou, W. Chen, and J. Yin, “HPSU: A benchmark for human-level percep- tion in real-world spoken speech understanding,”arXiv preprint arXiv:2511.23178, 2025

  21. [21]

    H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al

    Y . Chen, X. Yue, C. Zhang, X. Gao, R. T. Tan, and H. Li, “V oicebench: Benchmarking llm-based voice assistants,”arXiv preprint arXiv:2410.17196, 2024

  22. [22]

    Uro-bench: Towards comprehensive evalua- tion for end-to-end spoken dialogue models,

    R. Yan, X. Li, W. Chen, Z. Niu, C. Yang, Z. Ma, K. Yu, and X. Chen, “Uro-bench: Towards comprehensive evalua- tion for end-to-end spoken dialogue models,” inProc. EMNLP, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Association for Computational Linguistics, 2025, pp. 17 211–17 242

  23. [23]

    Paras2s: Benchmarking and aligning spoken language models for paralinguistic- aware speech-to-speech interaction.arXiv preprint arXiv:2511.08723, 2025

    S.-w. Yang, M. Tu, A. T. Liu, X. Qu, H.-y. Lee, L. Lu, Y . Wang, and Y . Wu, “Paras2s: Benchmarking and aligning spoken lan- guage models for paralinguistic-aware speech-to-speech interac- tion,”arXiv preprint arXiv:2511.08723, 2025

  24. [24]

    V oxdialogue: Can spoken dia- logue systems understand information beyond words?

    X. Cheng, R. Hu, X. Yang, J. Lu, D. Fu, Z. Wang, S. Ji, R. Huang, B. Zhang, T. Jin, and Z. Zhao, “V oxdialogue: Can spoken dia- logue systems understand information beyond words?” inTProc. ICLR. OpenReview.net, 2025

  25. [25]

    Mtalk-bench: Evaluating speech-to-speech models in multi-turn dialogues via arena-style and rubrics protocols,

    Y . Du, Q. Huang, G. Zhu, Z. Dai, S. Chen, Q. Zhu, Y . Zhang, L. Zhou, and B. Wang, “Mtalk-bench: Evaluating speech-to- speech models in multi-turn dialogues via arena-style and rubrics protocols,”arXiv preprint arXiv:2508.18240, 2025

  26. [26]

    Multi-bench: A multi-turn interactive benchmark for assessing emotional intelligence ability of spoken dialogue models,

    Y . Deng, G. Hu, H. Sun, X. Zhang, H. Zhang, F. Tian, X. Yang, G. Yu, and E. S. Chng, “Multi-bench: A multi-turn interactive benchmark for assessing emotional intelligence ability of spoken dialogue models,”arXiv preprint arXiv:2511.00850, 2025

  27. [27]

    Moshi: a speech-text foundation model for real-time dialogue

    A. D ´efossez, L. Mazar´e, M. Orsini, A. Royer, P. P´erez, H. J´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

  28. [28]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. Heet al., “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  29. [29]

    Qwen3-Omni Technical Report

    J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. Heet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

  30. [30]

    Isa-bench: Benchmarking instruc- tion sensitivity for large audio language models,

    B. Li, W. Huang, Y . Qiu, Y . Guo, H. Wang, Z. Li, J. Peng, Z. Ma, X. Chen, and K. Yu, “Isa-bench: Benchmarking instruc- tion sensitivity for large audio language models,”arXiv preprint arXiv:2510.23558, 2025

  31. [31]

    Steering language model to stable speech emotion recog- nition via contextual perception and chain of thought,

    Z. Zhao, X. Zhu, X. Wang, S. Wang, X. Geng, W. Tian, and L. Xie, “Steering language model to stable speech emotion recog- nition via contextual perception and chain of thought,”IEEE Transactions on Audio, Speech and Language Processing, vol. 34, pp. 415–426, 2025

  32. [32]

    Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

    H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guo, X. Zhang, P. Zhang, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-tts technical report,”arXiv preprint arXiv:2601.15621, 2026

  33. [33]

    Freeze-omni: A smart and low latency speech-to- speech dialogue model with frozen LLM,

    X. Wang, Y . Li, C. Fu, Y . Zhang, Y . Shen, L. Xie, K. Li, X. Sun, and L. Ma, “Freeze-omni: A smart and low latency speech-to- speech dialogue model with frozen LLM,” inProc. ICML, vol. 267, 2025

  34. [34]

    Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

    A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y . Dong, and J. Tang, “Glm-4-voice: Towards intelligent and human-like end- to-end spoken chatbot,”arXiv preprint arXiv:2412.02612, 2024

  35. [35]

    Kimi-Audio Technical Report

    D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Songet al., “Kimi-audio technical report,”arXiv preprint arXiv:2504.18425, 2025

  36. [36]

    Step-audio 2 technical report, 2025

    B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Liet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025