pith. machine review for the scientific record. sign in

arxiv: 2604.12700 · v1 · submitted 2026-04-14 · 💻 cs.AI

Recognition: unknown

MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:37 UTC · model grok-4.3

classification 💻 cs.AI
keywords intent recognitionmultimodal datasetmulti-turn dialoguestrategic deceptionMLLM evaluationevidence chainshidden intent detection
0
0 comments X

The pith

MISID dataset reveals MLLM failures in multi-turn deception and FRACTAM improves intent inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the MISID dataset, drawn from high-stakes social strategy games, as a multimodal and multi-turn benchmark with two-tier annotations designed for tracking complex deceptive intents across long contexts. It tests current multimodal large language models and identifies specific shortcomings including text-prior visual hallucination, weak cross-modal integration, and difficulty chaining causal cues. The authors then present FRACTAM, which applies a Decouple-Anchor-Reason approach to extract unbiased unimodal facts, anchor them through staged retrieval, and link them into explicit evidence chains. A reader would care because reliable detection of hidden intent in extended interactions could support more accurate behavioral analysis in human-AI systems. If the reported gains hold, models would handle strategic deception tasks with higher inference accuracy without sacrificing basic perceptual reliability.

Core claim

We introduce MISID, a multimodal multi-turn multi-participant dataset for complex intent recognition sourced from strategic deception games and equipped with a fine-grained two-tier multi-dimensional annotation scheme for long-context discourse and causal tracking. Systematic evaluation of state-of-the-art MLLMs on MISID exposes critical deficiencies such as text-prior visual hallucination, impaired cross-modal synergy, and limited capacity for chaining causal cues. We therefore propose FRACTAM, which follows a Decouple-Anchor-Reason paradigm to extract pure unimodal factual representations, apply two-stage retrieval for long-range factual anchoring, and construct explicit cross-modal证据链, as

What carries the argument

The Decouple-Anchor-Reason paradigm that first decouples modalities to obtain unbiased factual representations, then anchors long-range facts via two-stage retrieval, and finally constructs explicit cross-modal evidence chains for intent inference.

If this is right

  • Current MLLMs exhibit text-prior visual hallucination, impaired cross-modal synergy, and weak causal chaining on complex strategic tasks.
  • FRACTAM raises hidden intent detection and inference accuracy on the MISID benchmark.
  • Perceptual accuracy stays robust while inference improves under the new framework.
  • The two-tier annotation supports evidence-based causal tracking across extended multimodal discourse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchoring and evidence-chain technique could be tested on real-world negotiation transcripts or security interview recordings to check generalization beyond games.
  • If the two-tier scheme proves reliable, it could be adapted to label long video or audio archives for intent in other high-stakes domains.
  • The emphasis on pure unimodal facts before fusion suggests a route to reduce hallucination in any long-context multimodal model, not only intent recognition.

Load-bearing premise

High-stakes social strategy games accurately mirror the structure and cues of real-world extended deceptive narratives and the two-tier annotation scheme reliably captures causal intent.

What would settle it

If FRACTAM applied to MISID produces no measurable gain in hidden intent detection accuracy over unmodified MLLMs, or if inter-annotator agreement on the causal-tracking tier proves low, the performance claims would be falsified.

Figures

Figures reproduced from arXiv: 2604.12700 by Dayou Zhang, Fangxin Wang, Muyang Chen, Rongrong Zhang, Shufang Lin, Xiabing Zhou.

Figure 1
Figure 1. Figure 1: An overview of the MISID benchmark. (Top) A multi-participant strategic dialogue timeline exhibiting hidden tactics. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Emotion category distribution (left) and speech [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall architecture of the FRACTAM framework. The pipeline standardizes multimodal inputs into objective text, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios, including text-prior visual hallucination, impaired cross-modal synergy, and limited capacity in chaining causal cues. Consequently, we propose FRACTAM as a baseline framework. Using a ``Decouple-Anchor-Reason'' paradigm, FRACTAM reduces text bias by extracting pure unimodal factual representations, employs two-stage retrieval for long-range factual anchoring, and constructs explicit cross-modal evidence chains. Extensive experiments demonstrate that FRACTAM enhances mainstream models' performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset is available at https://naislab.cn/datasets/MISID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces MISID, a multimodal multi-turn dataset sourced from high-stakes social strategy games, featuring a fine-grained two-tier multi-dimensional annotation scheme for long-context discourse analysis and evidence-based causal tracking of deceptive intents. It evaluates state-of-the-art MLLMs on the dataset, identifying deficiencies such as text-prior visual hallucination, impaired cross-modal synergy, and limited causal cue chaining. The authors propose FRACTAM, a Decouple-Anchor-Reason baseline framework that extracts unimodal factual representations, uses two-stage retrieval for factual anchoring, and builds explicit cross-modal evidence chains, with experiments claiming improved hidden intent detection and inference while preserving perceptual accuracy.

Significance. If the annotations are shown to be reliable proxies for causal structure, MISID would provide a valuable new benchmark for multimodal intent recognition in extended strategic interactions, addressing a gap in existing single-turn or simple-dialogue datasets. The FRACTAM framework offers a practical paradigm for reducing text bias in MLLMs on complex tasks. Dataset release and reproducible baseline experiments are positive contributions that could spur further work in behavioral analysis and HCI.

major comments (1)
  1. [Dataset construction and annotation] The two-tier annotation scheme (described in the abstract and dataset construction) is load-bearing for all empirical claims, yet the manuscript reports no inter-annotator agreement metrics, adjudication protocol, or external validation against known ground-truth intents. Without these, the reported MLLM deficiencies and FRACTAM gains on hidden-intent detection cannot be interpreted as reflecting true causal structure rather than annotator noise or bias.
minor comments (1)
  1. [Abstract] The abstract states that FRACTAM 'enhances mainstream models' performance' but does not name the specific metrics (e.g., F1, accuracy) or statistical significance tests used in the 'extensive experiments.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, particularly on the critical role of annotation reliability for the MISID dataset. We address the major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: The two-tier annotation scheme (described in the abstract and dataset construction) is load-bearing for all empirical claims, yet the manuscript reports no inter-annotator agreement metrics, adjudication protocol, or external validation against known ground-truth intents. Without these, the reported MLLM deficiencies and FRACTAM gains on hidden-intent detection cannot be interpreted as reflecting true causal structure rather than annotator noise or bias.

    Authors: We agree that the lack of reported inter-annotator agreement (IAA) metrics and a detailed adjudication protocol is a notable gap, given the centrality of the two-tier scheme to all empirical claims. The manuscript describes the annotation process at a high level but does not quantify agreement or fully specify how disagreements were resolved. In the revised version, we will expand the dataset construction section to include IAA metrics (such as Fleiss' kappa across annotators for both tiers), a complete description of the multi-annotator workflow and adjudication protocol, and any available cross-checks against game logs or other observable evidence. These additions will support interpreting the MLLM deficiencies and FRACTAM improvements as reflecting genuine model limitations rather than annotation artifacts. On external validation against independent ground-truth intents, we note that the deceptive intents in this benchmark are inherently inferred from the multimodal interactions and evidence chains; the annotations constitute the primary ground truth by design, and no separate external oracle exists beyond the provided game data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release and baseline method with independent evaluation

full rationale

The paper introduces the MISID dataset from high-stakes games, describes a two-tier annotation scheme for causal tracking, evaluates MLLMs on deficiencies like text-prior hallucination, and proposes FRACTAM via a Decouple-Anchor-Reason paradigm as an empirical baseline. No mathematical derivations, equations, fitted parameters, or predictions are present that reduce to self-defined inputs. No self-citations appear in the abstract or described content, and the central claims rest on direct experimental comparisons rather than tautological reductions or imported uniqueness theorems. The work is self-contained as a dataset contribution plus method proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, mathematical axioms, or invented physical entities; the paper rests on standard assumptions about multimodal data utility and the representativeness of game-based deception scenarios.

pith-pipeline@v0.9.0 · 5554 in / 1179 out tokens · 22199 ms · 2026-05-10T14:37:09.792968+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Anthropic. 2025. Claude Sonnet 4.5 System Card. https://www.anthropic.com/ claude-sonnet-4-5-system-card

  2. [2]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  3. [3]

    Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.arXiv preprint arXiv:1803.01271(2018)

  4. [4]

    Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, and Verena Rieser. 2020. SLURP: A spoken language understanding resource package. InProceedings of the Conference on Empirical Methods in Natural Language Processing. 7252–7262

  5. [5]

    Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nis- nevich, et al. 2020. Experience grounds language. InProceedings of the Conference on Empirical Methods in Natural Language Processing. 8718–8735

  6. [6]

    Hervé Bredin. 2023. pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. InProceedings of the Interspeech Conference. ISCA, 1983– 1987

  7. [7]

    1987.Politeness: Some universals in language usage

    Penelope Brown. 1987.Politeness: Some universals in language usage. Vol. 4. Cambridge university press

  8. [8]

    David B Buller and Judee K Burgoon. 1996. Interpersonal deception theory. Communication Theory6, 3 (1996), 203–242

  9. [9]

    Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmer- mann, Rada Mihalcea, and Soujanya Poria. 2019. Towards multimodal sarcasm detection (an _obviously_ perfect paper). InProceedings of the Annual Meeting of the Association for Computational Linguistics. 4619–4629

  10. [10]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  11. [11]

    Gordon V Cormack, Charles LA Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. InPro- ceedings of the international ACM SIGIR Conference on Research and Development in Information Retrieval. 758–759

  12. [12]

    Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density- based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. 226–231

  13. [13]

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learn- ing in deep neural networks.Nature Machine Intelligence2, 11 (2020), 665–673

  14. [14]

    Daniela Gerz, Pei-Hao Su, Razvan Kusztos, Avishek Mondal, Michał Lis, Eshan Singhal, Nikola Mrkšić, Tsung-Hsien Wen, and Ivan Vulić. 2021. Multilingual and cross-lingual intent detection from spoken data. InProceedings of the Conference on Empirical Methods in Natural Language Processing. 7468–7475

  15. [15]

    Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. InProceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing. 154–164

  16. [16]

    Google DeepMind. 2025. Gemini 3 Flash: Frontier Intelligence Built for Speed. https://blog.google/products-and-platforms/products/gemini/gemini-3-flash/

  17. [17]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  18. [18]

    Viresh Gupta, Mohit Agarwal, Manik Arora, Tanmoy Chakraborty, Richa Singh, and Mayank Vatsa. 2019. Bag-of-lies: A multimodal dataset for deception detec- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 83–90

  19. [19]

    Bayesian

    John C Harsanyi. 1967. Games with incomplete information played by “Bayesian” players, I–III Part I. The basic model.Management Science14, 3 (1967), 159–182

  20. [20]

    He He, Derek Chen, Anusha Balakrishnan, and Percy Liang. 2018. Decoupling strategy and generation in negotiation dialogues. InProceedings of the Conference on Empirical Methods in Natural Language Processing. 2333–2343

  21. [21]

    Julia Hirschberg, Stefan Benus, Jason M Brenier, Frank Enos, Sarah Friedman, Sarah Gilman, Cynthia Girand, Martin Graciarena, Andreas Kathol, Laura A Michaelis, Bryan L Pellom, Elizabeth Shriberg, and Andreas Stolcke. 2005. Dis- tinguishing deceptive from non-deceptive speech. InProceedings of the Annual Conference of the International Speech Communicatio...

  22. [22]

    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural Computation9, 8 (1997), 1735–1780

  23. [23]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  24. [24]

    Julia Kruk, Jonah Lubin, Karan Sikka, Xiao Lin, Dan Jurafsky, and Ajay Divakaran

  25. [25]

    InProceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing

    Integrating text and image: Determining multimodal document intent in instagram posts. InProceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing. 4622–4632

  26. [26]

    David La Barbera, Gian Carlo Milanese, Georgios Peikos, Gabriella Pasi, and Marco Viviani. 2025. Beyond binary classification: ranking for information access in misinformation contexts. InProceeding of the National Conference on Artificial Intelligence (CEUR Workshop Proceedings, Vol. 4121). 1–7

  27. [27]

    Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al . 2025. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313 (2025)

  28. [28]

    Jiapeng Li, Ping Wei, Wenjuan Han, and Lifeng Fan. 2023. Intentqa: Context-aware video intent reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 11963–11974

  29. [29]

    Yulong Li, Yuxuan Zhang, Rui Chen, Feilong Tang, Zhixiang Lu, Ming Hu, Jiang- hao Wu, Haochen Xue, Mian Zhou, Chong Li, et al. 2025. Genesis: A Large-Scale Benchmark for Multimodal Large Language Model in Emotional Causality Anal- ysis. InProceedings of the ACM International Conference on Multimedia. 12651– 12658

  30. [30]

    Zheng Lian, Bin Liu, and Jianhua Tao. 2021. CTNet: Conversational transformer network for emotion recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing29 (2021), 985–1000

  31. [31]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  32. [32]

    Rui Liu, Haolin Zuo, Zheng Lian, Xiaofen Xing, Björn W Schuller, and Haizhou Li. 2024. Emotion and intent joint understanding in multimodal conversation: A benchmarking dataset.arXiv preprint arXiv:2407.02751(2024)

  33. [33]

    Shuhua Liu, Lanting Li, Ming Fang, Chih-Cheng Hung, and Shihao Yang. 2022. Research on Implicit Intent Recognition Method Based on Prompt Learning. A vailable at SSRN 4164522(2022)

  34. [34]

    Adyasha Maharana, Quan Hung Tran, Franck Dernoncourt, Seunghyun Yoon, Trung Bui, Walter Chang, and Mohit Bansal. 2022. Multimodal intent discovery from livestream videos. InFindings of the Association for Computational Linguistics. 476–489

  35. [35]

    Kyle Mahowald, Anna A Ivanova, Idan A Blank, Nancy Kanwisher, Joshua B Tenenbaum, and Evelina Fedorenko. 2024. Dissociating language and thought in large language models.Trends in Cognitive Sciences28, 6 (2024), 517–540

  36. [36]

    Vlad Niculae, Srijan Kumar, Jordan Boyd-Graber, and Cristian Danescu-Niculescu- Mizil. 2015. Linguistic harbingers of betrayal: A case study on an online strategy game. InProceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 1650–1659

  37. [37]

    OpenAI. 2025. GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum. https://openai.com/index/gpt-5-system-card-addendum-gpt-5-1/

  38. [38]

    Verónica Pérez-Rosas, Mohamed Abouelenien, Rada Mihalcea, and Mihai Burzo

  39. [39]

    InProceedings of the ACM on International Conference on Multimodal Interaction

    Deception detection using real-life trial data. InProceedings of the ACM on International Conference on Multimodal Interaction. 59–66

  40. [40]

    Verónica Pérez-Rosas and Rada Mihalcea. 2015. Experiments in open domain deception detection. InProceedings of the Conference on Empirical Methods in Natural Language Processing. 1120–1125

  41. [41]

    Steven Pinker, Martin A Nowak, and James J Lee. 2008. The logic of indirect speech.Proceedings of the National Academy of Sciences105, 3 (2008), 833–838

  42. [42]

    Alexis Plaquet and Hervé Bredin. 2023. Powerset multi-class cross entropy loss for neural speaker diarization. InProceedings of the Interspeech Conference. ISCA, 3222–3226

  43. [43]

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. Meld: A multimodal multi-party dataset for emotion recognition in conversations. InProceedings of the Annual Meeting of the Association for Computational Linguistics. 527–536

  44. [44]

    Yao Qian, Ximo Bianv, Yu Shi, Naoyuki Kanda, Leo Shen, Zhen Xiao, and Michael Zeng. 2021. Speech-language pre-training for end-to-end spoken language under- standing. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7458–7462

  45. [45]

    Tulika Saha, Aditya Patra, Sriparna Saha, and Pushpak Bhattacharyya. 2020. Towards emotion-aided multi-modal dialogue act classification. InProceedings of the Annual Meeting of the Association for Computational Linguistics. 4361–4372

  46. [46]

    Maarten Sap, Ronan Le Bras, Daniel Fried, and Yejin Choi. 2022. Neural theory- of-mind? on the limits of social intelligence in large lms. InProceedings of the Conference on Empirical Methods in Natural Language Processing. 3762–3780

  47. [47]

    Vasanth Sarathy, Alexander Tsuetaki, Antonio Roque, and Matthias Scheutz. 2020. Reasoning requirements for indirect speech act interpretation. InProceedings of the International Conference on Computational Linguistics. 4937–4948

  48. [48]

    Jocelyn Shen, Amina Luvsanchultem, Jessica Kim, Kynnedy Smith, Valdemar Danry, Kantwon Rogers, Sharifa Alghowinem, Hae Won Park, Maarten Sap, and Cynthia Breazeal. 2026. The Hidden Puppet Master: A Theoretical and Real-World MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Shufang Lin, Muyang Chen, Xiabing Zhou, Rongrong Zhang, Dayou Zhang, and Fangxi...

  49. [49]

    Weizhou Shen, Siyue Wu, Yunyi Yang, and Xiaojun Quan. 2021. Directed acyclic graph network for conversational emotion recognition. InProceedings of the An- nual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 1551–1560

  50. [50]

    Yuanchen Shi, Fang Kong, and Longyin Zhang. 2025. Impact of Stickers on Multimodal Sentiment and Intent in Social Media: A New Task, Dataset and Baseline. InProceedings of the ACM International Conference on Multimedia. 5637– 5646

  51. [51]

    Gopendra Vikram Singh, Mauajama Firdaus, Asif Ekbal, and Pushpak Bhat- tacharyya. 2022. Emoint-trans: A multimodal transformer for identifying emo- tions and intents in social conversations.IEEE/ACM Transactions on Audio, Speech, and Language Processing31 (2022), 290–300

  52. [52]

    Felix Soldner, Verónica Pérez-Rosas, and Rada Mihalcea. 2019. Box of lies: Mul- timodal deception detection in dialogues. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 1768–1777

  53. [53]

    Michael Spence. 1978. Job market signaling. InUncertainty in economics. Elsevier, 281–306

  54. [54]

    Qwen Team. 2026. Qwen 3.5: Scaling Native Multimodal Agents with Efficient Architectures. https://qwen.ai/blog?id=qwen3.5

  55. [55]

    Fanfan Wang, Zixiang Ding, Rui Xia, Zhaoyu Li, and Jianfei Yu. 2022. Multimodal emotion-cause pair extraction in conversations.IEEE Transactions on Affective Computing14, 3 (2022), 1832–1844

  56. [56]

    Chengyan Wu, Yiqiang Cai, Yang Liu, Pengxu Zhu, Yun Xue, Ziwei Gong, Julia Hirschberg, and Bolei Ma. 2025. Multimodal emotion recognition in conversa- tions: A survey of methods, trends, challenges and prospects.arXiv preprint arXiv:2505.20511(2025)

  57. [57]

    xAI. 2025. Grok 4.1 Model Card. https://data.x.ai/2025-11-17-grok-4-1-model- card.pdf

  58. [58]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  59. [59]

    Shaozu Yuan, Xin Shen, Yuming Zhao, Hang Liu, Zhiling Yan, Ruixue Liu, and Meng Chen. 2022. MCIC: multimodal conversational intent classification for E-commerce customer service. InProceedings of the CCF International Conference on Natural Language Processing and Chinese Computing. Springer, 749–761

  60. [60]

    Z.ai. 2025. GLM-4.7 Model Card. https://build.nvidia.com/z-ai/glm4_7/ modelcard

  61. [61]

    Dongyu Zhang, Minghao Zhang, Heting Zhang, Liang Yang, and Hongfei Lin

  62. [62]

    InProceed- ings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing

    MultiMET: A multimodal dataset for metaphor understanding. InProceed- ings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. 3214–3225

  63. [63]

    Hanlei Zhang, Xin Wang, Hua Xu, Qianrui Zhou, Kai Gao, Jianhua Su, Jinyue Zhao, Wenrui Li, and Yanting Chen. 2024. MIntRec2.0: A Large-scale Bench- mark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations. InProceedings of the International Conference on Learning Repre- sentations

  64. [64]

    Hanlei Zhang, Hua Xu, Xin Wang, Qianrui Zhou, Shaojie Zhao, and Jiayan Teng

  65. [65]

    InProceedings of the ACM International Conference on Multimedia

    Mintrec: A new dataset for multimodal intent recognition. InProceedings of the ACM International Conference on Multimedia. 1688–1697

  66. [66]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176(2025)

  67. [67]

    Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. 2024. SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents. InProceedings of the International Conference on Learning Rep- resentations. OpenReview.net. https://openreview.net/...