pith. sign in

arxiv: 2606.12018 · v1 · pith:X6GJ3I5Ynew · submitted 2026-06-10 · 💻 cs.AI

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Pith reviewed 2026-06-27 09:42 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent frameworkknowledge distillationsocial intelligence reasoningtest-time adaptationlong-tail eventsLoRA fine-tuningmultimodal large language model
0
0 comments X

The pith

A multi-agent framework on a lightweight multimodal model reaches state-of-the-art social reasoning with roughly 30 percent of the training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a collaborative multi-agent system built on a lightweight multimodal large language model to perform social intelligence reasoning. Both training and inference stages incorporate knowledge distillation, multi-modal inputs are localized, and long-tail events are converted into explicit formatted text so they are not lost amid more common signals. Distillation-enhanced test-time adaptation with LoRA fine-tuning is applied across event extraction, chain-of-thought prompting, and self-reflection steps. Evaluations on multiple benchmarks show the approach outperforms other open and closed models. A reader would care because the results suggest social reasoning can be made more data-efficient through structured extraction and adaptation rather than scale alone.

Core claim

The framework integrates multi-agent collaboration, precise localization of multi-modal social data, extraction of long-tail events into formatted text, and distillation-enhanced test-time adaptation with LoRA fine-tuning on a lightweight MLLM, enabling state-of-the-art results on social intelligence reasoning benchmarks while using only around 30 percent of the training data from IntentTrain.

What carries the argument

The MODF-SIR multi-agent omni-modal distilled framework that augments a lightweight MLLM with knowledge distillation, long-tail event text extraction, and distillation-enhanced test-time adaptation across the full reasoning pipeline.

If this is right

  • Long-tail social events can be preserved and used effectively by converting them to explicit text before tokenization.
  • Test-time adaptation applied to extraction, chain-of-thought, and self-reflection steps improves instance-level reasoning without full retraining.
  • State-of-the-art performance on social benchmarks is achievable with substantially less training data than standard approaches.
  • Multi-agent division of labor allows the system to localize relevant multi-modal inputs and handle noise from head events.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the gains trace mainly to explicit text formatting of rare events, the same step could be tested on other multimodal tasks that suffer from imbalanced signal strength.
  • The low data requirement opens the possibility of deploying similar systems on smaller curated datasets for domains where full-scale training data is scarce.
  • Adding self-reflection inside the adapted pipeline may point to iterative refinement as a general lever for improving social or commonsense reasoning.

Load-bearing premise

The combination of multi-agent collaboration, long-tail event extraction to text, distillation-enhanced test-time adaptation, and LoRA fine-tuning produces genuine gains in social reasoning rather than benchmark-specific improvements.

What would settle it

An ablation study that removes the multi-agent structure, the long-tail event text conversion, and the distillation-enhanced TTA while keeping the base lightweight model and training data fixed, then checks whether benchmark scores remain comparable or drop sharply.

Figures

Figures reproduced from arXiv: 2606.12018 by Bimei Wang, Bin Hu, Hong Peng, Jisheng Dang, Qi Tian, Shang Ma, Tat-Seng Chua, Wencan Zhang, Yifan Zhang.

Figure 1
Figure 1. Figure 1: The motivation of our method. Traditional method employs a black-box reasoning paradigm, which may cause problems such as hallucinations. Our method employs multi￾agent strategy, visualizing the reasoning steps. the omni-modal data. This conditional routing mechanism ensures the optimal allocation of computational resources. (iii) Long-tail events are inherently subtle and transient. Consequently, conducti… view at source ↗
Figure 2
Figure 2. Figure 2: The overall workflow of our MODF-SIR. Given a video, audio and query. MODF-SIR would activate different agents based on different situations and perform step-by-step reasoning. alignment between queries and visual evidence, limiting their ability to provide precise and interpretable reasoning, espe￾cially in long and complex videos. The task of temporal grounding has addressed this limitation. Existing met… view at source ↗
Figure 3
Figure 3. Figure 3: The AKD Router Agent activates other agents based [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of our MODF-SIR. ELT Retriever Agent would first collect the information from the video, audio and query. The AKD Router Agent would first determine which mode to use. In this sample, AKD Router Agent decide to use OMLT Reasoner Agent directly. OMLT Reasoner Agent generates responses, and TTA Reviser scores these responses. If the score is below the satisfaction threshold, LoRA fine-tuning is… view at source ↗
Figure 5
Figure 5. Figure 5: T means the TTA Reviser’s maximum iterations [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at https://github.com/eeee-sys/MODF-SIR, demo is available at https://huggingface.co/spaces/Harry-1234/MODF-SIR, LoRA is available at https://huggingface.co/Harry-1234/MODF-SIR and the dataset for training router is available at https://huggingface.co/datasets/Harry-1234/IntentRouterTrain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MODF-SIR, a multi-agent collaborative framework built on a lightweight Multimodal Large Language Model (MLLM) for social intelligence reasoning. Training and inference are augmented by knowledge distillation; multi-modal social data is localized, long-tail events are extracted and rendered as explicit formatted text to avoid overshadowing during tokenization, and the pipeline incorporates distillation-enhanced Test-Time Adaptation (TTA) with Chain-of-Thought prompting, self-reflection, and LoRA fine-tuning for instance-level reasoning. The authors claim state-of-the-art results across multiple benchmarks while using only around 30% of the training data from IntentTrain.

Significance. If the empirical claims hold and the gains are shown to arise from the architecture rather than benchmark-specific tuning, the combination of multi-agent collaboration, explicit long-tail event handling, and distillation-enhanced TTA on a lightweight MLLM could provide an efficient, data-frugal approach to social reasoning tasks, with potential impact on multi-agent MLLM systems.

major comments (2)
  1. [Abstract] Abstract: the central claim of achieving state-of-the-art results 'across multiple benchmarks' is unsupported by any named benchmarks, baseline models, metrics, error bars, ablation studies, or held-out validation details, rendering the contribution impossible to evaluate.
  2. [Abstract] Abstract: no evidence is supplied that performance improvements derive from multi-agent collaboration, long-tail event extraction, or distillation-enhanced TTA rather than data selection, prompt engineering, or test-time adaptation tuned to the (unspecified) benchmarks.
minor comments (1)
  1. [Abstract] Abstract: the term 'omni-modal' in the title is not defined or distinguished from the 'multi-modal' usage in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting issues with the abstract. We agree that the abstract requires revision to include specific details on benchmarks, baselines, metrics, and evidence for component contributions. The full manuscript contains these elements in the experiments and ablations sections, but we will update the abstract and clarify attributions in a revised version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of achieving state-of-the-art results 'across multiple benchmarks' is unsupported by any named benchmarks, baseline models, metrics, error bars, ablation studies, or held-out validation details, rendering the contribution impossible to evaluate.

    Authors: We agree the abstract is too high-level and omits these specifics, making evaluation difficult from the abstract alone. The full manuscript reports results on named benchmarks including IntentTrain and others, with comparisons to open-source and proprietary baselines, specific metrics, ablation studies, and held-out validation. Error bars appear in experimental figures. We will revise the abstract to name the benchmarks, list key baselines and metrics, reference the ablation studies, and note the use of 30% training data from IntentTrain. revision: yes

  2. Referee: [Abstract] Abstract: no evidence is supplied that performance improvements derive from multi-agent collaboration, long-tail event extraction, or distillation-enhanced TTA rather than data selection, prompt engineering, or test-time adaptation tuned to the (unspecified) benchmarks.

    Authors: The manuscript includes dedicated ablation studies and analyses in the experiments section that isolate the contributions of multi-agent collaboration, long-tail event extraction/rendering as formatted text, and distillation-enhanced TTA with LoRA and self-reflection. These show gains beyond basic TTA or prompt engineering. However, the abstract does not reference this evidence. We will revise the abstract to briefly note these components and their demonstrated roles, and ensure the main text more explicitly contrasts against alternative explanations such as data selection. No new experiments are required as the ablations already address this. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external benchmarks

full rationale

The paper proposes an engineering framework (multi-agent collaboration, long-tail event extraction to text, distillation-enhanced TTA, LoRA) and reports empirical SOTA results on benchmarks using ~30% IntentTrain data. No equations, derivations, or mathematical claims exist that could reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. All load-bearing claims rest on external benchmark evaluations, which are falsifiable outside the paper. No self-citations are invoked as uniqueness theorems or to justify ansatzes. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5834 in / 1045 out tokens · 17181 ms · 2026-06-27T09:42:23.257503+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 14 linked inside Pith

  1. [1]

    Machine theory of mind,

    N. Rabinowitz, F. Perbet, F. Song, C. Zhang, S. A. Eslami, and M. Botvinick, “Machine theory of mind,” inInternational conference on machine learning. PMLR, 2018, pp. 4218–4227

  2. [2]

    Un- derstanding and sharing intentions: The origins of cultural cognition,

    M. Tomasello, M. Carpenter, J. Call, T. Behne, and H. Moll, “Un- derstanding and sharing intentions: The origins of cultural cognition,” Behavioral and brain sciences, vol. 28, no. 5, pp. 675–691, 2005

  3. [3]

    De- constructing and reconstructing theory of mind,

    S. M. Schaafsma, D. W. Pfaff, R. P. Spunt, and R. Adolphs, “De- constructing and reconstructing theory of mind,”Trends in cognitive sciences, vol. 19, no. 2, pp. 65–72, 2015

  4. [4]

    The neural basis of mentalizing,

    C. D. Frith and U. Frith, “The neural basis of mentalizing,”Neuron, vol. 50, no. 4, pp. 531–534, 2006

  5. [5]

    Mtag: Modal-temporal attention graph for unaligned human multimodal language sequences,

    J. Yang, Y . Wang, R. Yi, Y . Zhu, A. Rehman, A. Zadeh, S. Poria, and L.-P. Morency, “Mtag: Modal-temporal attention graph for unaligned human multimodal language sequences,” inProceedings of the 2021 conference of the North American chapter of the association for com- putational linguistics: human language technologies, 2021, pp. 1009– 1021

  6. [6]

    The multimodal facilitation effect in human communication,

    L. Drijvers and J. Holler, “The multimodal facilitation effect in human communication,”Psychonomic Bulletin & Review, vol. 30, no. 2, pp. 792–801, 2023

  7. [7]

    Kahneman,Thinking, fast and slow

    D. Kahneman,Thinking, fast and slow. macmillan, 2011

  8. [8]

    Judging llm-as-a-judge with mt-bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

  9. [9]

    Worldsense: Evaluating real-world omnimodal understanding for multimodal llms,

    J. Hong, S. Yan, J. Cai, X. Jiang, Y . Hu, and W. Xie, “Worldsense: Evaluating real-world omnimodal understanding for multimodal llms,” arXiv preprint arXiv:2502.04326, 2025. 10

  10. [10]

    Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,

    Z. Zhou, R. Wang, and Z. Wu, “Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,”arXiv preprint arXiv:2505.17862, 2025

  11. [11]

    Humanomniv2: From understanding to omni- modal reasoning with context,

    Q. Yang, S. Yao, W. Chen, S. Fu, D. Bai, J. Zhao, B. Sun, B. Yin, X. Wei, and J. Zhou, “Humanomniv2: From understanding to omni- modal reasoning with context,”arXiv preprint arXiv:2506.21277, 2025

  12. [12]

    Gpt-4o system card,

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  13. [13]

    Gemini 2.5 Pro,

    Google, “Gemini 2.5 Pro,” https://deepmind.google/technologies/gemini/ pro/, 2025

  14. [14]

    Show and tell: Lessons learned from the 2015 mscoco image captioning challenge,

    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Lessons learned from the 2015 mscoco image captioning challenge,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 652–663, 2016

  15. [15]

    Tgif-qa: Toward spatio- temporal reasoning in visual question answering,

    Y . Jang, Y . Song, Y . Yu, Y . Kim, and G. Kim, “Tgif-qa: Toward spatio- temporal reasoning in visual question answering,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2758–2766

  16. [16]

    A survey on video moment localization,

    M. Liu, L. Nie, Y . Wang, M. Wang, and Y . Rui, “A survey on video moment localization,”ACM Computing Surveys, vol. 55, no. 9, pp. 1–37, 2023

  17. [17]

    Self-consistency improves chain of thought reasoning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,”arXiv preprint arXiv:2203.11171, 2022

  18. [18]

    Re- flexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

  19. [19]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  20. [20]

    Modularized self-reflected video reasoner for multimodal llm with application to video question answering,

    Z. Song, X. Wang, Z. Qian, H. Chen, L. Huang, H. Xue, and W. Zhu, “Modularized self-reflected video reasoner for multimodal llm with application to video question answering,” inForty-second International Conference on Machine Learning, 2025

  21. [21]

    From verbatim to gist: Distilling pyramidal multimodal memory via semantic information bottleneck for long-horizon video agents,

    N. Lian, Y . Wang, H. Yao, J. Wang, B. Chen, Y . Wang, M. Zhang, and S.-T. Xia, “From verbatim to gist: Distilling pyramidal multimodal memory via semantic information bottleneck for long-horizon video agents,”arXiv preprint arXiv:2603.01455, 2026

  22. [22]

    Where llm agents fail and how they can learn from failures,

    K. Zhu, Z. Liu, B. Li, M. Tian, Y . Yang, J. Zhang, P. Han, Q. Xie, F. Cui, W. Zhanget al., “Where llm agents fail and how they can learn from failures,”arXiv preprint arXiv:2509.25370, 2025

  23. [23]

    From denoising to refining: A corrective framework for vision-language dif- fusion model,

    Y . Ji, T. Wang, Y . Ge, Z. Liu, S. Yang, Y . Shan, and P. Luo, “From denoising to refining: A corrective framework for vision-language dif- fusion model,”arXiv preprint arXiv:2510.19871, 2025

  24. [24]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  25. [25]

    An empirical study of catastrophic forgetting in large language models during con- tinual fine-tuning,

    Y . Luo, Z. Yang, F. Meng, Y . Li, J. Zhou, and Y . Zhang, “An empirical study of catastrophic forgetting in large language models during con- tinual fine-tuning,”IEEE Transactions on Audio, Speech and Language Processing, 2025

  26. [26]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  27. [27]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,

    A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2630–2640

  28. [28]

    Tall: Temporal activity local- ization via language query,

    J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity local- ization via language query,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 5267–5275

  29. [29]

    Merlot: Multimodal neural script knowledge models,

    R. Zellers, X. Lu, J. Hessel, Y . Yu, J. S. Park, J. Cao, A. Farhadi, and Y . Choi, “Merlot: Multimodal neural script knowledge models,” Advances in neural information processing systems, vol. 34, pp. 23 634– 23 651, 2021

  30. [30]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  31. [31]

    Qwen3-omni technical report,

    J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

  32. [32]

    Shortcut learning in deep neural networks,

    R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,”Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020

  33. [33]

    Winoground: Probing vision and language models for visio- linguistic compositionality,

    T. Thrush, R. Jiang, M. Bartolo, A. Singh, A. Williams, D. Kiela, and C. Ross, “Winoground: Probing vision and language models for visio- linguistic compositionality,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2022, pp. 5238–5248

  34. [34]

    Ov-mer: Towards open-vocabulary multimodal emotion recognition,

    Z. Lian, H. Sun, L. Sun, H. Chen, L. Chen, H. Gu, Z. Wen, S. Chen, S. Zhang, H. Yaoet al., “Ov-mer: Towards open-vocabulary multimodal emotion recognition,”arXiv preprint arXiv:2410.01495, 2024

  35. [35]

    Deep long-tailed learning: A survey,

    Y . Zhang, B. Kang, B. Hooi, S. Yan, and J. Feng, “Deep long-tailed learning: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 9, pp. 10 795–10 816, 2023

  36. [36]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  37. [37]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

  38. [38]

    The acm multimedia 2022 computational paralinguistics challenge: V ocalisations, stuttering, activity, & mosquitoes,

    B. Schuller, A. Batliner, S. Amiriparian, C. Bergler, M. Gerczuk, N. Holz, P. Larrouy-Maestri, S. Bayerl, K. Riedhammer, A. Mallol- Ragoltaet al., “The acm multimedia 2022 computational paralinguistics challenge: V ocalisations, stuttering, activity, & mosquitoes,” inProceed- ings of the 30th ACM International Conference on Multimedia, 2022, pp. 7120–7124

  39. [39]

    Darwin, deception, and facial expression,

    P. Ekman, “Darwin, deception, and facial expression,”Annals of the new York Academy of sciences, vol. 1000, no. 1, pp. 205–221, 2003

  40. [40]

    The” something something

    R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzynska, S. West- phal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitaget al., “The” something something” video database for learning and evaluat- ing visual common sense,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 5842–5850

  41. [41]

    Slowfast networks for video recognition,

    C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211

  42. [42]

    Multimodal transformer for unaligned multimodal language sequences,

    Y .-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 6558–6569

  43. [43]

    Let’s verify step by step,

    H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” inThe twelfth international conference on learning represen- tations, 2023

  44. [44]

    Concrete problems in ai safety,

    D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man ´e, “Concrete problems in ai safety,”arXiv preprint arXiv:1606.06565, 2016

  45. [45]

    Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action,

    J. Lu, C. Clark, S. Lee, Z. Zhang, S. Khosla, R. Marten, D. Hoiem, and A. Kembhavi, “Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 439–26 455

  46. [46]

    Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms,

    Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhaoet al., “Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms,”arXiv preprint arXiv:2406.07476, 2024

  47. [47]

    Ola: Pushing the frontiers of omni-modal language model,

    Z. Liu, Y . Dong, J. Wang, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Ola: Pushing the frontiers of omni-modal language model,”arXiv preprint arXiv:2502.04328, 2025

  48. [48]

    Openai o1 system card,

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carneyet al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

  49. [49]

    Minicpm-v: A gpt-4v level mllm on your phone,

    Y . Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. Heet al., “Minicpm-v: A gpt-4v level mllm on your phone,”arXiv preprint arXiv:2408.01800, 2024

  50. [50]

    Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,

    C. Fu, H. Lin, X. Wang, Y .-F. Zhang, Y . Shen, X. Liu, H. Cao, Z. Long, H. Gao, K. Liet al., “Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,”arXiv preprint arXiv:2501.01957, 2025

  51. [51]

    Introducing the next generation of claude,

    A. Anthropic, “Introducing the next generation of claude,”https://www. anthropic. com/news/claude-3-family, 2024

  52. [52]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

    G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wanget al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024