pith. sign in

arxiv: 2508.10016 · v4 · pith:JV7LWPYNnew · submitted 2025-08-06 · 💻 cs.CL

Training-Free Multimodal Large Language Model Orchestration

Pith reviewed 2026-05-25 07:59 UTC · model grok-4.3

classification 💻 cs.CL
keywords training-freemultimodal orchestrationLLM controllermodality expertscross-modal memoryomni-modal systemsintent inference
0
0 comments X

The pith

An unmodified LLM can orchestrate separate modality experts into a multimodal system without joint training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LLM Orchestration as a framework that combines off-the-shelf experts for different modalities by using an existing language model to decide routing and sequencing. The controller outputs explicit control tokens based on user intent, stores compressed evidence in a shared text memory for reuse across turns, and relies on a unified layer to manage interactions including streaming and interruptions. This setup is claimed to deliver competitive results on standard multimodal benchmarks while keeping overhead low and allowing new experts to be added without retraining. A reader would care because current multimodal systems often demand expensive end-to-end training to align modalities, and this method avoids that requirement entirely.

Core claim

LLM Orchestration integrates off-the-shelf modality experts into a unified input-output system through three components: an LLM controller that infers intent from multimodal context and emits explicit control tokens for expert selection and sequencing, a text-centric cross-modal memory that compresses evidence into structured records for retrieval, and a unified interaction layer that executes routing decisions to support modality transitions and interruption-aware dialogue, all without any additional gradient-based training.

What carries the argument

The LLM controller that infers user intent and emits explicit control tokens to select and sequence modality experts.

If this is right

  • Strong performance across diverse multimodal benchmarks under standard evaluation constraints.
  • Low orchestration overhead relative to end-to-end trained systems.
  • Modular upgradeability that supports swapping or adding experts without retraining the controller.
  • A practical route to omni-modal systems that avoids the data and compute costs of joint training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit token-based routing could make error tracing easier in deployed systems than in fully trained multimodal models.
  • New modalities could be incorporated simply by registering additional experts and updating the controller's selection vocabulary.
  • Text compression in the memory might constrain performance on very long multimodal sessions due to context length limits.

Load-bearing premise

An off-the-shelf LLM can reliably infer user intent from multimodal context and emit correct control tokens for expert selection and sequencing without any training.

What would settle it

A set of multimodal benchmark examples where the controller repeatedly chooses the wrong expert or sequence, causing overall accuracy to fall well below that of jointly trained multimodal models.

Figures

Figures reproduced from arXiv: 2508.10016 by Jiayi Ji, Rongrong Ji, Tat-Seng Chua, Tianyu Xie, Wang Chen, Xiawu Zheng, Yuexiao Ma, Yuhang Wu.

Figure 1
Figure 1. Figure 1: illustrates the training procedures of VITA(a) and our Training-Free Multimodal Large Language Model [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the MLLM Orchestration framework, featuring core components such as the Central Controller [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison on Video-MME benchmark. Our orchestration mechanism achieves consistent [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: TTS processing architecture comparison showing significant improvements in both speed and stability with [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free Large Language Model Orchestration (LLM Orchestration), a training-free orchestration framework that integrates off-the-shelf modality experts into a unified multimodal input--output system without additional gradient-based training for integration. LLM Orchestration comprises three components: (1) an LLM controller that infers user intent and emits explicit control tokens for expert selection and sequencing, enabling protocol-constrained and auditable routing; (2) a text-centric cross-modal memory that compresses multimodal evidence into structured records for lightweight retrieval and reuse, reducing redundant expert invocations across turns; and (3) a unified interaction layer that executes routing and memory decisions to support consistent modality transitions, full-duplex streaming, and interruption-aware dialogue. Across diverse multimodal benchmarks, LLM Orchestration achieves strong performance under standard evaluation constraints while maintaining low orchestration overhead and modular upgradeability, providing a practical alternative to costly joint training for omni-modal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes LLM Orchestration, a training-free framework for multimodal systems that uses an off-the-shelf LLM controller to infer user intent and emit explicit control tokens for expert selection/sequencing, a text-centric cross-modal memory for compressing and retrieving multimodal evidence, and a unified interaction layer for routing, streaming, and interruption-aware dialogue. It claims this integrates modality experts into a unified input-output system, achieving strong performance across diverse multimodal benchmarks with low orchestration overhead and modular upgradeability, as an alternative to costly joint training.

Significance. If the central mechanism holds, the approach would offer a practical, extensible alternative to end-to-end multimodal alignment by avoiding gradient-based integration costs while preserving auditability and upgradeability. The modular design and emphasis on protocol-constrained routing are potentially valuable for omni-modal assistants, but the absence of isolated controller metrics leaves the contribution of the orchestration itself unverified relative to benchmark leniency or downstream components.

major comments (1)
  1. [Abstract and §3] Abstract and §3 (LLM controller description): the central claim that an untrained LLM controller reliably infers multimodal intent and emits correct explicit control tokens for expert selection/sequencing is load-bearing for the training-free advantage, yet the manuscript provides no quantitative isolation of controller token accuracy, failure modes, sensitivity to prompt/LLM choice, or error rates. Without these metrics, benchmark results cannot be attributed to the orchestration mechanism rather than other factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The major comment identifies a valid gap in isolating the controller's contribution, which we address below by committing to additional analysis in the revision.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (LLM controller description): the central claim that an untrained LLM controller reliably infers multimodal intent and emits correct explicit control tokens for expert selection/sequencing is load-bearing for the training-free advantage, yet the manuscript provides no quantitative isolation of controller token accuracy, failure modes, sensitivity to prompt/LLM choice, or error rates. Without these metrics, benchmark results cannot be attributed to the orchestration mechanism rather than other factors.

    Authors: We acknowledge that the manuscript does not include isolated quantitative metrics on controller token accuracy, failure modes, or sensitivity to prompt/LLM choice. The reported results focus on end-to-end benchmark performance to demonstrate the overall viability of the training-free framework. To better attribute performance to the orchestration mechanism, we will add a dedicated analysis in the revised §3 and experiments section. This will include controller token prediction accuracy on a set of multimodal intent examples, error categorization, and sensitivity tests across LLM variants and prompt formulations. revision: yes

Circularity Check

0 steps flagged

No circularity in framework description or claims

full rationale

The paper describes a training-free orchestration system built from off-the-shelf components (LLM controller, memory, interaction layer) and reports benchmark results. No equations, fitted parameters, predictions, or derivations appear in the abstract or described structure. No self-definitional reductions, fitted-input predictions, or load-bearing self-citations are present; the central claims rest on empirical evaluation rather than internal construction from the inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only; the central claim rests on the domain assumption that an unmodified LLM can perform reliable intent inference and token emission for routing, plus the assumption that modality experts remain effective when called via external protocol.

axioms (2)
  • domain assumption An off-the-shelf LLM can infer user intent and emit correct control tokens for expert selection without training.
    Invoked in the description of the LLM controller component.
  • domain assumption Text-centric compression of multimodal evidence preserves enough information for lightweight retrieval across dialogue turns.
    Invoked in the cross-modal memory component.

pith-pipeline@v0.9.0 · 5732 in / 1288 out tokens · 29237 ms · 2026-05-25T07:59:19.233560+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 25 internal anchors

  1. [1]

    Gpt-4 technical report

    OpenAI et al. Gpt-4 technical report. Technical report, OpenAI, 2023

  2. [2]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  3. [3]

    The llama 3 herd of models

    Aaron Grattafiori et al. The llama 3 herd of models. IEEE Spectrum, 2024

  4. [4]

    Llava-next: Improved reasoning, ocr, and world knowledge

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge. arXiv preprint, 2024

  5. [5]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  6. [6]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems , 35:23716–23736, 2022

  7. [7]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

  8. [8]

    Visual Instruction Tuning

    Haotian Liu et al. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

  9. [9]

    X-llava: Enhanced cross-lingual large vision-language alignment

    Byung-Kwan Shin et al. X-llava: Enhanced cross-lingual large vision-language alignment. arXiv preprint, 2024

  10. [10]

    InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

    Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023

  11. [11]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 24185–24198, 2024

  12. [12]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024

  13. [13]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. Advances in Neural Information Processing Systems , 37:87310–87356, 2024

  14. [14]

    Vita: Towards open-source interactive omni multimodal llm

    Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Shaoqi Dong, Xiong Wang, Di Yin, Long Ma, et al. Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211, 2024

  15. [15]

    Llama-omni: Seamless speech interaction with large language models

    Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666, 2024

  16. [16]

    Rlaif-v: Aligning mllms through open-source 11 ai feedback for super gpt-4v trustworthiness

    Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, et al. Rlaif-v: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024

  17. [18]

    Baichuan-omni technical report

    Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, et al. Baichuan-omni technical report. arXiv preprint arXiv:2410.08565, 2024

  18. [19]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 12

  19. [20]

    Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

    Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm. arXiv preprint arXiv:2411.00774, 2024

  20. [21]

    Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities

    Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities. arXiv preprint arXiv:2410.11190, 2024

  21. [22]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037, 2024

  22. [23]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  23. [24]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning , pages 19730–19742. PMLR, 2023

  24. [25]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

  25. [26]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023

  26. [27]

    Videollm: Modeling video sequence with large language models

    Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023

  27. [28]

    Cogvlm: Visual expert for pretrained language models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al. Cogvlm: Visual expert for pretrained language models. Advances in Neural Information Processing Systems, 37:121475–121499, 2024

  28. [29]

    Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, and Steven C. H. Hoi. Lavis: A library for language-vision intelligence, 2022

  29. [30]

    Humanomni: A large vision-speech language model for human-centric video understanding

    Jiaxing Zhao, Qize Yang, Yixing Peng, Detao Bai, Shimin Yao, Boyuan Sun, Xiang Chen, Shenghao Fu, Xihan Wei, Liefeng Bo, et al. Humanomni: A large vision-speech language model for human-centric video understanding. arXiv preprint arXiv:2501.15111, 2025

  30. [31]

    InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output. arXiv preprint arXiv:2407.03320, 2024

  31. [32]

    Vary: Scaling up the vision vocabulary for large vision-language model

    Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. In European Conference on Computer Vision, pages 408–424. Springer, 2024

  32. [33]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023

  33. [34]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  34. [35]

    A survey of vision-language pre-trained models

    Yifan Du, Zikang Liu, Junyi Li, and Wayne Xin Zhao. A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936, 2022

  35. [36]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

  36. [37]

    Mmctagent: Multi-modal critical thinking agent framework for complex visual reasoning

    Somnath Kumar, Yash Gadhia, Tanuja Ganu, and Akshay Nambi. Mmctagent: Multi-modal critical thinking agent framework for complex visual reasoning. arXiv preprint arXiv:2405.18358, 2024

  37. [38]

    Llava-plus: Learning to use tools for creating multimodal agents

    Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. In European Conference on Computer Vision, pages 126–142. Springer, 2024

  38. [39]

    Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing

    Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, and Chunyuan Li. Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint arXiv:2311.00571, 2023

  39. [40]

    LLM Multi-Agent Systems: Challenges and Open Problems

    Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, Zhaozhuo Xu, and Chaoyang He. Llm multi-agent systems: Challenges and open problems. arXiv preprint arXiv:2402.03578, 2024. 13

  40. [41]

    Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system

    Weize Chen, Jiarui Yuan, Chen Qian, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system. arXiv preprint arXiv:2410.08115, 2024

  41. [42]

    Real-time multimodal interaction in virtual reality-a case study with a large virtual interface

    Lizhou Cao, Huadong Zhang, Chao Peng, and Jeffrey T Hansberger. Real-time multimodal interaction in virtual reality-a case study with a large virtual interface. Multimedia Tools and Applications, 82(16):25427–25448, 2023

  42. [43]

    Multimodal alignment and fusion: A survey

    Songtao Li and Hao Tang. Multimodal alignment and fusion: A survey. arXiv preprint arXiv:2411.17040, 2024

  43. [44]

    Exploration of llm multi-agent application implementation based on langgraph+ crewai

    Zhihua Duan and Jialin Wang. Exploration of llm multi-agent application implementation based on langgraph+ crewai. arXiv preprint arXiv:2411.18241, 2024

  44. [45]

    Taskweaver: A code-first agent framework

    Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. Taskweaver: A code-first agent framework. arXiv preprint arXiv:2311.17541, 2023

  45. [46]

    Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning

    Ling Yue, Sixue Xing, Jintai Chen, and Tianfan Fu. Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning. In Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics , pages 1–10, 2024

  46. [47]

    Lawluo: A chinese law firm co-run by llm agents

    Jingyun Sun, Chengxiao Dai, Zhongze Luo, Yangbo Chang, and Yang Li. Lawluo: A chinese law firm co-run by llm agents. arXiv preprint arXiv:2407.16252, 2024

  47. [48]

    Invagent: A large language model based multi-agent system for inventory management in supply chains

    Yinzhu Quan and Zefang Liu. Invagent: A large language model based multi-agent system for inventory management in supply chains. arXiv preprint arXiv:2407.11384, 2024

  48. [49]

    Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization

    Yoichi Ishibashi and Yoshimasa Nishimura. Self-organized agents: A llm multi-agent framework toward ultra large-scale code generation and optimization. arXiv preprint arXiv:2404.02183, 2024

  49. [50]

    Cmat: A multi-agent collaboration tuning framework for enhancing small language models

    Xuechen Liang, Meiling Tao, Yinghui Xia, Tianyu Shi, Jun Wang, and JingSong Yang. Cmat: A multi-agent collaboration tuning framework for enhancing small language models. arXiv preprint arXiv:2404.01663, 2024

  50. [51]

    Chain of agents: Large language models collaborating on long-context tasks

    Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Arik. Chain of agents: Large language models collaborating on long-context tasks. Advances in Neural Information Processing Systems , 37:132208–132237, 2024

  51. [52]

    Qwen2-Audio Technical Report

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024

  52. [53]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  53. [54]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  54. [55]

    Qwen-vl: A versatile vision-language model for understanding, localization

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization. Text Reading, and Beyond, 2, 2023

  55. [56]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215, 2025

  56. [57]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024

  57. [58]

    M2-omni: Advancing omni-mllm for comprehensive modality support with competitive performance

    Qingpei Guo, Kaiyou Song, Zipeng Feng, Ziping Ma, Qinglong Zhang, Sirui Gao, Xuzheng Yu, Yunxiao Sun, Jingdong Chen, Ming Yang, et al. M2-omni: Advancing omni-mllm for comprehensive modality support with competitive performance. arXiv preprint arXiv:2502.18778, 2025

  58. [59]

    Claude-3.5

    Anthropic. Claude-3.5. https://www.anthropic.com/news/claude-3-5-sonnet , 2024. Accessed: 2024- 02-11

  59. [60]

    OpenAI. Gpt-4v. https://openai.com/index/gpt-4v-system-card/ , 2023. Accessed: 2023-02-09

  60. [61]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

  61. [62]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  62. [63]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024. 14

  63. [64]

    LVBench: An Extreme Long Video Understanding Benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024

  64. [65]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024

  65. [66]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

  66. [67]

    Measuring multimodal mathematical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024

  67. [68]

    Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy

    Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Yuliang Liu, et al. Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy. arXiv preprint arXiv:2412.02210, 2024

  68. [69]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pages 235–251. Springer, 2016

  69. [70]

    Special Control Token + Response Content

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 1697–1706, 2022. 15 Appendix A MLLM Orchestration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....