pith. sign in

arxiv: 2506.04565 · v2 · submitted 2025-06-05 · 💻 cs.MA · cs.CL

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

Pith reviewed 2026-05-19 11:45 UTC · model grok-4.3

classification 💻 cs.MA cs.CL
keywords Compound AI SystemsLarge Language ModelsRetrieval-Augmented GenerationLLM AgentsMultimodal LLMsOrchestrationTaxonomySystem Integration
0
0 comments X

The pith

Compound AI Systems integrate LLMs with retrievers, agents, tools, and orchestrators to surpass standalone model limits in memory, reasoning, and multimodal tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey defines Compound AI Systems as combinations of large language models with external specialized modules to handle tasks that require ongoing context, real-time information, and coordinated actions. It proposes a taxonomy organized around component roles and how those components are sequenced or managed together. The work then examines four main patterns in use today, compares their practical trade-offs, and flags open problems in making such systems scale and work together reliably. A reader would care because it supplies a shared language and starting map for moving from isolated models to assembled intelligent pipelines.

Core claim

The paper claims that Compound AI Systems overcome the inherent constraints of standalone large language models by deliberately composing them with retrievers for external knowledge, agents for sequential decision making, tools for action execution, and orchestrators for workflow control, and that a taxonomy based on these roles and orchestration patterns can organize the currently scattered literature and guide future system design.

What carries the argument

A multi-dimensional taxonomy organized by component roles and orchestration strategies, used to classify and compare systems across the four paradigms of retrieval-augmented generation, LLM agents, multimodal LLMs, and explicit orchestration.

If this is right

  • Representative systems in each paradigm can be directly compared on design trade-offs such as latency versus accuracy.
  • Standardized evaluation methods become feasible once systems are placed in the same taxonomy.
  • Identified challenges in scalability, interoperability, and coordination point to concrete next research steps.
  • Practitioners gain a map for choosing which components to add when building task-specific pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could be tested by applying it to new systems released after the survey to measure how well it accommodates innovation.
  • Connections between orchestration strategies and existing workflow engines in software engineering may speed up adoption.
  • If coordination challenges remain unsolved, hybrid human-AI oversight layers may become a necessary addition to the described architectures.

Load-bearing premise

The existing collection of Compound AI Systems is too scattered and lacks any shared framework, so a new taxonomy is needed before the field can advance systematically.

What would settle it

A later survey or benchmark study that shows the proposed taxonomy fails to group real systems into coherent categories or that a simpler existing classification already captures all major design choices without gaps.

Figures

Figures reproduced from arXiv: 2506.04565 by Guiling Wang, Jiayi Chen, Junyi Ye.

Figure 1
Figure 1. Figure 1: Taxonomy of Retrieval-Augmented Generation (RAG) systems. The diagram categorizes the core components into three [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A structured overview of LLM agents across three dimensions: application scenarios (e.g., general-purpose, embodied), agent [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Multimodal Large Language Models. The diagram categorizes MLLMs by architecture, fusion strategy, modality, [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An overview of the Compound AI System orchestration layer, highlighting the relationship between its structure (e.g., [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
read the original abstract

Compound AI Systems (CAIS) are an emerging paradigm that integrates large language models (LLMs) with external components, including retrievers, agents, tools, and orchestrators, to overcome the limitations of standalone models in tasks requiring memory, reasoning, real-time grounding, and multimodal understanding. These systems enable more capable and context-aware behaviors by composing multiple specialized modules into cohesive workflows. Despite growing adoption in both academia and industry, the CAIS landscape remains fragmented and lacks a unified framework for analysis, taxonomy, and evaluation. In this survey, we define the concept of CAIS, propose a multi-dimensional taxonomy based on component roles and orchestration strategies, and analyze four foundational paradigms: Retrieval-Augmented Generation (RAG), LLM Agents, Multimodal LLMs (MLLMs), and Orchestration. We review representative systems, compare design trade-offs, and summarize evaluation methodologies across these paradigms. Finally, we identify key challenges - including scalability, interoperability, benchmarking, and coordination - and outline promising directions for future research. This survey aims to provide researchers and practitioners with a comprehensive foundation for understanding, developing, and advancing the next generation of system-level artificial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript surveys Compound AI Systems (CAIS), defining them as integrations of LLMs with external components (retrievers, agents, tools, orchestrators) to overcome standalone LLM limitations in memory, reasoning, real-time grounding, and multimodal tasks. It proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four paradigms (RAG, LLM Agents, MLLMs, Orchestration), analyzes representative systems and trade-offs, summarizes evaluation methodologies, and outlines challenges including scalability, interoperability, benchmarking, and coordination.

Significance. If the taxonomy and synthesis hold, the work provides a useful organizational framework for an emerging, fragmented area of system-level AI. It synthesizes literature across paradigms without advancing unsubstantiated mathematical claims and highlights actionable open problems for researchers and practitioners.

minor comments (3)
  1. The abstract states that the CAIS landscape 'lacks a unified framework' and that the proposed taxonomy addresses this, but the manuscript should include an explicit comparison (e.g., in the introduction or related-work section) to prior surveys on RAG, agents, or multimodal systems to substantiate the novelty of the multi-dimensional taxonomy.
  2. The four paradigms are listed without an accompanying table or diagram that maps each to the taxonomy dimensions (component roles and orchestration strategies); adding such a summary table would improve clarity and allow readers to see the taxonomy in action.
  3. Evaluation methodologies are summarized across paradigms, but the manuscript would benefit from a brief discussion of common pitfalls (e.g., contamination in RAG benchmarks or agent trajectory evaluation) to strengthen the practical guidance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive and positive review of our survey on Compound AI Systems. The summary accurately reflects the scope, taxonomy, and contributions of the manuscript. We appreciate the recommendation for minor revision and will incorporate improvements to enhance clarity and completeness.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This survey paper defines CAIS as integrations of LLMs with external components to address capability gaps and proposes a multi-dimensional taxonomy based on component roles and orchestration strategies. It catalogs four paradigms (RAG, LLM Agents, MLLMs, Orchestration) by reviewing representative systems, trade-offs, and evaluations drawn from external prior literature. No equations, fitted parameters, predictions, or derivations appear; the central organizational claim that the field lacks a unified framework is addressed directly by the survey's synthesis rather than reducing to any self-referential input, self-citation chain, or ansatz. The argument remains self-contained as a literature review without load-bearing steps that collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that synthesizes existing concepts from the literature on AI systems rather than introducing new mathematical derivations, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5735 in / 1058 out tokens · 67245 ms · 2026-05-19T11:45:59.245336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Uncertainty Propagation in LLM-Based Systems

    cs.SE 2026-04 unverdicted novelty 7.0

    This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insight...

  2. A Survey of Context Engineering for Large Language Models

    cs.CL 2025-07 accept novelty 4.0

    The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...

Reference graph

Works this paper leans on

232 extracted references · 232 canonical work pages · cited by 2 Pith papers · 57 internal anchors

  1. [1]

    Arkadeep Acharya, Brijraj Singh, and Naoyuki Onoe. 2023. LLM Based Generation of Item-Description for Recommendation System. InProceedings of the 17th ACM Conference on Recommender Systems (Singapore, Singapore) (RecSys ’23). Association for Computing Machinery, New York, NY, USA, 1204–1207. https://doi.org/10.1145/3604915.3610647

  2. [2]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  3. [3]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736

  4. [4]

    Anthropic. 2024. Model Context Protocol. https://www.anthropic.com/news/model-context-protocol. Accessed: 2025-03-29

  5. [5]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511 (2023)

  6. [6]

    Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136 (2023)

  7. [7]

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)

  8. [8]

    Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilović, et al. 2019. AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM Journal of Research and Development 63, 4/5 (2019), 4–1

  9. [9]

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing . 1533–1544

  10. [10]

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4

  11. [11]

    Multimodal Blog. 2024. How to Chunk Documents for Retrieval-Augmented Generation (RAG) . https://www.multimodal.dev/post/how-to-chunk- documents-for-rag?utm_source=chatgpt.com Accessed: 2024-12-14

  12. [12]

    Bloomberg Intelligence. 2023. Generative AI to Become a $1.3 Trillion Market by 2032, Research Finds. https://www.bloomberg.com/company/ press/generative-ai-to-become-a-1-3-trillion-market-by-2032-research-finds/. Accessed: 2025-05-30

  13. [13]

    Daniil A Boiko, Robert MacKnight, and Gabe Gomes. 2023. Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332 (2023)

  14. [14]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

  15. [15]

    Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. 2024. Honeybee: Locality-enhanced projector for multimodal llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 13817–13827

  16. [16]

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023)

  17. [17]

    Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. 2024. Allava: Harnessing gpt4v-synthesized data for lite vision-language models. arXiv preprint arXiv:2402.11684 (2024)

  18. [18]

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2025. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision . Springer, 370–387

  19. [19]

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al

  20. [20]

    IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518

  21. [21]

    Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. 2022. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491 (2022). Manuscript submitted to ACM 28 Chen et al

  22. [22]

    Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al . 2023. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848 2, 4 (2023), 6

  23. [23]

    Xiang Chen, Lei Li, Ningyu Zhang, Xiaozhuan Liang, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. 2022. Decoupling knowledge from memorization: Retrieval-augmented prompt learning. Advances in Neural Information Processing Systems 35 (2022), 23908–23922

  24. [24]

    Xin Cheng, Di Luo, Xiuying Chen, Lemao Liu, Dongyan Zhao, and Rui Yan. 2024. Lift yourself up: Retrieval-augmented text generation with self-memory. Advances in Neural Information Processing Systems 36 (2024)

  25. [25]

    Wikipedia contributors. 2024. Claude (language model). https://en.wikipedia.org/wiki/Claude_(language_model). Accessed: 2024-12-24

  26. [26]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M Voorhees, and Ian Soboroff. 2021. TREC deep learning track: Reusable test collections in the large data regime. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 2369–2375

  27. [27]

    Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The power of noise: Redefining retrieval for rag systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 719–729

  28. [28]

    Hongwei Cui, Yuyang Du, Qun Yang, Yulin Shao, and Soung Chang Liew. 2024. Llmind: Orchestrating ai and iot with llm for complex task execution. IEEE Communications Magazine (2024)

  29. [29]

    Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B Hall, and Ming-Wei Chang. 2022. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755 (2022)

  30. [30]

    Irene de Zarzà, Joachim de Curtò, Gemma Roig, and Carlos T Calafate. 2023. Llm adaptive pid control for b5g truck platooning systems. Sensors 23, 13 (2023), 5899

  31. [31]

    deepset.ai Team. 2024. Haystack GitHub Repository. https://github.com/deepset-ai/haystack. Accessed: 2024-12-24

  32. [32]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems 36 (2023), 28091–28114

  33. [33]

    DataStax Documentation. 2024. Introduction to Indexing in Retrieval-Augmented Generation (RAG) . https://docs.datastax.com/en/ragstack/intro-to- rag/indexing.html?utm_source=chatgpt.com Accessed: 2024-12-14

  34. [34]

    Jian Dong, Wei Bao, Xiaoqi Cao, Yang Xu, Yuze Yang, Binbin Li, Qi Zhang, and Heng Ye. 2025. AISBench: an performance benchmark for AI server systems. The Journal of Supercomputing 81, 2 (2025), 1–24

  35. [35]

    Alexey Dosovitskiy. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  36. [36]

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

  37. [37]

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning

  38. [38]

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161 (2019)

  39. [39]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130 (2024)

  40. [40]

    Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations . 150–158

  41. [41]

    Jonathan Evertz, Merlin Chlosta, Lea Schönherr, and Thorsten Eisenhofer. 2024. Whispers in the Machine: Confidentiality in LLM-integrated Systems. arXiv preprint arXiv:2402.06922 (2024)

  42. [42]

    Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R Radev. 2019. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749 (2019)

  43. [43]

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. 2022. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems 35 (2022), 18343–18362

  44. [44]

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . 6491–6501

  45. [45]

    Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. 2025. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review. arXiv preprint arXiv:2504.19678 (2025)

  46. [46]

    Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Xu Wen, Rui Ren, Chen Zheng, Xiwen He, Hainan Ye, et al. 2018. Bigdatabench: A scalable and unified big data and ai benchmark suite. arXiv preprint arXiv:1802.08254 (2018)

  47. [47]

    Manas Gaur, Kalpa Gunaratna, Vijay Srinivasan, and Hongxia Jin. 2022. Iseeq: Information seeking question generation using dynamic meta- information retrieval and knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 36. 10672–10680

  48. [48]

    Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. 2023. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041 (2023). Manuscript submitted to ACM From Standalone LLMs to Integrated Intelligence: A Survey of Compound AI Systems 29

  49. [49]

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics 9 (2021), 346–361

  50. [50]

    Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, and Carolin Lawrence. 2024. Agentquest: A modular benchmark framework to measure progress and improve llm agents. arXiv preprint arXiv:2404.06411 (2024)

  51. [51]

    Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Rajaram Naik, Pengshan Cai, and Alfio Gliozzo. 2022. Re2G: Retrieve, rerank, generate. arXiv preprint arXiv:2207.06300 (2022)

  52. [52]

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. 2023. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790 (2023)

  53. [53]

    Anirudh Goyal, Abram Friesen, Andrea Banino, Theophane Weber, Nan Rosemary Ke, Adria Puigdomenech Badia, Arthur Guez, Mehdi Mirza, Peter C Humphreys, Ksenia Konyushova, et al. 2022. Retrieval-augmented reinforcement learning. In International Conference on Machine Learning . PMLR, 7740–7765

  54. [54]

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 (2024)

  55. [55]

    Hiroaki Hayashi, Prashant Budania, Peng Wang, Chris Ackerson, Raj Neervannan, and Graham Neubig. 2021. Wikiasp: A dataset for multi-domain aspect-based summarization. Transactions of the Association for Computational Linguistics 9 (2021), 211–225

  56. [56]

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060 (2020)

  57. [57]

    Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. 2023. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 3, 4 (2023), 6

  58. [58]

    Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. 2024. Bliva: A simple multimodal llm for better handling of text-rich visual questions. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 2256–2264

  59. [59]

    Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, et al. 2024. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507 (2024)

  60. [60]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, and Alireza Fathi. 2023. Reveal: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 23369–23379. https://doi.org/10.1109/CVPR5272...

  61. [61]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2023. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems (2023)

  62. [62]

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al

  63. [63]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608 (2022)

  64. [64]

    Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282 (2020)

  65. [65]

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research 24, 251 (2023), 1–43

  66. [66]

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983 (2023)

  67. [67]

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017)

  68. [68]

    Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, and Sung Ju Hwang. 2024. Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks. Advances in Neural Information Processing Systems 36 (2024)

  69. [69]

    Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A Smith, Yejin Choi, Kentaro Inui, et al. 2023. Realtime qa: What’s the answer right now? Advances in neural information processing systems 36 (2023), 49025–49043

  70. [70]

    Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024 (2022)

  71. [71]

    Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language models can solve computer tasks. Advances in Neural Information Processing Systems 36 (2023), 39648–39677

  72. [72]

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. Ocr-free document understanding transformer. In European Conference on Computer Vision . Springer, 498–517

  73. [73]

    Diederik P Kingma, Max Welling, et al. 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning 12, 4 (2019), 307–392

  74. [74]

    Tomáš Kočisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6 (2018), 317–328

  75. [75]

    Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. 2024. Generating images with multimodal language models. Advances in Neural Information Processing Systems 36 (2024). Manuscript submitted to ACM 30 Chen et al

  76. [76]

    Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. 2023. Grounding language models to images for multimodal inputs and outputs. InInternational Conference on Machine Learning . PMLR, 17283–17300

  77. [77]

    Roman Koshkin, Katsuhito Sudoh, and Satoshi Nakamura. 2024. Transllama: Llm-based simultaneous translation system. arXiv preprint arXiv:2402.04636 (2024)

  78. [78]

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466

  79. [79]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles . 611–626

  80. [80]

    Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. 2022. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115 (2022)

Showing first 80 references.