From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

Guiling Wang; Jiayi Chen; Junyi Ye

arxiv: 2506.04565 · v2 · submitted 2025-06-05 · 💻 cs.MA · cs.CL

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

Jiayi Chen , Junyi Ye , Guiling Wang This is my paper

Pith reviewed 2026-05-19 11:45 UTC · model grok-4.3

classification 💻 cs.MA cs.CL

keywords Compound AI SystemsLarge Language ModelsRetrieval-Augmented GenerationLLM AgentsMultimodal LLMsOrchestrationTaxonomySystem Integration

0 comments

The pith

Compound AI Systems integrate LLMs with retrievers, agents, tools, and orchestrators to surpass standalone model limits in memory, reasoning, and multimodal tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey defines Compound AI Systems as combinations of large language models with external specialized modules to handle tasks that require ongoing context, real-time information, and coordinated actions. It proposes a taxonomy organized around component roles and how those components are sequenced or managed together. The work then examines four main patterns in use today, compares their practical trade-offs, and flags open problems in making such systems scale and work together reliably. A reader would care because it supplies a shared language and starting map for moving from isolated models to assembled intelligent pipelines.

Core claim

The paper claims that Compound AI Systems overcome the inherent constraints of standalone large language models by deliberately composing them with retrievers for external knowledge, agents for sequential decision making, tools for action execution, and orchestrators for workflow control, and that a taxonomy based on these roles and orchestration patterns can organize the currently scattered literature and guide future system design.

What carries the argument

A multi-dimensional taxonomy organized by component roles and orchestration strategies, used to classify and compare systems across the four paradigms of retrieval-augmented generation, LLM agents, multimodal LLMs, and explicit orchestration.

If this is right

Representative systems in each paradigm can be directly compared on design trade-offs such as latency versus accuracy.
Standardized evaluation methods become feasible once systems are placed in the same taxonomy.
Identified challenges in scalability, interoperability, and coordination point to concrete next research steps.
Practitioners gain a map for choosing which components to add when building task-specific pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could be tested by applying it to new systems released after the survey to measure how well it accommodates innovation.
Connections between orchestration strategies and existing workflow engines in software engineering may speed up adoption.
If coordination challenges remain unsolved, hybrid human-AI oversight layers may become a necessary addition to the described architectures.

Load-bearing premise

The existing collection of Compound AI Systems is too scattered and lacks any shared framework, so a new taxonomy is needed before the field can advance systematically.

What would settle it

A later survey or benchmark study that shows the proposed taxonomy fails to group real systems into coherent categories or that a simpler existing classification already captures all major design choices without gaps.

Figures

Figures reproduced from arXiv: 2506.04565 by Guiling Wang, Jiayi Chen, Junyi Ye.

**Figure 1.** Figure 1: Taxonomy of Retrieval-Augmented Generation (RAG) systems. The diagram categorizes the core components into three [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: A structured overview of LLM agents across three dimensions: application scenarios (e.g., general-purpose, embodied), agent [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Multimodal Large Language Models. The diagram categorizes MLLMs by architecture, fusion strategy, modality, [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: An overview of the Compound AI System orchestration layer, highlighting the relationship between its structure (e.g., [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

read the original abstract

Compound AI Systems (CAIS) are an emerging paradigm that integrates large language models (LLMs) with external components, including retrievers, agents, tools, and orchestrators, to overcome the limitations of standalone models in tasks requiring memory, reasoning, real-time grounding, and multimodal understanding. These systems enable more capable and context-aware behaviors by composing multiple specialized modules into cohesive workflows. Despite growing adoption in both academia and industry, the CAIS landscape remains fragmented and lacks a unified framework for analysis, taxonomy, and evaluation. In this survey, we define the concept of CAIS, propose a multi-dimensional taxonomy based on component roles and orchestration strategies, and analyze four foundational paradigms: Retrieval-Augmented Generation (RAG), LLM Agents, Multimodal LLMs (MLLMs), and Orchestration. We review representative systems, compare design trade-offs, and summarize evaluation methodologies across these paradigms. Finally, we identify key challenges - including scalability, interoperability, benchmarking, and coordination - and outline promising directions for future research. This survey aims to provide researchers and practitioners with a comprehensive foundation for understanding, developing, and advancing the next generation of system-level artificial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes compound AI systems with a taxonomy but stays mostly at the level of synthesis and review.

read the letter

The main thing to know about this paper is that it surveys the emerging area of Compound AI Systems, which combine LLMs with other components like retrievers and agents, and it introduces a taxonomy to bring some order to the field. The paper does well in covering the four key paradigms: Retrieval-Augmented Generation, LLM Agents, Multimodal LLMs, and Orchestration. It reviews representative systems for each, discusses design trade-offs, and summarizes how people evaluate these setups. This kind of overview can save time for readers trying to navigate the literature. The section on challenges and future directions is straightforward and points to real issues like interoperability and benchmarking. What is actually new is the multi-dimensional taxonomy based on component roles and orchestration strategies. It offers a way to categorize these systems that synthesizes ideas from separate lines of work. The definition of CAIS as integrations to overcome standalone LLM limits is clear and sets up the rest of the discussion. On the softer side, the taxonomy could use more rigor. It stays at a conceptual level without providing detailed classification rules or applying it systematically to many examples. The analysis of trade-offs is good but doesn't include much quantitative comparison or discussion of why some integrations succeed where others don't. As a survey it draws from existing papers, so the value is in the organization rather than new findings. This paper is aimed at AI researchers and practitioners who are building or studying systems that integrate multiple AI components. A reader new to the area or looking for research directions would find it helpful. It deserves a serious referee because good surveys help standardize terminology in fast-growing fields like this one. My recommendation is to send it for peer review, with feedback focused on making the taxonomy more precise and perhaps adding more critical evaluation of the reviewed systems.

Referee Report

0 major / 3 minor

Summary. The manuscript surveys Compound AI Systems (CAIS), defining them as integrations of LLMs with external components (retrievers, agents, tools, orchestrators) to overcome standalone LLM limitations in memory, reasoning, real-time grounding, and multimodal tasks. It proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four paradigms (RAG, LLM Agents, MLLMs, Orchestration), analyzes representative systems and trade-offs, summarizes evaluation methodologies, and outlines challenges including scalability, interoperability, benchmarking, and coordination.

Significance. If the taxonomy and synthesis hold, the work provides a useful organizational framework for an emerging, fragmented area of system-level AI. It synthesizes literature across paradigms without advancing unsubstantiated mathematical claims and highlights actionable open problems for researchers and practitioners.

minor comments (3)

The abstract states that the CAIS landscape 'lacks a unified framework' and that the proposed taxonomy addresses this, but the manuscript should include an explicit comparison (e.g., in the introduction or related-work section) to prior surveys on RAG, agents, or multimodal systems to substantiate the novelty of the multi-dimensional taxonomy.
The four paradigms are listed without an accompanying table or diagram that maps each to the taxonomy dimensions (component roles and orchestration strategies); adding such a summary table would improve clarity and allow readers to see the taxonomy in action.
Evaluation methodologies are summarized across paradigms, but the manuscript would benefit from a brief discussion of common pitfalls (e.g., contamination in RAG benchmarks or agent trajectory evaluation) to strengthen the practical guidance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive and positive review of our survey on Compound AI Systems. The summary accurately reflects the scope, taxonomy, and contributions of the manuscript. We appreciate the recommendation for minor revision and will incorporate improvements to enhance clarity and completeness.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This survey paper defines CAIS as integrations of LLMs with external components to address capability gaps and proposes a multi-dimensional taxonomy based on component roles and orchestration strategies. It catalogs four paradigms (RAG, LLM Agents, MLLMs, Orchestration) by reviewing representative systems, trade-offs, and evaluations drawn from external prior literature. No equations, fitted parameters, predictions, or derivations appear; the central organizational claim that the field lacks a unified framework is addressed directly by the survey's synthesis rather than reducing to any self-referential input, self-citation chain, or ansatz. The argument remains self-contained as a literature review without load-bearing steps that collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that synthesizes existing concepts from the literature on AI systems rather than introducing new mathematical derivations, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5735 in / 1058 out tokens · 67245 ms · 2026-05-19T11:45:59.245336+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define CAIS as modular and extensible architectures that integrate LLMs with specialized external components... propose a multi-dimensional taxonomy based on component roles and orchestration strategies
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

analyze four foundational paradigms: Retrieval-Augmented Generation (RAG), LLM Agents, Multimodal LLMs (MLLMs), and Orchestration

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Uncertainty Propagation in LLM-Based Systems
cs.SE 2026-04 unverdicted novelty 7.0

This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insight...
A Survey of Context Engineering for Large Language Models
cs.CL 2025-07 accept novelty 4.0

The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...

Reference graph

Works this paper leans on

232 extracted references · 232 canonical work pages · cited by 2 Pith papers · 57 internal anchors

[1]

Arkadeep Acharya, Brijraj Singh, and Naoyuki Onoe. 2023. LLM Based Generation of Item-Description for Recommendation System. InProceedings of the 17th ACM Conference on Recommender Systems (Singapore, Singapore) (RecSys ’23). Association for Computing Machinery, New York, NY, USA, 1204–1207. https://doi.org/10.1145/3604915.3610647

work page doi:10.1145/3604915.3610647 2023
[2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736

work page 2022
[4]

Anthropic. 2024. Model Context Protocol. https://www.anthropic.com/news/model-context-protocol. Accessed: 2025-03-29

work page 2024
[5]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136 (2023)

work page arXiv 2023
[7]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilović, et al. 2019. AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM Journal of Research and Development 63, 4/5 (2019), 4–1

work page 2019
[9]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing . 1533–1544

work page 2013
[10]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4

work page 2021
[11]

Multimodal Blog. 2024. How to Chunk Documents for Retrieval-Augmented Generation (RAG) . https://www.multimodal.dev/post/how-to-chunk- documents-for-rag?utm_source=chatgpt.com Accessed: 2024-12-14

work page 2024
[12]

Bloomberg Intelligence. 2023. Generative AI to Become a $1.3 Trillion Market by 2032, Research Finds. https://www.bloomberg.com/company/ press/generative-ai-to-become-a-1-3-trillion-market-by-2032-research-finds/. Accessed: 2025-05-30

work page 2023
[13]

Daniil A Boiko, Robert MacKnight, and Gabe Gomes. 2023. Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332 (2023)

work page internal anchor Pith review arXiv 2023
[14]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020
[15]

Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. 2024. Honeybee: Locality-enhanced projector for multimodal llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 13817–13827

work page 2024
[16]

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. 2024. Allava: Harnessing gpt4v-synthesized data for lite vision-language models. arXiv preprint arXiv:2402.11684 (2024)

work page internal anchor Pith review arXiv 2024
[18]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2025. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision . Springer, 370–387

work page 2025
[19]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al

work page
[20]

IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518

Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518

work page 2022
[21]

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. 2022. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491 (2022). Manuscript submitted to ACM 28 Chen et al

work page arXiv 2022
[22]

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al . 2023. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848 2, 4 (2023), 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Xiang Chen, Lei Li, Ningyu Zhang, Xiaozhuan Liang, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. 2022. Decoupling knowledge from memorization: Retrieval-augmented prompt learning. Advances in Neural Information Processing Systems 35 (2022), 23908–23922

work page 2022
[24]

Xin Cheng, Di Luo, Xiuying Chen, Lemao Liu, Dongyan Zhao, and Rui Yan. 2024. Lift yourself up: Retrieval-augmented text generation with self-memory. Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[25]

Wikipedia contributors. 2024. Claude (language model). https://en.wikipedia.org/wiki/Claude_(language_model). Accessed: 2024-12-24

work page 2024
[26]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M Voorhees, and Ian Soboroff. 2021. TREC deep learning track: Reusable test collections in the large data regime. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 2369–2375

work page 2021
[27]

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The power of noise: Redefining retrieval for rag systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 719–729

work page 2024
[28]

Hongwei Cui, Yuyang Du, Qun Yang, Yulin Shao, and Soung Chang Liew. 2024. Llmind: Orchestrating ai and iot with llm for complex task execution. IEEE Communications Magazine (2024)

work page 2024
[29]

Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B Hall, and Ming-Wei Chang. 2022. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755 (2022)

work page arXiv 2022
[30]

Irene de Zarzà, Joachim de Curtò, Gemma Roig, and Carlos T Calafate. 2023. Llm adaptive pid control for b5g truck platooning systems. Sensors 23, 13 (2023), 5899

work page 2023
[31]

deepset.ai Team. 2024. Haystack GitHub Repository. https://github.com/deepset-ai/haystack. Accessed: 2024-12-24

work page 2024
[32]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems 36 (2023), 28091–28114

work page 2023
[33]

DataStax Documentation. 2024. Introduction to Indexing in Retrieval-Augmented Generation (RAG) . https://docs.datastax.com/en/ragstack/intro-to- rag/indexing.html?utm_source=chatgpt.com Accessed: 2024-12-14

work page 2024
[34]

Jian Dong, Wei Bao, Xiaoqi Cao, Yang Xu, Yuze Yang, Binbin Li, Qi Zhang, and Heng Ye. 2025. AISBench: an performance benchmark for AI server systems. The Journal of Supercomputing 81, 2 (2025), 1–24

work page 2025
[35]

Alexey Dosovitskiy. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[36]

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning

work page 2023
[38]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[39]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations . 150–158

work page 2024
[41]

Jonathan Evertz, Merlin Chlosta, Lea Schönherr, and Thorsten Eisenhofer. 2024. Whispers in the Machine: Confidentiality in LLM-integrated Systems. arXiv preprint arXiv:2402.06922 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R Radev. 2019. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[43]

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. 2022. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems 35 (2022), 18343–18362

work page 2022
[44]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . 6491–6501

work page 2024
[45]

Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. 2025. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review. arXiv preprint arXiv:2504.19678 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Xu Wen, Rui Ren, Chen Zheng, Xiwen He, Hainan Ye, et al. 2018. Bigdatabench: A scalable and unified big data and ai benchmark suite. arXiv preprint arXiv:1802.08254 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

Manas Gaur, Kalpa Gunaratna, Vijay Srinivasan, and Hongxia Jin. 2022. Iseeq: Information seeking question generation using dynamic meta- information retrieval and knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 36. 10672–10680

work page 2022
[48]

Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. 2023. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041 (2023). Manuscript submitted to ACM From Standalone LLMs to Integrated Intelligence: A Survey of Compound AI Systems 29

work page arXiv 2023
[49]

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics 9 (2021), 346–361

work page 2021
[50]

Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, and Carolin Lawrence. 2024. Agentquest: A modular benchmark framework to measure progress and improve llm agents. arXiv preprint arXiv:2404.06411 (2024)

work page arXiv 2024
[51]

Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Rajaram Naik, Pengshan Cai, and Alfio Gliozzo. 2022. Re2G: Retrieve, rerank, generate. arXiv preprint arXiv:2207.06300 (2022)

work page arXiv 2022
[52]

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. 2023. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790 (2023)

work page arXiv 2023
[53]

Anirudh Goyal, Abram Friesen, Andrea Banino, Theophane Weber, Nan Rosemary Ke, Adria Puigdomenech Badia, Arthur Guez, Mehdi Mirza, Peter C Humphreys, Ksenia Konyushova, et al. 2022. Retrieval-augmented reinforcement learning. In International Conference on Machine Learning . PMLR, 7740–7765

work page 2022
[54]

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Hiroaki Hayashi, Prashant Budania, Peng Wang, Chris Ackerson, Raj Neervannan, and Graham Neubig. 2021. Wikiasp: A dataset for multi-domain aspect-based summarization. Transactions of the Association for Computational Linguistics 9 (2021), 211–225

work page 2021
[56]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[57]

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. 2023. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 3, 4 (2023), 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. 2024. Bliva: A simple multimodal llm for better handling of text-rich visual questions. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 2256–2264

work page 2024
[59]

Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, et al. 2024. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507 (2024)

work page arXiv 2024
[60]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, and Alireza Fathi. 2023. Reveal: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 23369–23379. https://doi.org/10.1109/CVPR5272...

work page doi:10.1109/cvpr52729.2023.02238 2023
[61]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2023. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems (2023)

work page 2023
[62]

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al

work page
[63]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[65]

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research 24, 251 (2023), 1–43

work page 2023
[66]

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983 (2023)

work page arXiv 2023
[67]

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[68]

Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, and Sung Ju Hwang. 2024. Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks. Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[69]

Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A Smith, Yejin Choi, Kentaro Inui, et al. 2023. Realtime qa: What’s the answer right now? Advances in neural information processing systems 36 (2023), 49025–49043

work page 2023
[70]

Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024 (2022)

work page arXiv 2022
[71]

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language models can solve computer tasks. Advances in Neural Information Processing Systems 36 (2023), 39648–39677

work page 2023
[72]

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. Ocr-free document understanding transformer. In European Conference on Computer Vision . Springer, 498–517

work page 2022
[73]

Diederik P Kingma, Max Welling, et al. 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning 12, 4 (2019), 307–392

work page 2019
[74]

Tomáš Kočisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6 (2018), 317–328

work page 2018
[75]

Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. 2024. Generating images with multimodal language models. Advances in Neural Information Processing Systems 36 (2024). Manuscript submitted to ACM 30 Chen et al

work page 2024
[76]

Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. 2023. Grounding language models to images for multimodal inputs and outputs. InInternational Conference on Machine Learning . PMLR, 17283–17300

work page 2023
[77]

Roman Koshkin, Katsuhito Sudoh, and Satoshi Nakamura. 2024. Transllama: Llm-based simultaneous translation system. arXiv preprint arXiv:2402.04636 (2024)

work page arXiv 2024
[78]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466

work page 2019
[79]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles . 611–626

work page 2023
[80]

Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. 2022. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115 (2022)

work page arXiv 2022

Showing first 80 references.

[1] [1]

Arkadeep Acharya, Brijraj Singh, and Naoyuki Onoe. 2023. LLM Based Generation of Item-Description for Recommendation System. InProceedings of the 17th ACM Conference on Recommender Systems (Singapore, Singapore) (RecSys ’23). Association for Computing Machinery, New York, NY, USA, 1204–1207. https://doi.org/10.1145/3604915.3610647

work page doi:10.1145/3604915.3610647 2023

[2] [2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736

work page 2022

[4] [4]

Anthropic. 2024. Model Context Protocol. https://www.anthropic.com/news/model-context-protocol. Accessed: 2025-03-29

work page 2024

[5] [5]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136 (2023)

work page arXiv 2023

[7] [7]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilović, et al. 2019. AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM Journal of Research and Development 63, 4/5 (2019), 4–1

work page 2019

[9] [9]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing . 1533–1544

work page 2013

[10] [10]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4

work page 2021

[11] [11]

Multimodal Blog. 2024. How to Chunk Documents for Retrieval-Augmented Generation (RAG) . https://www.multimodal.dev/post/how-to-chunk- documents-for-rag?utm_source=chatgpt.com Accessed: 2024-12-14

work page 2024

[12] [12]

Bloomberg Intelligence. 2023. Generative AI to Become a $1.3 Trillion Market by 2032, Research Finds. https://www.bloomberg.com/company/ press/generative-ai-to-become-a-1-3-trillion-market-by-2032-research-finds/. Accessed: 2025-05-30

work page 2023

[13] [13]

Daniil A Boiko, Robert MacKnight, and Gabe Gomes. 2023. Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332 (2023)

work page internal anchor Pith review arXiv 2023

[14] [14]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020

[15] [15]

Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. 2024. Honeybee: Locality-enhanced projector for multimodal llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 13817–13827

work page 2024

[16] [16]

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. 2024. Allava: Harnessing gpt4v-synthesized data for lite vision-language models. arXiv preprint arXiv:2402.11684 (2024)

work page internal anchor Pith review arXiv 2024

[18] [18]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2025. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision . Springer, 370–387

work page 2025

[19] [19]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al

work page

[20] [20]

IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518

Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518

work page 2022

[21] [21]

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. 2022. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491 (2022). Manuscript submitted to ACM 28 Chen et al

work page arXiv 2022

[22] [22]

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al . 2023. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848 2, 4 (2023), 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Xiang Chen, Lei Li, Ningyu Zhang, Xiaozhuan Liang, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. 2022. Decoupling knowledge from memorization: Retrieval-augmented prompt learning. Advances in Neural Information Processing Systems 35 (2022), 23908–23922

work page 2022

[24] [24]

Xin Cheng, Di Luo, Xiuying Chen, Lemao Liu, Dongyan Zhao, and Rui Yan. 2024. Lift yourself up: Retrieval-augmented text generation with self-memory. Advances in Neural Information Processing Systems 36 (2024)

work page 2024

[25] [25]

Wikipedia contributors. 2024. Claude (language model). https://en.wikipedia.org/wiki/Claude_(language_model). Accessed: 2024-12-24

work page 2024

[26] [26]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M Voorhees, and Ian Soboroff. 2021. TREC deep learning track: Reusable test collections in the large data regime. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 2369–2375

work page 2021

[27] [27]

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The power of noise: Redefining retrieval for rag systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 719–729

work page 2024

[28] [28]

Hongwei Cui, Yuyang Du, Qun Yang, Yulin Shao, and Soung Chang Liew. 2024. Llmind: Orchestrating ai and iot with llm for complex task execution. IEEE Communications Magazine (2024)

work page 2024

[29] [29]

Zhuyun Dai, Vincent Y Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B Hall, and Ming-Wei Chang. 2022. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755 (2022)

work page arXiv 2022

[30] [30]

Irene de Zarzà, Joachim de Curtò, Gemma Roig, and Carlos T Calafate. 2023. Llm adaptive pid control for b5g truck platooning systems. Sensors 23, 13 (2023), 5899

work page 2023

[31] [31]

deepset.ai Team. 2024. Haystack GitHub Repository. https://github.com/deepset-ai/haystack. Accessed: 2024-12-24

work page 2024

[32] [32]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems 36 (2023), 28091–28114

work page 2023

[33] [33]

DataStax Documentation. 2024. Introduction to Indexing in Retrieval-Augmented Generation (RAG) . https://docs.datastax.com/en/ragstack/intro-to- rag/indexing.html?utm_source=chatgpt.com Accessed: 2024-12-14

work page 2024

[34] [34]

Jian Dong, Wei Bao, Xiaoqi Cao, Yang Xu, Yuze Yang, Binbin Li, Qi Zhang, and Heng Ye. 2025. AISBench: an performance benchmark for AI server systems. The Journal of Supercomputing 81, 2 (2025), 1–24

work page 2025

[35] [35]

Alexey Dosovitskiy. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[36] [36]

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning

work page 2023

[38] [38]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[39] [39]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations . 150–158

work page 2024

[41] [41]

Jonathan Evertz, Merlin Chlosta, Lea Schönherr, and Thorsten Eisenhofer. 2024. Whispers in the Machine: Confidentiality in LLM-integrated Systems. arXiv preprint arXiv:2402.06922 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R Radev. 2019. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[43] [43]

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. 2022. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems 35 (2022), 18343–18362

work page 2022

[44] [44]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . 6491–6501

work page 2024

[45] [45]

Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. 2025. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review. arXiv preprint arXiv:2504.19678 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Xu Wen, Rui Ren, Chen Zheng, Xiwen He, Hainan Ye, et al. 2018. Bigdatabench: A scalable and unified big data and ai benchmark suite. arXiv preprint arXiv:1802.08254 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[47] [47]

Manas Gaur, Kalpa Gunaratna, Vijay Srinivasan, and Hongxia Jin. 2022. Iseeq: Information seeking question generation using dynamic meta- information retrieval and knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 36. 10672–10680

work page 2022

[48] [48]

Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. 2023. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041 (2023). Manuscript submitted to ACM From Standalone LLMs to Integrated Intelligence: A Survey of Compound AI Systems 29

work page arXiv 2023

[49] [49]

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics 9 (2021), 346–361

work page 2021

[50] [50]

Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, and Carolin Lawrence. 2024. Agentquest: A modular benchmark framework to measure progress and improve llm agents. arXiv preprint arXiv:2404.06411 (2024)

work page arXiv 2024

[51] [51]

Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Rajaram Naik, Pengshan Cai, and Alfio Gliozzo. 2022. Re2G: Retrieve, rerank, generate. arXiv preprint arXiv:2207.06300 (2022)

work page arXiv 2022

[52] [52]

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. 2023. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790 (2023)

work page arXiv 2023

[53] [53]

Anirudh Goyal, Abram Friesen, Andrea Banino, Theophane Weber, Nan Rosemary Ke, Adria Puigdomenech Badia, Arthur Guez, Mehdi Mirza, Peter C Humphreys, Ksenia Konyushova, et al. 2022. Retrieval-augmented reinforcement learning. In International Conference on Machine Learning . PMLR, 7740–7765

work page 2022

[54] [54]

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

Hiroaki Hayashi, Prashant Budania, Peng Wang, Chris Ackerson, Raj Neervannan, and Graham Neubig. 2021. Wikiasp: A dataset for multi-domain aspect-based summarization. Transactions of the Association for Computational Linguistics 9 (2021), 211–225

work page 2021

[56] [56]

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[57] [57]

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. 2023. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 3, 4 (2023), 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. 2024. Bliva: A simple multimodal llm for better handling of text-rich visual questions. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 2256–2264

work page 2024

[59] [59]

Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, et al. 2024. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507 (2024)

work page arXiv 2024

[60] [60]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, and Alireza Fathi. 2023. Reveal: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 23369–23379. https://doi.org/10.1109/CVPR5272...

work page doi:10.1109/cvpr52729.2023.02238 2023

[61] [61]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2023. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems (2023)

work page 2023

[62] [62]

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al

work page

[63] [63]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[64] [64]

Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[65] [65]

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research 24, 251 (2023), 1–43

work page 2023

[66] [66]

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983 (2023)

work page arXiv 2023

[67] [67]

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[68] [68]

Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, and Sung Ju Hwang. 2024. Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks. Advances in Neural Information Processing Systems 36 (2024)

work page 2024

[69] [69]

Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A Smith, Yejin Choi, Kentaro Inui, et al. 2023. Realtime qa: What’s the answer right now? Advances in neural information processing systems 36 (2023), 49025–49043

work page 2023

[70] [70]

Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024 (2022)

work page arXiv 2022

[71] [71]

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language models can solve computer tasks. Advances in Neural Information Processing Systems 36 (2023), 39648–39677

work page 2023

[72] [72]

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. Ocr-free document understanding transformer. In European Conference on Computer Vision . Springer, 498–517

work page 2022

[73] [73]

Diederik P Kingma, Max Welling, et al. 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning 12, 4 (2019), 307–392

work page 2019

[74] [74]

Tomáš Kočisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2018. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6 (2018), 317–328

work page 2018

[75] [75]

Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. 2024. Generating images with multimodal language models. Advances in Neural Information Processing Systems 36 (2024). Manuscript submitted to ACM 30 Chen et al

work page 2024

[76] [76]

Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. 2023. Grounding language models to images for multimodal inputs and outputs. InInternational Conference on Machine Learning . PMLR, 17283–17300

work page 2023

[77] [77]

Roman Koshkin, Katsuhito Sudoh, and Satoshi Nakamura. 2024. Transllama: Llm-based simultaneous translation system. arXiv preprint arXiv:2402.04636 (2024)

work page arXiv 2024

[78] [78]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466

work page 2019

[79] [79]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles . 611–626

work page 2023

[80] [80]

Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. 2022. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115 (2022)

work page arXiv 2022