arxiv: 2507.13334 · v2 · submitted 2025-07-17 · 💻 cs.CL

Recognition: no theorem link

A Survey of Context Engineering for Large Language Models

Lingrui Mei , Jiayu Yao , Yuyao Ge , Yiwei Wang , Baolong Bi , Yujun Cai , Jiazhi Liu , Mingyu Li

show 7 more authors

Zhong-Zhi Li Duzhen Zhang Chenlin Zhou Jiayi Mao Tianze Xia Jiafeng Guo Shenghua Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords context engineeringlarge language modelsretrieval-augmented generationcontext managementmulti-agent systemsLLM capabilitiesprompt optimizationresearch survey

0 comments

The pith

Context engineering optimizes inputs for LLMs but exposes their weakness in producing sophisticated long-form outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey defines Context Engineering as the systematic optimization of contextual information for large language models, going beyond basic prompting to include retrieval, processing, and management. It breaks the field into foundational components and shows how they integrate into larger systems such as retrieval-augmented generation, memory architectures, tool use, and multi-agent setups. By reviewing over 1400 papers, the authors identify a core asymmetry: models handle complex contexts effectively when inputs are engineered well, yet they remain limited when asked to produce equally complex long-form outputs. This matters because LLM performance depends directly on the quality of supplied context, and the output gap restricts applications that require extended reasoning or generation. The survey frames closing this gap as a priority for advancing context-aware AI.

Core claim

The paper establishes Context Engineering as a formal discipline that optimizes information payloads for LLMs. It decomposes the discipline into components of context retrieval and generation, processing, and management, then shows their architectural integration in retrieval-augmented generation, memory systems, tool-integrated reasoning, and multi-agent systems. Analysis of the literature reveals a fundamental asymmetry in model capabilities: current LLMs, when augmented by advanced context engineering, show strong proficiency at understanding complex contexts but exhibit pronounced limitations in generating equally sophisticated long-form outputs.

What carries the argument

Context Engineering as the systematic optimization of information payloads, organized through a taxonomy of retrieval/generation, processing, and management components integrated into RAG, memory, and multi-agent architectures, which surfaces the understanding-generation asymmetry.

If this is right

Techniques in retrieval-augmented generation and memory systems can reliably improve model handling of complex inputs.
Multi-agent and tool-integrated systems gain reliability when context management coordinates information across agents.
Future model development should target the generation side of the asymmetry to enable longer, more coherent outputs.
The proposed taxonomy supplies a shared structure for designing new context-aware applications.
Addressing the gap would expand practical uses of LLMs in tasks requiring extended creative or analytical output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives may need explicit weighting toward output generation quality rather than input comprehension alone.
Direct benchmarks that score input understanding depth against output sophistication could quantify the asymmetry more precisely.
The asymmetry may extend to multimodal settings, where models process rich inputs but struggle to generate detailed outputs.
Hybrid workflows could route generation tasks to humans or specialized modules while models manage context.

Load-bearing premise

The claim that the asymmetry is the defining research priority assumes the authors' selection and interpretation of over 1400 papers accurately reflects the full state of the field without bias.

What would settle it

A controlled comparison showing that current LLMs, equipped with the best context engineering techniques, produce long-form outputs whose complexity and structure match the depth of their input understanding would falsify the asymmetry claim.

read the original abstract

The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1400 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes context techniques into a useful taxonomy and flags the input-output capability gap in LLMs.

read the letter

This survey's main contribution is a taxonomy that breaks down context engineering for LLMs into foundational parts and system-level integrations, along with the observation that models excel at handling complex contexts but struggle with generating sophisticated long outputs. It starts by defining the components: context retrieval and generation, processing, and management. Then it covers how these are used in retrieval-augmented generation, memory systems, tool use, and multi-agent setups. The review draws from more than 1400 papers, which gives it breadth and makes it a decent reference for anyone trying to navigate the space. The paper does well at providing structure. Instead of just listing techniques, it organizes them in a way that shows how they fit together. The asymmetry point is presented as emerging from the literature synthesis, and it aligns with common experiences in the field without overreaching into predictions. On the soft side, the work is primarily organizational. The taxonomy pulls together existing concepts like RAG and memory rather than introducing entirely new methods or data. The claim about the defining research gap depends on the completeness of their literature search. If some areas are underrepresented, the identified asymmetry might not hold as strongly. This kind of survey is useful for researchers and engineers who need an overview to guide their work on context-aware systems. It is not a breakthrough paper, but it can serve as a starting point for discussions on priorities. I would recommend sending it for peer review. A careful referee could help refine the taxonomy and check the coverage, which would make the final version more reliable for the community.

Referee Report

2 major / 3 minor

Summary. The paper introduces Context Engineering as a formal discipline for optimizing contextual information payloads for LLMs, extending beyond prompt engineering. It presents a taxonomy of foundational components (context retrieval and generation, processing, and management) and their integrations into architectures such as RAG, memory systems, tool-integrated reasoning, and multi-agent systems. Drawing on a synthesis of over 1400 papers, the central claim is that LLMs show strong proficiency in understanding complex contexts but pronounced limitations in generating sophisticated long-form outputs, making this asymmetry a defining priority for future research.

Significance. If the taxonomy is comprehensive and the asymmetry observation is representative of the literature, the survey supplies a unified technical roadmap that can orient both researchers and practitioners working on context-aware AI systems. The explicit identification of the understanding-generation gap, grounded in the reviewed body of work, offers a clear focal point for subsequent efforts to balance LLM capabilities.

major comments (2)

[research gap / concluding section] The section discussing the research gap and future priorities: the claim that current models exhibit 'pronounced limitations in generating equally sophisticated, long-form outputs' is presented as emerging directly from the literature synthesis, yet the manuscript does not aggregate or cite specific quantitative benchmarks (e.g., performance deltas on long-form generation tasks versus context-understanding tasks) that would make the asymmetry diagnosis more concrete and testable.
[introduction / methodology overview] The description of the paper-selection process (implicit in the >1400-paper claim): without an explicit statement of search strategy, inclusion/exclusion criteria, or coverage across sub-areas (e.g., proportion of papers on generation versus retrieval), the representativeness of the synthesis—and therefore the robustness of the asymmetry diagnosis—remains difficult to evaluate.

minor comments (3)

[taxonomy section] The taxonomy diagram (if present) or its textual description would benefit from explicit labels or arrows clarifying the data-flow relationships among the three foundational components and the three system-level integrations.
[throughout] Terminology consistency: 'context retrieval and generation' is sometimes written with a slash and sometimes as separate items; standardize phrasing across sections for readability.
[references] A small number of citations appear to be repeated or listed without distinguishing primary from secondary sources; a brief note on citation selection criteria would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We agree that both points raised can strengthen the manuscript and will incorporate revisions to address them explicitly. The changes will enhance transparency without altering the core contributions or taxonomy.

read point-by-point responses

Referee: [research gap / concluding section] The section discussing the research gap and future priorities: the claim that current models exhibit 'pronounced limitations in generating equally sophisticated, long-form outputs' is presented as emerging directly from the literature synthesis, yet the manuscript does not aggregate or cite specific quantitative benchmarks (e.g., performance deltas on long-form generation tasks versus context-understanding tasks) that would make the asymmetry diagnosis more concrete and testable.

Authors: We appreciate this suggestion. The asymmetry observation synthesizes patterns across the reviewed works, where context-understanding benchmarks (e.g., long-context QA and retrieval) consistently show high performance while long-form generation tasks reveal coherence and consistency challenges. In revision, we will expand the concluding section with targeted citations to representative benchmarks, including performance deltas from papers on Needle-in-a-Haystack tests versus long-form writing or summarization evaluations. This will make the claim more concrete and testable while remaining within survey scope. revision: yes
Referee: [introduction / methodology overview] The description of the paper-selection process (implicit in the >1400-paper claim): without an explicit statement of search strategy, inclusion/exclusion criteria, or coverage across sub-areas (e.g., proportion of papers on generation versus retrieval), the representativeness of the synthesis—and therefore the robustness of the asymmetry diagnosis—remains difficult to evaluate.

Authors: We agree that an explicit methodology description will improve evaluability. We will add a dedicated 'Survey Methodology' subsection (or appendix) detailing the search strategy (keywords across arXiv, ACL, and NeurIPS from 2018–2024), inclusion/exclusion criteria (peer-reviewed empirical or technical papers on context techniques), and approximate coverage breakdowns by category (retrieval, processing, management, and integrated systems). This addition will directly support assessment of the synthesis. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This survey synthesizes over 1400 prior works into a taxonomy of context retrieval, processing, management, and system integrations such as RAG and multi-agent setups. The claimed asymmetry between strong context understanding and weaker long-form generation is presented as an observational conclusion from that literature review, with no equations, fitted parameters, formal derivations, or predictions that reduce to the paper's own inputs by construction. All load-bearing steps are descriptive summaries of external research; no self-citation chains or ansatzes are invoked to force the central gap diagnosis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper the central claims rest on the completeness and representativeness of the reviewed literature; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5546 in / 1179 out tokens · 34343 ms · 2026-05-13T20:54:26.041174+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Context to Skills: Can Language Models Learn from Context Skillfully?
cs.AI 2026-04 unverdicted novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
cs.CL 2026-04 unverdicted novelty 7.0

Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair
cs.AR 2026-04 unverdicted novelty 7.0

Clover fixes 96.8% of bugs on an RTL-repair benchmark using stochastic tree-of-thoughts and neural-symbolic agents, outperforming traditional and LLM baselines by 94% and 63% respectively with 87.5% pass@1.
Context Training with Active Information Seeking
cs.CL 2026-05 unverdicted novelty 6.0

Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...
PREPING: Building Agent Memory without Tasks
cs.AI 2026-05 unverdicted novelty 6.0

Preping builds agent memory via proposer-guided synthetic practice and selective validation, matching offline/online methods at 2-3x lower deployment cost.
S^2tory: Story Spine Distillation for Movie Script Summarization
cs.CL 2026-05 unverdicted novelty 6.0

S^2tory uses narratological theory and a Narrative Expert Agent to identify plot nuclei in movie scripts for high-fidelity summarization at 3.5x compression, with strong zero-shot generalization to books.
CL-bench Life: Can Language Models Learn from Real-Life Context?
cs.CL 2026-04 unverdicted novelty 6.0

CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers
cs.CR 2026-04 unverdicted novelty 6.0

Arbiter-K is a new execution architecture that treats LLMs as probabilistic processors inside a neuro-symbolic kernel with a semantic ISA to enable deterministic security enforcement and unsafe trajectory interdiction...
AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

AnchorMem decouples atomic fact anchors and associative event graphs for retrieval from preserved raw interaction contexts, outperforming prior memory methods on the LoCoMo benchmark.
Towards Long-horizon Agentic Multimodal Search
cs.CV 2026-04 unverdicted novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
Contexty: Capturing and Organizing In-situ Thoughts for Context-Aware AI Support
cs.HC 2026-04 unverdicted novelty 6.0

Contexty captures users' cognitive traces as editable snippets and organizes them to enable more effective, user-controlled context-aware AI collaboration during complex tasks.
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
cs.CV 2026-04 unverdicted novelty 6.0

VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs
cs.SE 2026-04 unverdicted novelty 6.0

A small recency window of 3-5 prior ADRs as context produces higher-fidelity LLM-generated Architecture Decision Records than no context, full history, or retrieval-augmented selection in typical sequential workflows.
LightThinker++: From Reasoning Compression to Memory Management
cs.CL 2026-04 unverdicted novelty 6.0

LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop
cs.CV 2026-04 unverdicted novelty 6.0

ExpressEdit delivers fast, artifact-free stylized facial expression editing inside Photoshop via a diffusion model plugin and an accompanying expression database.
Reflective Context Learning: Studying the Optimization Primitives of Context Space
cs.LG 2026-04 unverdicted novelty 6.0

Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, a...
Context Training with Active Information Seeking
cs.CL 2026-05 unverdicted novelty 5.0

Active information seeking via search tools, when combined with multi-candidate context pruning during training, produces consistent gains on translation, health, and reasoning tasks over naive tool addition or no-too...
VIP-COP: Context Optimization for Tabular Foundation Models
cs.LG 2026-05 unverdicted novelty 5.0

VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimen...
Towards Agentic Investigation of Security Alerts
cs.CR 2026-04 unverdicted novelty 5.0

An agentic LLM workflow with overview queries, query selection, evidence extraction, and verdict generation achieves significantly higher accuracy on security alert investigation than direct LLM use.
Human-Inspired Context-Selective Multimodal Memory for Social Robots
cs.AI 2026-04 unverdicted novelty 5.0

A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.
CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning
cs.CL 2026-04 unverdicted novelty 5.0

CodaRAG improves RAG by using a CLS-inspired three-stage pipeline of knowledge consolidation, multi-dimensional associative navigation, and interference elimination, delivering 7-11% gains on GraphRAG-Bench for factua...
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
cs.LG 2026-04 unverdicted novelty 5.0

Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and...
Context Collapse: Barriers to Adoption for Generative AI in Workplace Settings
cs.CY 2026-04 unverdicted novelty 5.0

Expert interviews demonstrate that context in generative AI workplace use collapses or rots over time, limiting tool effectiveness and revealing pitfalls in computational context approaches.
Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants
cs.SE 2026-04 unverdicted novelty 4.0

Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.

Reference graph

Works this paper leans on

292 extracted references · 292 canonical work pages · cited by 25 Pith papers · 12 internal anchors

[1]

https:// agent-network-protocol.com/specs/communication.html

Anp-agent communication meta-protocol specification(draft). https:// agent-network-protocol.com/specs/communication.html. [Online; accessed 17- July-2025]

work page 2025
[2]

S. A. Automating human evaluation of dialogue systems.North American Chapter of the Association for Computational Linguistics, 2022

work page 2022
[3]

Qaraqe, and E

Samir Abdaljalil, Hasan Kurban, Khalid A. Qaraqe, and E. Serpedin. Theorem-of-thought: A multi- agent framework for abductive, deductive, and inductive reasoning in language models. arXiv preprint, 2025

work page 2025
[4]

Rankify: A comprehensive python toolkit for retrieval, re-ranking, and retrieval-augmented genera- tion, arXiv preprint arXiv:2502.02464, 2025

Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, and Adam Jatowt. Rankify: A comprehensive python toolkit for retrieval, re-ranking, and retrieval-augmented genera- tion, arXiv preprint arXiv:2502.02464, 2025. URLhttps://arxiv.org/abs/2502.02464v3

work page arXiv 2025
[5]

Bhargav, M

Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matt Stallone, Rameswar Panda, Yara Rizk, G. Bhargav, M. Crouse, Chulaka Gunasekara, S. Ikbal, Sachin Joshi, Hima P. Karanam, Vineet Kumar, Asim Munawar, S. Neelam, Dinesh Raghu, Udit Sharma, Adriana Meza Soria, Dheeraj Sreedhar, P. Venkateswaran, Merve Unuvar, David Cox, S. Roukos, Luis A...

work page 2024
[6]

Acharya, Karthigeyan Kuppan, and Divya Bhaskaracharya

D. Acharya, Karthigeyan Kuppan, and Divya Bhaskaracharya. Agentic ai: Autonomous intelligence for complex goals—a comprehensive survey.IEEE Access, 2025

work page 2025
[7]

Tallyqa: Answering complex counting questions

Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. AAAI Conference on Artificial Intelligence, 2018

work page 2018
[8]

Star attention: Efficient llm inference over long sequences, arXiv preprint arXiv:2411.17116, 2024

Shantanu Acharya, Fei Jia, and Boris Ginsburg. Star attention: Efficient llm inference over long sequences, arXiv preprint arXiv:2411.17116, 2024. URLhttps://arxiv.org/abs/2411. 17116v3. 59

work page arXiv 2024
[9]

Preprint, arXiv:2502.08820

Emre Can Acikgoz, Jeremy Greer, Akul Datta, Ze Yang, William Zeng, Oussama Elachqar, Emmanouil Koukoumidis, Dilek Hakkani-Tur, and Gokhan Tur. Can a single model master both multi-turn conversations and tool use? coalm: A unified conversational agentic language model, arXiv preprint arXiv:2502.08820, 2025. URLhttps://arxiv.org/abs/2502.08820v3

work page arXiv 2025
[10]

A desideratum for conversational agents: Capabilities, challenges, and future directions, arXiv preprint arXiv:2504.16939, 2025

Emre Can Acikgoz, Cheng Qian, Hongru Wang, Vardhan Dongre, Xiusi Chen, Heng Ji, Dilek Hakkani- Tur, and Gokhan Tur. A desideratum for conversational agents: Capabilities, challenges, and future directions, arXiv preprint arXiv:2504.16939, 2025. URLhttps://arxiv.org/abs/2504. 16939v1

work page arXiv 2025
[11]

Anum Afzal, Juraj Vladika, Gentrit Fazlija, Andrei Staradubets, and Florian Matthes. Towards opti- mizing a retrieval augmented generation using large language model on academic data.International Conference on Natural Language Processing and Information Retrieval, 2024

work page 2024
[12]

Azad, and P

Ankush Agarwal, Sakharam Gawade, A. Azad, and P. Bhattacharyya. Kitlm: Domain-specific knowledge integration into language models for question answering.ICON, 2023

work page 2023
[13]

Large scale knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training

Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. Large scale knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. arXiv preprint, 2020

work page 2020
[14]

Hegselmann, Hunter Lang, Yoon Kim, and D

Monica Agrawal, S. Hegselmann, Hunter Lang, Yoon Kim, and D. Sontag. Large language models are few-shot clinical information extractors.Conference on Empirical Methods in Natural Language Processing, 2022

work page 2022
[15]

Mcp bridge: A lightweight, llm-agnostic restful proxy for model context protocol servers.arXiv preprint arXiv:2504.08999,

Arash Ahmadi, S. Sharif, and Yaser Mohammadi Banadaki. Mcp bridge: A lightweight, llm-agnostic restful proxy for model context protocol servers, arXiv preprint arXiv:2504.08999, 2025. URL https://arxiv.org/abs/2504.08999v1

work page arXiv 2025
[16]

Ainslie, J

J. Ainslie, J. Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr’on, and Sumit K. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[17]

Multi-agent system concepts theory and application phases

Adel Al-Jumaily. Multi-agent system concepts theory and application phases. arXiv preprint, 2006

work page 2006
[18]

Position interpolation improves alibi extrapolation

Faisal Al-Khateeb, Nolan Dey, Daria Soboleva, and Joel Hestness. Position interpolation improves alibi extrapolation. arXiv preprint, 2023

work page 2023
[19]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, A. Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricar...

work page 2022
[20]

Albrecht and P

Stefano V. Albrecht and P. Stone. Autonomous agents modelling other agents: A comprehensive survey and open problems.Artificial Intelligence, 2017

work page 2017
[21]

Understanding the Challenges and Opportunities of Generative AI Apps: An Empirical Study

Buthayna AlMulla, Maram Assi, and Safwat Hassan. Understanding the challenges and promises of developing generative ai apps: An empirical study, arXiv preprint arXiv:2506.16453, 2025. URL https://arxiv.org/abs/2506.16453v2. 60

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Alsuhaibani, Christian D

Reem S. Alsuhaibani, Christian D. Newman, M. J. Decker, Michael L. Collard, and Jonathan I. Maletic. On the naming of methods: A survey of professional developers.International Conference on Software Engineering, 2021

work page 2021
[23]

Giorgini, A

Francesco Alzetta, P. Giorgini, A. Najjar, M. Schumacher, and Davide Calvaresi. In-time explainability in multi-agent systems: Challenges, opportunities, and roadmap.EXTRAAMAS@AAMAS, 2020

work page 2020
[24]

Lüth, Paul F

Kenza Amara, Lukas Klein, Carsten T. Lüth, Paul F. Jäger, Hendrik Strobelt, and Mennatallah El- Assady. Why context matters in vqa and reasoning: Semantic interventions for vlm input modalities, arXiv preprint arXiv:2410.01690v1, 2024. URLhttps://arxiv.org/abs/2410.01690v1

work page arXiv 2024
[25]

Prompt design and engineering: Introduction and advanced methods, arXiv preprint arXiv:2401.14423, 2024

Xavier Amatriain. Prompt design and engineering: Introduction and advanced methods, arXiv preprint arXiv:2401.14423, 2024. URLhttps://arxiv.org/abs/2401.14423v4

work page arXiv 2024
[26]

Dawn: Designing distributed agents in a worldwide network, arXiv preprint arXiv:2410.22339, 2024

Zahra Aminiranjbar, Jianan Tang, Qiudan Wang, Shubha Pant, and Mahesh Viswanathan. Dawn: Designing distributed agents in a worldwide network, arXiv preprint arXiv:2410.22339, 2024. URL https://arxiv.org/abs/2410.22339v3

work page arXiv 2024
[27]

Why does the effective context length of llms fall short?International Conference on Learning Representations, 2024

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short?International Conference on Learning Representations, 2024

work page 2024
[28]

Thread: A logic-based data organization paradigm for how-to question answering with retrieval augmented generation.arXiv preprint arXiv:2406.13372, 2024

Kaikai An, Fangkai Yang, Liqun Li, Junting Lu, Sitao Cheng, Shuzheng Si, Lu Wang, Pu Zhao, Lele Cao, Qingwei Lin, et al. Thread: A logic-based data organization paradigm for how-to question answering with retrieval augmented generation.arXiv preprint arXiv:2406.13372, 2024

work page arXiv 2024
[29]

Nissist: An incident mitigation copilot based on troubleshooting guides

Kaikai An, Fangkai Yang, Junting Lu, Liqun Li, Zhixing Ren, Hao Huang, Lu Wang, Pu Zhao, Yu Kang, Hua Ding, et al. Nissist: An incident mitigation copilot based on troubleshooting guides. In Proceedings of the 27th European Conference on Artificial Intelligence (ECAI 2024), pages 4471–4474, 2024

work page 2024
[30]

Ultraif: Advancing instruction following from the wild

Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, and Baobao Chang. Ultraif: Advancing instruction following from the wild. pages 7930–7957, 2025

work page 2025
[31]

Sumin An, Junyoung Sung, Wonpyo Park, Chanjun Park, and Paul Hongsuck Seo. Lcirc: A recurrent compression approach for efficient long-form context and query dependent modeling in llms.North American Chapter of the Association for Computational Linguistics, 2025

work page 2025
[32]

Dynamic context pruning for efficient and interpretable autoregressive transformers.Neural Information Processing Systems, 2023

Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurélien Lucchi, and Thomas Hof- mann. Dynamic context pruning for efficient and interpretable autoregressive transformers.Neural Information Processing Systems, 2023

work page 2023
[33]

Anderson, M

John R. Anderson, M. Matessa, and C. Lebiere. Act-r: A theory of higher level cognition and its relation to visual attention.Hum. Comput. Interact., 1997

work page 1997
[34]

Language models as agent models.Conference on Empirical Methods in Natural Language Processing, 2022

Jacob Andreas. Language models as agent models.Conference on Empirical Methods in Natural Language Processing, 2022

work page 2022
[35]

Baldoni, and Leonardo Querzoni

Leonardo Aniello, R. Baldoni, and Leonardo Querzoni. Adaptive online scheduling in storm.Dis- tributed Event-Based Systems, 2013. 61

work page 2013
[36]

Arigraph: Learning knowledge graph world models with episodic memory for llm agents

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, M. Burtsev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents, arXiv preprint arXiv:2407.04363, 2024. URLhttps://arxiv.org/abs/2407.04363v3

work page arXiv 2024
[37]

Introducing the model context protocol, November 2024

Anthropic. Introducing the model context protocol, November 2024. URL https://www. anthropic.com/news/model-context-protocol. [Online; accessed 17-July-2025]

work page 2024
[38]

Wmks Ilmini

RM Aratchige and Dr. Wmks Ilmini. Llms working in harmony: A survey on the technological aspects of building effective llm-based multi agent systems, arXiv preprint arXiv:2504.01963, 2025. URL https://arxiv.org/abs/2504.01963v1

work page arXiv 2025
[39]

Leo Ardon, Daniel Furelos-Blanco, and A. Russo. Learning reward machines in cooperative multi- agent tasks.AAMAS Workshops, 2023

work page 2023
[40]

Armeni, C

K. Armeni, C. Honey, and Tal Linzen. Characterizing verbatim short-term memory in neural language models. Conference on Computational Natural Language Learning, 2022

work page 2022
[41]

Self-rag: Learning to retrieve, generate, and critique through self-reflection.International Conference on Learning Representations, 2023

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.International Conference on Learning Representations, 2023

work page 2023
[42]

Self iterative label refinement via robust unla- beled learning, arXiv preprint arXiv:2502.12565, 2025

Hikaru Asano, Tadashi Kozuno, and Yukino Baba. Self iterative label refinement via robust unla- beled learning, arXiv preprint arXiv:2502.12565, 2025. URLhttps://arxiv.org/abs/2502. 12565v1

work page arXiv 2025
[43]

Bifurcated attention: Accelerating massively parallel decoding with shared prefixes in llms, arXiv preprint arXiv:2403.08845, 2024

Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, and Bing Xiang. Bifurcated attention: Accelerating massively parallel decoding with shared prefixes in llms, arXiv preprint arXiv:2403.08845, 2024. URLhttps://arxi...

work page arXiv 2024
[44]

Lasp: Llm assisted security property generation for soc verification.Workshop on Machine Learning for CAD, 2024

Avinash Ayalasomayajula, Rui Guo, Jingbo Zhou, Sujan Kumar Saha, and Farimah Farahmandi. Lasp: Llm assisted security property generation for soc verification.Workshop on Machine Learning for CAD, 2024

work page 2024
[45]

Aytes, Jinheon Baek, and Sung Ju Hwang

Simon A. Aytes, Jinheon Baek, and Sung Ju Hwang. Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching. arXiv preprint, 2025

work page 2025
[46]

Kazerouni, I

Bobby Azad, Reza Azad, Sania Eskandari, Afshin Bozorgpour, A. Kazerouni, I. Rekik, and D. Merhof. Foundational models in medical imaging: A comprehensive survey and future vision, arXiv preprint arXiv:2310.18689, 2023. URLhttps://arxiv.org/abs/2310.18689v1

work page arXiv 2023
[47]

Transformers for tabular data representation: A survey of models and applications.Transactions of the Association for Computational Linguistics, 2023

Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. Transformers for tabular data representation: A survey of models and applications.Transactions of the Association for Computational Linguistics, 2023

work page 2023
[48]

Chandrasekaran, Silviu Cucerzan, Allen Herring, and S

Jinheon Baek, N. Chandrasekaran, Silviu Cucerzan, Allen Herring, and S. Jauhar. Knowledge- augmented large language models for personalized contextual query suggestion.The Web Conference, 2023. 62

work page 2023
[49]

A survey of multimodal large language model from a data-centric perspective, arXiv preprint arXiv:2405.16640v2, 2024

Tianyi Bai, Hao Liang, Binwang Wan, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, Conghui He, Binhang Yuan, and Wentao Zhang. A survey of multimodal large language model from a data-centric perspective, arXiv preprint arXiv:2405.16640v2, 2024. URLhttps://arxiv.org/abs/2405. 16640v2

work page arXiv 2024
[50]

Citrus: Chunked instruction-aware state eviction for long sequence modeling

Yu Bai, Xiyuan Zou, Heyan Huang, Sanxing Chen, Marc-Antoine Rondeau, Yang Gao, and Jackie Chi Kit Cheung. Citrus: Chunked instruction-aware state eviction for long sequence modeling. Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[51]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Souhail Bakkali, Sanket Biswas, Zuheng Ming, Mickaël Coustaty, Marccal Rusinol, O. R. Terrades, and Josep Llad’os. Globaldoc: A cross-modal vision-language framework for real-world document image retrieval and classification.IEEE Workshop/Winter Conference on Applications of Computer Vision, 2023

work page 2023
[53]

Purushothaman, and Renuka Sindhgatta

JayachanduBandlamudi,K.Mukherjee,PrernaAgarwal,SampathDechu,SiyuHuo,VatcheIsahagian, Vinod Muthusamy, N. Purushothaman, and Renuka Sindhgatta. Towards hybrid automation by bootstrappingconversationalinterfacesforitoperationtasks. AAAIConferenceonArtificialIntelligence , 2023

work page 2023
[54]

Pimplikar, Sampath Dechu, Alex Straley, Anbumunee Ponniah, and Renuka Sindhgatta

Jayachandu Bandlamudi, Kushal Mukherjee, Prerna Agarwal, Ritwik Chaudhuri, R. Pimplikar, Sampath Dechu, Alex Straley, Anbumunee Ponniah, and Renuka Sindhgatta. Building conversational artifacts to enable digital assistant for apis and rpas.AAAI Conference on Artificial Intelligence, 2024

work page 2024
[55]

Large language models for recommendation: Past, present, and future.Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

Keqin Bao, Jizhi Zhang, Xinyu Lin, Yang Zhang, Wenjie Wang, and Fuli Feng. Large language models for recommendation: Past, present, and future.Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

work page 2024
[56]

Schetinger, and Cody Dunne

Sara Di Bartolomeo, Giorgio Severi, V. Schetinger, and Cody Dunne. Ask and you shall receive (a graph drawing): Testing chatgpt’s potential to apply graph layout algorithms.Eurographics Conference on Visualization, 2023

work page 2023
[57]

Exploring autonomous agents through the lens of large language models: A review.arXiv preprint arXiv:2404.04442, 2024

Saikat Barua. Exploring autonomous agents through the lens of large language models: A review, arXiv preprint arXiv:2404.04442, 2024. URLhttps://arxiv.org/abs/2404.04442v1

work page arXiv 2024
[58]

Nestful: Abenchmark for evaluating llms on nested sequences of api calls, arXiv preprint arXiv:2409.03797, 2024

KinjalBasu, IbrahimAbdelaziz, KelseyBradford, M.Crouse, KiranKate, SadhanaKumaravel, Saurabh Goyal,AsimMunawar,YaraRizk,XinWang,LuisA.Lastras,andP.Kapanipathi. Nestful: Abenchmark for evaluating llms on nested sequences of api calls, arXiv preprint arXiv:2409.03797, 2024. URL https://arxiv.org/abs/2409.03797v3. 63

work page arXiv 2024
[59]

Natural language-oriented programming (nlop): Towards democratizing software creation

Amin Beheshti. Natural language-oriented programming (nlop): Towards democratizing software creation. 2024 IEEE International Conference on Software Services Engineering (SSE), 2024

work page 2024
[60]

Azadeh Beiranvand and S. M. Vahidipour. Integrating structural and semantic signals in text- attributed graphs with bigtex, arXiv preprint arXiv:2504.12474, 2025. URLhttps://arxiv.org/ abs/2504.12474v2

work page arXiv 2025
[61]

Decimamba: Exploring the length extrapolation potential of mamba.International Conference on Learning Representations, 2024

Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba.International Conference on Learning Representations, 2024

work page 2024
[62]

Assaf Ben-Kish, Itamar Zimerman, M. J. Mirza, James R. Glass, Leonid Karlinsky, and Raja Giryes. Overflow prevention enhances long-context recurrent llms. arXiv preprint, 2025

work page 2025
[63]

Benna and Stefano Fusi

M. Benna and Stefano Fusi. Complex synapses as efficient memory systems.BMC Neuroscience, 2015

work page 2015
[64]

Benna and Stefano Fusi

M. Benna and Stefano Fusi. Computational principles of biological memory, arXiv preprint arXiv:1507.07580, 2015. URLhttps://arxiv.org/abs/1507.07580v1

work page arXiv 2015
[65]

Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, and Waseem Alshikh

Shelly Bensal, Umar Jamil, Christopher Bryant, M. Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, and Waseem Alshikh. Reflect, retry, reward: Self-improving llms via reinforcement learn- ing, arXiv preprint arXiv:2505.24726, 2025. URLhttps://arxiv.org/abs/2505.24726v1

work page arXiv 2025
[66]

Bermúdez, A

Idoia Berges, J. Bermúdez, A. Goñi, and A. Illarramendi. Semantic web technology for agent communication protocols.Extended Semantic Web Conference, 2008

work page 2008
[67]

Gaurav Beri and Vaishnavi Srivastava. Advanced techniques in prompt engineering for large language models: A comprehensive study.2024 IEEE 4th International Conference on ICT in Business Industry & Government (ICTBIG), 2024

work page 2024
[68]

Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. Unlimiformer: Long-range transformers with unlimited length input.Neural Information Processing Systems, 2023

work page 2023
[69]

Niewiadomski, P

Maciej Besta, Nils Blach, Aleš Kubíček, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, H. Niewiadomski, P. Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models.AAAI Conference on Artificial Intelligence, 2023

work page 2023
[70]

Judgment aggregation, discursive dilemma and reflective equilib- rium: Neural language models as self-improving doxastic agents.Frontiers in Artificial Intelligence, 2022

Gregor Betz and Kyle Richardson. Judgment aggregation, discursive dilemma and reflective equilib- rium: Neural language models as self-improving doxastic agents.Frontiers in Artificial Intelligence, 2022

work page 2022
[71]

Bezalel, Eyal Orgad, and Amir Globerson

L. Bezalel, Eyal Orgad, and Amir Globerson. Teaching models to improve on tape.AAAI Conference on Artificial Intelligence, 2024

work page 2024
[72]

Collins, Adrian Weller, Andrew Gordon Wilson, and Muhammad Bilal Zafar

Umang Bhatt, Sanyam Kapoor, Mihir Upadhyay, Ilia Sucholutsky, Francesco Quinzan, Katherine M. Collins, Adrian Weller, Andrew Gordon Wilson, and Muhammad Bilal Zafar. When should we orchestrate multiple agents?, arXiv preprint arXiv:2503.13577, 2025. URLhttps://arxiv.org/ abs/2503.13577v1. 64

work page arXiv 2025
[73]

Context-dpo: Aligning language models for context- faithfulness

Baolong Bi, Shaohan Huang, Yiwei Wang, Tianchi Yang, Zihan Zhang, Haizhen Huang, Lingrui Mei, Junfeng Fang, Zehao Li, Furu Wei, et al. Context-dpo: Aligning language models for context- faithfulness. ACL 2025, 2024

work page 2025
[74]

Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024

Baolong Bi, Shenghua Liu, Lingrui Mei, Yiwei Wang, Pengliang Ji, and Xueqi Cheng. Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024

work page 2025
[75]

Lpnl: Scalable link prediction with large language models.ACL 2024, 2024

Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, and Xueqi Cheng. Lpnl: Scalable link prediction with large language models.ACL 2024, 2024

work page 2024
[76]

Struedit: Structured outputs enable the fast and accurate knowledge editing for large language models

Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Hongcheng Gao, Junfeng Fang, and Xueqi Cheng. Struedit: Structured outputs enable the fast and accurate knowledge editing for large language models. 2024

work page 2024
[77]

Adaptive token biaser: Knowledge editing via biasing key entities.EMNLP 2024, 2024

Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Hongcheng Gao, Yilong Xu, and Xueqi Cheng. Adaptive token biaser: Knowledge editing via biasing key entities.EMNLP 2024, 2024

work page 2024
[78]

Refinex: Learning to refine pre-training data at scale from expert-guided programs

Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, and Xueqi Cheng. Refinex: Learning to refine pre-training data at scale from expert-guided programs. 2025

work page 2025
[79]

Is factuality enhancement a free lunch for llms? better factuality can lead to worse context-faithfulness

Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Junfeng Fang, Hongcheng Gao, Shiyu Ni, and Xueqi Cheng. Is factuality enhancement a free lunch for llms? better factuality can lead to worse context-faithfulness. ICLR 2025, 2025

work page 2025
[80]

Parameters vs

Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, and Xueqi Cheng. Parameters vs. context: Fine-grained control of knowledge reliance in language models. 2025

work page 2025

Showing first 80 references.