pith. machine review for the scientific record. sign in

arxiv: 2507.13334 · v2 · submitted 2025-07-17 · 💻 cs.CL

Recognition: no theorem link

A Survey of Context Engineering for Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords context engineeringlarge language modelsretrieval-augmented generationcontext managementmulti-agent systemsLLM capabilitiesprompt optimizationresearch survey
0
0 comments X

The pith

Context engineering optimizes inputs for LLMs but exposes their weakness in producing sophisticated long-form outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey defines Context Engineering as the systematic optimization of contextual information for large language models, going beyond basic prompting to include retrieval, processing, and management. It breaks the field into foundational components and shows how they integrate into larger systems such as retrieval-augmented generation, memory architectures, tool use, and multi-agent setups. By reviewing over 1400 papers, the authors identify a core asymmetry: models handle complex contexts effectively when inputs are engineered well, yet they remain limited when asked to produce equally complex long-form outputs. This matters because LLM performance depends directly on the quality of supplied context, and the output gap restricts applications that require extended reasoning or generation. The survey frames closing this gap as a priority for advancing context-aware AI.

Core claim

The paper establishes Context Engineering as a formal discipline that optimizes information payloads for LLMs. It decomposes the discipline into components of context retrieval and generation, processing, and management, then shows their architectural integration in retrieval-augmented generation, memory systems, tool-integrated reasoning, and multi-agent systems. Analysis of the literature reveals a fundamental asymmetry in model capabilities: current LLMs, when augmented by advanced context engineering, show strong proficiency at understanding complex contexts but exhibit pronounced limitations in generating equally sophisticated long-form outputs.

What carries the argument

Context Engineering as the systematic optimization of information payloads, organized through a taxonomy of retrieval/generation, processing, and management components integrated into RAG, memory, and multi-agent architectures, which surfaces the understanding-generation asymmetry.

If this is right

  • Techniques in retrieval-augmented generation and memory systems can reliably improve model handling of complex inputs.
  • Multi-agent and tool-integrated systems gain reliability when context management coordinates information across agents.
  • Future model development should target the generation side of the asymmetry to enable longer, more coherent outputs.
  • The proposed taxonomy supplies a shared structure for designing new context-aware applications.
  • Addressing the gap would expand practical uses of LLMs in tasks requiring extended creative or analytical output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives may need explicit weighting toward output generation quality rather than input comprehension alone.
  • Direct benchmarks that score input understanding depth against output sophistication could quantify the asymmetry more precisely.
  • The asymmetry may extend to multimodal settings, where models process rich inputs but struggle to generate detailed outputs.
  • Hybrid workflows could route generation tasks to humans or specialized modules while models manage context.

Load-bearing premise

The claim that the asymmetry is the defining research priority assumes the authors' selection and interpretation of over 1400 papers accurately reflects the full state of the field without bias.

What would settle it

A controlled comparison showing that current LLMs, equipped with the best context engineering techniques, produce long-form outputs whose complexity and structure match the depth of their input understanding would falsify the asymmetry claim.

read the original abstract

The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1400 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Context Engineering as a formal discipline for optimizing contextual information payloads for LLMs, extending beyond prompt engineering. It presents a taxonomy of foundational components (context retrieval and generation, processing, and management) and their integrations into architectures such as RAG, memory systems, tool-integrated reasoning, and multi-agent systems. Drawing on a synthesis of over 1400 papers, the central claim is that LLMs show strong proficiency in understanding complex contexts but pronounced limitations in generating sophisticated long-form outputs, making this asymmetry a defining priority for future research.

Significance. If the taxonomy is comprehensive and the asymmetry observation is representative of the literature, the survey supplies a unified technical roadmap that can orient both researchers and practitioners working on context-aware AI systems. The explicit identification of the understanding-generation gap, grounded in the reviewed body of work, offers a clear focal point for subsequent efforts to balance LLM capabilities.

major comments (2)
  1. [research gap / concluding section] The section discussing the research gap and future priorities: the claim that current models exhibit 'pronounced limitations in generating equally sophisticated, long-form outputs' is presented as emerging directly from the literature synthesis, yet the manuscript does not aggregate or cite specific quantitative benchmarks (e.g., performance deltas on long-form generation tasks versus context-understanding tasks) that would make the asymmetry diagnosis more concrete and testable.
  2. [introduction / methodology overview] The description of the paper-selection process (implicit in the >1400-paper claim): without an explicit statement of search strategy, inclusion/exclusion criteria, or coverage across sub-areas (e.g., proportion of papers on generation versus retrieval), the representativeness of the synthesis—and therefore the robustness of the asymmetry diagnosis—remains difficult to evaluate.
minor comments (3)
  1. [taxonomy section] The taxonomy diagram (if present) or its textual description would benefit from explicit labels or arrows clarifying the data-flow relationships among the three foundational components and the three system-level integrations.
  2. [throughout] Terminology consistency: 'context retrieval and generation' is sometimes written with a slash and sometimes as separate items; standardize phrasing across sections for readability.
  3. [references] A small number of citations appear to be repeated or listed without distinguishing primary from secondary sources; a brief note on citation selection criteria would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We agree that both points raised can strengthen the manuscript and will incorporate revisions to address them explicitly. The changes will enhance transparency without altering the core contributions or taxonomy.

read point-by-point responses
  1. Referee: [research gap / concluding section] The section discussing the research gap and future priorities: the claim that current models exhibit 'pronounced limitations in generating equally sophisticated, long-form outputs' is presented as emerging directly from the literature synthesis, yet the manuscript does not aggregate or cite specific quantitative benchmarks (e.g., performance deltas on long-form generation tasks versus context-understanding tasks) that would make the asymmetry diagnosis more concrete and testable.

    Authors: We appreciate this suggestion. The asymmetry observation synthesizes patterns across the reviewed works, where context-understanding benchmarks (e.g., long-context QA and retrieval) consistently show high performance while long-form generation tasks reveal coherence and consistency challenges. In revision, we will expand the concluding section with targeted citations to representative benchmarks, including performance deltas from papers on Needle-in-a-Haystack tests versus long-form writing or summarization evaluations. This will make the claim more concrete and testable while remaining within survey scope. revision: yes

  2. Referee: [introduction / methodology overview] The description of the paper-selection process (implicit in the >1400-paper claim): without an explicit statement of search strategy, inclusion/exclusion criteria, or coverage across sub-areas (e.g., proportion of papers on generation versus retrieval), the representativeness of the synthesis—and therefore the robustness of the asymmetry diagnosis—remains difficult to evaluate.

    Authors: We agree that an explicit methodology description will improve evaluability. We will add a dedicated 'Survey Methodology' subsection (or appendix) detailing the search strategy (keywords across arXiv, ACL, and NeurIPS from 2018–2024), inclusion/exclusion criteria (peer-reviewed empirical or technical papers on context techniques), and approximate coverage breakdowns by category (retrieval, processing, management, and integrated systems). This addition will directly support assessment of the synthesis. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This survey synthesizes over 1400 prior works into a taxonomy of context retrieval, processing, management, and system integrations such as RAG and multi-agent setups. The claimed asymmetry between strong context understanding and weaker long-form generation is presented as an observational conclusion from that literature review, with no equations, fitted parameters, formal derivations, or predictions that reduce to the paper's own inputs by construction. All load-bearing steps are descriptive summaries of external research; no self-citation chains or ansatzes are invoked to force the central gap diagnosis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper the central claims rest on the completeness and representativeness of the reviewed literature; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5546 in / 1179 out tokens · 34343 ms · 2026-05-13T20:54:26.041174+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Context to Skills: Can Language Models Learn from Context Skillfully?

    cs.AI 2026-04 unverdicted novelty 8.0

    Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.

  2. Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  3. Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.

  4. From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.

  5. Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair

    cs.AR 2026-04 unverdicted novelty 7.0

    Clover fixes 96.8% of bugs on an RTL-repair benchmark using stochastic tree-of-thoughts and neural-symbolic agents, outperforming traditional and LLM baselines by 94% and 63% respectively with 87.5% pass@1.

  6. Context Training with Active Information Seeking

    cs.CL 2026-05 unverdicted novelty 6.0

    Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...

  7. PREPING: Building Agent Memory without Tasks

    cs.AI 2026-05 unverdicted novelty 6.0

    Preping builds agent memory via proposer-guided synthetic practice and selective validation, matching offline/online methods at 2-3x lower deployment cost.

  8. S^2tory: Story Spine Distillation for Movie Script Summarization

    cs.CL 2026-05 unverdicted novelty 6.0

    S^2tory uses narratological theory and a Narrative Expert Agent to identify plot nuclei in movie scripts for high-fidelity summarization at 3.5x compression, with strong zero-shot generalization to books.

  9. CL-bench Life: Can Language Models Learn from Real-Life Context?

    cs.CL 2026-04 unverdicted novelty 6.0

    CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.

  10. From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

    cs.CR 2026-04 unverdicted novelty 6.0

    Arbiter-K is a new execution architecture that treats LLMs as probabilistic processors inside a neuro-symbolic kernel with a semantic ISA to enable deterministic security enforcement and unsafe trajectory interdiction...

  11. AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    AnchorMem decouples atomic fact anchors and associative event graphs for retrieval from preserved raw interaction contexts, outperforming prior memory methods on the LoCoMo benchmark.

  12. Towards Long-horizon Agentic Multimodal Search

    cs.CV 2026-04 unverdicted novelty 6.0

    LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...

  13. Contexty: Capturing and Organizing In-situ Thoughts for Context-Aware AI Support

    cs.HC 2026-04 unverdicted novelty 6.0

    Contexty captures users' cognitive traces as editable snippets and organizes them to enable more effective, user-controlled context-aware AI collaboration during complex tasks.

  14. VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

    cs.CV 2026-04 unverdicted novelty 6.0

    VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.

  15. Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs

    cs.SE 2026-04 unverdicted novelty 6.0

    A small recency window of 3-5 prior ADRs as context produces higher-fidelity LLM-generated Architecture Decision Records than no context, full history, or retrieval-augmented selection in typical sequential workflows.

  16. LightThinker++: From Reasoning Compression to Memory Management

    cs.CL 2026-04 unverdicted novelty 6.0

    LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

  17. ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop

    cs.CV 2026-04 unverdicted novelty 6.0

    ExpressEdit delivers fast, artifact-free stylized facial expression editing inside Photoshop via a diffusion model plugin and an accompanying expression database.

  18. Reflective Context Learning: Studying the Optimization Primitives of Context Space

    cs.LG 2026-04 unverdicted novelty 6.0

    Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, a...

  19. Context Training with Active Information Seeking

    cs.CL 2026-05 unverdicted novelty 5.0

    Active information seeking via search tools, when combined with multi-candidate context pruning during training, produces consistent gains on translation, health, and reasoning tasks over naive tool addition or no-too...

  20. VIP-COP: Context Optimization for Tabular Foundation Models

    cs.LG 2026-05 unverdicted novelty 5.0

    VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimen...

  21. Towards Agentic Investigation of Security Alerts

    cs.CR 2026-04 unverdicted novelty 5.0

    An agentic LLM workflow with overview queries, query selection, evidence extraction, and verdict generation achieves significantly higher accuracy on security alert investigation than direct LLM use.

  22. Human-Inspired Context-Selective Multimodal Memory for Social Robots

    cs.AI 2026-04 unverdicted novelty 5.0

    A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.

  23. CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning

    cs.CL 2026-04 unverdicted novelty 5.0

    CodaRAG improves RAG by using a CLS-inspired three-stage pipeline of knowledge consolidation, multi-dimensional associative navigation, and interference elimination, delivering 7-11% gains on GraphRAG-Bench for factua...

  24. Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

    cs.LG 2026-04 unverdicted novelty 5.0

    Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and...

  25. Context Collapse: Barriers to Adoption for Generative AI in Workplace Settings

    cs.CY 2026-04 unverdicted novelty 5.0

    Expert interviews demonstrate that context in generative AI workplace use collapses or rots over time, limiting tool effectiveness and revealing pitfalls in computational context approaches.

  26. Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants

    cs.SE 2026-04 unverdicted novelty 4.0

    Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.

Reference graph

Works this paper leans on

292 extracted references · 292 canonical work pages · cited by 25 Pith papers · 12 internal anchors

  1. [1]

    https:// agent-network-protocol.com/specs/communication.html

    Anp-agent communication meta-protocol specification(draft). https:// agent-network-protocol.com/specs/communication.html. [Online; accessed 17- July-2025]

  2. [2]

    S. A. Automating human evaluation of dialogue systems.North American Chapter of the Association for Computational Linguistics, 2022

  3. [3]

    Qaraqe, and E

    Samir Abdaljalil, Hasan Kurban, Khalid A. Qaraqe, and E. Serpedin. Theorem-of-thought: A multi- agent framework for abductive, deductive, and inductive reasoning in language models. arXiv preprint, 2025

  4. [4]

    Rankify: A comprehensive python toolkit for retrieval, re-ranking, and retrieval-augmented genera- tion, arXiv preprint arXiv:2502.02464, 2025

    Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, and Adam Jatowt. Rankify: A comprehensive python toolkit for retrieval, re-ranking, and retrieval-augmented genera- tion, arXiv preprint arXiv:2502.02464, 2025. URLhttps://arxiv.org/abs/2502.02464v3

  5. [5]

    Bhargav, M

    Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matt Stallone, Rameswar Panda, Yara Rizk, G. Bhargav, M. Crouse, Chulaka Gunasekara, S. Ikbal, Sachin Joshi, Hima P. Karanam, Vineet Kumar, Asim Munawar, S. Neelam, Dinesh Raghu, Udit Sharma, Adriana Meza Soria, Dheeraj Sreedhar, P. Venkateswaran, Merve Unuvar, David Cox, S. Roukos, Luis A...

  6. [6]

    Acharya, Karthigeyan Kuppan, and Divya Bhaskaracharya

    D. Acharya, Karthigeyan Kuppan, and Divya Bhaskaracharya. Agentic ai: Autonomous intelligence for complex goals—a comprehensive survey.IEEE Access, 2025

  7. [7]

    Tallyqa: Answering complex counting questions

    Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. AAAI Conference on Artificial Intelligence, 2018

  8. [8]

    Star attention: Efficient llm inference over long sequences, arXiv preprint arXiv:2411.17116, 2024

    Shantanu Acharya, Fei Jia, and Boris Ginsburg. Star attention: Efficient llm inference over long sequences, arXiv preprint arXiv:2411.17116, 2024. URLhttps://arxiv.org/abs/2411. 17116v3. 59

  9. [9]

    Preprint, arXiv:2502.08820

    Emre Can Acikgoz, Jeremy Greer, Akul Datta, Ze Yang, William Zeng, Oussama Elachqar, Emmanouil Koukoumidis, Dilek Hakkani-Tur, and Gokhan Tur. Can a single model master both multi-turn conversations and tool use? coalm: A unified conversational agentic language model, arXiv preprint arXiv:2502.08820, 2025. URLhttps://arxiv.org/abs/2502.08820v3

  10. [10]

    A desideratum for conversational agents: Capabilities, challenges, and future directions, arXiv preprint arXiv:2504.16939, 2025

    Emre Can Acikgoz, Cheng Qian, Hongru Wang, Vardhan Dongre, Xiusi Chen, Heng Ji, Dilek Hakkani- Tur, and Gokhan Tur. A desideratum for conversational agents: Capabilities, challenges, and future directions, arXiv preprint arXiv:2504.16939, 2025. URLhttps://arxiv.org/abs/2504. 16939v1

  11. [11]

    Anum Afzal, Juraj Vladika, Gentrit Fazlija, Andrei Staradubets, and Florian Matthes. Towards opti- mizing a retrieval augmented generation using large language model on academic data.International Conference on Natural Language Processing and Information Retrieval, 2024

  12. [12]

    Azad, and P

    Ankush Agarwal, Sakharam Gawade, A. Azad, and P. Bhattacharyya. Kitlm: Domain-specific knowledge integration into language models for question answering.ICON, 2023

  13. [13]

    Large scale knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training

    Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. Large scale knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. arXiv preprint, 2020

  14. [14]

    Hegselmann, Hunter Lang, Yoon Kim, and D

    Monica Agrawal, S. Hegselmann, Hunter Lang, Yoon Kim, and D. Sontag. Large language models are few-shot clinical information extractors.Conference on Empirical Methods in Natural Language Processing, 2022

  15. [15]

    Mcp bridge: A lightweight, llm-agnostic restful proxy for model context protocol servers.arXiv preprint arXiv:2504.08999,

    Arash Ahmadi, S. Sharif, and Yaser Mohammadi Banadaki. Mcp bridge: A lightweight, llm-agnostic restful proxy for model context protocol servers, arXiv preprint arXiv:2504.08999, 2025. URL https://arxiv.org/abs/2504.08999v1

  16. [16]

    Ainslie, J

    J. Ainslie, J. Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr’on, and Sumit K. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.Conference on Empirical Methods in Natural Language Processing, 2023

  17. [17]

    Multi-agent system concepts theory and application phases

    Adel Al-Jumaily. Multi-agent system concepts theory and application phases. arXiv preprint, 2006

  18. [18]

    Position interpolation improves alibi extrapolation

    Faisal Al-Khateeb, Nolan Dey, Daria Soboleva, and Joel Hestness. Position interpolation improves alibi extrapolation. arXiv preprint, 2023

  19. [19]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, A. Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricar...

  20. [20]

    Albrecht and P

    Stefano V. Albrecht and P. Stone. Autonomous agents modelling other agents: A comprehensive survey and open problems.Artificial Intelligence, 2017

  21. [21]

    Understanding the Challenges and Opportunities of Generative AI Apps: An Empirical Study

    Buthayna AlMulla, Maram Assi, and Safwat Hassan. Understanding the challenges and promises of developing generative ai apps: An empirical study, arXiv preprint arXiv:2506.16453, 2025. URL https://arxiv.org/abs/2506.16453v2. 60

  22. [22]

    Alsuhaibani, Christian D

    Reem S. Alsuhaibani, Christian D. Newman, M. J. Decker, Michael L. Collard, and Jonathan I. Maletic. On the naming of methods: A survey of professional developers.International Conference on Software Engineering, 2021

  23. [23]

    Giorgini, A

    Francesco Alzetta, P. Giorgini, A. Najjar, M. Schumacher, and Davide Calvaresi. In-time explainability in multi-agent systems: Challenges, opportunities, and roadmap.EXTRAAMAS@AAMAS, 2020

  24. [24]

    Lüth, Paul F

    Kenza Amara, Lukas Klein, Carsten T. Lüth, Paul F. Jäger, Hendrik Strobelt, and Mennatallah El- Assady. Why context matters in vqa and reasoning: Semantic interventions for vlm input modalities, arXiv preprint arXiv:2410.01690v1, 2024. URLhttps://arxiv.org/abs/2410.01690v1

  25. [25]

    Prompt design and engineering: Introduction and advanced methods, arXiv preprint arXiv:2401.14423, 2024

    Xavier Amatriain. Prompt design and engineering: Introduction and advanced methods, arXiv preprint arXiv:2401.14423, 2024. URLhttps://arxiv.org/abs/2401.14423v4

  26. [26]

    Dawn: Designing distributed agents in a worldwide network, arXiv preprint arXiv:2410.22339, 2024

    Zahra Aminiranjbar, Jianan Tang, Qiudan Wang, Shubha Pant, and Mahesh Viswanathan. Dawn: Designing distributed agents in a worldwide network, arXiv preprint arXiv:2410.22339, 2024. URL https://arxiv.org/abs/2410.22339v3

  27. [27]

    Why does the effective context length of llms fall short?International Conference on Learning Representations, 2024

    Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short?International Conference on Learning Representations, 2024

  28. [28]

    Thread: A logic-based data organization paradigm for how-to question answering with retrieval augmented generation.arXiv preprint arXiv:2406.13372, 2024

    Kaikai An, Fangkai Yang, Liqun Li, Junting Lu, Sitao Cheng, Shuzheng Si, Lu Wang, Pu Zhao, Lele Cao, Qingwei Lin, et al. Thread: A logic-based data organization paradigm for how-to question answering with retrieval augmented generation.arXiv preprint arXiv:2406.13372, 2024

  29. [29]

    Nissist: An incident mitigation copilot based on troubleshooting guides

    Kaikai An, Fangkai Yang, Junting Lu, Liqun Li, Zhixing Ren, Hao Huang, Lu Wang, Pu Zhao, Yu Kang, Hua Ding, et al. Nissist: An incident mitigation copilot based on troubleshooting guides. In Proceedings of the 27th European Conference on Artificial Intelligence (ECAI 2024), pages 4471–4474, 2024

  30. [30]

    Ultraif: Advancing instruction following from the wild

    Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, and Baobao Chang. Ultraif: Advancing instruction following from the wild. pages 7930–7957, 2025

  31. [31]

    Sumin An, Junyoung Sung, Wonpyo Park, Chanjun Park, and Paul Hongsuck Seo. Lcirc: A recurrent compression approach for efficient long-form context and query dependent modeling in llms.North American Chapter of the Association for Computational Linguistics, 2025

  32. [32]

    Dynamic context pruning for efficient and interpretable autoregressive transformers.Neural Information Processing Systems, 2023

    Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurélien Lucchi, and Thomas Hof- mann. Dynamic context pruning for efficient and interpretable autoregressive transformers.Neural Information Processing Systems, 2023

  33. [33]

    Anderson, M

    John R. Anderson, M. Matessa, and C. Lebiere. Act-r: A theory of higher level cognition and its relation to visual attention.Hum. Comput. Interact., 1997

  34. [34]

    Language models as agent models.Conference on Empirical Methods in Natural Language Processing, 2022

    Jacob Andreas. Language models as agent models.Conference on Empirical Methods in Natural Language Processing, 2022

  35. [35]

    Baldoni, and Leonardo Querzoni

    Leonardo Aniello, R. Baldoni, and Leonardo Querzoni. Adaptive online scheduling in storm.Dis- tributed Event-Based Systems, 2013. 61

  36. [36]

    Arigraph: Learning knowledge graph world models with episodic memory for llm agents

    Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, M. Burtsev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents, arXiv preprint arXiv:2407.04363, 2024. URLhttps://arxiv.org/abs/2407.04363v3

  37. [37]

    Introducing the model context protocol, November 2024

    Anthropic. Introducing the model context protocol, November 2024. URL https://www. anthropic.com/news/model-context-protocol. [Online; accessed 17-July-2025]

  38. [38]

    Wmks Ilmini

    RM Aratchige and Dr. Wmks Ilmini. Llms working in harmony: A survey on the technological aspects of building effective llm-based multi agent systems, arXiv preprint arXiv:2504.01963, 2025. URL https://arxiv.org/abs/2504.01963v1

  39. [39]

    Leo Ardon, Daniel Furelos-Blanco, and A. Russo. Learning reward machines in cooperative multi- agent tasks.AAMAS Workshops, 2023

  40. [40]

    Armeni, C

    K. Armeni, C. Honey, and Tal Linzen. Characterizing verbatim short-term memory in neural language models. Conference on Computational Natural Language Learning, 2022

  41. [41]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection.International Conference on Learning Representations, 2023

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.International Conference on Learning Representations, 2023

  42. [42]

    Self iterative label refinement via robust unla- beled learning, arXiv preprint arXiv:2502.12565, 2025

    Hikaru Asano, Tadashi Kozuno, and Yukino Baba. Self iterative label refinement via robust unla- beled learning, arXiv preprint arXiv:2502.12565, 2025. URLhttps://arxiv.org/abs/2502. 12565v1

  43. [43]

    Bifurcated attention: Accelerating massively parallel decoding with shared prefixes in llms, arXiv preprint arXiv:2403.08845, 2024

    Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, and Bing Xiang. Bifurcated attention: Accelerating massively parallel decoding with shared prefixes in llms, arXiv preprint arXiv:2403.08845, 2024. URLhttps://arxi...

  44. [44]

    Lasp: Llm assisted security property generation for soc verification.Workshop on Machine Learning for CAD, 2024

    Avinash Ayalasomayajula, Rui Guo, Jingbo Zhou, Sujan Kumar Saha, and Farimah Farahmandi. Lasp: Llm assisted security property generation for soc verification.Workshop on Machine Learning for CAD, 2024

  45. [45]

    Aytes, Jinheon Baek, and Sung Ju Hwang

    Simon A. Aytes, Jinheon Baek, and Sung Ju Hwang. Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching. arXiv preprint, 2025

  46. [46]

    Kazerouni, I

    Bobby Azad, Reza Azad, Sania Eskandari, Afshin Bozorgpour, A. Kazerouni, I. Rekik, and D. Merhof. Foundational models in medical imaging: A comprehensive survey and future vision, arXiv preprint arXiv:2310.18689, 2023. URLhttps://arxiv.org/abs/2310.18689v1

  47. [47]

    Transformers for tabular data representation: A survey of models and applications.Transactions of the Association for Computational Linguistics, 2023

    Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. Transformers for tabular data representation: A survey of models and applications.Transactions of the Association for Computational Linguistics, 2023

  48. [48]

    Chandrasekaran, Silviu Cucerzan, Allen Herring, and S

    Jinheon Baek, N. Chandrasekaran, Silviu Cucerzan, Allen Herring, and S. Jauhar. Knowledge- augmented large language models for personalized contextual query suggestion.The Web Conference, 2023. 62

  49. [49]

    A survey of multimodal large language model from a data-centric perspective, arXiv preprint arXiv:2405.16640v2, 2024

    Tianyi Bai, Hao Liang, Binwang Wan, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, Conghui He, Binhang Yuan, and Wentao Zhang. A survey of multimodal large language model from a data-centric perspective, arXiv preprint arXiv:2405.16640v2, 2024. URLhttps://arxiv.org/abs/2405. 16640v2

  50. [50]

    Citrus: Chunked instruction-aware state eviction for long sequence modeling

    Yu Bai, Xiyuan Zou, Heyan Huang, Sanxing Chen, Marc-Antoine Rondeau, Yang Gao, and Jackie Chi Kit Cheung. Citrus: Chunked instruction-aware state eviction for long sequence modeling. Conference on Empirical Methods in Natural Language Processing, 2024

  51. [51]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  52. [52]

    Souhail Bakkali, Sanket Biswas, Zuheng Ming, Mickaël Coustaty, Marccal Rusinol, O. R. Terrades, and Josep Llad’os. Globaldoc: A cross-modal vision-language framework for real-world document image retrieval and classification.IEEE Workshop/Winter Conference on Applications of Computer Vision, 2023

  53. [53]

    Purushothaman, and Renuka Sindhgatta

    JayachanduBandlamudi,K.Mukherjee,PrernaAgarwal,SampathDechu,SiyuHuo,VatcheIsahagian, Vinod Muthusamy, N. Purushothaman, and Renuka Sindhgatta. Towards hybrid automation by bootstrappingconversationalinterfacesforitoperationtasks. AAAIConferenceonArtificialIntelligence , 2023

  54. [54]

    Pimplikar, Sampath Dechu, Alex Straley, Anbumunee Ponniah, and Renuka Sindhgatta

    Jayachandu Bandlamudi, Kushal Mukherjee, Prerna Agarwal, Ritwik Chaudhuri, R. Pimplikar, Sampath Dechu, Alex Straley, Anbumunee Ponniah, and Renuka Sindhgatta. Building conversational artifacts to enable digital assistant for apis and rpas.AAAI Conference on Artificial Intelligence, 2024

  55. [55]

    Large language models for recommendation: Past, present, and future.Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

    Keqin Bao, Jizhi Zhang, Xinyu Lin, Yang Zhang, Wenjie Wang, and Fuli Feng. Large language models for recommendation: Past, present, and future.Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024

  56. [56]

    Schetinger, and Cody Dunne

    Sara Di Bartolomeo, Giorgio Severi, V. Schetinger, and Cody Dunne. Ask and you shall receive (a graph drawing): Testing chatgpt’s potential to apply graph layout algorithms.Eurographics Conference on Visualization, 2023

  57. [57]

    Exploring autonomous agents through the lens of large language models: A review.arXiv preprint arXiv:2404.04442, 2024

    Saikat Barua. Exploring autonomous agents through the lens of large language models: A review, arXiv preprint arXiv:2404.04442, 2024. URLhttps://arxiv.org/abs/2404.04442v1

  58. [58]

    Nestful: Abenchmark for evaluating llms on nested sequences of api calls, arXiv preprint arXiv:2409.03797, 2024

    KinjalBasu, IbrahimAbdelaziz, KelseyBradford, M.Crouse, KiranKate, SadhanaKumaravel, Saurabh Goyal,AsimMunawar,YaraRizk,XinWang,LuisA.Lastras,andP.Kapanipathi. Nestful: Abenchmark for evaluating llms on nested sequences of api calls, arXiv preprint arXiv:2409.03797, 2024. URL https://arxiv.org/abs/2409.03797v3. 63

  59. [59]

    Natural language-oriented programming (nlop): Towards democratizing software creation

    Amin Beheshti. Natural language-oriented programming (nlop): Towards democratizing software creation. 2024 IEEE International Conference on Software Services Engineering (SSE), 2024

  60. [60]

    Azadeh Beiranvand and S. M. Vahidipour. Integrating structural and semantic signals in text- attributed graphs with bigtex, arXiv preprint arXiv:2504.12474, 2025. URLhttps://arxiv.org/ abs/2504.12474v2

  61. [61]

    Decimamba: Exploring the length extrapolation potential of mamba.International Conference on Learning Representations, 2024

    Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba.International Conference on Learning Representations, 2024

  62. [62]

    Assaf Ben-Kish, Itamar Zimerman, M. J. Mirza, James R. Glass, Leonid Karlinsky, and Raja Giryes. Overflow prevention enhances long-context recurrent llms. arXiv preprint, 2025

  63. [63]

    Benna and Stefano Fusi

    M. Benna and Stefano Fusi. Complex synapses as efficient memory systems.BMC Neuroscience, 2015

  64. [64]

    Benna and Stefano Fusi

    M. Benna and Stefano Fusi. Computational principles of biological memory, arXiv preprint arXiv:1507.07580, 2015. URLhttps://arxiv.org/abs/1507.07580v1

  65. [65]

    Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, and Waseem Alshikh

    Shelly Bensal, Umar Jamil, Christopher Bryant, M. Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, and Waseem Alshikh. Reflect, retry, reward: Self-improving llms via reinforcement learn- ing, arXiv preprint arXiv:2505.24726, 2025. URLhttps://arxiv.org/abs/2505.24726v1

  66. [66]

    Bermúdez, A

    Idoia Berges, J. Bermúdez, A. Goñi, and A. Illarramendi. Semantic web technology for agent communication protocols.Extended Semantic Web Conference, 2008

  67. [67]

    Gaurav Beri and Vaishnavi Srivastava. Advanced techniques in prompt engineering for large language models: A comprehensive study.2024 IEEE 4th International Conference on ICT in Business Industry & Government (ICTBIG), 2024

  68. [68]

    Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. Unlimiformer: Long-range transformers with unlimited length input.Neural Information Processing Systems, 2023

  69. [69]

    Niewiadomski, P

    Maciej Besta, Nils Blach, Aleš Kubíček, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, H. Niewiadomski, P. Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models.AAAI Conference on Artificial Intelligence, 2023

  70. [70]

    Judgment aggregation, discursive dilemma and reflective equilib- rium: Neural language models as self-improving doxastic agents.Frontiers in Artificial Intelligence, 2022

    Gregor Betz and Kyle Richardson. Judgment aggregation, discursive dilemma and reflective equilib- rium: Neural language models as self-improving doxastic agents.Frontiers in Artificial Intelligence, 2022

  71. [71]

    Bezalel, Eyal Orgad, and Amir Globerson

    L. Bezalel, Eyal Orgad, and Amir Globerson. Teaching models to improve on tape.AAAI Conference on Artificial Intelligence, 2024

  72. [72]

    Collins, Adrian Weller, Andrew Gordon Wilson, and Muhammad Bilal Zafar

    Umang Bhatt, Sanyam Kapoor, Mihir Upadhyay, Ilia Sucholutsky, Francesco Quinzan, Katherine M. Collins, Adrian Weller, Andrew Gordon Wilson, and Muhammad Bilal Zafar. When should we orchestrate multiple agents?, arXiv preprint arXiv:2503.13577, 2025. URLhttps://arxiv.org/ abs/2503.13577v1. 64

  73. [73]

    Context-dpo: Aligning language models for context- faithfulness

    Baolong Bi, Shaohan Huang, Yiwei Wang, Tianchi Yang, Zihan Zhang, Haizhen Huang, Lingrui Mei, Junfeng Fang, Zehao Li, Furu Wei, et al. Context-dpo: Aligning language models for context- faithfulness. ACL 2025, 2024

  74. [74]

    Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024

    Baolong Bi, Shenghua Liu, Lingrui Mei, Yiwei Wang, Pengliang Ji, and Xueqi Cheng. Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024

  75. [75]

    Lpnl: Scalable link prediction with large language models.ACL 2024, 2024

    Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, and Xueqi Cheng. Lpnl: Scalable link prediction with large language models.ACL 2024, 2024

  76. [76]

    Struedit: Structured outputs enable the fast and accurate knowledge editing for large language models

    Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Hongcheng Gao, Junfeng Fang, and Xueqi Cheng. Struedit: Structured outputs enable the fast and accurate knowledge editing for large language models. 2024

  77. [77]

    Adaptive token biaser: Knowledge editing via biasing key entities.EMNLP 2024, 2024

    Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Hongcheng Gao, Yilong Xu, and Xueqi Cheng. Adaptive token biaser: Knowledge editing via biasing key entities.EMNLP 2024, 2024

  78. [78]

    Refinex: Learning to refine pre-training data at scale from expert-guided programs

    Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, and Xueqi Cheng. Refinex: Learning to refine pre-training data at scale from expert-guided programs. 2025

  79. [79]

    Is factuality enhancement a free lunch for llms? better factuality can lead to worse context-faithfulness

    Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Junfeng Fang, Hongcheng Gao, Shiyu Ni, and Xueqi Cheng. Is factuality enhancement a free lunch for llms? better factuality can lead to worse context-faithfulness. ICLR 2025, 2025

  80. [80]

    Parameters vs

    Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, and Xueqi Cheng. Parameters vs. context: Fine-grained control of knowledge reliance in language models. 2025

Showing first 80 references.