pith. machine review for the scientific record. sign in

arxiv: 2604.08224 · v1 · submitted 2026-04-09 · 💻 cs.SE · cs.MA

Recognition: unknown

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:35 UTC · model grok-4.3

classification 💻 cs.SE cs.MA
keywords LLM agentsexternalizationmemoryskillsprotocolsharnesscognitive artifactsagent infrastructure
0
0 comments X

The pith

LLM agents advance primarily by externalizing cognitive tasks into memory, skills, protocols, and harness systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This review argues that the real progress in LLM agents comes from reorganizing the runtime environment around the model rather than altering its weights. By externalizing memory for state across time, skills for procedures, protocols for interactions, and using harness engineering to coordinate them, hard internal computations become easier for the model to handle reliably. The paper traces this as a shift from weights to context to harness, offering a unifying framework based on cognitive artifacts. This matters because it reframes agent design as an infrastructure problem that can make even current models more capable without waiting for bigger ones.

Core claim

The paper establishes that agent infrastructure matters because it transforms hard cognitive burdens into forms that the model can solve more reliably. Memory externalizes state across time, skills externalize procedural expertise, protocols externalize interaction structure, and harness engineering unifies them into governed execution. This provides a systems-level view explaining why practical agent progress depends on better external cognitive infrastructure alongside stronger models, including trade-offs between parametric and externalized capabilities and directions like self-evolving harnesses.

What carries the argument

The unifying lens of externalization as cognitive artifacts, with memory, skills, protocols, and harness as the four key mechanisms that offload and restructure cognitive demands.

If this is right

  • Agent development will focus more on designing reliable external modules that models can use effectively.
  • Evaluation metrics will shift to assess coordination and governance provided by the harness.
  • New agent systems may incorporate self-evolving harnesses that adapt infrastructure dynamically.
  • Shared infrastructure layers could allow different models to benefit from common externalized components.
  • Long-term co-evolution between models and their external environments will become a key research area.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that benchmarks for agents should include controlled experiments varying only the harness to measure its isolated impact.
  • Interoperability standards for protocols could accelerate adoption across different LLM platforms.
  • The framework implies potential limits to pure scaling of models without corresponding infrastructure advances.
  • Open challenges in governance may require new frameworks for auditing externalized components in agents.

Load-bearing premise

That the progression from weights to context to harness, viewed through cognitive artifacts, provides a unifying and accurate explanation for why agent systems improve in practice.

What would settle it

Finding a set of major agent capabilities that improved substantially through model fine-tuning or scaling alone, without corresponding advances in memory, skills, protocols, or harness design.

read the original abstract

Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover internally are now externalized into memory stores, reusable skills, interaction protocols, and the surrounding harness that makes these modules reliable in practice. This paper reviews that shift through the lens of externalization. Drawing on the idea of cognitive artifacts, we argue that agent infrastructure matters not merely because it adds auxiliary components, but because it transforms hard cognitive burdens into forms that the model can solve more reliably. Under this view, memory externalizes state across time, skills externalize procedural expertise, protocols externalize interaction structure, and harness engineering serves as the unification layer that coordinates them into governed execution. We trace a historical progression from weights to context to harness, analyze memory, skills, and protocols as three distinct but coupled forms of externalization, and examine how they interact inside a larger agent system. We further discuss the trade-off between parametric and externalized capability, identify emerging directions such as self-evolving harnesses and shared agent infrastructure, and discuss open challenges in evaluation, governance, and the long-term co-evolution of models and external infrastructure. The result is a systems-level framework for explaining why practical agent progress increasingly depends not only on stronger models, but on better external cognitive infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper reviews the development of LLM agents as a process of externalizing cognitive burdens from model weights into runtime infrastructure. It frames this shift using cognitive artifacts: memory externalizes state across time, skills externalize procedural knowledge, protocols externalize interaction structure, and harness engineering coordinates these into reliable execution. The manuscript traces a historical progression from weights to context to harness, analyzes the three externalization forms and their interactions, discusses parametric vs. externalized capability trade-offs, and identifies open challenges in evaluation, governance, and co-evolution of models with infrastructure.

Significance. If the externalization lens holds, the paper supplies a coherent systems-level framework that organizes disparate agent-engineering practices and explains why infrastructure improvements often yield more reliable gains than scale alone. It synthesizes trends across memory, skills, and protocols literature into a single narrative, which could help researchers and practitioners prioritize harness design and shared infrastructure. The review format itself is a strength, as it avoids new empirical claims while highlighting falsifiable directions such as self-evolving harnesses.

minor comments (3)
  1. The historical progression section would benefit from an explicit timeline or table summarizing key milestones (e.g., early ReAct-style agents vs. later tool-use harnesses) to make the weights-to-context-to-harness arc easier to follow.
  2. In the trade-off discussion, clarify whether the parametric/externalized distinction is treated as a strict dichotomy or a continuum; the current phrasing risks implying zero-sum dynamics without addressing hybrid approaches that retain some parametric capability.
  3. The open challenges subsection on evaluation could reference specific existing benchmarks (e.g., AgentBench or WebArena) and note how the externalization view would change what those benchmarks measure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of the manuscript, which correctly identifies the externalization lens as the central organizing principle. The recommendation for minor revision is noted, and we will incorporate any editorial or presentational improvements in the revised version.

Circularity Check

0 steps flagged

No significant circularity: conceptual review without derivations or fitted predictions

full rationale

The paper is a literature review that organizes existing trends in LLM agent design under an interpretive framework of externalization (memory for state, skills for procedures, protocols for interaction, harness for coordination). It draws on the idea of cognitive artifacts but presents no new equations, quantitative predictions, parameter fits, or first-principles derivations. The central claim—that external infrastructure transforms cognitive burdens into more solvable forms—is interpretive and cites prior work without reducing any result to a self-referential definition or fitted input renamed as prediction. No load-bearing self-citation chains, uniqueness theorems, or ansatzes are invoked in a way that creates circularity. The acknowledged trade-offs further keep the framing non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the established concept of cognitive artifacts from prior literature and standard assumptions in AI systems research without introducing new free parameters or invented entities.

axioms (1)
  • domain assumption Cognitive artifacts from prior literature can be productively applied to frame externalization in LLM agents
    The abstract explicitly draws on this idea to argue that infrastructure transforms cognitive burdens.

pith-pipeline@v0.9.0 · 5619 in / 1087 out tokens · 65152 ms · 2026-05-10T17:35:29.425409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.

  2. Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...

  3. Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    Malicious actors could use AI agents to submit large numbers of fake papers, inflating the submission count and thereby raising the acceptance odds for a small set of chosen legitimate papers under stable conference a...

  4. CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models

    cs.LG 2026-05 unverdicted novelty 6.0

    CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yield...

  5. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  6. MeloTune: On-Device Arousal Learning and Peer-to-Peer Mood Coupling for Proactive Music Curation

    cs.SD 2026-04 unverdicted novelty 6.0

    MeloTune implements learned per-listener Personal Arousal Functions and mesh memory protocols on mobile devices to predict affective trajectories and enable peer-coupled proactive music selection, reporting 96.6% patt...

  7. Harness Engineering as Categorical Architecture

    cs.PL 2026-05 unverdicted novelty 5.0

    Categorical Architecture triple (G, Know, Phi) supplies the formal theory for composing LLM agent harnesses with structurally preserved certificates.

  8. The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

    cs.AI 2026-05 unverdicted novelty 4.0

    Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.

  9. A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    cs.IR 2026-05 unverdicted novelty 4.0

    The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

  10. Memory as Metabolism: A Design for Companion Knowledge Systems

    cs.AI 2026-04 unverdicted novelty 4.0

    This paper designs a companion knowledge system with TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, and AUDIT operations plus memory gravity and minority-hypothesis retention to give contradictory evidence a path to updat...

Reference graph

Works this paper leans on

200 extracted references · 129 canonical work pages · cited by 10 Pith papers · 52 internal anchors

  1. [1]

    M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gober, K. Gopalakrishnan, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, 2022

  2. [2]

    Arigraph: Learning knowledge graph world models with episodic memory for llm agents

    P. Anokhin, N. Semenov, A. Sorokin, D. Evseev, A. Kravchenko, M. Burtsev, and E. Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents. arXiv preprint arXiv:2407.04363, 2024

  3. [3]

    Introducing the model context protocol

    Anthropic . Introducing the model context protocol. https://www.anthropic.com/news/model-context-protocol, Nov. 2024. Anthropic news post, November 25, 2024

  4. [4]

    Model context protocol, 2024

    Anthropic. Model context protocol, 2024. URL https://www.anthropic.com/news/model-context-protocol. Accessed: 2025-04-19

  5. [5]

    Introducing agent skills

    Anthropic . Introducing agent skills. https://claude.com/blog/skills, Oct. 2025. Anthropic product announcement, October 16, 2025

  6. [6]

    Agent skills

    Anthropic . Agent skills. https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview, 2026. Claude API Docs, accessed 2026-04-02

  7. [7]

    Baghel and R

    G. Baghel and R. Chandna. Introducing hashicorp agent skills, 2026. URL https://www.hashicorp.com/en/blog/introducing-hashicorp-agent-skills#what-are-agent-skills

  8. [8]

    Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022 a . doi:10.48550/arXiv.2204.05862. URL https://arxiv.org/abs/2204.05862

  9. [9]

    Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. L...

  10. [10]

    Borgeaud, A

    S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. B. M. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, et al. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, pages 2206--2240. PMLR, 2022. URL https://proceedings.mlr.press/v162/borgea...

  11. [11]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

  12. [12]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877--1901, 2020

  13. [13]

    H. Cai, Y. Li, W. Wang, F. Zhu, X. Shen, W. Li, and T.-S. Chua. Large language models empowered personalized web agents. In Proceedings of the ACM on Web Conference 2025, pages 198--215, 2025

  14. [14]

    H. Chai, Z. Cao, M. Ran, Y. Yang, J. Lin, X. Peng, H. Wang, R. Ding, Z. Wan, M. Wen, et al. Parl-mt: Learning to call functions in multi-turn conversation with progress awareness. arXiv preprint arXiv:2509.23206, 2025

  15. [15]

    Agent network protocol technical white paper,

    G. Chang, E. Lin, C. Yuan, R. Cai, B. Chen, X. Xie, and Y. Zhang. Agent network protocol technical white paper, 2025. URL https://arxiv.org/abs/2508.00007

  16. [16]

    H. Chen, Z. Sun, H. Ye, K. Li, and X. Lin. Continual learning in large language models: Methods, challenges, and opportunities. arXiv preprint arXiv:2603.12658, 2026 a

  17. [17]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

  18. [18]

    S. Chen, S. Wong, L. Chen, and Y. Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023. URL https://arxiv.org/abs/2306.15595

  19. [19]

    S. Chen, S. Lin, Y. Shi, H. Lian, X. Gu, L. Yun, D. Chen, L. Cao, J. Liu, N. Xia, et al. Swe-exp: Experience-driven software issue resolution. arXiv preprint arXiv:2507.23361, 2025

  20. [20]

    T. Chen, Y. Li, M. Solodko, S. Wang, N. Jiang, T. Cui, J. Hao, J. Ko, S. Abdali, L. Xu, et al. Cua-skill: Develop skills for computer using agent. arXiv preprint arXiv:2601.21123, 2026 b

  21. [21]

    Dated data: Tracing knowledge cutoffs in large language models.arXiv:2403.12958, 2024

    J. Cheng, M. Marone, O. Weller, D. Lawrie, D. Khashabi, and B. Van Durme. Dated data: Tracing knowledge cutoffs in large language models. arXiv preprint arXiv:2403.12958, 2024

  22. [22]

    Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026

    X. Cheng, W. Zeng, D. Dai, Q. Chen, B. Wang, Z. Xie, K. Huang, X. Yu, Z. Hao, Y. Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372, 2026

  23. [23]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav. Mem0 : Building production-ready AI agents with scalable long-term memory. arXiv preprint arXiv:2504.19413, 2025. doi:10.48550/arXiv.2504.19413

  24. [24]

    M. R. Chinthareddy. Reliable graph-rag for codebases: Ast-derived graphs vs llm-extracted knowledge graphs. arXiv preprint arXiv:2601.08773, 2026

  25. [25]

    Chowdhery, S

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. PaLM : Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023. URL https://jmlr.org/papers/v24/22-1144.html

  26. [26]

    Chalmers , title =

    A. Clark and D. J. Chalmers. The extended mind. Analysis, 58 0 (1): 0 7--19, 1998. doi:10.1093/analys/58.1.7

  27. [27]

    Ag-ui: The agent-user interaction protocol

    CopilotKit . Ag-ui: The agent-user interaction protocol. https://github.com/ag-ui-protocol/ag-ui, 2025. Official protocol repository and specification

  28. [28]

    Corallo and P

    G. Corallo and P. Papotti. Parallel context-of-experts decoding for retrieval augmented generation. arXiv preprint arXiv:2601.08670, 2026

  29. [29]

    CrewAI : Framework for orchestrating role-playing autonomous AI agents

    CrewAI . CrewAI : Framework for orchestrating role-playing autonomous AI agents. https://github.com/crewAIInc/crewAI, 2024. GitHub repository, accessed 2026-04-02

  30. [30]

    De Brigard, S

    F. De Brigard, S. Umanath, and M. Irish. Rethinking the distinction between episodic and semantic memory: Insights from the past, present, and future. Memory & Cognition, 50 0 (3): 0 459--463, 2022

  31. [31]

    DeepSeek-V3 Technical Report

    DeepSeek-AI . Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2025. URL https://arxiv.org/abs/2412.19437

  32. [32]

    Z. Deng, Y. Guo, C. Han, W. Ma, J. Xiong, S. Wen, and Y. Xiang. Ai agents under threat: A survey of key security challenges and future pathways. ACM Computing Surveys, 57 0 (7): 0 1--36, 2025

  33. [34]

    P. Du. Memory for autonomous llm agents:mechanisms, evaluation, and emerging frontiers, 2026 b . URL https://arxiv.org/abs/2603.07670

  34. [35]

    D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness, and J. Larson. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

  35. [37]

    A survey of agent interoperability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp),

    A. Ehtesham, A. Singh, G. K. Gupta, and S. Kumar. A survey of agent interoperability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp), 2025 b . URL https://arxiv.org/abs/2505.02279

  36. [38]

    A survey of agent interoperability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp),

    A. Ehtesham et al. A survey of agent interoperability protocols: Model context protocol ( MCP ), agent communication protocol ( ACP ), agent-to-agent protocol ( A2A ), and agent network protocol ( ANP ). arXiv preprint arXiv:2505.02279, 2025 c . doi:10.48550/arXiv.2505.02279

  37. [39]

    N. Esmi, M. Nezhad-Moghaddam, F. Borhani, A. Shahbahrami, A. Daemdoost, and G. Gaydadjiev. Gpt-5 vs other llms in long short-context performance. In 2025 3rd International Conference on Foundation and Large Language Models (FLLM), pages 129--133. IEEE, 2025

  38. [40]

    Agent Control Protocol: Admission Control for Agent Actions

    M. Fernandez. Agent control protocol: Admission control for agent actions. arXiv preprint arXiv:2603.18829, 2026

  39. [41]

    S. Gao, R. Zhu, Z. Kong, A. Noori, X. Su, C. Ginder, T. Tsiligkaridis, and M. Zitnik. Txagent: An ai agent for therapeutic reasoning across a universe of tools. arXiv preprint arXiv:2503.10970, 2025 a . URL https://arxiv.org/abs/2503.10970

  40. [42]

    S. Gao, R. Zhu, P. Sui, Z. Kong, S. Aldogom, Y. Huang, A. Noori, R. Shamji, K. Parvataneni, T. Tsiligkaridis, and M. Zitnik. Democratizing ai scientists using tooluniverse. arXiv preprint arXiv:2509.23426, 2025 b . URL https://arxiv.org/abs/2509.23426

  41. [43]

    Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2024. doi:10.48550/arXiv.2312.10997. URL https://arxiv.org/abs/2312.10997

  42. [44]

    Gemini Team , R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Sorber, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. URL https://arxiv.org/abs/2312.11805

  43. [45]

    Gigerenzer and W

    G. Gigerenzer and W. Gaissmaier. Heuristic decision making. Annual Review of Psychology, 62 0 (1): 0 451--482, 2011. doi:10.1146/annurev-psych-120709-145346. URL https://doi.org/10.1146/annurev-psych-120709-145346

  44. [46]

    Gemini: Try deep research and gemini 2.0 flash experimental

    Google . Gemini: Try deep research and gemini 2.0 flash experimental. https://blog.google/products-and-platforms/products/gemini/google-gemini-deep-research/, Dec. 2024. Google blog post introducing Deep Research in Gemini, December 11, 2024; accessed 2026-04-02

  45. [47]

    A2a: A new era of agent interoperability

    Google . A2a: A new era of agent interoperability. https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/, 2025 a . Official announcement of the Agent2Agent (A2A) protocol for enabling secure communication and coordination between AI agents

  46. [48]

    A2ui: Agent-to-user interface protocol

    Google . A2ui: Agent-to-user interface protocol. https://github.com/google/A2UI, 2025 b . Open-source implementation of the A2UI protocol, enabling AI agents to generate declarative user interfaces that are rendered natively across platforms

  47. [49]

    Under the hood: Universal commerce protocol (ucp)

    Google . Under the hood: Universal commerce protocol (ucp). https://developers.googleblog.com/under-the-hood-universal-commerce-protocol-ucp/, 2026. Official introduction of the Universal Commerce Protocol (UCP), an open standard enabling interoperable agent-driven commerce across discovery, checkout, and post-purchase workflows

  48. [50]

    Announcing the agent2agent protocol ( A2A )

    Google Cloud . Announcing the agent2agent protocol ( A2A ). https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/, Apr. 2025 a . Google Developers Blog announcement, April 9, 2025; see also the official specification site at https://google.github.io/A2A/

  49. [51]

    Announcing agent payments protocol (ap2)

    Google Cloud . Announcing agent payments protocol (ap2). https://cloud.google.com/blog/products/ai-machine-learning/announcing-agents-to-payments-ap2-protocol, 2025 b . Official introduction of AP2 as an open protocol enabling secure, compliant, and interoperable agent-driven payments

  50. [52]

    Z. Guo, Z. Chen, X. Nie, J. Lin, Y. Zhou, and W. Zhang. Skillprobe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration. arXiv preprint arXiv:2603.21019, 2026

  51. [53]

    Y. Hao, S. Mehri, C. Zhai, and D. Hakkani-T \"u r. User preference modeling for conversational llm agents: Weak rewards from retrieval-augmented interaction. arXiv preprint arXiv:2603.20939, 2026

  52. [54]

    M. M. Hasan, H. Li, G. K. Rajbahadur, B. Adams, and A. E. Hassan. Model context protocol (mcp) tool descriptions are smelly! towards improving ai agent efficiency with augmented mcp tool descriptions, 2026. URL https://arxiv.org/abs/2602.14878

  53. [55]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. URL https://arxiv.org/abs/2203.15556

  54. [56]

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, et al. MetaGPT : Meta programming for a multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023. URL https://arxiv.org/abs/2308.00352

  55. [57]

    X. Hou, Y. Zhao, S. Wang, and H. Wang. Model context protocol (mcp): Landscape, security threats, and future research directions. ACM Transactions on Software Engineering and Methodology, 2025

  56. [58]

    Hsiao, M

    V. Hsiao, M. Roberts, and L. Smith. Procedural knowledge improves agentic llm workflows, 2025. URL https://arxiv.org/abs/2511.07568

  57. [59]

    Z. Hu, Q. Zhu, H. Yan, Y. He, and L. Gui. Beyond rag for agent memory: Retrieval by decoupling and aggregation. arXiv preprint arXiv:2602.02007, 2026

  58. [60]

    Hutchins

    E. Hutchins. Cognition in the Wild. MIT press, 1995

  59. [61]

    The simplest protocol for ai agents to work together

    IBM Research . The simplest protocol for ai agents to work together. https://research.ibm.com/blog/agent-communication-protocol-ai, 2025. Official introduction of ACP, describing it as a shared communication language enabling collaboration among AI agents

  60. [62]

    Adaptation of agentic ai: A survey of post-training, memory, and skills.arXiv preprint arXiv:2512.16301, 2026a

    P. Jiang, J. Lin, Z. Shi, Z. Wang, L. He, Y. Wu, M. Zhong, P. Song, Q. Zhang, H. Wang, X. Xu, H. Xu, P. Han, D. Zhang, J. Sun, C. Yang, K. Qian, T. Wang, C. Hu, M. Li, Q. Li, H. Peng, S. Wang, J. Shang, C. Zhang, J. You, L. Liu, P. Lu, Y. Zhang, H. Ji, Y. Choi, D. Song, J. Sun, and J. Han. Adaptation of agentic ai: A survey of post-training, memory, and s...

  61. [63]

    SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    Y. Jiang et al. SoK : Agentic skills -- beyond tool use in LLM agents. arXiv preprint arXiv:2602.20867, 2026 b

  62. [64]

    Json-rpc 2.0 specification, 2010

    JSON-RPC Working Group . Json-rpc 2.0 specification, 2010. URL https://www.jsonrpc.org/specification

  63. [65]

    J. Kang, M. Ji, Z. Zhao, and T. Bai. Memory os of ai agent. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972--25981, 2025

  64. [66]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

  65. [67]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

  66. [68]

    D. Kirsh. Complementary strategies: Why we use our hands when we think. In Proceedings of the seventeenth annual conference of the cognitive science society, Hillsdale, NJ, 1995. Lawrence Erlbaum

  67. [69]

    Kojima, S

    T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199--22213, 2022

  68. [70]

    D. Kong, S. Lin, Z. Xu, Z. Wang, M. Li, Y. Li, Y. Zhang, H. Peng, X. Chen, Z. Sha, et al. A survey of llm-driven ai agent communication: Protocols, security risks, and defense countermeasures. arXiv preprint arXiv:2506.19676, 2025

  69. [71]

    LangGraph : Build resilient language agents as graphs

    LangChain . LangGraph : Build resilient language agents as graphs. https://github.com/langchain-ai/langgraph, 2024. GitHub repository, accessed 2026-04-02

  70. [72]

    Lazaros, A

    K. Lazaros, A. G. Vrahatis, and S. Kotsiantis. Human-in-the-loop artificial intelligence: A systematic review of concepts, methods, and applications. Entropy, 28 0 (4): 0 377, 2026

  71. [73]

    S. U. Lee, L. Zhu, M. Shamsujjoha, L. Dong, Q. Lu, J. Chen, and L. Briand. A structured approach to safety case construction for ai systems, 2026. URL https://arxiv.org/abs/2601.22773

  72. [74]

    W. Y. Lee. Capable but unreliable: Canonical path deviation as a causal mechanism of agent failure in long-horizon tasks. arXiv preprint arXiv:2602.19008, 2026. URL https://arxiv.org/abs/2602.19008

  73. [75]

    u ttler, M. Lewis, W.-t. Yih, T. Rockt \

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K \"u ttler, M. Lewis, W.-t. Yih, T. Rockt \"a schel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459--9474, 2020

  74. [76]

    G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem. CAMEL : Communicative agents for ``mind'' exploration of large language model society. Advances in Neural Information Processing Systems, 36, 2023

  75. [77]

    H. Li, C. Mu, J. Chen, S. Ren, Z. Cui, Y. Zhang, L. Bai, and S. Hu. Organizing, orchestrating, and benchmarking agent skills at ecosystem scale. arXiv preprint arXiv:2603.02176, 2026 a

  76. [78]

    Memory, consciousness and large language model

    J. Li and J. Li. Memory, consciousness and large language model. arXiv preprint arXiv:2401.02509, 2024

  77. [79]

    N. Li, K. Zhang, K. Polley, and J. Ma. Security considerations for artificial intelligence agents. arXiv preprint arXiv:2603.12230, 2026 b

  78. [80]

    X. Li. A review of prominent paradigms for LLM -based agents: Tool use (including RAG ), planning, and feedback learning. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9760--9779, Abu Dhabi, UAE, 2025. Association for Computational Linguistics

  79. [81]

    X. Li, W. Chen, Y. Liu, S. Zheng, X. Chen, Y. He, Y. Li, B. You, H. Shen, J. Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks. arXiv preprint arXiv:2602.12670, 2026 c

  80. [82]

    Z. Li, C. Xi, C. Li, D. Chen, B. Chen, S. Song, S. Niu, H. Wang, J. Yang, C. Tang, et al. Memos: A memory os for ai system. arXiv preprint arXiv:2507.03724, 2025

Showing first 80 references.