arxiv: 2604.08224 · v1 · submitted 2026-04-09 · 💻 cs.SE · cs.MA

Recognition: unknown

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Chenyu Zhou , Huacan Chai , Wenteng Chen , Zihan Guo , Rong Shan , Yuanyi Song , Tianyi Xu , Yingxuan Yang

show 13 more authors

Aofan Yu Weiming Zhang Congming Zheng Jiachen Zhu Zeyu Zheng Zhuosheng Zhang Xingyu Lou Changwang Zhang Zhihui Fu Jun Wang Weiwen Liu Jianghao Lin Weinan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:35 UTC · model grok-4.3

classification 💻 cs.SE cs.MA

keywords LLM agentsexternalizationmemoryskillsprotocolsharnesscognitive artifactsagent infrastructure

0 comments

The pith

LLM agents advance primarily by externalizing cognitive tasks into memory, skills, protocols, and harness systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This review argues that the real progress in LLM agents comes from reorganizing the runtime environment around the model rather than altering its weights. By externalizing memory for state across time, skills for procedures, protocols for interactions, and using harness engineering to coordinate them, hard internal computations become easier for the model to handle reliably. The paper traces this as a shift from weights to context to harness, offering a unifying framework based on cognitive artifacts. This matters because it reframes agent design as an infrastructure problem that can make even current models more capable without waiting for bigger ones.

Core claim

The paper establishes that agent infrastructure matters because it transforms hard cognitive burdens into forms that the model can solve more reliably. Memory externalizes state across time, skills externalize procedural expertise, protocols externalize interaction structure, and harness engineering unifies them into governed execution. This provides a systems-level view explaining why practical agent progress depends on better external cognitive infrastructure alongside stronger models, including trade-offs between parametric and externalized capabilities and directions like self-evolving harnesses.

What carries the argument

The unifying lens of externalization as cognitive artifacts, with memory, skills, protocols, and harness as the four key mechanisms that offload and restructure cognitive demands.

If this is right

Agent development will focus more on designing reliable external modules that models can use effectively.
Evaluation metrics will shift to assess coordination and governance provided by the harness.
New agent systems may incorporate self-evolving harnesses that adapt infrastructure dynamically.
Shared infrastructure layers could allow different models to benefit from common externalized components.
Long-term co-evolution between models and their external environments will become a key research area.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that benchmarks for agents should include controlled experiments varying only the harness to measure its isolated impact.
Interoperability standards for protocols could accelerate adoption across different LLM platforms.
The framework implies potential limits to pure scaling of models without corresponding infrastructure advances.
Open challenges in governance may require new frameworks for auditing externalized components in agents.

Load-bearing premise

That the progression from weights to context to harness, viewed through cognitive artifacts, provides a unifying and accurate explanation for why agent systems improve in practice.

What would settle it

Finding a set of major agent capabilities that improved substantially through model fine-tuning or scaling alone, without corresponding advances in memory, skills, protocols, or harness design.

read the original abstract

Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover internally are now externalized into memory stores, reusable skills, interaction protocols, and the surrounding harness that makes these modules reliable in practice. This paper reviews that shift through the lens of externalization. Drawing on the idea of cognitive artifacts, we argue that agent infrastructure matters not merely because it adds auxiliary components, but because it transforms hard cognitive burdens into forms that the model can solve more reliably. Under this view, memory externalizes state across time, skills externalize procedural expertise, protocols externalize interaction structure, and harness engineering serves as the unification layer that coordinates them into governed execution. We trace a historical progression from weights to context to harness, analyze memory, skills, and protocols as three distinct but coupled forms of externalization, and examine how they interact inside a larger agent system. We further discuss the trade-off between parametric and externalized capability, identify emerging directions such as self-evolving harnesses and shared agent infrastructure, and discuss open challenges in evaluation, governance, and the long-term co-evolution of models and external infrastructure. The result is a systems-level framework for explaining why practical agent progress increasingly depends not only on stronger models, but on better external cognitive infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This review organizes LLM agent work around externalizing cognitive loads into memory, skills, protocols, and harness, which is a practical synthesis but adds no new experiments or derivations.

read the letter

Hi, the main takeaway is that the paper treats agent progress as moving cognitive work out of the model weights and into external pieces: memory for state, skills for procedures, protocols for interactions, and harness for coordination. That framing is the organizing idea, and it lines up with how many systems are actually built now. It traces the shift from pure parametric scaling to context windows to these layered external structures, which helps make sense of recent trends. The sections on trade-offs between keeping capability inside the model versus externalizing it are straightforward and acknowledge real limits. It also flags open problems in evaluation, governance, and how models and infrastructure will co-evolve. Those parts feel grounded in the cited literature and give a systems-level view rather than just listing papers. The weakness is that it stays interpretive. As a review there are no new measurements, no controlled comparisons, and no formal checks on whether the externalization lens actually predicts better reliability in practice. Some links between the cognitive-artifact concept and specific agent papers could use tighter examples or counter-cases to avoid feeling like post-hoc grouping. The harness discussion is the newest angle but still rests on synthesis. This is worth reading for anyone designing or surveying LLM agents who wants a map of design choices rather than a single new result. It is not essential for someone focused on model training or theoretical limits. I would send it to peer review; the structure is coherent and the framing could be useful if the authors add more concrete citations and tighten the evidence sections.

Referee Report

0 major / 3 minor

Summary. The paper reviews the development of LLM agents as a process of externalizing cognitive burdens from model weights into runtime infrastructure. It frames this shift using cognitive artifacts: memory externalizes state across time, skills externalize procedural knowledge, protocols externalize interaction structure, and harness engineering coordinates these into reliable execution. The manuscript traces a historical progression from weights to context to harness, analyzes the three externalization forms and their interactions, discusses parametric vs. externalized capability trade-offs, and identifies open challenges in evaluation, governance, and co-evolution of models with infrastructure.

Significance. If the externalization lens holds, the paper supplies a coherent systems-level framework that organizes disparate agent-engineering practices and explains why infrastructure improvements often yield more reliable gains than scale alone. It synthesizes trends across memory, skills, and protocols literature into a single narrative, which could help researchers and practitioners prioritize harness design and shared infrastructure. The review format itself is a strength, as it avoids new empirical claims while highlighting falsifiable directions such as self-evolving harnesses.

minor comments (3)

The historical progression section would benefit from an explicit timeline or table summarizing key milestones (e.g., early ReAct-style agents vs. later tool-use harnesses) to make the weights-to-context-to-harness arc easier to follow.
In the trade-off discussion, clarify whether the parametric/externalized distinction is treated as a strict dichotomy or a continuum; the current phrasing risks implying zero-sum dynamics without addressing hybrid approaches that retain some parametric capability.
The open challenges subsection on evaluation could reference specific existing benchmarks (e.g., AgentBench or WebArena) and note how the externalization view would change what those benchmarks measure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of the manuscript, which correctly identifies the externalization lens as the central organizing principle. The recommendation for minor revision is noted, and we will incorporate any editorial or presentational improvements in the revised version.

Circularity Check

0 steps flagged

No significant circularity: conceptual review without derivations or fitted predictions

full rationale

The paper is a literature review that organizes existing trends in LLM agent design under an interpretive framework of externalization (memory for state, skills for procedures, protocols for interaction, harness for coordination). It draws on the idea of cognitive artifacts but presents no new equations, quantitative predictions, parameter fits, or first-principles derivations. The central claim—that external infrastructure transforms cognitive burdens into more solvable forms—is interpretive and cites prior work without reducing any result to a self-referential definition or fitted input renamed as prediction. No load-bearing self-citation chains, uniqueness theorems, or ansatzes are invoked in a way that creates circularity. The acknowledged trade-offs further keep the framing non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the established concept of cognitive artifacts from prior literature and standard assumptions in AI systems research without introducing new free parameters or invented entities.

axioms (1)

domain assumption Cognitive artifacts from prior literature can be productively applied to frame externalization in LLM agents
The abstract explicitly draws on this idea to argue that infrastructure transforms cognitive burdens.

pith-pipeline@v0.9.0 · 5619 in / 1087 out tokens · 65152 ms · 2026-05-10T17:35:29.425409+00:00 · methodology

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents
cs.CL 2026-05 unverdicted novelty 6.0

Malicious actors could use AI agents to submit large numbers of fake papers, inflating the submission count and thereby raising the acceptance odds for a small set of chosen legitimate papers under stable conference a...
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models
cs.LG 2026-05 unverdicted novelty 6.0

CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yield...
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
MeloTune: On-Device Arousal Learning and Peer-to-Peer Mood Coupling for Proactive Music Curation
cs.SD 2026-04 unverdicted novelty 6.0

MeloTune implements learned per-listener Personal Arousal Functions and mesh memory protocols on mobile devices to predict affective trajectories and enable peer-coupled proactive music selection, reporting 96.6% patt...
Harness Engineering as Categorical Architecture
cs.PL 2026-05 unverdicted novelty 5.0

Categorical Architecture triple (G, Know, Phi) supplies the formal theory for composing LLM agent harnesses with structurally preserved certificates.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
cs.AI 2026-05 unverdicted novelty 4.0

Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
Memory as Metabolism: A Design for Companion Knowledge Systems
cs.AI 2026-04 unverdicted novelty 4.0

This paper designs a companion knowledge system with TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, and AUDIT operations plus memory gravity and minority-hypothesis retention to give contradictory evidence a path to updat...

Reference graph

Works this paper leans on

200 extracted references · 129 canonical work pages · cited by 10 Pith papers · 52 internal anchors

[1]

M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gober, K. Gopalakrishnan, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, 2022

2022
[2]

Arigraph: Learning knowledge graph world models with episodic memory for llm agents

P. Anokhin, N. Semenov, A. Sorokin, D. Evseev, A. Kravchenko, M. Burtsev, and E. Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents. arXiv preprint arXiv:2407.04363, 2024

work page arXiv 2024
[3]

Introducing the model context protocol

Anthropic . Introducing the model context protocol. https://www.anthropic.com/news/model-context-protocol, Nov. 2024. Anthropic news post, November 25, 2024

2024
[4]

Model context protocol, 2024

Anthropic. Model context protocol, 2024. URL https://www.anthropic.com/news/model-context-protocol. Accessed: 2025-04-19

2024
[5]

Introducing agent skills

Anthropic . Introducing agent skills. https://claude.com/blog/skills, Oct. 2025. Anthropic product announcement, October 16, 2025

2025
[6]

Agent skills

Anthropic . Agent skills. https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview, 2026. Claude API Docs, accessed 2026-04-02

2026
[7]

Baghel and R

G. Baghel and R. Chandna. Introducing hashicorp agent skills, 2026. URL https://www.hashicorp.com/en/blog/introducing-hashicorp-agent-skills#what-are-agent-skills

2026
[8]

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022 a . doi:10.48550/arXiv.2204.05862. URL https://arxiv.org/abs/2204.05862

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862 2022
[9]

Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. L...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Borgeaud, A

S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. B. M. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, et al. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, pages 2206--2240. PMLR, 2022. URL https://proceedings.mlr.press/v162/borgea...

2022
[11]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review arXiv 2023
[12]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877--1901, 2020

1901
[13]

H. Cai, Y. Li, W. Wang, F. Zhu, X. Shen, W. Li, and T.-S. Chua. Large language models empowered personalized web agents. In Proceedings of the ACM on Web Conference 2025, pages 198--215, 2025

2025
[14]

H. Chai, Z. Cao, M. Ran, Y. Yang, J. Lin, X. Peng, H. Wang, R. Ding, Z. Wan, M. Wen, et al. Parl-mt: Learning to call functions in multi-turn conversation with progress awareness. arXiv preprint arXiv:2509.23206, 2025

work page arXiv 2025
[15]

Agent network protocol technical white paper,

G. Chang, E. Lin, C. Yuan, R. Cai, B. Chen, X. Xie, and Y. Zhang. Agent network protocol technical white paper, 2025. URL https://arxiv.org/abs/2508.00007

work page arXiv 2025
[16]

H. Chen, Z. Sun, H. Ye, K. Li, and X. Lin. Continual learning in large language models: Methods, challenges, and opportunities. arXiv preprint arXiv:2603.12658, 2026 a

work page arXiv 2026
[17]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

S. Chen, S. Wong, L. Chen, and Y. Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023. URL https://arxiv.org/abs/2306.15595

work page internal anchor Pith review arXiv 2023
[19]

S. Chen, S. Lin, Y. Shi, H. Lian, X. Gu, L. Yun, D. Chen, L. Cao, J. Liu, N. Xia, et al. Swe-exp: Experience-driven software issue resolution. arXiv preprint arXiv:2507.23361, 2025

work page arXiv 2025
[20]

T. Chen, Y. Li, M. Solodko, S. Wang, N. Jiang, T. Cui, J. Hao, J. Ko, S. Abdali, L. Xu, et al. Cua-skill: Develop skills for computer using agent. arXiv preprint arXiv:2601.21123, 2026 b

work page arXiv 2026
[21]

Dated data: Tracing knowledge cutoffs in large language models.arXiv:2403.12958, 2024

J. Cheng, M. Marone, O. Weller, D. Lawrie, D. Khashabi, and B. Van Durme. Dated data: Tracing knowledge cutoffs in large language models. arXiv preprint arXiv:2403.12958, 2024

work page arXiv 2024
[22]

Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026

X. Cheng, W. Zeng, D. Dai, Q. Chen, B. Wang, Z. Xie, K. Huang, X. Yu, Z. Hao, Y. Li, et al. Conditional memory via scalable lookup: A new axis of sparsity for large language models. arXiv preprint arXiv:2601.07372, 2026

work page arXiv 2026
[23]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav. Mem0 : Building production-ready AI agents with scalable long-term memory. arXiv preprint arXiv:2504.19413, 2025. doi:10.48550/arXiv.2504.19413

work page internal anchor Pith review doi:10.48550/arxiv.2504.19413 2025
[24]

M. R. Chinthareddy. Reliable graph-rag for codebases: Ast-derived graphs vs llm-extracted knowledge graphs. arXiv preprint arXiv:2601.08773, 2026

work page arXiv 2026
[25]

Chowdhery, S

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. PaLM : Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023. URL https://jmlr.org/papers/v24/22-1144.html

2023
[26]

Chalmers , title =

A. Clark and D. J. Chalmers. The extended mind. Analysis, 58 0 (1): 0 7--19, 1998. doi:10.1093/analys/58.1.7

work page doi:10.1093/analys/58.1.7 1998
[27]

Ag-ui: The agent-user interaction protocol

CopilotKit . Ag-ui: The agent-user interaction protocol. https://github.com/ag-ui-protocol/ag-ui, 2025. Official protocol repository and specification

2025
[28]

Corallo and P

G. Corallo and P. Papotti. Parallel context-of-experts decoding for retrieval augmented generation. arXiv preprint arXiv:2601.08670, 2026

work page arXiv 2026
[29]

CrewAI : Framework for orchestrating role-playing autonomous AI agents

CrewAI . CrewAI : Framework for orchestrating role-playing autonomous AI agents. https://github.com/crewAIInc/crewAI, 2024. GitHub repository, accessed 2026-04-02

2024
[30]

De Brigard, S

F. De Brigard, S. Umanath, and M. Irish. Rethinking the distinction between episodic and semantic memory: Insights from the past, present, and future. Memory & Cognition, 50 0 (3): 0 459--463, 2022

2022
[31]

DeepSeek-V3 Technical Report

DeepSeek-AI . Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2025. URL https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Z. Deng, Y. Guo, C. Han, W. Ma, J. Xiong, S. Wen, and Y. Xiang. Ai agents under threat: A survey of key security challenges and future pathways. ACM Computing Surveys, 57 0 (7): 0 1--36, 2025

2025
[34]

P. Du. Memory for autonomous llm agents:mechanisms, evaluation, and emerging frontiers, 2026 b . URL https://arxiv.org/abs/2603.07670

work page arXiv 2026
[35]

D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness, and J. Larson. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review arXiv 2024
[37]

A survey of agent interoperability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp),

A. Ehtesham, A. Singh, G. K. Gupta, and S. Kumar. A survey of agent interoperability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp), 2025 b . URL https://arxiv.org/abs/2505.02279

work page arXiv 2025
[38]

A survey of agent interoperability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp),

A. Ehtesham et al. A survey of agent interoperability protocols: Model context protocol ( MCP ), agent communication protocol ( ACP ), agent-to-agent protocol ( A2A ), and agent network protocol ( ANP ). arXiv preprint arXiv:2505.02279, 2025 c . doi:10.48550/arXiv.2505.02279

work page doi:10.48550/arxiv.2505.02279 2025
[39]

N. Esmi, M. Nezhad-Moghaddam, F. Borhani, A. Shahbahrami, A. Daemdoost, and G. Gaydadjiev. Gpt-5 vs other llms in long short-context performance. In 2025 3rd International Conference on Foundation and Large Language Models (FLLM), pages 129--133. IEEE, 2025

2025
[40]

Agent Control Protocol: Admission Control for Agent Actions

M. Fernandez. Agent control protocol: Admission control for agent actions. arXiv preprint arXiv:2603.18829, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

S. Gao, R. Zhu, Z. Kong, A. Noori, X. Su, C. Ginder, T. Tsiligkaridis, and M. Zitnik. Txagent: An ai agent for therapeutic reasoning across a universe of tools. arXiv preprint arXiv:2503.10970, 2025 a . URL https://arxiv.org/abs/2503.10970

work page arXiv 2025
[42]

S. Gao, R. Zhu, P. Sui, Z. Kong, S. Aldogom, Y. Huang, A. Noori, R. Shamji, K. Parvataneni, T. Tsiligkaridis, and M. Zitnik. Democratizing ai scientists using tooluniverse. arXiv preprint arXiv:2509.23426, 2025 b . URL https://arxiv.org/abs/2509.23426

work page arXiv 2025
[43]

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2024. doi:10.48550/arXiv.2312.10997. URL https://arxiv.org/abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2024
[44]

Gemini Team , R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Sorber, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. URL https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Gigerenzer and W

G. Gigerenzer and W. Gaissmaier. Heuristic decision making. Annual Review of Psychology, 62 0 (1): 0 451--482, 2011. doi:10.1146/annurev-psych-120709-145346. URL https://doi.org/10.1146/annurev-psych-120709-145346

work page doi:10.1146/annurev-psych-120709-145346 2011
[46]

Gemini: Try deep research and gemini 2.0 flash experimental

Google . Gemini: Try deep research and gemini 2.0 flash experimental. https://blog.google/products-and-platforms/products/gemini/google-gemini-deep-research/, Dec. 2024. Google blog post introducing Deep Research in Gemini, December 11, 2024; accessed 2026-04-02

2024
[47]

A2a: A new era of agent interoperability

Google . A2a: A new era of agent interoperability. https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/, 2025 a . Official announcement of the Agent2Agent (A2A) protocol for enabling secure communication and coordination between AI agents

2025
[48]

A2ui: Agent-to-user interface protocol

Google . A2ui: Agent-to-user interface protocol. https://github.com/google/A2UI, 2025 b . Open-source implementation of the A2UI protocol, enabling AI agents to generate declarative user interfaces that are rendered natively across platforms

2025
[49]

Under the hood: Universal commerce protocol (ucp)

Google . Under the hood: Universal commerce protocol (ucp). https://developers.googleblog.com/under-the-hood-universal-commerce-protocol-ucp/, 2026. Official introduction of the Universal Commerce Protocol (UCP), an open standard enabling interoperable agent-driven commerce across discovery, checkout, and post-purchase workflows

2026
[50]

Announcing the agent2agent protocol ( A2A )

Google Cloud . Announcing the agent2agent protocol ( A2A ). https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/, Apr. 2025 a . Google Developers Blog announcement, April 9, 2025; see also the official specification site at https://google.github.io/A2A/

2025
[51]

Announcing agent payments protocol (ap2)

Google Cloud . Announcing agent payments protocol (ap2). https://cloud.google.com/blog/products/ai-machine-learning/announcing-agents-to-payments-ap2-protocol, 2025 b . Official introduction of AP2 as an open protocol enabling secure, compliant, and interoperable agent-driven payments

2025
[52]

Z. Guo, Z. Chen, X. Nie, J. Lin, Y. Zhou, and W. Zhang. Skillprobe: Security auditing for emerging agent skill marketplaces via multi-agent collaboration. arXiv preprint arXiv:2603.21019, 2026

work page arXiv 2026
[53]

Y. Hao, S. Mehri, C. Zhai, and D. Hakkani-T \"u r. User preference modeling for conversational llm agents: Weak rewards from retrieval-augmented interaction. arXiv preprint arXiv:2603.20939, 2026

work page arXiv 2026
[54]

M. M. Hasan, H. Li, G. K. Rajbahadur, B. Adams, and A. E. Hassan. Model context protocol (mcp) tool descriptions are smelly! towards improving ai agent efficiency with augmented mcp tool descriptions, 2026. URL https://arxiv.org/abs/2602.14878

work page arXiv 2026
[55]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. URL https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, et al. MetaGPT : Meta programming for a multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023. URL https://arxiv.org/abs/2308.00352

work page internal anchor Pith review arXiv 2023
[57]

X. Hou, Y. Zhao, S. Wang, and H. Wang. Model context protocol (mcp): Landscape, security threats, and future research directions. ACM Transactions on Software Engineering and Methodology, 2025

2025
[58]

Hsiao, M

V. Hsiao, M. Roberts, and L. Smith. Procedural knowledge improves agentic llm workflows, 2025. URL https://arxiv.org/abs/2511.07568

work page arXiv 2025
[59]

Z. Hu, Q. Zhu, H. Yan, Y. He, and L. Gui. Beyond rag for agent memory: Retrieval by decoupling and aggregation. arXiv preprint arXiv:2602.02007, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

Hutchins

E. Hutchins. Cognition in the Wild. MIT press, 1995

1995
[61]

The simplest protocol for ai agents to work together

IBM Research . The simplest protocol for ai agents to work together. https://research.ibm.com/blog/agent-communication-protocol-ai, 2025. Official introduction of ACP, describing it as a shared communication language enabling collaboration among AI agents

2025
[62]

Adaptation of agentic ai: A survey of post-training, memory, and skills.arXiv preprint arXiv:2512.16301, 2026a

P. Jiang, J. Lin, Z. Shi, Z. Wang, L. He, Y. Wu, M. Zhong, P. Song, Q. Zhang, H. Wang, X. Xu, H. Xu, P. Han, D. Zhang, J. Sun, C. Yang, K. Qian, T. Wang, C. Hu, M. Li, Q. Li, H. Peng, S. Wang, J. Shang, C. Zhang, J. You, L. Liu, P. Lu, Y. Zhang, H. Ji, Y. Choi, D. Song, J. Sun, and J. Han. Adaptation of agentic ai: A survey of post-training, memory, and s...

work page arXiv 2026
[63]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Y. Jiang et al. SoK : Agentic skills -- beyond tool use in LLM agents. arXiv preprint arXiv:2602.20867, 2026 b

work page internal anchor Pith review arXiv 2026
[64]

Json-rpc 2.0 specification, 2010

JSON-RPC Working Group . Json-rpc 2.0 specification, 2010. URL https://www.jsonrpc.org/specification

2010
[65]

J. Kang, M. Ji, Z. Zhao, and T. Bai. Memory os of ai agent. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25972--25981, 2025

2025
[66]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001
[67]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

D. Kirsh. Complementary strategies: Why we use our hands when we think. In Proceedings of the seventeenth annual conference of the cognitive science society, Hillsdale, NJ, 1995. Lawrence Erlbaum

1995
[69]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199--22213, 2022

2022
[70]

D. Kong, S. Lin, Z. Xu, Z. Wang, M. Li, Y. Li, Y. Zhang, H. Peng, X. Chen, Z. Sha, et al. A survey of llm-driven ai agent communication: Protocols, security risks, and defense countermeasures. arXiv preprint arXiv:2506.19676, 2025

work page arXiv 2025
[71]

LangGraph : Build resilient language agents as graphs

LangChain . LangGraph : Build resilient language agents as graphs. https://github.com/langchain-ai/langgraph, 2024. GitHub repository, accessed 2026-04-02

2024
[72]

Lazaros, A

K. Lazaros, A. G. Vrahatis, and S. Kotsiantis. Human-in-the-loop artificial intelligence: A systematic review of concepts, methods, and applications. Entropy, 28 0 (4): 0 377, 2026

2026
[73]

S. U. Lee, L. Zhu, M. Shamsujjoha, L. Dong, Q. Lu, J. Chen, and L. Briand. A structured approach to safety case construction for ai systems, 2026. URL https://arxiv.org/abs/2601.22773

work page arXiv 2026
[74]

W. Y. Lee. Capable but unreliable: Canonical path deviation as a causal mechanism of agent failure in long-horizon tasks. arXiv preprint arXiv:2602.19008, 2026. URL https://arxiv.org/abs/2602.19008

work page arXiv 2026
[75]

u ttler, M. Lewis, W.-t. Yih, T. Rockt \

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K \"u ttler, M. Lewis, W.-t. Yih, T. Rockt \"a schel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459--9474, 2020

2020
[76]

G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem. CAMEL : Communicative agents for ``mind'' exploration of large language model society. Advances in Neural Information Processing Systems, 36, 2023

2023
[77]

H. Li, C. Mu, J. Chen, S. Ren, Z. Cui, Y. Zhang, L. Bai, and S. Hu. Organizing, orchestrating, and benchmarking agent skills at ecosystem scale. arXiv preprint arXiv:2603.02176, 2026 a

work page arXiv 2026
[78]

Memory, consciousness and large language model

J. Li and J. Li. Memory, consciousness and large language model. arXiv preprint arXiv:2401.02509, 2024

work page arXiv 2024
[79]

N. Li, K. Zhang, K. Polley, and J. Ma. Security considerations for artificial intelligence agents. arXiv preprint arXiv:2603.12230, 2026 b

work page internal anchor Pith review Pith/arXiv arXiv 2026
[80]

X. Li. A review of prominent paradigms for LLM -based agents: Tool use (including RAG ), planning, and feedback learning. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9760--9779, Abu Dhabi, UAE, 2025. Association for Computational Linguistics

2025
[81]

X. Li, W. Chen, Y. Liu, S. Zheng, X. Chen, Y. He, Y. Li, B. You, H. Shen, J. Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks. arXiv preprint arXiv:2602.12670, 2026 c

work page internal anchor Pith review arXiv 2026
[82]

Z. Li, C. Xi, C. Li, D. Chen, B. Chen, S. Song, S. Niu, H. Wang, J. Yang, C. Tang, et al. Memos: A memory os for ai system. arXiv preprint arXiv:2507.03724, 2025

work page internal anchor Pith review arXiv 2025

Showing first 80 references.