hub Canonical reference

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

· 2026 · cs.SE · arXiv 2604.08224

Canonical reference. 100% of citing Pith papers cite this work as background.

17 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 17 citing papers arXiv PDF

abstract

Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover internally are now externalized into memory stores, reusable skills, interaction protocols, and the surrounding harness that makes these modules reliable in practice. This paper reviews that shift through the lens of externalization. Drawing on the idea of cognitive artifacts, we argue that agent infrastructure matters not merely because it adds auxiliary components, but because it transforms hard cognitive burdens into forms that the model can solve more reliably. Under this view, memory externalizes state across time, skills externalize procedural expertise, protocols externalize interaction structure, and harness engineering serves as the unification layer that coordinates them into governed execution. We trace a historical progression from weights to context to harness, analyze memory, skills, and protocols as three distinct but coupled forms of externalization, and examine how they interact inside a larger agent system. We further discuss the trade-off between parametric and externalized capability, identify emerging directions such as self-evolving harnesses and shared agent infrastructure, and discuss open challenges in evaluation, governance, and the long-term co-evolution of models and external infrastructure. The result is a systems-level framework for explaining why practical agent progress increasingly depends not only on stronger models, but on better external cognitive infrastructure.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8

citation-polarity summary

background 8

representative citing papers

Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.

SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

SMMBench is a benchmark evaluating multimodal agents on cross-source reasoning, conflict resolution, preference reasoning, and action prediction, showing current systems struggle with evidence distributed across heterogeneous sources.

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.

Mem-$\pi$: Adaptive Memory through Learning When and What to Generate

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.

AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers

stat.CO · 2026-05-18 · unverdicted · novelty 6.0

AI4BayesCode generates validated modular stateful MCMC samplers from natural language Bayesian model descriptions via LLM translation, modular blocks, and recursive stateful composition.

Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Malicious actors could use AI agents to submit large numbers of fake papers, inflating the submission count and thereby raising the acceptance odds for a small set of chosen legitimate papers under stable conference acceptance rates.

CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yielding improved benchmark performance with auditable traces.

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

MeloTune: On-Device Arousal Learning and Peer-to-Peer Mood Coupling for Proactive Music Curation

cs.SD · 2026-04-12 · unverdicted · novelty 6.0

MeloTune implements learned per-listener Personal Arousal Functions and mesh memory protocols on mobile devices to predict affective trajectories and enable peer-coupled proactive music selection, reporting 96.6% pattern accuracy in deployment.

Code as Agent Harness

cs.CL · 2026-05-18 · accept · novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

cs.CL · 2026-05-18 · unverdicted · novelty 5.0

SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.

Harness Engineering as Categorical Architecture

cs.PL · 2026-05-12 · unverdicted · novelty 5.0

Categorical Architecture triple (G, Know, Phi) supplies the formal theory for composing LLM agent harnesses with structurally preserved certificates.

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

cs.LG · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and SearchQA while showing some skills remain externally useful.

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

cs.IR · 2026-05-08 · unverdicted · novelty 5.0 · 2 refs

A survey that taxonomizes agent skills for LLM-based agents across representation, acquisition, retrieval, and evolution stages while reviewing methods, resources, and open challenges.

The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

cs.AI · 2026-05-11 · unverdicted · novelty 4.0

Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.

Memory as Metabolism: A Design for Companion Knowledge Systems

cs.AI · 2026-04-13 · unverdicted · novelty 4.0

This paper designs a companion knowledge system with TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, and AUDIT operations plus memory gravity and minority-hypothesis retention to give contradictory evidence a path to update dominant interpretations in personal LLM wikis.

citing papers explorer

Showing 17 of 17 citing papers.

Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction cs.CV · 2026-05-20 · unverdicted · none · ref 70 · internal anchor
Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.
Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents cs.CL · 2026-05-16 · unverdicted · none · ref 6 · internal anchor
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory cs.CL · 2026-05-15 · unverdicted · none · ref 1 · internal anchor
SMMBench is a benchmark evaluating multimodal agents on cross-source reasoning, conflict resolution, preference reasoning, and action prediction, showing current systems struggle with evidence distributed across heterogeneous sources.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory cs.AI · 2026-05-11 · unverdicted · none · ref 55 · internal anchor
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate cs.CL · 2026-05-20 · unverdicted · none · ref 60 · internal anchor
Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers stat.CO · 2026-05-18 · unverdicted · none · ref 72 · internal anchor
AI4BayesCode generates validated modular stateful MCMC samplers from natural language Bayesian model descriptions via LLM translation, modular blocks, and recursive stateful composition.
Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents cs.CL · 2026-05-11 · unverdicted · none · ref 30 · internal anchor
Malicious actors could use AI agents to submit large numbers of fake papers, inflating the submission count and thereby raising the acceptance odds for a small set of chosen legitimate papers under stable conference acceptance rates.
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models cs.LG · 2026-05-08 · unverdicted · none · ref 27 · internal anchor
CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yielding improved benchmark performance with auditable traces.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 137 · internal anchor
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
MeloTune: On-Device Arousal Learning and Peer-to-Peer Mood Coupling for Proactive Music Curation cs.SD · 2026-04-12 · unverdicted · none · ref 21 · internal anchor
MeloTune implements learned per-listener Personal Arousal Functions and mesh memory protocols on mobile devices to predict affective trajectories and enable peer-coupled proactive music selection, reporting 96.6% pattern accuracy in deployment.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 181 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution cs.CL · 2026-05-18 · unverdicted · none · ref 78 · internal anchor
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
Harness Engineering as Categorical Architecture cs.PL · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
Categorical Architecture triple (G, Know, Phi) supplies the formal theory for composing LLM agent harnesses with structurally preserved certificates.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning cs.LG · 2026-05-11 · unverdicted · none · ref 77 · 2 links · internal anchor
SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and SearchQA while showing some skills remain externally useful.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications cs.IR · 2026-05-08 · unverdicted · none · ref 117 · 2 links · internal anchor
A survey that taxonomizes agent skills for LLM-based agents across representation, acquisition, retrieval, and evolution stages while reviewing methods, resources, and open challenges.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents cs.AI · 2026-05-11 · unverdicted · none · ref 51 · internal anchor
Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
Memory as Metabolism: A Design for Companion Knowledge Systems cs.AI · 2026-04-13 · unverdicted · none · ref 53 · internal anchor
This paper designs a companion knowledge system with TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, and AUDIT operations plus memory gravity and minority-hypothesis retention to give contradictory evidence a path to update dominant interpretations in personal LLM wikis.

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer