Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.
super hub Canonical reference
Voyager: An Open-Ended Embodied Agent with Large Language Models
Canonical reference. 94% of citing Pith papers cite this work as background.
abstract
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. We open-source our full codebase and prompts at https://voyager.minedojo.org/.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox querie
authors
co-cited works
representative citing papers
Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
Multi-agent LLM system Agora under Sealed Joint Search conditions produces +1.87 holdout Sharpe on CSI 1000 over a 91-day sealed period, exceeding the best baseline at +1.334 under favorable seed.
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
PromptPO shows LLMs can act as black-box policy optimizers for sequential RL when leveraging prior knowledge, matching baselines in exploration and robotics but underperforming in MuJoCo.
ElasticMem enables LLM agents to learn adaptive latent memory retrieval and elastic budget allocation, improving QA accuracy by 24-26% and ALFWorld success by 27-66% over baselines with lower token cost.
CORE distills contrasts between successful and unsuccessful reasoning traces into compact natural-language insights that enable faster model self-improvement on reasoning tasks with fewer rollouts than parametric or other non-parametric baselines.
SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
AGORA is an inference-free step-level compressor for LLM agent prompts that retains at least 75% of uncompressed performance in most tested settings where token-level methods collapse due to action-grammar destruction.
Introduces Sentinel Challenge benchmark and CoSaR framework for cooperative spatial reasoning and planning among 3-5 decentralized embodied agents across 14 city-scale scenes.
PANDO introduces an online skill-distillation method with a structured library, reflection, demotion, routing, compression, and cache-aware prompting that reaches 58.3% success on 910 VisualWebArena tasks using 58-61% fewer tokens than prior methods.
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.
The paper diagnoses library drift in self-evolving LLM skill libraries and demonstrates a governance recipe raising pass@1 from 0.258 to 0.584 on MBPP+ hard-100.
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
citing papers explorer
No citing papers match the current filters.