super hub Canonical reference

Voyager: An Open-Ended Embodied Agent with Large Language Models

Ajay Mandlekar, Chaowei Xiao, Guanzhi Wang, Yuke Zhu, Yunfan Jiang, Yuqi Xie · 2023 · cs.AI · arXiv 2305.16291

Canonical reference. 94% of citing Pith papers cite this work as background.

255 Pith papers citing it

Background 94% of classified citations

open full Pith review browse 255 citing papers more from Ajay Mandlekar arXiv PDF

abstract

We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. We open-source our full codebase and prompts at https://voyager.minedojo.org/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 58 method 2 dataset 1 other 1

citation-polarity summary

background 58 support 1 unclear 1 use dataset 1 use method 1

claims ledger

abstract We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox querie

authors

Ajay Mandlekar Chaowei Xiao Guanzhi Wang Yuke Zhu Yunfan Jiang Yuqi Xie

co-cited works

representative citing papers

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

cs.LG · 2026-05-11 · conditional · novelty 8.0

Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models

cs.CV · 2026-05-09 · unverdicted · novelty 8.0

Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 3 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

SEVerA: Verified Synthesis of Self-Evolving Agents

cs.LG · 2026-03-26 · unverdicted · novelty 8.0

SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

cs.LG · 2026-05-29 · unverdicted · novelty 7.0 · 2 refs

PromptPO shows LLMs can act as black-box policy optimizers for sequential RL when leveraging prior knowledge, matching baselines in exploration and robotics but underperforming in MuJoCo.

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

ElasticMem enables LLM agents to learn adaptive latent memory retrieval and elastic budget allocation, improving QA accuracy by 24-26% and ALFWorld success by 27-66% over baselines with lower token cost.

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

CORE distills contrasts between successful and unsuccessful reasoning traces into compact natural-language insights that enable faster model self-improvement on reasoning tasks with fewer rollouts than parametric or other non-parametric baselines.

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference

cs.DC · 2026-05-27 · unverdicted · novelty 7.0

SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.

AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

AGORA is an inference-free step-level compressor for LLM agent prompts that retains at least 75% of uncompressed performance in most tested settings where token-level methods collapse due to action-grammar destruction.

Sentinel: Embodied Cooperative Spatial Reasoning and Planning

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

Introduces Sentinel Challenge benchmark and CoSaR framework for cooperative spatial reasoning and planning among 3-5 decentralized embodied agents across 14 city-scale scenes.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.

MemGym: a Long-Horizon Memory Environment for LLM Agents

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

EXG: Self-Evolving Agents with Experience Graphs

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.

Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance

nlin.AO · 2026-05-17 · unverdicted · novelty 7.0

LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.

\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

MasFACT transfers historical topology priors across tasks via Fused Gromov-Wasserstein optimal transport and PAC-Bayes conservative adaptation to reduce topology forgetting in continual multi-agent settings.

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

cs.AI · 2026-05-16 · unverdicted · novelty 7.0

Alice uses preservation conflicts from failed candidate updates to create class-stratified hypotheses and guide exploration, improving executable world-model learning under prior misalignment.

X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Digital Human Attention

cs.AI · 2026-05-15 · unverdicted · novelty 7.0 · 2 refs

X-SYNTH synthesizes enterprise context from digital human attention using Digital Twin Signatures and seven attention filters, raising true lead rate from 9.5% to 61.9% while cutting false lead rate to 18.8%.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Automated Design of Agentic Systems cs.AI · 2024-08-15 · conditional · none · ref 216 · internal anchor
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads cs.LG · 2024-01-19 · conditional · none · ref 89 · internal anchor
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents cs.CL · 2024-10-30 · unverdicted · none · ref 58 · internal anchor
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 158 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap cs.SE · 2024-10-28 · unverdicted · none · ref 103 · internal anchor
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
Agent AI: Surveying the Horizons of Multimodal Interaction cs.AI · 2024-01-07 · unverdicted · none · ref 200 · internal anchor
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
A Survey on the Memory Mechanism of Large Language Model based Agents cs.AI · 2024-04-21 · accept · none · ref 99 · internal anchor
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security cs.HC · 2024-01-10 · unverdicted · none · ref 236 · internal anchor
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Voyager: An Open-Ended Embodied Agent with Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer