HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

· 2025 · cs.AI · arXiv 2508.12782

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large language models (LLMs) perform well on step-by-step reasoning benchmarks such as mathematics and code generation, yet their ability to carry out robust long-horizon planning under realistic constraints remains insufficiently evaluated. Existing planning benchmarks often rely on abstract domains or interactive feedback, obscuring end-to-end planning failures and feasibility errors. We introduce HeroBench, a benchmark for evaluating long-horizon, hierarchical planning and structured reasoning in a complex RPG-inspired virtual world. Tasks require models to select numerically feasible equipment, reason over multi-level crafting and resource dependencies, and execute hundreds to thousands of actions as a single end-to-end plan. HeroBench integrates symbolic planning, numeric combat simulation, spatial reasoning, and resource management, while supporting scalable difficulty and adversarial distractors. HeroBench evaluates executable plans through simulation, enabling both success-based and fine-grained progress metrics, as well as detailed failure mode analysis. An evaluation of 25 state-of-the-art LLMs reveals large performance disparities rarely observed in conventional reasoning benchmarks. While reasoning models perform substantially better, no model reliably solves the hardest tasks, highlighting persistent challenges in long-horizon autonomous planning.

representative citing papers

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

cs.CL · 2026-02-10 · conditional · novelty 6.0

EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.

citing papers explorer

Showing 1 of 1 citing paper.

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies cs.CL · 2026-02-10 · conditional · none · ref 1 · internal anchor
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

fields

years

verdicts

representative citing papers

citing papers explorer