StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

Bo An; Changjiu Jiang; Jiageng Li; Mingcong Lei; Weihao Tan; Xinrun Wang; Yitian Hong; Yu Duan

arxiv: 2507.07445 · v3 · pith:WM2XUGJ5new · submitted 2025-07-10 · 💻 cs.AI

StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

Weihao Tan , Changjiu Jiang , Yu Duan , Mingcong Lei , Jiageng Li , Yitian Hong , Xinrun Wang , Bo An This is my paper

classification 💻 cs.AI

keywords agentsstardojobenchmarkinteractionsmultimodalopen-endedproduction-livingsocial

0 comments

read the original abstract

Autonomous agents navigating human society must master both production activities and social interactions, yet existing benchmarks rarely evaluate these skills simultaneously. To bridge this gap, we introduce StarDojo, a novel benchmark based on Stardew Valley, designed to assess AI agents in open-ended production-living simulations. In StarDojo, agents are tasked to perform essential livelihood activities such as farming and crafting, while simultaneously engaging in social interactions to establish relationships within a vibrant community. StarDojo features 1,000 meticulously curated tasks across five key domains: farming, crafting, exploration, combat, and social interactions. Additionally, we provide a compact subset of 100 representative tasks for efficient model evaluation. The benchmark offers a unified, user-friendly interface that eliminates the need for keyboard and mouse control, supports all major operating systems, and enables the parallel execution of multiple environment instances, making it particularly well-suited for evaluating the most capable foundation agents, powered by multimodal large language models (MLLMs). Extensive evaluations of state-of-the-art MLLMs agents demonstrate substantial limitations, with the best-performing model, GPT-4.1, achieving only a 12.7% success rate, primarily due to challenges in visual understanding, multimodal reasoning and low-level manipulation. As a user-friendly environment and benchmark, StarDojo aims to facilitate further research towards robust, open-ended agents in complex production-living environments.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents
cs.CV 2026-05 unverdicted novelty 6.0

SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 unverdicted novelty 6.0

MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 conditional novelty 6.0

MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.