Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Dongmin Park , Minkyu Kim , Beongjun Choi , Junhyuck Kim , Keon Lee , Jonghyun Lee , Inkyu Park , Byeong-Uk Lee

show 8 more authors

Jaeyoung Hwang Jaewoo Ahn Ameya S. Mahabaleshwarkar Bilal Kartal Pritam Biswas Yoshi Suhara Kangwook Lee Jaewoong Cho

Authors on Pith no claims yet

classification 💻 cs.AI

keywords agentsgameorakagenticdatasetsfine-tuninggenresstudies

0 comments

read the original abstract

Large Language Model (LLM) agents are reshaping the game industry, by enabling more intelligent and human-preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets to adapt pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a benchmark for training and evaluating LLM agents across 12 popular video games spanning all major genres. Using a plug-and-play interface built on Model Context Protocol (MCP), Orak supports systematic and reproducible studies of agentic modules in varied game scenarios. We further release a fine-tuning dataset of expert LLM gameplay trajectories covering multiple genres, turning general LLMs into effective game agents. Orak offers a united evaluation framework, including game leaderboards, LLM battle arenas, and \fix{ablation studies} of input modality, agentic strategies, and fine-tuning effects, establishing a foundation towards versatile gaming agents. Code and datasets are available at https://github.com/krafton-ai/Orak and https://huggingface.co/datasets/KRAFTON/Orak.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play
cs.AI 2026-04 unverdicted novelty 6.0

STRATAGEM uses a Reasoning Transferability Coefficient and Reasoning Evolution Reward in game self-play to promote domain-agnostic reasoning in language models, yielding gains on math, general reasoning, and code benchmarks.
RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents
cs.CL 2026-04 unverdicted novelty 6.0

RPA-Check is a new multi-stage framework using dimension definition, boolean checklist augmentation, semantic filtering, and LLM-as-judge verification to assess role-playing agents, with tests on a legal training game...
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 5.0

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 3.0

This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.