pith. machine review for the scientific record. sign in

arxiv: 2506.03610 · v3 · submitted 2025-06-04 · 💻 cs.AI

Recognition: unknown

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Authors on Pith no claims yet
classification 💻 cs.AI
keywords agentsgameorakagenticdatasetsfine-tuninggenresstudies
0
0 comments X
read the original abstract

Large Language Model (LLM) agents are reshaping the game industry, by enabling more intelligent and human-preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets to adapt pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a benchmark for training and evaluating LLM agents across 12 popular video games spanning all major genres. Using a plug-and-play interface built on Model Context Protocol (MCP), Orak supports systematic and reproducible studies of agentic modules in varied game scenarios. We further release a fine-tuning dataset of expert LLM gameplay trajectories covering multiple genres, turning general LLMs into effective game agents. Orak offers a united evaluation framework, including game leaderboards, LLM battle arenas, and \fix{ablation studies} of input modality, agentic strategies, and fine-tuning effects, establishing a foundation towards versatile gaming agents. Code and datasets are available at https://github.com/krafton-ai/Orak and https://huggingface.co/datasets/KRAFTON/Orak.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  2. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  3. Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

    cs.AI 2026-04 unverdicted novelty 6.0

    STRATAGEM uses a Reasoning Transferability Coefficient and Reasoning Evolution Reward in game self-play to promote domain-agnostic reasoning in language models, yielding gains on math, general reasoning, and code benchmarks.

  4. RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

    cs.CL 2026-04 unverdicted novelty 6.0

    RPA-Check is a new multi-stage framework using dimension definition, boolean checklist augmentation, semantic filtering, and LLM-as-judge verification to assess role-playing agents, with tests on a legal training game...

  5. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 5.0

    The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

  6. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 3.0

    This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.