BALROG : Benchmarking agentic LLM and VLM reasoning on games

· 2025

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.

Scalable Option Learning in High-Throughput Environments

cs.LG · 2025-08-30 · unverdicted · novelty 6.0

SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.

citing papers explorer

Showing 2 of 2 citing papers.

CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V cs.AI · 2026-04-09 · unverdicted · none · ref 30
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
Scalable Option Learning in High-Throughput Environments cs.LG · 2025-08-30 · unverdicted · none · ref 45
SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.

BALROG : Benchmarking agentic LLM and VLM reasoning on games

fields

years

verdicts

representative citing papers

citing papers explorer