VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.
hub Canonical reference
Dota 2 with Large Scale Deep Reinforcement Learning
Canonical reference. 93% of citing Pith papers cite this work as background.
abstract
On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
GPTNT benchmark demonstrates that state-of-the-art multimodal models cannot perform real-time collaborative bomb defusal in Keep Talking and Nobody Explodes, unlike human players.
Concept-based models can use controlled 'benign' information leakage to remain accurate and intervenable under real-world concept incompleteness by reframing their training objective.
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
A projected gradient descent algorithm for noisy inductive matrix completion achieves linear convergence and stable recovery at sample complexity governed by side-information dimension, extending to inexact side-information with optimal error degradation.
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Standard MORL metrics do not measure whether preference inputs reliably control agent behavior, so a new controllability metric is introduced to restore the link between user intent and agent output.
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
InfoChess proposes a symmetric adversarial game focused purely on information control and probabilistic king-location inference, with RL agents outperforming heuristic baselines and gameplay dissected via belief entropy, cross-entropy, and predictive scores.
PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalization to 77.1%.
NePPO learns a player-independent potential function via a novel objective whose minimization yields an approximate Nash equilibrium for general-sum multi-agent games.
Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy, with geodesic interpolation on the occupancy manifold yielding a scalar-parameter family that includes count-based and max-entropy exploration.
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
Introduces the Dota2-Vis dataset of 288 videos from 144 TI 2025 matches plus 2,477 annotated minimaps and evaluates YOLO11 variants for player-icon detection to produce visibility curves.
EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.
Self-play RL with a vision transformer policy, powered by a 10,000x faster JAX simulator, produces an agent that ranks #1 on the Generals.io leaderboard and wins 199-70 against top humans.
Asymmetric physics (high-fidelity non-diff simulator plus differentiable surrogates) enables end-to-end training of decentralized vision-based policies for up to 512 quadrupeds that transfer zero-shot to real hardware.
TSP reframes secure code generation as a tree-structured self-play process that supplies dense on-policy signals at vulnerability-prone nodes, yielding higher security pass rates and cross-language generalization than SFT or unstructured self-play.
Adversarial co-evolution of LLM constitutions in public goods games reaches near-parity equilibrium only when fitness is coupled across factions and evaluation uses at least five seeds per generation.
pcsp is a shared RL policy using LLM persona embeddings, low-rank projection, and PPO+InfoNCE+KL training that delivers 17x above-chance zero-shot persona identification and 22x faster inference on a 300-persona benchmark.
citing papers explorer
-
GPTNT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes
GPTNT benchmark demonstrates that state-of-the-art multimodal models cannot perform real-time collaborative bomb defusal in Keep Talking and Nobody Explodes, unlike human players.
-
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
-
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
-
One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents
pcsp is a shared RL policy using LLM persona embeddings, low-rank projection, and PPO+InfoNCE+KL training that delivers 17x above-chance zero-shot persona identification and 22x faster inference on a 300-persona benchmark.
-
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
-
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of
-
Two AI Metrics Diverged: Will it Make All the Difference?
Bounded performance metrics always favor convergence of AI capabilities to meek models while unbounded metrics allow frontier models to maintain leads indefinitely, with policy implications for capability concentration.
-
An Introduction to Causal Reinforcement Learning
Proposes causal reinforcement learning (CRL) as a framework that decomposes RL environments into structural causal models to unify online, off-policy, and causal learning while defining new tasks including generalized policy learning and counterfactual learning.
-
Solipsistic Superintelligence is Unlikely to be Cooperative
Solipsistic superintelligence developed via unilateral optimization is unlikely to cooperate due to endogenous non-stationarity creating an unclosable train-test-deploy gap.
-
ASH: Agents that Self-Hone via Embodied Learning
ASH learns long-horizon embodied policies from unlabeled internet video via a self-improvement loop that trains an IDM on its own trajectories and extracts supervision plus key-moment memory from video.
-
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
-
RAMP: Hybrid DRL for Online Learning of Numeric Action Models
RAMP learns numeric action models online via a DRL-planning feedback loop and outperforms PPO on IPC numeric domains in solvability and plan quality.
-
From monoliths to modules: Decomposing transducers for efficient world modelling
A framework for decomposing transducers into sub-transducers on distinct subspaces to enable parallel and interpretable world models.
-
BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards
BV-Blend blends prompt-local and semantic-cluster historical reward statistics via SEM-derived weights to stabilize critic-free RL advantage estimation.
-
Augmenting Game AI with Deep Reinforcement Learning
Proposes a requirements-based framework for RL-augmented game AI, discusses deployment practicalities, and identifies research bottlenecks for industry adoption.
-
Plasticity Loss in Deep Reinforcement Learning: A Survey
Survey unifies the definition of plasticity loss in DRL, taxonomizes over 50 mitigations, identifies evaluation gaps, and finds general regularization often outperforms domain-specific methods.
-
Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent
A survey provides a task-based formalization of meta-learning and meta-RL while chronicling algorithms that lead to DeepMind's Adaptive Agent.