hub Canonical reference

Dota 2 with Large Scale Deep Reinforcement Learning

· 2019 · cs.LG · arXiv 1912.06680

Canonical reference. 93% of citing Pith papers cite this work as background.

78 Pith papers citing it

Background 93% of classified citations

open full Pith review browse 78 citing papers arXiv PDF

abstract

On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 other 1

citation-polarity summary

background 13 unclear 1

representative citing papers

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

Generative Language Modeling for Automated Theorem Proving

cs.LG · 2020-09-07 · unverdicted · novelty 8.0

GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.

GPTNT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

GPTNT benchmark demonstrates that state-of-the-art multimodal models cannot perform real-time collaborative bomb defusal in Keep Talking and Nobody Explodes, unlike human players.

In Defense of Information Leakage in Concept-based Models

cs.LG · 2026-06-09 · conditional · novelty 7.0

Concept-based models can use controlled 'benign' information leakage to remain accurate and intervenable under real-world concept incompleteness by reframing their training objective.

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

Sample-efficient inductive matrix completion with noise and inexact side-information

stat.ML · 2026-05-16 · unverdicted · novelty 7.0

A projected gradient descent algorithm for noisy inductive matrix completion achieves linear convergence and stable recovery at sample complexity governed by side-information dimension, extending to inexact side-information with optimal error degradation.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

Controllability in preference-conditioned multi-objective reinforcement learning

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Standard MORL metrics do not measure whether preference inputs reliably control agent behavior, so a new controllability metric is introduced to restore the link between user intent and agent output.

Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

cs.AI · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.

Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters

cs.LG · 2026-05-07 · accept · novelty 7.0

Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

InfoChess: A Game of Adversarial Inference and a Laboratory for Quantifiable Information Control

cs.MA · 2026-04-15 · unverdicted · novelty 7.0

InfoChess proposes a symmetric adversarial game focused purely on information control and probabilistic king-location inference, with RL agents outperforming heuristic baselines and gameplay dissected via belief entropy, cross-entropy, and predictive scores.

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

cs.LG · 2026-04-04 · conditional · novelty 7.0

PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalization to 77.1%.

NePPO: Near-Potential Policy Optimization for General-Sum Multi-Agent Reinforcement Learning

cs.LG · 2026-03-07 · unverdicted · novelty 7.0

NePPO learns a player-independent potential function via a novel objective whose minimization yields an approximate Nash equilibrium for general-sum multi-agent games.

An Information-Geometric Approach to Artificial Curiosity

cs.LG · 2025-04-08 · unverdicted · novelty 7.0

Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy, with geodesic interpolation on the occupancy manifold yielding a scalar-parameter family that includes count-based and max-entropy exploration.

Voyager: An Open-Ended Embodied Agent with Large Language Models

cs.AI · 2023-05-25 · unverdicted · novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能

Computer Vision for MOBA Analytics: A Dataset and Baseline for Visibility Analysis in Dota 2

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

Introduces the Dota2-Vis dataset of 288 videos from 144 TI 2025 matches plus 2,477 annotated minimaps and evaluates YOLO11 variants for player-icon detection to produce visibility curves.

EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.

Superhuman AI for Generals.io Using Self-Play Reinforcement Learning

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

Self-play RL with a vision transformer policy, powered by a 10,000x faster JAX simulator, produces an agent that ranks #1 on the Generals.io leaderboard and wins 199-70 against top humans.

Asymmetric physics enables efficient learning in quadrupedal robot swarms

cs.RO · 2026-06-22 · unverdicted · novelty 6.0

Asymmetric physics (high-fidelity non-diff simulator plus differentiable surrogates) enables end-to-end training of decentralized vision-based policies for up to 512 quadrupeds that transfer zero-shot to real hardware.

Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

cs.CR · 2026-06-02 · unverdicted · novelty 6.0

TSP reframes secure code generation as a tree-structured self-play process that supplies dense on-policy signals at vulnerability-prone nodes, yielding higher security pass rates and cross-language generalization than SFT or unstructured self-play.

Constitutional Arms Races in the Public Goods Game: Co-Evolving LLM Constitutions Under Cooperation-Defection Pressure

cs.MA · 2026-05-26 · unverdicted · novelty 6.0

Adversarial co-evolution of LLM constitutions in public goods games reaches near-parity equilibrium only when fitness is coupled across factions and evaluation uses at least five seeds per generation.

One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

pcsp is a shared RL policy using LLM persona embeddings, low-rank projection, and PPO+InfoNCE+KL training that delivers 17x above-chance zero-shot persona identification and 22x faster inference on a 300-persona benchmark.

citing papers explorer

Showing 23 of 23 citing papers after filters.

In Defense of Information Leakage in Concept-based Models cs.LG · 2026-06-09 · conditional · none · ref 123 · internal anchor
Concept-based models can use controlled 'benign' information leakage to remain accurate and intervenable under real-world concept incompleteness by reframing their training objective.
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation cs.LG · 2026-05-18 · unverdicted · none · ref 133 · internal anchor
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling cs.LG · 2026-05-14 · unverdicted · none · ref 23 · internal anchor
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Controllability in preference-conditioned multi-objective reinforcement learning cs.LG · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
Standard MORL metrics do not measure whether preference inputs reliably control agent behavior, so a new controllability metric is introduced to restore the link between user intent and agent output.
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters cs.LG · 2026-05-07 · accept · none · ref 265 · internal anchor
Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO cs.LG · 2026-04-04 · conditional · none · ref 1 · internal anchor
PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalization to 77.1%.
NePPO: Near-Potential Policy Optimization for General-Sum Multi-Agent Reinforcement Learning cs.LG · 2026-03-07 · unverdicted · none · ref 2 · internal anchor
NePPO learns a player-independent potential function via a novel objective whose minimization yields an approximate Nash equilibrium for general-sum multi-agent games.
EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games cs.LG · 2026-06-22 · unverdicted · none · ref 1 · internal anchor
EMAgnet replaces uniform-magnet regularization in PPO self-play with an EMA of last-iterate policy parameters and reports lower exploitability on most tested zero-sum benchmarks, especially those with dominated strategies.
Superhuman AI for Generals.io Using Self-Play Reinforcement Learning cs.LG · 2026-06-22 · unverdicted · none · ref 7 · internal anchor
Self-play RL with a vision transformer policy, powered by a 10,000x faster JAX simulator, produces an agent that ranks #1 on the Generals.io leaderboard and wins 199-70 against top humans.
GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning cs.LG · 2026-05-19 · unverdicted · none · ref 33 · internal anchor
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data cs.LG · 2026-05-07 · conditional · none · ref 2 · 2 links · internal anchor
SOPE dynamically controls offline training length in online RL using actor-aligned OPE on validation data to stop when benefits saturate, achieving up to 45.6% better performance and 22x less computation on Minari tasks.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 22 · internal anchor
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 20 · internal anchor
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient cs.LG · 2026-04-28 · unverdicted · none · ref 7 · internal anchor
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.
Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models cs.LG · 2026-04-28 · unverdicted · none · ref 3 · internal anchor
Latent transitions in models like Dreamer are biased toward dense regions, creating attractors that hide true dynamics discrepancies and cause epistemic uncertainty to be unreliable while overestimating rewards.
Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter cs.LG · 2026-06-26 · unverdicted · none · ref 55 · internal anchor
CA-NKCF is a hybrid neural-Kalman consensus filter for distributed state estimation that operates without noise covariance knowledge and shows robustness to model misspecification in linear, chaotic, and wireless scenarios.
Uncertainty-aware reinforcement learning for chemical language models cs.LG · 2026-06-23 · unverdicted · none · ref 12 · internal anchor
Uncertainty-aware RL for chemical language models raises true hit rate from 0.5 to 0.75 by favoring low-uncertainty regions during optimization.
Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning cs.LG · 2026-06-18 · unverdicted · none · ref 45 · internal anchor
Extends DAE theory to POMDPs with minimal changes and introduces discrete latent dynamics to cut computational cost, with ALE experiments showing scalability and retained sample efficiency.
An Agency-Transferring Model-Free Policy Enhancement Technique cs.LG · 2026-06-08 · unverdicted · none · ref 2 · internal anchor
A model-free RL method arbitrates between a functional baseline policy and a learning policy, transferring agency over time to yield a standalone policy with high goal-reaching rates and competitive returns on continuous-control tasks.
Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions cs.LG · 2026-06-02 · unverdicted · none · ref 52 · internal anchor
GTR introduces a bounded non-monotonic Gaussian trust region and Mixture Gaussian Anchor to enable effective behavior transitions in non-stationary RL where standard PPO fails.
Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games cs.LG · 2026-05-14 · unverdicted · none · ref 31 · internal anchor
DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark environments.
Position: Deployed Reinforcement Learning should be Continual cs.LG · 2026-06-01 · unverdicted · none · ref 3 · internal anchor
Deployed RL agents receiving evaluative rewards face inherent non-stationarity and should engage in continual learning rather than following a train-then-fix approach.
ARROW: Augmented Replay for RObust World models cs.LG · 2026-03-12 · unreviewed · ref 24 · internal anchor

Dota 2 with Large Scale Deep Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer