pith. sign in

hub Canonical reference

Dota 2 with Large Scale Deep Reinforcement Learning

Canonical reference. 93% of citing Pith papers cite this work as background.

78 Pith papers citing it
Background 93% of classified citations
abstract

On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.

hub tools

citation-role summary

background 13 other 1

citation-polarity summary

clear filters

representative citing papers

In Defense of Information Leakage in Concept-based Models

cs.LG · 2026-06-09 · conditional · novelty 7.0

Concept-based models can use controlled 'benign' information leakage to remain accurate and intervenable under real-world concept incompleteness by reframing their training objective.

An Information-Geometric Approach to Artificial Curiosity

cs.LG · 2025-04-08 · unverdicted · novelty 7.0

Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy, with geodesic interpolation on the occupancy manifold yielding a scalar-parameter family that includes count-based and max-entropy exploration.

Voyager: An Open-Ended Embodied Agent with Large Language Models

cs.AI · 2023-05-25 · unverdicted · novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能

Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

cs.CR · 2026-06-02 · unverdicted · novelty 6.0

TSP reframes secure code generation as a tree-structured self-play process that supplies dense on-policy signals at vulnerability-prone nodes, yielding higher security pass rates and cross-language generalization than SFT or unstructured self-play.

citing papers explorer

Showing 6 of 6 citing papers after filters.

  • In Defense of Information Leakage in Concept-based Models cs.LG · 2026-06-09 · conditional · none · ref 123 · internal anchor

    Concept-based models can use controlled 'benign' information leakage to remain accurate and intervenable under real-world concept incompleteness by reframing their training objective.

  • Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO cs.LG · 2026-04-04 · conditional · none · ref 1 · internal anchor

    PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalization to 77.1%.

  • SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data cs.LG · 2026-05-07 · conditional · none · ref 2 · 2 links · internal anchor

    SOPE dynamically controls offline training length in online RL using actor-aligned OPE on validation data to stop when benefits saturate, achieving up to 45.6% better performance and 22x less computation on Minari tasks.

  • AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning cs.LG · 2025-05-30 · conditional · none · ref 2 · internal anchor

    AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.

  • Proximal Policy Distillation cs.LG · 2024-07-21 · conditional · none · ref 4 · internal anchor

    PPD integrates PPO into policy distillation so the student collects and uses its own rewards, yielding better sample efficiency and robustness than standard student-distill or teacher-distill on ATARI, Mujoco, and Procgen tasks.

  • TD-MPC2: Scalable, Robust World Models for Continuous Control cs.LG · 2023-10-25 · conditional · none · ref 133 · internal anchor

    TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.