A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al · 2018

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Query-Conditioned Test-Time Self-Training for Large Language Models

cs.CL · 2026-05-13 · conditional · novelty 7.0 · 2 refs

QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.

Learning to Discover at Test Time

cs.LG · 2026-01-22 · unverdicted · novelty 7.0

TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.

Shaping Zero-Shot Coordination via State Blocking

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 5 refs

LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.

ANO: A Principled Approach to Robust Policy Optimization

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.

citing papers explorer

Showing 5 of 5 citing papers.

Query-Conditioned Test-Time Self-Training for Large Language Models cs.CL · 2026-05-13 · conditional · none · ref 27 · 2 links
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
Learning to Discover at Test Time cs.LG · 2026-01-22 · unverdicted · none · ref 64
TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
Shaping Zero-Shot Coordination via State Blocking cs.LG · 2026-05-12 · unverdicted · none · ref 8
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning cs.AI · 2026-05-07 · unverdicted · none · ref 28 · 5 links
LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.
ANO: A Principled Approach to Robust Policy Optimization cs.AI · 2026-05-04 · unverdicted · none · ref 28
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer