pith. sign in

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

years

2026 5

roles

background 1

polarities

background 1

representative citing papers

Learning to Discover at Test Time

cs.LG · 2026-01-22 · unverdicted · novelty 7.0

TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.

Shaping Zero-Shot Coordination via State Blocking

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.

ANO: A Principled Approach to Robust Policy Optimization

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.

citing papers explorer

Showing 5 of 5 citing papers.

  • Query-Conditioned Test-Time Self-Training for Large Language Models cs.CL · 2026-05-13 · conditional · none · ref 27 · 2 links

    QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.

  • Learning to Discover at Test Time cs.LG · 2026-01-22 · unverdicted · none · ref 64

    TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.

  • Shaping Zero-Shot Coordination via State Blocking cs.LG · 2026-05-12 · unverdicted · none · ref 8

    SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.

  • Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning cs.AI · 2026-05-07 · unverdicted · none · ref 28 · 5 links

    LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.

  • ANO: A Principled Approach to Robust Policy Optimization cs.AI · 2026-05-04 · unverdicted · none · ref 28

    ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.