PMCTS is a new parallel MCTS variant that preserves formal policy improvement guarantees and scales with parallel compute, outperforming heuristic baselines in tested domains.
hub
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Inverse-RPO derives two variance-aware prior-based UCT policies from UCB-V that outperform PUCT on benchmarks with no extra cost.
ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical comparisons of online and offline-to-online RL algorithms.
A single-qubit quantum reinforcement learning agent solves CartPole faster than classical networks and quantifies shot-count versus control-frequency requirements for real-time closed-loop control on NISQ hardware, including direct electronics programming to reduce latency.
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
Coarse-to-fine 1D token sequences in autoregressive models enable stronger test-time search and even training-free text-to-image generation guided by verifiers, outperforming traditional 2D grid tokenization.
NashPG is a policy-gradient method with iteratively refined regularization that guarantees monotonic convergence to Nash equilibria in two-player zero-sum extensive-form games and scales to large benchmarks.
Aristotle reaches gold-medal-equivalent performance on 2025 IMO problems via integrated Lean proof search, informal lemma formalization, and a dedicated geometry solver.
GIFT fine-tunes deep RL policies with a stability-focused reward to improve global stability while preserving task performance.
An empirical study of JEPA world models identifies architecture, training objective, and planning choices that yield a model outperforming DINO-WM and V-JEPA-2-AC on navigation and manipulation tasks.
Advocates developing high-quality open-source scheduling software and linking observation planning with data analysis for future astronomical surveys.
citing papers explorer
-
PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling
PMCTS is a new parallel MCTS variant that preserves formal policy improvement guarantees and scales with parallel compute, outperforming heuristic baselines in tested domains.
-
Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search
Inverse-RPO derives two variance-aware prior-based UCT policies from UCB-V that outperform PUCT on benchmarks with no extra cost.
-
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical comparisons of online and offline-to-online RL algorithms.
-
Towards Real-time Control of a CartPole System on a Quantum Computer
A single-qubit quantum reinforcement learning agent solves CartPole faster than classical networks and quantifies shot-count versus control-frequency requirements for real-time closed-loop control on NISQ hardware, including direct electronics programming to reduce latency.
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
-
(1D) Ordered Tokens Enable Efficient Test-Time Search
Coarse-to-fine 1D token sequences in autoregressive models enable stronger test-time search and even training-free text-to-image generation guided by verifiers, outperforming traditional 2D grid tokenization.
-
NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria
NashPG is a policy-gradient method with iteratively refined regularization that guarantees monotonic convergence to Nash equilibria in two-player zero-sum extensive-form games and scales to large benchmarks.
-
Aristotle: IMO-level Automated Theorem Proving
Aristotle reaches gold-medal-equivalent performance on 2025 IMO problems via integrated Lean proof search, informal lemma formalization, and a dedicated geometry solver.
-
GIFT: Global stabilisation via Intrinsic Fine Tuning
GIFT fine-tunes deep RL policies with a stability-focused reward to improve global stability while preserving task performance.
-
What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?
An empirical study of JEPA world models identifies architecture, training objective, and planning choices that yield a model outperforming DINO-WM and V-JEPA-2-AC on navigation and manipulation tasks.
-
Scheduling Discovery in the 2020s
Advocates developing high-quality open-source scheduling software and linking observation planning with data analysis for future astronomical surveys.