hub Canonical reference

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, Oleg Klimov · 2018 · cs.LG · arXiv 1810.12894

Canonical reference. 75% of citing Pith papers cite this work as background.

40 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed randomly initialized neural network. We also introduce a method to flexibly combine intrinsic and extrinsic rewards. We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma's Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access to the underlying state of the game, and occasionally completes the first level.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 1 other 1

citation-polarity summary

background 6 unclear 1 use method 1

representative citing papers

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

VLM-Safe-RL adds frozen VLM signals as anticipatory costs to the CMDP Lagrangian update via dual-path CLIP, VLM-Lagrange, and confidence gating, outperforming baselines on Safety-Gymnasium FormulaOne while showing partial generalization.

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.

Detecting Metastable Basins in High Dimensions via Marginal Trajectory Distribution Discrimination

stat.ML · 2026-05-22 · unverdicted · novelty 7.0

A neural algorithm identifies metastable basins in high-dimensional Markov processes by iteratively merging candidate representatives based on estimated classification risk between their marginal trajectory distributions.

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

cs.AI · 2026-05-16 · unverdicted · novelty 7.0

Alice uses preservation conflicts from failed candidate updates to create class-stratified hypotheses and guide exploration, improving executable world-model learning under prior misalignment.

Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring

cs.LG · 2025-09-29 · unverdicted · novelty 7.0

LPM uses a dual-network design to compute intrinsic rewards from the change in prediction error across iterations, providing a noise-robust signal that is theoretically linked to information gain.

An Information-Geometric Approach to Artificial Curiosity

cs.LG · 2025-04-08 · unverdicted · novelty 7.0

Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy, with geodesic interpolation on the occupancy manifold yielding a scalar-parameter family that includes count-based and max-entropy exploration.

Solving Rubik's Cube with a Robot Hand

cs.LG · 2019-10-16 · accept · novelty 7.0

Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.

Learning to Theorize the World from Observation

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven generalization.

Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning

cs.MA · 2026-05-03 · unverdicted · novelty 7.0

A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.

Dota 2 with Large Scale Deep Reinforcement Learning

cs.LG · 2019-12-13 · accept · novelty 7.0

OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty

cs.AI · 2026-06-23 · unverdicted · novelty 6.0

Heuresis evaluates six search strategies for autonomous ML research agents and finds that novel ideas are rare, none rated original, and only one reaches top-10 quality while strategies steer axes but do not expand the quality-novelty frontier.

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

cs.AI · 2026-06-07 · unverdicted · novelty 6.0

ISPO densifies GRPO rewards with sequence-level informativeness and token-level directional signals from policy probabilities to reduce zero-advantage collapse and hallucinated certainty on math benchmarks.

On Advantage Estimates for Max@K Policy Gradients

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.

Goal-Conditioned Agents that Learn Everything All at Once

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.

SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning

cs.RO · 2025-06-17 · unverdicted · novelty 6.0

SENIOR improves feedback efficiency and policy learning speed in PbRL by combining motion-distinction query selection via kernel density estimation with preference-guided intrinsic rewards, showing gains on simulated and real robot tasks.

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

cs.AI · 2024-08-01 · conditional · novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.

RoboNet: Large-Scale Multi-Robot Learning

cs.RO · 2019-10-24 · conditional · novelty 6.0

RoboNet is a multi-robot video dataset that enables pre-training of vision-based manipulation models which, after fine-tuning on a new robot, outperform robot-specific training that uses 4-20 times more data.

Benchmarking Batch Deep Reinforcement Learning Algorithms

cs.LG · 2019-10-03 · unverdicted · novelty 6.0

Many batch RL algorithms underperform both online DQN and the behavioral policy on Atari; an adapted discrete-action BCQ outperforms the others tested.

Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration

cs.RO · 2026-05-12 · unverdicted · novelty 6.0

QOED selects identifiable parameter directions via Fisher matrix eigenspace analysis and modifies exploration objectives to approximate ideal information gain under bounded nuisance assumptions, yielding 21-35% performance gains in robotic tasks.

Shaping Zero-Shot Coordination via State Blocking

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.

QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.

citing papers explorer

Showing 40 of 40 citing papers.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution cs.CL · 2023-09-28 · unverdicted · none · ref 56 · internal anchor
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models cs.LG · 2026-06-09 · unverdicted · none · ref 45 · internal anchor
VLM-Safe-RL adds frozen VLM signals as anticipatory costs to the CMDP Lagrangian update via dual-path CLIP, VLM-Lagrange, and confidence gating, outperforming baselines on Safety-Gymnasium FormulaOne while showing partial generalization.
OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation cs.LG · 2026-06-04 · unverdicted · none · ref 14 · internal anchor
OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.
Detecting Metastable Basins in High Dimensions via Marginal Trajectory Distribution Discrimination stat.ML · 2026-05-22 · unverdicted · none · ref 5 · internal anchor
A neural algorithm identifies metastable basins in high-dimensional Markov processes by iteratively merging candidate representatives based on estimated classification risk between their marginal trajectory distributions.
Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models cs.AI · 2026-05-16 · unverdicted · none · ref 2 · internal anchor
Alice uses preservation conflicts from failed candidate updates to create class-stratified hypotheses and guide exploration, improving executable world-model learning under prior misalignment.
Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring cs.LG · 2025-09-29 · unverdicted · none · ref 4 · internal anchor
LPM uses a dual-network design to compute intrinsic rewards from the change in prediction error across iterations, providing a noise-robust signal that is theoretically linked to information gain.
An Information-Geometric Approach to Artificial Curiosity cs.LG · 2025-04-08 · unverdicted · none · ref 2 · internal anchor
Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy, with geodesic interpolation on the occupancy manifold yielding a scalar-parameter family that includes count-based and max-entropy exploration.
Solving Rubik's Cube with a Robot Hand cs.LG · 2019-10-16 · accept · none · ref 12 · internal anchor
Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.
Learning to Theorize the World from Observation cs.LG · 2026-05-05 · unverdicted · none · ref 60
NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven generalization.
Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning cs.MA · 2026-05-03 · unverdicted · none · ref 13
A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.
Dota 2 with Large Scale Deep Reinforcement Learning cs.LG · 2019-12-13 · accept · none · ref 44
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
Heuresis: Search Strategies for Autonomous AI Research Agents Across Quality, Diversity and Novelty cs.AI · 2026-06-23 · unverdicted · none · ref 7 · internal anchor
Heuresis evaluates six search strategies for autonomous ML research agents and finds that novel ideas are rare, none rated original, and only one reaches top-10 quality while strategies steer axes but do not expand the quality-novelty frontier.
Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization cs.AI · 2026-06-07 · unverdicted · none · ref 37 · internal anchor
ISPO densifies GRPO rewards with sequence-level informativeness and token-level directional signals from policy probabilities to reduce zero-advantage collapse and hallucinated certainty on math benchmarks.
On Advantage Estimates for Max@K Policy Gradients cs.LG · 2026-06-04 · unverdicted · none · ref 5 · internal anchor
Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.
Goal-Conditioned Agents that Learn Everything All at Once cs.LG · 2026-05-22 · unverdicted · none · ref 82 · internal anchor
LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.
Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents cs.CL · 2026-05-19 · unverdicted · none · ref 3 · internal anchor
ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 3 · internal anchor
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning cs.RO · 2025-06-17 · unverdicted · none · ref 33 · internal anchor
SENIOR improves feedback efficiency and policy learning speed in PbRL by combining motion-distinction query selection via kernel density estimation with preference-guided intrinsic rewards, showing gains on simulated and real robot tasks.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 38 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
RoboNet: Large-Scale Multi-Robot Learning cs.RO · 2019-10-24 · conditional · none · ref 56 · internal anchor
RoboNet is a multi-robot video dataset that enables pre-training of vision-based manipulation models which, after fine-tuning on a new robot, outperform robot-specific training that uses 4-20 times more data.
Benchmarking Batch Deep Reinforcement Learning Algorithms cs.LG · 2019-10-03 · unverdicted · none · ref 2 · internal anchor
Many batch RL algorithms underperform both online DQN and the behavioral policy on Atari; an adapted discrete-action BCQ outperforms the others tested.
Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration cs.RO · 2026-05-12 · unverdicted · none · ref 23
QOED selects identifiable parameter directions via Fisher matrix eigenspace analysis and modifies exploration objectives to approximate ideal information gain under bounded nuisance assumptions, yielding 21-35% performance gains in robotic tasks.
Shaping Zero-Shot Coordination via State Blocking cs.LG · 2026-05-12 · unverdicted · none · ref 37
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 19
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs cs.LG · 2026-05-02 · unverdicted · none · ref 60
An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.
Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields cs.AI · 2026-04-28 · unverdicted · none · ref 9
Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.
Dual-Timescale Memory in a Spiking Neuron-Astrocyte Network for Efficient Navigation q-bio.QM · 2026-04-16 · unverdicted · none · ref 51
A neuron-astrocyte network with dual-timescale memory reduces median path lengths up to sixfold in partially observable grid-world navigation tasks.
Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms cs.RO · 2026-04-15 · unverdicted · none · ref 25
A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.
The FIL Hypothesis: Inductive Biases Help with Kernel Engineering cs.AI · 2026-06-29 · unverdicted · none · ref 8 · internal anchor
The FIL Hypothesis claims that inductive biases outperform purely data-driven methods on GPU programming tasks with non-trivial feedback loops.
Signed Compression Progress on a Sealed Audit is Goodhart-Resistant cs.LG · 2026-06-09 · unverdicted · partial · ref 3 · internal anchor
Signed compression progress on a sealed audit as intrinsic reward equals true audit improvement plus at most 2 Delta_n deviation, making it Goodhart-resistant.
Variational Proximal Policy Optimization stat.ML · 2026-06-06 · unverdicted · none · ref 199 · internal anchor
VP2O maps PPO to SVGD in a MoE architecture using functional kernels and expert orthogonalization, claiming +179 ELO on Codeforces and 32% token reduction on AIME for a 33B/4B model.
Reinforcement Learning from Cross-domain Videos with Video Prediction Model cs.CV · 2026-06-02 · unverdicted · none · ref 2 · internal anchor
XIPER creates a reward signal for cross-domain video imitation learning by training a video prediction model that maps agent views to the expert domain and scoring prediction likelihood.
Randomized Least Squares Value Iteration itself is Joint Differentially Private cs.LG · 2026-06-01 · unverdicted · none · ref 6 · internal anchor
RLSVI is (ε(δ),δ)-joint differentially private in tabular episodic MDPs with ε(δ) = 2AK/(H² log(2HSA)) + 2√(2AK log(1/δ)/(H² log(2HSA))).
Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion cs.RO · 2026-05-24 · unverdicted · none · ref 2 · internal anchor
Targeted changes to policy initialization, critic targets, and return estimation let SAC match PPO performance across legged locomotion tasks in massively parallel simulation.
When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited cs.LG · 2026-05-16 · unverdicted · none · ref 15 · internal anchor
Robust minimax task inference in BFMs achieves dynamics-shift robustness from nominal offline data alone and outperforms standard baselines.
OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence cs.RO · 2026-05-12 · unverdicted · none · ref 6 · internal anchor
OrbiSim builds a differentiable physics engine from world models to support gradient-based policy optimization and contact modeling in robotics.
Test-Time Alignment via Hypothesis Reweighting cs.LG · 2024-12-11 · unverdicted · none · ref 8 · internal anchor
HyRe personalizes reward models at test time by reweighting an ensemble of heads trained on aggregate preferences, using few target examples to outperform uniform averaging and prior methods on RewardBench and 32 tasks.
Improving Zero-Shot Offline RL via Behavioral Task Sampling cs.AI · 2026-04-28 · unverdicted · none · ref 3
Extracting task vectors from offline data to define training task distributions improves zero-shot offline RL performance by an average of 20%.
TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution cs.AI · 2026-06-07 · unverdicted · none · ref 42 · internal anchor
TT-DAC-PS, an enhanced version of TD3, achieves lower mean implementation shortfall than PPO, SAC, A2C, TWAP, VWAP, and AC on LOB data from ten U.S. stocks.
Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning cs.LG · 2026-04-16 · unreviewed · ref 2

Exploration by Random Network Distillation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer