{"total":28,"items":[{"citing_arxiv_id":"2606.30442","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The FIL Hypothesis: Inductive Biases Help with Kernel Engineering","primary_cat":"cs.AI","submitted_at":"2026-06-29T15:16:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The FIL Hypothesis claims that inductive biases outperform purely data-driven methods on GPU programming tasks with non-trivial feedback loops.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23551","ref_index":82,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Goal-Conditioned Agents that Learn Everything All at Once","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:17:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20061","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents","primary_cat":"cs.CL","submitted_at":"2026-05-19T16:19:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19461","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Mode Collapse: Distribution Matching for Diverse Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-19T07:13:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17017","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited","primary_cat":"cs.LG","submitted_at":"2026-05-16T14:33:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Robust minimax task inference in BFMs achieves dynamics-shift robustness from nominal offline data alone and outperforms standard baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16725","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models","primary_cat":"cs.AI","submitted_at":"2026-05-16T00:18:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Alice uses preservation conflicts from failed candidate updates to create class-stratified hypotheses and guide exploration, improving executable world-model learning under prior misalignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16395","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:43:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OrbiSim builds a differentiable physics engine from world models to support gradient-based policy optimization and contact modeling in robotics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12084","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:07:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QOED selects identifiable parameter directions via Fisher matrix eigenspace analysis and modifies exploration objectives to approximate ideal information gain under bounded nuisance assumptions, yielding 21-35% performance gains in robotic tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"age agents to seek experiences that are either novel or hard to predict. For novelty, classic count-based methods in tabular RL [19] have been extended to continuous spaces using density estimation [20, 21]. Go-Explore [22] further biases exploration by returning to previously discovered states and expanding from them. For prediction-based curiosity, Random Network Distillation (RND) [23] uses the prediction error of a fixed ran- dom target network as an intrinsic reward. Related methods use forward dynamics prediction [24, 25] or inverse dynamics to emphasize controllable novelty [26]. Physics-based curiosity can also be defined through parameter estimation error [27], but it typically assumes a small and specified set of parameters."},{"citing_arxiv_id":"2605.11688","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Shaping Zero-Shot Coordination via State Blocking","primary_cat":"cs.LG","submitted_at":"2026-05-12T07:46:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[35] Georg Ostrovski, Marc G Bellemare, Aäron Oord, and Rémi Munos. Count-based exploration with neural density models. InInternational conference on machine learning, pages 2721-2730. PMLR, 2017. [36] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational conference on machine learning, pages 2778- 2787. PMLR, 2017. [37] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018. [38] Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. Large-scale study of curiosity-driven learning.arXiv preprint arXiv:1808.04355, 2018. [39] Eitan Altman.Constrained Markov decision processes."},{"citing_arxiv_id":"2605.03413","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Theorize the World from Observation","primary_cat":"cs.LG","submitted_at":"2026-05-05T06:39:12+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01865","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning","primary_cat":"cs.MA","submitted_at":"2026-05-03T13:20:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01862","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL","primary_cat":"cs.LG","submitted_at":"2026-05-03T13:11:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"where in deterministic environments, this reduces to the discounted indicator of whether the trajectory reaches goalg. In-Distribution Optimal Q-Value: Q⋆(s, a, g, h) := max τ∈T β:(sh,ah)=(s,a) Qβ(τ, g, h).(18) 15 Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL Optimal Stitched Policy: π⋆ β(a|s, g, h) :=P β(a|s, g, h, Q ⋆(s, a, g, h)).(19) Performance Metric: J(π) :=E s1∼ρ,g∼p(g)[V π 1 (s1, g)],(20) whereV π h (s, g) :=E π[PH t=h r(st, at, g)|s h =s]andr(s, a, g) =1[ϕ(s ′) =g]. B.2. Assumptions Assumption B.1(Deterministic Environment).The transition dynamics P:S × A → S is deterministic, i.e., given (s, a), the next state s′ =P(s, a) is unique. This is standard in goal-conditioned RL theory (Park et al."},{"citing_arxiv_id":"2605.01242","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs","primary_cat":"cs.LG","submitted_at":"2026-05-02T04:46:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26095","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields","primary_cat":"cs.AI","submitted_at":"2026-04-28T20:18:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25496","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Improving Zero-Shot Offline RL via Behavioral Task Sampling","primary_cat":"cs.AI","submitted_at":"2026-04-28T10:56:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Improving Zero-Shot Offline RL via Behavioral Task Sampling 3.1. Problem Setup and Notations We define a Markov Decision Process (MDP) by the tuple M=⟨S,A,P, µ, γ⟩ , where S is the state space, A is the action space, P:S × A × S →[0,1] is the transition probability distribution such that P(s ′|s, a) =P(s t+1 = s′|st =s, a t =a) , µ:S →[0,1] is the initial state distribution s0 ∼µ(s 0), and γ∈[0,1) is the discount factor. We operate in the offline setting, where the agent learns from a fixed datasetD={τ i}N i=1 consisting of trajectories collected by a single or multiple unknown behavior policies. Let ϕ:S →R d denote a bounded learned state embedding. We consider linear reward functions of the form Rz(s) = ϕ(s)⊤z, parameterized by a task vector z∈S d−1 ⊂R d where Sd−1 is the unit d-sphere. Let Π be the set (popula- tion) of all policies π:S →∆(A) , with ∆(X) denoting the set of all probability distributions over a setX. 3.2. Successor Features Given a state embedding ϕ(s)∈R d learned via some cri- teria (e.g., autoencoders or low-rank approximations), SF methods learn the successor features of a family of policies πz for all task vectorsz∈R d:    ψ(s0, a0, z) =E hP t≥0 γtϕ(st) s0, a0, πz i , πz(s) := arg maxa ψ(s, a, z)⊤z (1) The successor features ψπ satisfy the Rd-valued Bellman equation ψπ =ϕ+γP πψπ with Pπ the policy-induced transition matrix. Therefore, we can train ψ by minimizing the Bellman residual E(st,at,st+1) ∼D ψ(st, at, z)−ϕ(st)−γ ¯ψ(st+1, πz(st+1), z) 2 (2) where ¯ψ is a non-trainable target version of ψ, as in Deep Q- learning (Mnih et al., 2013). This objective can be improved since we do not use the full vectorψ(s, a, z); only the scalar ψ(s, a, z)⊤z is required for defining the policies. Instead, we can minimize E(st,at,st+1)∼D \u0012 ψ(st, at, z)⊤z−ϕ(s t)⊤z −γ ¯ψ(st+1, πz(st+1), z)⊤z \u00132 . (3) This trains ψ(s, a, z)⊤z as the Q-function Q(s, a, z) corre- sponding to the rewardR z(s) =ϕ(s) ⊤z. Once a test reward Rtest(·) is revealed, we use a"},{"citing_arxiv_id":"2604.15414","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-16T17:06:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"062 0.20±0.04 pinj = 0.3 0.598±0.09 4.84±2.37−0.256±0.076 0.42±0.09−0.597±0.0740.25±0.05 INTR.+p inj = 0.2 0.673±0.09 4.25±2.50−0.175±0.0670.29±0.07−0.717±0.1450.26±0.08 Table 9:TeLAPA Cross-Archive Elite Injection Ablation.We report mean±95% CI across 20 runs. We introduce the Elite Injection mechanism with various elite-injection probabilitiesp inj ∈[0.1,0.2,0.3]following Eq. 59. We use the same hyperparameter configuration for all methods as in the main paper results. Additionally we combine the best pinj value with Episodic Intrinsic Reward (Intr.) to evaluate the impact of both ablation studies used together.pinj = 0 indicates main paper baseline as seen in Tab. 1. We interpret elite injection as anarchive-shaping interventionto isolate whetherexternally introduceddiversity helps"},{"citing_arxiv_id":"2604.15391","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dual-Timescale Memory in a Spiking Neuron-Astrocyte Network for Efficient Navigation","primary_cat":"q-bio.QM","submitted_at":"2026-04-16T07:13:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A neuron-astrocyte network with dual-timescale memory reduces median path lengths up to sixfold in partially observable grid-world navigation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and scale poorly to multi-step tasks with large state spaces [12, 11]. Modern approaches have substantially advanced exploration capabilities in high-dimensional environments, but this progress has come at significant computational cost. Intrinsic motivation methods, including count-based exploration with pseudo-counts [48, 49] and prediction-based approaches such as ICM [50] and RND [51], enable effective exploration in complex domains like Atari. However, as Ostrovski et al. [49] note, pseudo-count methods require expensive density model updates at every step. Prediction- based methods similarly require training two networks and computing errors continuously, imposing substantial computational burden [52]. Hierarchical methods employing temporal abstractions [53, 54] require multi-level policy"},{"citing_arxiv_id":"2604.16509","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms","primary_cat":"cs.RO","submitted_at":"2026-04-15T03:39:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25438","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring","primary_cat":"cs.LG","submitted_at":"2025-09-29T19:43:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LPM uses a dual-network design to compute intrinsic rewards from the change in prediction error across iterations, providing a noise-robust signal that is theoretically linked to information gain.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.14648","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2025-06-17T15:42:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SENIOR improves feedback efficiency and policy learning speed in PbRL by combining motion-distinction query selection via kernel density estimation with preference-guided intrinsic rewards, showing gains on simulated and real robot tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.06355","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Information-Geometric Approach to Artificial Curiosity","primary_cat":"cs.LG","submitted_at":"2025-04-08T18:04:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Information geometry constrains intrinsic rewards to strictly concave functions of reciprocal occupancy, with geodesic interpolation on the occupancy manifold yielding a scalar-parameter family that includes count-based and max-entropy exploration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.08812","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Test-Time Alignment via Hypothesis Reweighting","primary_cat":"cs.LG","submitted_at":"2024-12-11T23:02:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HyRe personalizes reward models at test time by reweighting an ensemble of heads trained on aggregate preferences, using few target examples to outperform uniform averaging and prior methods on RewardBench and 32 tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.00724","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models","primary_cat":"cs.AI","submitted_at":"2024-08-01T17:16:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.16797","ref_index":56,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution","primary_cat":"cs.CL","submitted_at":"2023-09-28T19:01:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1912.06680","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dota 2 with Large Scale Deep Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2019-12-13T19:56:40+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"variety of eﬀorts have pushed performance on the remaining Atari games[16], reduced the sample complexity, and introduced new challenges by focusing on intrinsic rewards [41-43]. As more computational resources have become available, a body of work has developed address- ing the use of distributed systems in training. Larger batch sizes were found to accelerate training of image models[44-46]. Proximal Policy Optimization[14] and A3C [47] improve the ability to asyn- chronously collect rollout data. Recent work has demonstrated the beneﬁt of distributed learning on a wide array of problems including single-player video games[48] and robotics[5]. The motivation for our surgery method is similar to prior work onNet2Netstyle function preserv-"},{"citing_arxiv_id":"1910.11215","ref_index":56,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboNet: Large-Scale Multi-Robot Learning","primary_cat":"cs.RO","submitted_at":"2019-10-24T15:20:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoboNet is a multi-robot video dataset that enables pre-training of vision-based manipulation models which, after fine-tuning on a new robot, outperform robot-specific training that uses 4-20 times more data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1910.07113","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Solving Rubik's Cube with a Robot Hand","primary_cat":"cs.LG","submitted_at":"2019-10-16T00:59:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reinforcement learning models trained only in simulation using automatic domain randomization solve Rubik's cube with a real robot hand.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Sorrentino. Dexterous manipulation through rolling. In Proceedings of the 1995 International Conference on Robotics and Automation, Nagoya, Aichi, Japan, May 21-27, 1995, pages 452-457, 1995. [11] M. Botvinick, S. Ritter, J. X. Wang, Z. Kurth-Nelson, C. Blundell, and D. Hassabis. Reinforcement learning, fast and slow. Trends in cognitive sciences, 2019. [12] Y . Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018. [13] S. Carter, D. Ha, I. Johnson, and C. Olah. Experiments in handwriting with a neural network. Distill, 2016. [14] Y . Chebotar, A. Handa, V . Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox. Closing the Sim-to-Real"},{"citing_arxiv_id":"1910.01708","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking Batch Deep Reinforcement Learning Algorithms","primary_cat":"cs.LG","submitted_at":"2019-10-03T20:15:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Many batch RL algorithms underperform both online DQN and the behavioral policy on Atari; an adapted discrete-action BCQ outperforms the others tested.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}