Title resolution pending

Reinforcement learning: An introduction , author= · 2018

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Toward Optimal Regret in Robust Pricing: Decoupling Corruption and Time

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.

Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.

Online Resource Allocation With General Constraints

cs.GT · 2026-05-11 · unverdicted · novelty 7.0

An algorithm for online resource allocation with budget and general constraints achieves O(sqrt(T)) regret in stochastic and alpha-regret in adversarial regimes with bounded constraint violations.

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.

Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets

math.OC · 2026-05-09 · unverdicted · novelty 7.0

Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends on intrinsic manifold dimension.

Actor-Critic with Active Importance Sampling

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

AISAC extends actor-critic methods by actively optimizing the data-collection policy via importance sampling and cross-entropy minimization to lower gradient variance and improve learning in continuous action spaces.

Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions

stat.ML · 2026-05-07 · unverdicted · novelty 7.0

ABGD parametrizes piecewise linear functions as difference of max-affine functions and converges linearly to an epsilon-accurate solution with O(d max(sigma/epsilon,1)^2) samples under sub-Gaussian noise, which is minimax optimal up to logs.

Automated Design of Agentic Systems

cs.AI · 2024-08-15 · conditional · novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.

Holder Policy Optimisation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.

Response Time Enhances Alignment with Heterogeneous Preferences

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

AdamO: A Collapse-Suppressed Optimizer for Offline RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.

LIMO: Less is More for Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 6.0

LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.

Process Reinforcement through Implicit Rewards

cs.LG · 2025-02-03 · conditional · novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

cs.AI · 2024-08-01 · conditional · novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.

Privacy Preserving Reinforcement Learning with One-Sided Feedback

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

POOL is a new RL algorithm that adds privacy protection in continuous spaces with one-sided feedback and achieves sample complexity matching known non-private lower bounds.

UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models

cs.LG · 2026-05-17 · unverdicted · novelty 5.0

UNR-Explainer applies MCTS to find subgraphs that change k-NN relations in unsupervised node embeddings, claiming superior performance on GraphSAGE and DGI across datasets.

Evaluating causal indirect effects when mediators are left-censored by assay limit of quantification

stat.ME · 2026-05-20

Learning to Theorize the World from Observation

cs.LG · 2026-05-05

To Use AI as Dice of Possibilities with Timing Computation

cs.AI · 2026-05-01

citing papers explorer

Showing 22 of 22 citing papers.

Toward Optimal Regret in Robust Pricing: Decoupling Corruption and Time cs.LG · 2026-05-08 · unverdicted · none · ref 24
A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling cs.LG · 2026-05-14 · unverdicted · none · ref 47
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients cs.LG · 2026-05-14 · unverdicted · none · ref 28
HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 18
ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.
Online Resource Allocation With General Constraints cs.GT · 2026-05-11 · unverdicted · none · ref 23
An algorithm for online resource allocation with budget and general constraints achieves O(sqrt(T)) regret in stochastic and alpha-regret in adversarial regimes with bounded constraint violations.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability cs.LG · 2026-05-09 · unverdicted · none · ref 54
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets math.OC · 2026-05-09 · unverdicted · none · ref 26
Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends on intrinsic manifold dimension.
Actor-Critic with Active Importance Sampling cs.LG · 2026-05-08 · unverdicted · none · ref 1
AISAC extends actor-critic methods by actively optimizing the data-collection policy via importance sampling and cross-entropy minimization to lower gradient variance and improve learning in continuous action spaces.
Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions stat.ML · 2026-05-07 · unverdicted · none · ref 151
ABGD parametrizes piecewise linear functions as difference of max-affine functions and converges linearly to an epsilon-accurate solution with O(d max(sigma/epsilon,1)^2) samples under sub-Gaussian noise, which is minimax optimal up to logs.
Automated Design of Agentic Systems cs.AI · 2024-08-15 · conditional · none · ref 48
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
Holder Policy Optimisation cs.LG · 2026-05-12 · unverdicted · none · ref 54 · 2 links
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
Response Time Enhances Alignment with Heterogeneous Preferences cs.LG · 2026-05-07 · unverdicted · none · ref 68
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
AdamO: A Collapse-Suppressed Optimizer for Offline RL cs.LG · 2026-05-03 · unverdicted · none · ref 31
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 162
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
LIMO: Less is More for Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 27
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
Process Reinforcement through Implicit Rewards cs.LG · 2025-02-03 · conditional · none · ref 89
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 41
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
Privacy Preserving Reinforcement Learning with One-Sided Feedback cs.LG · 2026-05-18 · unverdicted · none · ref 63
POOL is a new RL algorithm that adds privacy protection in continuous spaces with one-sided feedback and achieves sample complexity matching known non-private lower bounds.
UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models cs.LG · 2026-05-17 · unverdicted · none · ref 24
UNR-Explainer applies MCTS to find subgraphs that change k-NN relations in unsupervised node embeddings, claiming superior performance on GraphSAGE and DGI across datasets.
Evaluating causal indirect effects when mediators are left-censored by assay limit of quantification stat.ME · 2026-05-20 · unreviewed · ref 250
Learning to Theorize the World from Observation cs.LG · 2026-05-05 · unreviewed · ref 66
To Use AI as Dice of Possibilities with Timing Computation cs.AI · 2026-05-01 · unreviewed · ref 82

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer