A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
Title resolution pending
22 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.
An algorithm for online resource allocation with budget and general constraints achieves O(sqrt(T)) regret in stochastic and alpha-regret in adversarial regimes with bounded constraint violations.
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends on intrinsic manifold dimension.
AISAC extends actor-critic methods by actively optimizing the data-collection policy via importance sampling and cross-entropy minimization to lower gradient variance and improve learning in continuous action spaces.
ABGD parametrizes piecewise linear functions as difference of max-affine functions and converges linearly to an epsilon-accurate solution with O(d max(sigma/epsilon,1)^2) samples under sub-Gaussian noise, which is minimax optimal up to logs.
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
POOL is a new RL algorithm that adds privacy protection in continuous spaces with one-sided feedback and achieves sample complexity matching known non-private lower bounds.
UNR-Explainer applies MCTS to find subgraphs that change k-NN relations in unsupervised node embeddings, claiming superior performance on GraphSAGE and DGI across datasets.
citing papers explorer
-
Toward Optimal Regret in Robust Pricing: Decoupling Corruption and Time
A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
-
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
-
Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
HPO enables unbiased policy optimization in hybrid action spaces by mixing differentiable simulation gradients with score-function estimates, outperforming PPO as continuous dimensions increase.
-
Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning
ATD(λ) adapts TD(λ) in MARL via a density ratio estimator on past/current replay buffers to assign λ per state-action pair, yielding competitive or better results than fixed-λ QMIX and MAPPO on SMAC and Gfootball.
-
Online Resource Allocation With General Constraints
An algorithm for online resource allocation with budget and general constraints achieves O(sqrt(T)) regret in stochastic and alpha-regret in adversarial regimes with bounded constraint violations.
-
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.
-
Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets
Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends on intrinsic manifold dimension.
-
Actor-Critic with Active Importance Sampling
AISAC extends actor-critic methods by actively optimizing the data-collection policy via importance sampling and cross-entropy minimization to lower gradient variance and improve learning in continuous action spaces.
-
Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions
ABGD parametrizes piecewise linear functions as difference of max-affine functions and converges linearly to an epsilon-accurate solution with O(d max(sigma/epsilon,1)^2) samples under sub-Gaussian noise, which is minimax optimal up to logs.
-
Automated Design of Agentic Systems
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
-
Holder Policy Optimisation
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
AdamO: A Collapse-Suppressed Optimizer for Offline RL
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
-
LIMO: Less is More for Reasoning
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
Privacy Preserving Reinforcement Learning with One-Sided Feedback
POOL is a new RL algorithm that adds privacy protection in continuous spaces with one-sided feedback and achieves sample complexity matching known non-private lower bounds.
-
UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models
UNR-Explainer applies MCTS to find subgraphs that change k-NN relations in unsupervised node embeddings, claiming superior performance on GraphSAGE and DGI across datasets.
- Evaluating causal indirect effects when mediators are left-censored by assay limit of quantification
- Learning to Theorize the World from Observation
- To Use AI as Dice of Possibilities with Timing Computation