Corrected heavy-ball Q-learning with convergence and acceleration guarantees is derived via switched linear system and joint spectral radius analysis, extended to linear function approximation.
hub Mixed citations
Wiley Series in Probability and Statistics, Wiley (1994)
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Feasible value functions in POMDPs under memoryless policies form a semi-algebraic set defined by polynomial inequalities from the model parameters.
Finite-horizon optimal e-value designs for adaptive single-arm binary trials are constructed via dynamic programming and shown to have competitive operating characteristics with automatic futility indication.
A new efficient algorithm computes optimal conditional reachability probabilities in MDPs without creating hard cyclic reductions, achieving linear time on acyclic cases and substantial speedups on benchmarks from Bayesian networks, probabilistic programs, and runtime monitoring.
The optimal value and policy computation problem for finite-horizon objectives in multi-environment POMDPs is PSPACE-complete, and a new algorithm solves it more efficiently than previous methods on classical benchmarks.
A life-cycle optimization framework for deteriorating infrastructure under hazards is formulated as an MDP with a Kronecker-factored tensor method that reduces computational complexity from exponential to linear while preserving exact dynamic programming solutions.
UMB is a new binary file format for probabilistic systems that provides a unified, efficient alternative to tool-specific textual representations.
A new cooperative localization algorithm based on overlapping covariance intersection is fully distributed, provably recursively consistent, and scalable to ultra large-scale multi-agent systems without performance loss from ignored cross-correlations.
A POMDP decomposition method scales solving of the Sensor Selection Problem and Positional Observability Problem by 3 and 5 orders of magnitude in instance size and runtime.
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Resampling clinical time series into uniform bins for offline RL reduces performance by up to 60% and causes retrospective evaluations to overestimate returns by 1.5-3x versus unprocessed data.
AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.
Introduces a constrained max-min MORL algorithm with convergence analysis, validated in tabular settings and three simulated control domains.
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
Presents a simple discrete primer on hierarchical causality that requires causation classes, aggregation operators, and discrete event-time maps to connect actor and agent levels.
A tutorial framing deep learning as a complement to optimization for sequential decision-making under uncertainty, with applications in supply chains, healthcare, and energy.
citing papers explorer
-
The hidden risks of temporal resampling in clinical reinforcement learning
Resampling clinical time series into uniform bins for offline RL reduces performance by up to 60% and causes retrospective evaluations to overestimate returns by 1.5-3x versus unprocessed data.
-
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.