PG-DPO is a new variational framework that replaces Bellman recursion with a Pontryagin-guided adjoint-MC projection for RL under non-exponential discounting and shows gains on hyperbolic and survival benchmarks.
Title resolution pending
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 12roles
background 1polarities
background 1representative citing papers
Creativity is defined as meta-learning where a frozen diffusion creator optimizes candidates for rapid improvement by an adapting appraiser such as an autoencoder or CLIP adapter.
New concentration bounds and stopping rule close the exponential gap to match the lower bound for entropic best policy identification.
Aggregation distorts parametric behavioral curve peaks by factors of 3-5x via Simpson's paradox and survival bias, shown by individual vs. aggregate comparisons on Goodreads and Amazon datasets with a negative control.
Derives contraction-based Q-value extensions for exponential utility and proves almost-sure convergence of two-timescale and one-timescale model-free algorithms in discounted MDPs.
LIMEN discovers effective RL interfaces by using LLMs to evolve observation and reward programs together from raw state, guided by policy training success, outperforming single-component optimization.
InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
AEL uses a fast-timescale bandit for memory policy selection and slow-timescale LLM reflection for causal insights, achieving a Sharpe ratio of 2.13 on a 208-episode portfolio benchmark while showing that added mechanisms degrade performance.
Infra-Bayesian RL agents are shown via implementation to have lower worst-case regret than classical RL under model misspecification and to solve Newcomb's problem optimally.
Reinforcement learning with graph neural networks finds minimally rigid graphs that match known planar realization optima and set new records for spherical realization counts.
RLBD trains a neural policy with REINFORCE to select cuts adaptively in Benders decomposition, yielding faster convergence and better generalization than standard BD or SVM-based LearnBD on an EV charging problem.
Develops greedy optimization algorithms for directly learning optimal integer-weighted clinical risk scores, applied to predict post-discharge mortality in a large EHR cohort with a supporting simulation study.
citing papers explorer
-
Beyond the Bellman Recursion: A Pontryagin-Guided Framework for Non-Exponential Discounting
PG-DPO is a new variational framework that replaces Bellman recursion with a Pontryagin-guided adjoint-MC projection for RL under non-exponential discounting and shows gains on hyperbolic and survival benchmarks.
-
Seeking the Unfamiliar but Memorable: Conceptual Creativity as Meta-Learning
Creativity is defined as meta-learning where a frozen diffusion creator optimizes candidates for rapid improvement by an adapting appraiser such as an autoencoder or CLIP adapter.
-
Tight Sample Complexity Bounds for Entropic Best Policy Identification
New concentration bounds and stopping rule close the exponential gap to match the lower bound for entropic best policy identification.
-
Simpson's Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics
Aggregation distorts parametric behavioral curve peaks by factors of 3-5x via Simpson's paradox and survival bias, shown by individual vs. aggregate comparisons on Goodreads and Amazon datasets with a negative control.
-
Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
Derives contraction-based Q-value extensions for exponential utility and proves almost-sure convergence of two-timescale and one-timescale model-free algorithms in discounted MDPs.
-
Discovering Reinforcement Learning Interfaces with Large Language Models
LIMEN discovers effective RL interfaces by using LLMs to evolve observation and reward programs together from raw state, guided by policy training success, outperforming single-component optimization.
-
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
-
AEL: Agent Evolving Learning for Open-Ended Environments
AEL uses a fast-timescale bandit for memory policy selection and slow-timescale LLM reflection for causal insights, achieving a Sharpe ratio of 2.13 on a 208-episode portfolio benchmark while showing that added mechanisms degrade performance.
-
Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness
Infra-Bayesian RL agents are shown via implementation to have lower worst-case regret than classical RL under model misspecification and to solve Newcomb's problem optimally.
-
Learning Minimally Rigid Graphs with High Realization Counts
Reinforcement learning with graph neural networks finds minimally rigid graphs that match known planar realization optima and set new records for spherical realization counts.
-
Learning to Cut: Reinforcement Learning for Benders Decomposition
RLBD trains a neural policy with REINFORCE to select cuts adaptively in Benders decomposition, yielding faster convergence and better generalization than standard BD or SVM-based LearnBD on an EV charging problem.
-
Learning Interpretable Point-Based Clinical Risk Scores via Direct Optimization
Develops greedy optimization algorithms for directly learning optimal integer-weighted clinical risk scores, applied to predict post-discharge mortality in a large EHR cohort with a supporting simulation study.