Optimal control of the future via prospective learning with control
Pith reviewed 2026-05-17 23:00 UTC · model grok-4.3
The pith
In non-stationary reset-free environments, empirical risk minimization asymptotically reaches the Bayes optimal control policy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Prospective Learning with Control (PLuC), a framework that applies empirical risk minimization to learn control policies in non-stationary, reset-free environments. Under certain fairly general assumptions, we prove that this method asymptotically achieves the Bayes optimal policy. In the specific case of foraging, prospective agents converge orders of magnitude faster than modern reinforcement learning algorithms.
What carries the argument
Prospective Learning with Control (PLuC), which uses supervised learning techniques to optimize policies for future control in changing environments without resets.
If this is right
- ERM asymptotically achieves the Bayes optimal policy in the PLuC framework.
- Prospective foraging agents outperform RL algorithms in non-stationary reset-free settings.
- The method applies to both natural and artificial agents in canonical tasks like foraging.
- Time-aware modifications to RL still converge slower than prospective methods.
Where Pith is reading between the lines
- This framework may allow supervised learning successes to transfer directly to sequential decision making in realistic settings.
- Future work could test the approach in higher-dimensional or more complex non-stationary tasks.
- It suggests a path to more efficient learning in environments where resets are impossible, such as real-world robotics.
Load-bearing premise
The claim relies on certain fairly general but unspecified assumptions holding in the non-stationary reset-free environment.
What would settle it
Demonstrating a non-stationary reset-free environment where empirical risk minimization fails to converge to the Bayes optimal policy would falsify the asymptotic achievement result.
Figures
read the original abstract
Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility. Here, we extend supervised learning to address learning to control in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control'' (PLuC), we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control: foraging, a canonical task relevant to both natural and artificial agents. We illustrate that modern RL algorithms, which assume stationarity, struggle in these non-stationary reset-free environments. Even with time-aware modifications, they converge orders of magnitude slower than our prospective foraging agents on a simple 1-D foraging benchmark. Code is available at: https://github.com/neurodata/procontrol.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Prospective Learning with Control (PLuC), a framework extending supervised learning via empirical risk minimization (ERM) to optimal control in non-stationary, reset-free environments. It claims to prove that under certain fairly general assumptions, ERM asymptotically recovers the Bayes optimal policy. The framework is illustrated on a foraging task, where prospective agents are shown to converge orders of magnitude faster than standard and time-aware RL methods on a 1-D benchmark. Code is provided.
Significance. If the asymptotic result holds under well-specified assumptions that accommodate arbitrary non-stationarity without implicit access to future statistics, the work could offer a theoretically grounded supervised-learning route to control problems where RL's stationarity assumptions fail. The reproducibility via public code is a clear strength.
major comments (2)
- [Abstract] Abstract: The central claim that 'we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy' provides neither the assumptions nor any derivation outline or error analysis. Standard ERM convergence arguments require i.i.d. or stationary data; the non-stationary reset-free setting therefore needs explicit conditions (e.g., on total variation of the environment measure or existence of a limiting distribution) to remain valid. Without these, it is impossible to verify whether the result applies to the motivating class of problems or reduces to a fitted quantity by construction.
- [Foraging benchmark] Foraging benchmark section: The reported comparison states that RL algorithms 'converge orders of magnitude slower' than prospective agents, yet no variance across runs, confidence intervals, or statistical tests are provided. This weakens the empirical support for the claim that PLuC is practically superior in non-stationary reset-free settings.
minor comments (2)
- [Methods] The prospective loss function and its relation to the standard supervised loss could be stated more explicitly with a short example in the main text rather than deferred to the appendix.
- [Introduction] A brief discussion of how the framework reduces to standard supervised learning when the environment is stationary would help readers situate the contribution.
Simulated Author's Rebuttal
Thank you for the detailed and constructive feedback on our manuscript. We have carefully considered each of the major comments and provide point-by-point responses below. We believe these revisions will strengthen the presentation of our results on Prospective Learning with Control.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy' provides neither the assumptions nor any derivation outline or error analysis. Standard ERM convergence arguments require i.i.d. or stationary data; the non-stationary reset-free setting therefore needs explicit conditions (e.g., on total variation of the environment measure or existence of a limiting distribution) to remain valid. Without these, it is impossible to verify whether the result applies to the motivating class of problems or reduces to a fitted quantity by construction.
Authors: We thank the referee for highlighting the need for greater clarity regarding the theoretical result. The assumptions—including conditions on the total variation of the environment measure and existence of limiting distributions that accommodate arbitrary non-stationarity without implicit access to future statistics—are explicitly stated in the theorem and proof in Section 3 of the manuscript, along with a derivation outline and error analysis that extends standard ERM arguments to the reset-free case. To address this comment directly, we will revise the abstract to include a concise statement of the key assumptions and a high-level sketch of the convergence argument. This change will make the scope of the result immediately verifiable from the abstract while preserving the full details in the main text. revision: yes
-
Referee: [Foraging benchmark] Foraging benchmark section: The reported comparison states that RL algorithms 'converge orders of magnitude slower' than prospective agents, yet no variance across runs, confidence intervals, or statistical tests are provided. This weakens the empirical support for the claim that PLuC is practically superior in non-stationary reset-free settings.
Authors: We agree that the empirical section would benefit from additional statistical rigor. In the revised manuscript, we will report results averaged over multiple independent runs, include confidence intervals or standard error bars, and add appropriate statistical tests (e.g., paired t-tests or Wilcoxon tests) to quantify the significance of the observed differences in convergence rates. These updates will provide stronger quantitative support for the practical superiority of prospective agents over time-aware RL baselines in the 1-D foraging benchmark. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper claims an asymptotic proof that ERM recovers the Bayes optimal policy under certain fairly general assumptions within the PLuC framework for non-stationary reset-free control. No load-bearing steps are exhibited that reduce by the paper's own equations or self-citations to fitted inputs, self-definitions, or ansatzes imported from prior author work. The result is presented as independent content resting on the stated assumptions and framework extension rather than tautological renaming or construction. This is the expected honest outcome when the derivation chain does not collapse to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Certain fairly general assumptions allow ERM to asymptotically achieve the Bayes optimal policy in non-stationary reset-free control settings.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Sulla determinazione empirica delle leggi di probabilita.Gion
V Glivenko. Sulla determinazione empirica delle leggi di probabilita.Gion. Ist. Ital. Attauri., 4:92–99, 1933. URLhttps://ci.nii.ac.jp/naid/10026792179/. 1
-
[2]
Sulla determinazione empirica delle leggi di probabilita.Giorn
Francesco Paolo Cantelli. Sulla determinazione empirica delle leggi di probabilita.Giorn. Ist. Ital. Attuari, 4,
-
[3]
On the uniform convergence of relative frequencies of events to their probabilities,
V Vapnik and A Chervonenkis. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.Theory of Probability and its Applications, 16:264–280, 1971. ISSN 0040-585X. doi:10.1137/ 1116025. URLhttps://doi.org/10.1137/1116025. doi: 10.1137/1116025
-
[4]
A Theory of the Learnable.Communications of the ACM, 27:1134–1142, 1984
L G Valiant. A Theory of the Learnable.Communications of the ACM, 27:1134–1142, 1984. ISSN 0001-
work page 1984
-
[5]
doi:10.1145/1968.1972. URLhttp://doi.acm.org/10.1145/1968.1972. 1, 2
-
[6]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017. 1
work page 2017
-
[7]
A Bayesian approach to filtering junk E-mail.AAAI Con- ference on Artificial Intelligence, 1998
M Sahami, S Dumais, D Heckerman, and E Horvitz. A Bayesian approach to filtering junk E-mail.AAAI Con- ference on Artificial Intelligence, 1998. URLhttps://cdn.aaai.org/Workshops/1998/WS-98-05/ WS98-05-009.pdf. 1
work page 1998
-
[8]
30 Leland McInnes, John Healy, and Steve Astels
Abraham Wald. Statistical Decision Functions.Annals of Mathematical Statistics, 20:165–205, 1949. ISSN 0003-4851,2168-8990. doi:10.1214/aoms/1177730030. URLhttps://projecteuclid.org/euclid. aoms/1177730030. 1
-
[9]
The Annals of Mathematical Statistics , author =
Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state Markov chains. The annals of mathematical statistics, 37:1554–1563, 1966. ISSN 0003-4851,2168-8990. doi:10.1214/ aoms/1177699147. URLhttp://dx.doi.org/10.1214/aoms/1177699147. 1
-
[10]
Dynamic programming and stochastic control processes.Information and control, 1:228– 239, 1958
Richard Bellman. Dynamic programming and stochastic control processes.Information and control, 1:228– 239, 1958. ISSN 0019-9958,1878-2981. doi:10.1016/s0019-9958(58)80003-0. URLhttp://dx.doi. org/10.1016/S0019-9958(58)80003-0. 1
-
[11]
R E Kalman. A new approach to linear filtering and prediction problems.International Jour- nal of Engineering, Transactions A: Basics, 82:35–45, 1960. ISSN 0021-9223. doi:10.1115/ 1.3662552. URLhttp://fluidsengineering.asmedigitalcollection.asme.org/article. aspx?articleid=1430402. 1
work page 1960
-
[12]
Y oan D Landau. Adaptive control: The model reference approach.IEEE transactions on systems, man, and cybernetics, SMC-14:169–170, 1984. ISSN 0018-9472,2168-2909. doi:10.1109/tsmc.1984.6313284. URL http://dx.doi.org/10.1109/TSMC.1984.6313284. 1
-
[13]
Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT Press, 2018. 1, 4
work page 2018
-
[14]
David Silver, Aja Huang, Chris J. Maddison, et al. Mastering the game of go with deep neural networks and tree search.Nature, 529(7587):484–489, 2016. 1, 4
work page 2016
-
[15]
Khimya Khetarpal, Matthew Riemer, Irina Rish, and Doina Precup. Towards Continual Reinforcement Learn- ing: A Review and Perspectives.Journal of Artificial Intelligence Research, 75:1401–1476, 2022. ISSN 1076-9757,1076-9757. doi:10.1613/jair.1.13673. URLhttps://www.jair.org/index.php/jair/ article/view/13673. 1
-
[16]
arXiv preprint arXiv:2307.11046 , title =
David Abel, André Barreto, Benjamin Van Roy, Doina Precup, H V Hasselt, and Satinder Singh. A definition of continual reinforcement learning.Neural Information Processing Systems, abs/2307.11046, 2023. doi: 10.48550/arXiv.2307.11046. URLhttps://openreview.net/pdf?id=ZZS9WEWYbD. 8
-
[17]
Saurabh Kumar, Henrik Marklund, Ashish Rao, Yifan Zhu, Hong Jun Jeon, Yueyang Liu, and Benjamin Van Roy. Continual learning as computationally constrained reinforcement learning.Foundations and Trends® in Machine Learning, 18:913–1053, 2025. ISSN 1935-8237,1935-8245. doi:10.1561/2200000116. URLhttp://dx.doi.org/10.1561/2200000116. 1
-
[18]
Annie S Chen, Archit Sharma, S Levine, and Chelsea Finn. Y ou only live once: Single-life reinforcement learning.Advances in Neural Information Processing Systems, abs/2210.08863, 2022. ISSN 1049-5258. doi:10.48550/arXiv.2210.08863. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2022/file/5ec4e93f2cec19d47ef852a0e1fb2c48-Paper-Conference.pdf. 1
-
[19]
Reset-free lifelong learning with skill-space planning.arXiv [cs.LG], 2020
Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Reset-free lifelong learning with skill-space planning.arXiv [cs.LG], 2020. URLhttps://openreview.net/pdf?id=HIGSa_3kOx3. 1
work page 2020
-
[20]
Leslie Valiant.Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Com- plex World. Basic Books, 2013. ISBN 9780465032716. 2
work page 2013
-
[21]
The MIT Press, kindle edition, 2012
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of Machine Learning (Adaptive Computation and Machine Learning series). The MIT Press, kindle edition, 2012. 2
work page 2012
-
[22]
Ian Goodfellow, Y oshua Bengio, Aaron Courville, and Y oshua Bengio.Deep Learning, volume 1 ofAdaptive Computation and Machine Learning series. MIT Press, 2016. ISBN 9780262337434. URLhttps://www. amazon.com/dp/B01MRVFGX4/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1. 2
work page 2016
-
[23]
Joshua T Vogelstein, Jayanta Dey, Hayden S Helm, Will LeVine, Ronak D Mehta, Tyler M Tomita, Haoyin Xu, Ali Geisa, Qingyang Wang, Gido M van de Ven, Chenyu Gao, Weiwei Y ang, Bryan Tower, Jonathan Larson, Christopher M White, and Carey E Priebe. Simple lifelong learning machines.IEEE transactions on pattern analysis and machine intelligence, PP:1–15, 2025...
-
[24]
Prospective Learning: Principled Extrapolation to the Future
Ashwin De Silva, Rahul Ramesh, Lyle Ungar, Marshall Hussain Shuler, Noah J Cowan, Michael Platt, Chen Li, Leyla Isik, Seung-Eon Roh, Adam Charles, Archana Venkataraman, Brian Caffo, Javier J How, Justus M Kebschull, John W Krakauer, Maxim Bichuch, Kaleab Alemayehu Kinfu, Eva Y ezerets, Dinesh Jayaraman, Jong M Shin, Soledad Villar, Ian Phillips, Carey E P...
work page 2023
-
[25]
Prospective learning: Learning for a dynamic future
Ashwin De Silva, Rahul Ramesh, Rubing Y ang, Siyu Yu, Joshua T Vogelstein, and Pratik Chaudhari. Prospective learning: Learning for a dynamic future. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 4, 11, 12
work page 2024
-
[26]
Lecture notes in computer science
Yuxin Bai, Cecelia Shuai, Ashwin De Silva, Siyu Yu, Pratik Chaudhari, and Joshua T Vogelstein.Prospective learning in retrospect, pages 17–29. Lecture notes in computer science. Springer Nature Switzerland, 2026. 2, 5, 11
work page 2026
-
[27]
Dimitri Bertsekas.A course in Reinforcement Learning. Athena Scientific, 2023. 2, 4
work page 2023
- [28]
-
[29]
Bandit based monte-carlo planning
Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors,Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18–22, 2006, Proceedings, volume 4212 ofLecture Notes in Computer Science, pages 282–293. Springer, 2006. 9
work page 2006
-
[30]
Efficient selectivity and backup operators in monte-carlo tree search
Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In H. Jaap van den Herik, Paolo Ciancarini, and H. H. L. M. Donkers, editors,Computers and Games, CG 2006, Turin, Italy, May 29–31, 2006, Revised Papers, Lecture Notes in Computer Science, pages 72–83. Springer, 2007
work page 2006
-
[31]
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017. URLhttps://arxiv.org/abs/1712.01815. 4
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
What is foraging?Biology & Philosophy, 39:3, 2024
David L Barack. What is foraging?Biology & Philosophy, 39:3, 2024. 4
work page 2024
-
[33]
Psychology Press, 1 edition, 2014
James J Gibson.The Ecological Approach to Visual Perception: Classic Edition (Psychology Press & Routledge Classic Editions). Psychology Press, 1 edition, 2014. 4
work page 2014
-
[34]
Tree-based batch mode reinforcement learning.Journal of Machine Learning Research, 6, 2005
Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning.Journal of Machine Learning Research, 6, 2005. 5, 15
work page 2005
-
[35]
Finite-time bounds for fitted value iteration.Journal of Machine Learn- ing Research, 9(5), 2008
Remi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration.Journal of Machine Learn- ing Research, 9(5), 2008. 5, 15
work page 2008
-
[36]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018. 5, 16
work page 2018
-
[37]
Soft Actor-Critic Algorithms and Applications
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018. 5, 16
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[38]
Cambridge University Press, 2014
Shai Shalev-Shwartz and Shai Ben-David.Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. 11
work page 2014
-
[39]
J Neyman. On the application of probability theory to agricultural experiments: Essay on principles, section 9.(translated in 1990).Statistical Science, 5:465–480, 1923. 12
work page 1990
-
[40]
D Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of educational Psychology, 66:688–701, 1974. 12 10 A Prospective Learning without control (PL-C) Here we briefly review the prior work on this topic, which is called "prospective learning" [23–25] (PL), modifying notation slightly for convenience. In retrospec...
work page 1974
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.