Calibrated Model-Based Deep Reinforcement Learning

Ali Malik; Danny Nemer; Harlan Seymour; Jiaming Song; Stefano Ermon; Volodymyr Kuleshov

arxiv: 1906.08312 · v1 · pith:6H4U5YG5new · submitted 2019-06-19 · 💻 cs.LG · stat.ML

Calibrated Model-Based Deep Reinforcement Learning

Ali Malik , Volodymyr Kuleshov , Jiaming Song , Danny Nemer , Harlan Seymour , Stefano Ermon This is my paper

Pith reviewed 2026-05-25 20:11 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords model-based reinforcement learninguncertainty calibrationpredictive uncertaintysample efficiencyHalfCheetahMuJoCodeep reinforcement learning

0 comments

The pith

Augmenting any model-based RL agent with a calibrated model improves planning, sample complexity, and exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that predictive uncertainties in model-based reinforcement learning must be calibrated, meaning their probabilities match the actual frequencies of events, to avoid bottlenecking performance. It introduces a simple augmentation that adds this calibration to any existing model-based RL agent. The approach yields consistent gains in planning quality, reduced data requirements, and better exploration. On the HalfCheetah MuJoCo benchmark it reaches state-of-the-art results with half the samples of the prior best method. The finding indicates calibration can be added at low cost to lift model-based RL results.

Core claim

The central claim is that calibrated uncertainties—where predicted probabilities match empirical frequencies—are required for accurate model-based planning and reinforcement learning. A simple procedure augments any model-based RL agent with such a calibrated model, and the resulting system improves planning, sample complexity, and exploration. On the HalfCheetah MuJoCo task the calibrated system reaches state-of-the-art performance while using 50 percent fewer samples than the leading prior approach.

What carries the argument

A calibrated predictive model whose output probabilities match observed event frequencies, inserted into the model-based planning and exploration loop.

If this is right

Planning accuracy rises because the model supplies better-calibrated probabilities for decision making.
Sample complexity drops, so agents reach target performance with less interaction data.
Exploration improves because calibrated uncertainties guide more effective information-seeking behavior.
The augmentation applies to any existing model-based RL agent with only minimal added computation.
State-of-the-art results become reachable on continuous-control tasks such as MuJoCo environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Calibration could be applied to model-based planning outside RL, such as in robotics trajectory optimization.
The same augmentation might lift performance on other continuous-control benchmarks beyond HalfCheetah.
If calibration is the key missing ingredient, many existing model-based methods may improve simply by adding it.

Load-bearing premise

The calibration step produces uncertainty estimates that remain useful for the downstream planning algorithm instead of merely matching frequencies on held-out data.

What would settle it

Running the same HalfCheetah experiment and finding that the calibrated model version requires the same number of samples or more to reach the reported state-of-the-art score.

Figures

Figures reproduced from arXiv: 1906.08312 by Ali Malik, Danny Nemer, Harlan Seymour, Jiaming Song, Stefano Ermon, Volodymyr Kuleshov.

**Figure 1.** Figure 1: Modern model-based planning algorithms with probabilistic models can over-estimate their confidence (purple distribution), and overlook dangerous outcomes (e.g., a collision). We show how to endow agents with a calibrated world model that accurately captures true uncertainty (green distribution) and improves planning in high-stakes scenarios like autonomous driving or industrial optimisation. roth & Ras… view at source ↗

**Figure 2.** Figure 2: Top: Performnce of CalibLinUCB and LinUCB on the UCI covertype dataset. Bottom: Calibration curves of the LinUCB algorithms on the covertype dataset Results. We expect the LinUCB algorithm to already be calibrated on the synthetic linear data since the model is well-specified, implying no difference in performance between CalLinUCB and LinUCB. On the real UCI datasets [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 3.** Figure 3: Performance on different control tasks. The calibrated algorithm does at least as good, and often much better than the uncalibrated models. Plots show maximum reward obtained so far, averaged over 10 trials. Standard error is displayed as the shaded areas. Extensions to Safety. Calibration also plays an important role in the domain of RL safety (Berkenkamp et al., 2017). In situations where the agent is pl… view at source ↗

**Figure 4.** Figure 4: Predicted expected reward for both LinUCB and CalLinUCB algorithms on the covertype dataset. Figures show predictions at random timesteps where CalLinUCB chose the optimal action but LinUCB did not. Top: Predicted reward of both algorithms for the optimal action. Bottom: Predicted reward of both algorithms for the action which the algorithm chose to pick instead of the optimal action at that timestep. We c… view at source ↗

**Figure 5.** Figure 5: Cartpole future state predictions. The calibrated algorithm has much tighter uncertainties around the true next state in early training iterations. Later into training, their uncertainties are almost equivalent [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Estimates of predictive uncertainty are important for accurate model-based planning and reinforcement learning. However, predictive uncertainties---especially ones derived from modern deep learning systems---can be inaccurate and impose a bottleneck on performance. This paper explores which uncertainties are needed for model-based reinforcement learning and argues that good uncertainties must be calibrated, i.e. their probabilities should match empirical frequencies of predicted events. We describe a simple way to augment any model-based reinforcement learning agent with a calibrated model and show that doing so consistently improves planning, sample complexity, and exploration. On the \textsc{HalfCheetah} MuJoCo task, our system achieves state-of-the-art performance using 50\% fewer samples than the current leading approach. Our findings suggest that calibration can improve the performance of model-based reinforcement learning with minimal computational and implementation overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Calibration added to MBRL agents cuts samples on HalfCheetah but the abstract leaves open whether the calibration step itself drives the gains or just the multi-step coherence issue.

read the letter

The punchline is that this paper shows a simple calibration step on top of any model-based RL agent improves planning, sample complexity, and exploration, with a 50% reduction in samples needed for SOTA on HalfCheetah. The new element is the empirical demonstration that this augmentation works on standard continuous control benchmarks. The core idea of calibration comes from earlier uncertainty work, but applying it here and measuring the downstream effect on RL performance is the contribution. It does well by keeping the method lightweight and claiming consistent benefits across tasks. The argument that good uncertainties must match empirical frequencies is reasonable and directly tied to the planning use case. The main soft spot is the lack of detail in the abstract on whether the calibration actually produces uncertainties that are useful inside the planning loop. Calibration is done on single-step predictions from held-out data, but model-based planning involves multi-step rollouts where conditional coherence matters. If the calibration only fixes average frequencies without fixing the sequential properties, the gains could come from something else in the procedure. The paper would need to show ablations that isolate the calibration effect and perhaps some analysis of uncertainty propagation. This is for RL practitioners and researchers focused on model-based approaches and uncertainty. Anyone looking for ways to improve sample efficiency without heavy changes would find it relevant. It deserves peer review because the results are strong enough on the surface to warrant checking the methods and whether the claims hold under scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper argues that predictive uncertainties in model-based deep RL must be calibrated (i.e., predicted probabilities must match empirical frequencies) to avoid bottlenecks in planning. It presents a simple augmentation procedure that can be added to any MBRL agent and claims this consistently improves planning, sample complexity, and exploration; on HalfCheetah it reaches state-of-the-art performance with 50% fewer samples than prior leading methods.

Significance. If the reported gains are causally attributable to the calibration step and the uncertainties remain useful inside multi-step planning, the result would supply a low-overhead, broadly applicable improvement to existing MBRL pipelines. The emphasis on calibration as a distinct requirement beyond raw predictive accuracy would also sharpen the community's understanding of what makes model-based methods sample-efficient.

major comments (2)

[§4 and §5] §4 (method) and §5 (experiments): the calibration procedure is validated only on single-step marginal frequencies from a held-out set; no analysis or diagnostic is given showing that the resulting uncertainties remain conditionally calibrated or coherent when propagated over the multi-step rollouts actually used by the planner. This leaves open the possibility that observed gains arise from other changes in the augmentation rather than from calibration itself.
[Table 1] Table 1 / HalfCheetah results: the 50% sample reduction and SOTA claim are presented without an ablation that isolates the calibration module (i.e., the same base agent with and without the calibration step, all other factors fixed). Without this control it is impossible to attribute the performance difference to the calibrated uncertainties rather than ancillary implementation details.

minor comments (2)

[Abstract] The abstract refers to 'our system' without naming the underlying MBRL algorithm (e.g., PETS, MBPO) that receives the calibrated model; this should be stated explicitly for reproducibility.
[§3] Notation for the calibrated predictive distribution is introduced without an explicit comparison to the uncalibrated baseline distribution used in the same experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in the current evidence for attributing gains specifically to calibration under multi-step planning. We address each point below and will revise the manuscript to incorporate the requested analyses and ablations.

read point-by-point responses

Referee: [§4 and §5] §4 (method) and §5 (experiments): the calibration procedure is validated only on single-step marginal frequencies from a held-out set; no analysis or diagnostic is given showing that the resulting uncertainties remain conditionally calibrated or coherent when propagated over the multi-step rollouts actually used by the planner. This leaves open the possibility that observed gains arise from other changes in the augmentation rather than from calibration itself.

Authors: We agree that single-step marginal calibration alone does not fully demonstrate suitability for multi-step planning. The manuscript validates calibration on held-out single-step predictions as a necessary first step, but does not provide diagnostics for conditional calibration or coherence after propagation through the planner's rollouts. In revision we will add such diagnostics, for example by measuring empirical frequencies of events over trajectories actually sampled by the planner and checking whether the propagated uncertainties remain calibrated. revision: yes
Referee: [Table 1] Table 1 / HalfCheetah results: the 50% sample reduction and SOTA claim are presented without an ablation that isolates the calibration module (i.e., the same base agent with and without the calibration step, all other factors fixed). Without this control it is impossible to attribute the performance difference to the calibrated uncertainties rather than ancillary implementation details.

Authors: We acknowledge that the current comparisons are to external prior methods rather than an internal control that holds the base agent fixed. An explicit ablation isolating the calibration step would strengthen causal attribution. We will add this ablation to the revised manuscript, reporting performance of the identical base agent with and without the calibration augmentation on HalfCheetah (and other tasks) while keeping all other implementation details unchanged. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical augmentation evaluated independently of calibration fitting

full rationale

The paper's core claim is an empirical one: augmenting any MBRL agent with a calibrated model (via a post-processing step that matches probabilities to held-out frequencies) yields measurable gains in planning, sample efficiency, and exploration on tasks such as HalfCheetah. No derivation chain, equation, or self-citation reduces the reported performance improvements to quantities fitted on the same evaluation data or to a tautological redefinition. Calibration is described as a simple, independent module whose effect is measured separately from its construction; the central result therefore remains falsifiable on external benchmarks and does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated assumption that a calibrated model can be obtained by a lightweight post-processing step whose parameters do not require task-specific retuning.

pith-pipeline@v0.9.0 · 5674 in / 1011 out tokens · 14570 ms · 2026-05-25T20:11:20.898669+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees
cs.LG 2026-04 unverdicted novelty 8.0

RHC-UCRL is the first algorithm for safety-constrained RL under explicit adversarial dynamics, providing sub-linear regret and constraint violation guarantees by maintaining optimism over both agent and adversary policies.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Finite-time analysis of the multiarmed bandit problem

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47 0 (2-3): 0 235--256, May 2002. ISSN 0885-6125. doi:10.1023/A:1013689704352. URL https://doi.org/10.1023/A:1013689704352

work page doi:10.1023/a:1013689704352 2002
[3]

Safe model-based reinforcement learning with stability guarantees

Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. Safe model-based reinforcement learning with stability guarantees. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp.\ 908--918. Curran Associates, Inc., 2017

work page 2017
[4]

OpenAI Gym

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Sample-efficient reinforcement learning with stochastic ensemble value expansion

Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp.\ 8234--8244, 2018

work page 2018
[6]

Deep reinforcement learning in a handful of trials using probabilistic dynamics models

Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp.\ 4759--4770. Curran Associates, Inc., 2018

work page 2018
[7]

Model-Based Reinforcement Learning via Meta-Policy Optimization

Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., and Abbeel, P. Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Dawid, A. P. Present position and potential developments: Some personal views: Statistical theory: The prequential approach. Journal of the Royal Statistical Society. Series A (General), 147: 0 278--292, 1984

work page 1984
[9]

and Rasmussen, C

Deisenroth, M. and Rasmussen, C. E. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp.\ 465--472, 2011

work page 2011
[10]

Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks

Depeweg, S., Hern \'a ndez-Lobato, J. M., Doshi-Velez, F., and Udluft, S. Learning and policy search in stochastic dynamical systems with bayesian neural networks. arXiv preprint arXiv:1605.07127, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

P., and Selman, B

Ermon, S., Conrad, J., Gomes, C. P., and Selman, B. Playing games against nature: optimal policies for renewable resource allocation. 2012

work page 2012
[12]

and Ghahramani, Z

Gal, Y. and Ghahramani, Z. Dropout as a B ayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML-16), 2016 a

work page 2016
[13]

and Ghahramani, Z

Gal, Y. and Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pp.\ 1019--1027, 2016 b

work page 2016
[14]

Concrete dropout

Gal, Y., Hron, J., and Kendall, A. Concrete dropout. In Advances in Neural Information Processing Systems, pp.\ 3581--3590, 2017

work page 2017
[15]

and Raftery, A

Gneiting, T. and Raftery, A. E. Weather forecasting with ensemble methods. Science, 310 0 (5746): 0 248--249, 2005

work page 2005
[16]

and Raftery, A

Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102 0 (477): 0 359--378, 2007

work page 2007
[17]

Gneiting, T., Balabdaoui, F., and Raftery, A. E. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69 0 (2): 0 243--268, 2007

work page 2007
[19]

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. arXiv preprint arXiv:1706.04599, 2017 b

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Higuera, J. C. G., Meger, D., and Dudek, G. Synthesizing neural network controllers with probabilistic model based reinforcement learning. arXiv preprint arXiv:1803.02291, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2261--2269. IEEE, 2017

work page 2017
[23]

and Ermon, S

Kuleshov, V. and Ermon, S. Estimating uncertainty online against an adversary. In AAAI, pp.\ 2110--2116, 2017

work page 2017
[24]

and Liang, P

Kuleshov, V. and Liang, P. Calibrated structured prediction. In Advances in Neural Information Processing Systems (NIPS), 2015

work page 2015
[25]

Accurate uncertainties for deep learning using calibrated regression

Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncertainties for deep learning using calibrated regression. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 2796--2804, Stockholmsmässan, Stockholm Sweden, 10--15 Jul 2018. PMLR. URL http://pr...

work page 2018
[26]

and Flach, P

Kull, M. and Flach, P. Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp.\ 68--85. Springer, 2015

work page 2015
[27]

Model-Ensemble Trust-Region Policy Optimization

Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474, 2017 a

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Simple and scalable predictive uncertainty estimation using deep ensembles

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp.\ 6402--6413, 2017 b

work page 2017
[30]

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pp.\ 661--670, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-799-8. doi:10.1145/1772690.1772758. URL http://doi.acm.org/10.1145/1772690.1772758

work page doi:10.1145/1772690.1772758 2010
[31]

Sparse gaussian processes for bayesian optimization

McIntire, M., Ratner, D., and Ermon, S. Sparse gaussian processes for bayesian optimization. In UAI, 2016

work page 2016
[32]

Murphy, A. H. A new vector partition of the probability score. Journal of Applied Meteorology, 12 0 (4): 0 595--600, 1973

work page 1973
[33]

and Caruana, R

Niculescu-Mizil, A. and Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pp.\ 625--632, 2005

work page 2005
[34]

Platt, J. et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10 0 (3): 0 61--74, 1999

work page 1999
[35]

E., Gneiting, T., Balabdaoui, F., and Polakowski, M

Raftery, A. E., Gneiting, T., Balabdaoui, F., and Polakowski, M. Using bayesian model averaging to calibrate forecast ensembles. Monthly weather review, 133 0 (5): 0 1155--1174, 2005

work page 2005
[36]

EPOpt: Learning Robust Neural Network Policies Using Model Ensembles

Rajeswaran, A., Ghotra, S., Ravindran, B., and Levine, S. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[37]

and Brunskill, E

Rollinson, J. and Brunskill, E. From predictive models to instructional policies. International Educational Data Mining Society, 2015

work page 2015
[38]

Individualized sepsis treatment using reinforcement learning

Saria, S. Individualized sepsis treatment using reinforcement learning. Nature Medicine, 24 0 (11): 0 1641--1642, 11 2018. ISSN 1078-8956. doi:10.1038/s41591-018-0253-x

work page doi:10.1038/s41591-018-0253-x 2018
[39]

P., Kearns, M

Singh, S. P., Kearns, M. J., Litman, D. J., and Walker, M. A. Reinforcement learning for spoken dialogue systems. In Advances in Neural Information Processing Systems, pp.\ 956--962, 2000

work page 2000
[40]

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018

work page 2018
[41]

Mujoco: A physics engine for model-based control

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp.\ 5026--5033. IEEE, 2012

work page 2012
[42]

P., Lee, Y., and Tsitsiklis, J

Van Roy, B., Bertsekas, D. P., Lee, Y., and Tsitsiklis, J. N. A neuro-dynamic programming approach to retailer inventory management. In Decision and Control, 1997., Proceedings of the 36th IEEE Conference on, volume 4, pp.\ 4052--4057. IEEE, 1997

work page 1997
[43]

J., Campbell, B

Vermeulen, S. J., Campbell, B. M., and Ingram, J. S. Climate change and food systems. Annual Review of Environment and Resources, 37, 2012

work page 2012
[44]

and Elkan, C

Zadrozny, B. and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In International Conference on Knowledge Discovery and Data Mining (KDD), pp.\ 694--699, 2002

work page 2002

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Finite-time analysis of the multiarmed bandit problem

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47 0 (2-3): 0 235--256, May 2002. ISSN 0885-6125. doi:10.1023/A:1013689704352. URL https://doi.org/10.1023/A:1013689704352

work page doi:10.1023/a:1013689704352 2002

[3] [3]

Safe model-based reinforcement learning with stability guarantees

Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. Safe model-based reinforcement learning with stability guarantees. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp.\ 908--918. Curran Associates, Inc., 2017

work page 2017

[4] [4]

OpenAI Gym

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Sample-efficient reinforcement learning with stochastic ensemble value expansion

Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp.\ 8234--8244, 2018

work page 2018

[6] [6]

Deep reinforcement learning in a handful of trials using probabilistic dynamics models

Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp.\ 4759--4770. Curran Associates, Inc., 2018

work page 2018

[7] [7]

Model-Based Reinforcement Learning via Meta-Policy Optimization

Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., and Abbeel, P. Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Dawid, A. P. Present position and potential developments: Some personal views: Statistical theory: The prequential approach. Journal of the Royal Statistical Society. Series A (General), 147: 0 278--292, 1984

work page 1984

[9] [9]

and Rasmussen, C

Deisenroth, M. and Rasmussen, C. E. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp.\ 465--472, 2011

work page 2011

[10] [10]

Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks

Depeweg, S., Hern \'a ndez-Lobato, J. M., Doshi-Velez, F., and Udluft, S. Learning and policy search in stochastic dynamical systems with bayesian neural networks. arXiv preprint arXiv:1605.07127, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

P., and Selman, B

Ermon, S., Conrad, J., Gomes, C. P., and Selman, B. Playing games against nature: optimal policies for renewable resource allocation. 2012

work page 2012

[12] [12]

and Ghahramani, Z

Gal, Y. and Ghahramani, Z. Dropout as a B ayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML-16), 2016 a

work page 2016

[13] [13]

and Ghahramani, Z

Gal, Y. and Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pp.\ 1019--1027, 2016 b

work page 2016

[14] [14]

Concrete dropout

Gal, Y., Hron, J., and Kendall, A. Concrete dropout. In Advances in Neural Information Processing Systems, pp.\ 3581--3590, 2017

work page 2017

[15] [15]

and Raftery, A

Gneiting, T. and Raftery, A. E. Weather forecasting with ensemble methods. Science, 310 0 (5746): 0 248--249, 2005

work page 2005

[16] [16]

and Raftery, A

Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102 0 (477): 0 359--378, 2007

work page 2007

[17] [17]

Gneiting, T., Balabdaoui, F., and Raftery, A. E. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69 0 (2): 0 243--268, 2007

work page 2007

[18] [19]

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. arXiv preprint arXiv:1706.04599, 2017 b

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [20]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [21]

Higuera, J. C. G., Meger, D., and Dudek, G. Synthesizing neural network controllers with probabilistic model based reinforcement learning. arXiv preprint arXiv:1803.02291, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [22]

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2261--2269. IEEE, 2017

work page 2017

[22] [23]

and Ermon, S

Kuleshov, V. and Ermon, S. Estimating uncertainty online against an adversary. In AAAI, pp.\ 2110--2116, 2017

work page 2017

[23] [24]

and Liang, P

Kuleshov, V. and Liang, P. Calibrated structured prediction. In Advances in Neural Information Processing Systems (NIPS), 2015

work page 2015

[24] [25]

Accurate uncertainties for deep learning using calibrated regression

Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncertainties for deep learning using calibrated regression. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 2796--2804, Stockholmsmässan, Stockholm Sweden, 10--15 Jul 2018. PMLR. URL http://pr...

work page 2018

[25] [26]

and Flach, P

Kull, M. and Flach, P. Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp.\ 68--85. Springer, 2015

work page 2015

[26] [27]

Model-Ensemble Trust-Region Policy Optimization

Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [28]

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474, 2017 a

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [29]

Simple and scalable predictive uncertainty estimation using deep ensembles

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp.\ 6402--6413, 2017 b

work page 2017

[29] [30]

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW '10, pp.\ 661--670, New York, NY, USA, 2010. ACM. ISBN 978-1-60558-799-8. doi:10.1145/1772690.1772758. URL http://doi.acm.org/10.1145/1772690.1772758

work page doi:10.1145/1772690.1772758 2010

[30] [31]

Sparse gaussian processes for bayesian optimization

McIntire, M., Ratner, D., and Ermon, S. Sparse gaussian processes for bayesian optimization. In UAI, 2016

work page 2016

[31] [32]

Murphy, A. H. A new vector partition of the probability score. Journal of Applied Meteorology, 12 0 (4): 0 595--600, 1973

work page 1973

[32] [33]

and Caruana, R

Niculescu-Mizil, A. and Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pp.\ 625--632, 2005

work page 2005

[33] [34]

Platt, J. et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10 0 (3): 0 61--74, 1999

work page 1999

[34] [35]

E., Gneiting, T., Balabdaoui, F., and Polakowski, M

Raftery, A. E., Gneiting, T., Balabdaoui, F., and Polakowski, M. Using bayesian model averaging to calibrate forecast ensembles. Monthly weather review, 133 0 (5): 0 1155--1174, 2005

work page 2005

[35] [36]

EPOpt: Learning Robust Neural Network Policies Using Model Ensembles

Rajeswaran, A., Ghotra, S., Ravindran, B., and Levine, S. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[36] [37]

and Brunskill, E

Rollinson, J. and Brunskill, E. From predictive models to instructional policies. International Educational Data Mining Society, 2015

work page 2015

[37] [38]

Individualized sepsis treatment using reinforcement learning

Saria, S. Individualized sepsis treatment using reinforcement learning. Nature Medicine, 24 0 (11): 0 1641--1642, 11 2018. ISSN 1078-8956. doi:10.1038/s41591-018-0253-x

work page doi:10.1038/s41591-018-0253-x 2018

[38] [39]

P., Kearns, M

Singh, S. P., Kearns, M. J., Litman, D. J., and Walker, M. A. Reinforcement learning for spoken dialogue systems. In Advances in Neural Information Processing Systems, pp.\ 956--962, 2000

work page 2000

[39] [40]

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018

work page 2018

[40] [41]

Mujoco: A physics engine for model-based control

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp.\ 5026--5033. IEEE, 2012

work page 2012

[41] [42]

P., Lee, Y., and Tsitsiklis, J

Van Roy, B., Bertsekas, D. P., Lee, Y., and Tsitsiklis, J. N. A neuro-dynamic programming approach to retailer inventory management. In Decision and Control, 1997., Proceedings of the 36th IEEE Conference on, volume 4, pp.\ 4052--4057. IEEE, 1997

work page 1997

[42] [43]

J., Campbell, B

Vermeulen, S. J., Campbell, B. M., and Ingram, J. S. Climate change and food systems. Annual Review of Environment and Resources, 37, 2012

work page 2012

[43] [44]

and Elkan, C

Zadrozny, B. and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In International Conference on Knowledge Discovery and Data Mining (KDD), pp.\ 694--699, 2002

work page 2002