pith. machine review for the scientific record. sign in

arxiv: 2512.23694 · v2 · submitted 2025-12-29 · 📊 stat.ML · cs.LG· econ.EM

Recognition: unknown

Bellman Calibration for V-Learning in Offline Reinforcement Learning

Authors on Pith no claims yet
classification 📊 stat.ML cs.LGecon.EM
keywords bellmancalibrationvaluelearningpredictionapproximationcompletenesscriterion
0
0 comments X
read the original abstract

Reliable long-horizon value prediction is difficult in offline reinforcement learning because fitted value methods combine bootstrapping, function approximation, and distribution shift, while standard guarantees often require Bellman completeness or realizability. We introduce Bellman calibration, a weak reliability criterion requiring that states assigned similar predicted values have average Bellman targets that agree with those predictions. This criterion yields a scalar calibration error for diagnosing systematic numerical miscalibration, which we estimate from off-policy data using doubly robust Bellman target estimates. We then propose Iterated Bellman Calibration, a model-agnostic post-hoc procedure that recalibrates any learned value predictor by fitting a one-dimensional map of its original prediction, with histogram and isotonic variants. We prove finite-sample guarantees showing that Bellman calibration error is controlled at one-dimensional nonparametric rates without Bellman completeness or value-function realizability. Our value-error bounds separate statistical estimation, finite-iteration, and approximation errors, clarifying when calibration improves value prediction and when its gains are limited by the information in the original predictor or insufficient coverage.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Calibeating Prediction-Powered Inference

    stat.ML 2026-04 unverdicted novelty 7.0

    Post-hoc calibration of miscalibrated black-box predictions on a labeled sample improves efficiency of prediction-powered inference for semisupervised mean estimation.

  2. Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.