pith. sign in

arxiv: 2606.17515 · v1 · pith:STY4O33Bnew · submitted 2026-06-16 · 📊 stat.ME · stat.ML

Anytime-valid Optimal Policy Identification

Pith reviewed 2026-06-26 23:46 UTC · model grok-4.3

classification 📊 stat.ME stat.ML
keywords optimal policy identificationanytime-valid inferencecontextual banditslogged datastopping timesample complexitypolicy evaluationuniform coverage
0
0 comments X

The pith

A time-indexed set retains the true optimal policy with high probability at every time step, allowing data-dependent stopping for identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an anytime-valid procedure for selecting the best policy from a finite class using data generated by an external logging policy in contextual bandits. It constructs a sequence of sets that contain the optimal policy set uniformly over time with high probability, so the analyst can eliminate inferior policies and stop sampling once the optimum is isolated without invalidating the guarantees. When a unique optimum exists, the method supplies an explicit stopping time whose expected sample size scales as O((log |Π| + log log(1/Δ_min))/Δ_min²). Simulations indicate the approach can stop earlier than fixed-sample designs, and the framework is illustrated on data from an adaptive experiment measuring policy effects on misinformation.

Core claim

When the optimal policy is unique, we define a stopping time for its identification and derive a sample-complexity bound scaling as O((log |Π| + log log(1/Δ_min))/Δ_min²), where Δ_min is the gap between the best and second-best policy values. The construction rests on a time-indexed set S_t that retains the true optimal policy set uniformly over time with high probability under a fixed but uncontrolled logging policy.

What carries the argument

The time-indexed set S_t that retains the true optimal policy set uniformly over time with high probability under a fixed logging policy.

If this is right

  • The analyst can monitor policy values continuously and eliminate clearly suboptimal policies while preserving valid inference.
  • When the optimum is unique the procedure stops at a data-dependent time whose expected size is bounded by O((log |Π| + log log(1/Δ_min))/Δ_min²).
  • Simulations show the anytime-valid rule can produce substantial sample savings relative to any fixed-N design.
  • The same construction supplies a dynamic view of accumulating evidence in an applied adaptive experiment on misinformation reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The uniform-over-time guarantee may extend to slowly changing logging policies if the construction of S_t is adapted to track the instantaneous logging distribution.
  • The sample-complexity bound suggests the method could be competitive with fixed-sample optimal-policy identification when the number of candidate policies is moderate and gaps are not tiny.
  • Because the procedure eliminates policies sequentially, it may reduce computational cost in large policy classes by avoiding evaluation of already-eliminated candidates at later times.

Load-bearing premise

The time-indexed set S_t can be constructed to retain the true optimal policy set uniformly over time with high probability under a fixed but uncontrolled logging policy.

What would settle it

In repeated simulations with a known unique optimum and known gap Δ_min, the fraction of runs in which S_t fails to contain the optimum at the claimed stopping time exceeds the nominal error probability.

Figures

Figures reproduced from arXiv: 2606.17515 by Daniel Molitor.

Figure 1
Figure 1. Figure 1: Sequential elimination procedure. Construct policy value CSs from logged contex￾tual bandit data. Use these CSs to construct the candidate set St . Policies are eliminated when their upper confidence bounds fall below the best lower confidence bound; once St becomes a singleton, the remaining policy is selected. This formulation of an optimal policy candidate set extends naturally to related identifica￾tio… view at source ↗
Figure 2
Figure 2. Figure 2: demonstrates how our framework facilitates simultaneous inference on both the optimal policy π ⋆ and the underlying policy values ν(π). The top panel displays the progressive evolution of St as data accumulate: clearly suboptimal policies are eliminated early, while the optimal policy π ⋆ remains in St throughout and is ultimately uniquely identified. The bottom panel shows the corresponding CSs for the po… view at source ↗
Figure 3
Figure 3. Figure 3: Sample savings under target gap misspecification. The x-axis shows the ratio of the true minimum suboptimality gap to the target gap used to determine the oracle fixed sample size; the y-axis shows mean sample savings 1 − E[τ]/N90, averaged across target gap values, with ±1 standard error ribbon. Even when the true gap exceeds the target gap by only 25%, the anytime-valid procedure 12 [PITH_FULL_IMAGE:fig… view at source ↗
Figure 4
Figure 4. Figure 4: Anytime-valid optimal policy identification in the infodemic study. The heatmap displays St , the candidate set of optimal policies, over N = 15,292 observations. Policies sur￾viving to the end of the study (never eliminated) are shown at the top; remaining policies are ordered by their elimination time, latest to earliest [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

We develop an anytime-valid framework for optimal policy identification from logged contextual bandit data. In many applied settings, the analyst wants to select the optimal policy from a candidate policy class $\Pi$, but data are generated by an externally determined logging policy that they do not control. The analyst may also wish to monitor evidence continuously, stopping once the optimal policy is clear rather than committing to a fixed sample size in advance. This paper addresses these challenges by constructing a time-indexed set $S_t$ that retains the true optimal policy set uniformly over time with high probability. The resulting procedure allows the analyst to monitor policy values, eliminate clearly suboptimal policies, and stop at data-dependent times without invalidating inference. When the optimal policy is unique, we define a stopping time for its identification and derive a sample-complexity bound scaling as $O\!\left(\frac{\log |\Pi|+\log\log(1/\Delta_{\min})}{\Delta_{\min}^2}\right)$, where $\Delta_{\min}$ is the gap between the best and second-best policy values. Simulations demonstrate that the anytime-valid approach can yield substantial sample savings relative to fixed-$N$ designs. An application to a large adaptive experiment on reducing misinformation online illustrates how the method provides a dynamic view as evidence on the optimal policy accumulates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops an anytime-valid framework for optimal policy identification from logged contextual bandit data generated by an uncontrolled logging policy. It constructs a time-indexed set S_t that retains the true optimal policy set uniformly over time with high probability, enabling continuous monitoring, elimination of suboptimal policies, and data-dependent stopping without invalidating inference. When the optimal policy is unique, a stopping time is defined with sample-complexity bound O((log |Π| + log log(1/Δ_min))/Δ_min²), where Δ_min is the gap to the second-best policy. The claims are supported by simulations showing sample savings over fixed-N designs and an application to an adaptive experiment on reducing online misinformation.

Significance. If the uniform retention property of S_t and the derived bound hold under the stated conditions, the work would be significant for sequential policy learning in settings where sample sizes cannot be fixed in advance. The bound recovers the standard dependence on 1/Δ_min² up to logarithmic factors, with the extra log log term arising naturally from anytime validity; this is a reasonable tradeoff. The empirical components (simulations and real-data application) provide concrete evidence of practical utility in adaptive experiments.

major comments (2)
  1. [Abstract] The central construction of the time-indexed set S_t (abstract, paragraph 2) is asserted to retain the true optimal policy uniformly over t with high probability under a fixed logging policy, but the abstract provides no derivation, assumptions on reward variance or context distribution, or proof sketch. This property is load-bearing for both the stopping time and the sample-complexity bound; the full manuscript must supply the explicit construction and the uniform concentration argument.
  2. [Abstract] The sample-complexity bound O((log |Π| + log log(1/Δ_min))/Δ_min²) is stated for the unique-optimum case (abstract, final paragraph), but it is unclear whether the log log term arises from a standard union-bound or from a more refined anytime-valid martingale argument; the manuscript should clarify the precise inequality used and confirm that the bound remains valid when the logging policy is fixed and uncontrolled.
minor comments (2)
  1. [Abstract] Notation for the policy class Π and the gap Δ_min should be introduced with explicit definitions before the bound is stated.
  2. The simulations section would benefit from reporting the exact logging policy used and the number of Monte Carlo replications to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation of minor revision. We address the two major comments below with clarifications and targeted revisions to improve readability of the abstract and main text.

read point-by-point responses
  1. Referee: [Abstract] The central construction of the time-indexed set S_t (abstract, paragraph 2) is asserted to retain the true optimal policy uniformly over t with high probability under a fixed logging policy, but the abstract provides no derivation, assumptions on reward variance or context distribution, or proof sketch. This property is load-bearing for both the stopping time and the sample-complexity bound; the full manuscript must supply the explicit construction and the uniform concentration argument.

    Authors: The full manuscript supplies the explicit construction of S_t in Section 3 (Definition 1 and Algorithm 1), the uniform concentration argument via a time-uniform Freedman-type martingale inequality (Theorem 1), and the required assumptions (bounded rewards with variance proxy 1/4, fixed uncontrolled logging policy, and context distribution in Assumption 1). The abstract is intentionally concise per journal norms, but we have revised it to include a parenthetical pointer to Section 3 and the key assumption of bounded rewards. No derivation is added to the abstract itself due to length constraints, but the load-bearing property is now cross-referenced. revision: yes

  2. Referee: [Abstract] The sample-complexity bound O((log |Π| + log log(1/Δ_min))/Δ_min²) is stated for the unique-optimum case (abstract, final paragraph), but it is unclear whether the log log term arises from a standard union-bound or from a more refined anytime-valid martingale argument; the manuscript should clarify the precise inequality used and confirm that the bound remains valid when the logging policy is fixed and uncontrolled.

    Authors: The log log(1/Δ_min) term arises from the refined anytime-valid martingale argument (a peeling argument over dyadic time scales combined with a Freedman-type inequality, see the proof of Theorem 3), not a naive union bound. We have revised the manuscript to state the precise inequality (the time-uniform bound in Equation (4) of Section 3) and explicitly confirm validity under the fixed uncontrolled logging policy, as all concentration holds conditionally on the observed contexts and actions generated by that policy. The bound derivation is unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The abstract presents the construction of the time-indexed set S_t as a methodological step that yields a uniform high-probability guarantee, from which the stopping time and the O((log |Π| + log log(1/Δ_min))/Δ_min²) sample-complexity bound are then derived when the optimum is unique. This bound takes the standard form arising from concentration inequalities under a fixed gap Δ_min and policy class size, without any indication that it reduces by construction to a fitted parameter or to a self-referential definition of S_t itself. No load-bearing self-citation, ansatz smuggling, or renaming of known results is visible in the provided description. The central claim therefore retains independent content from the uniform guarantee and does not collapse to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient detail to enumerate free parameters, axioms, or invented entities; no explicit constants, lemmas, or new objects are named.

pith-pipeline@v0.9.1-grok · 5746 in / 1143 out tokens · 44315 ms · 2026-06-26T23:46:29.875743+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 35 canonical work pages · 8 internal anchors

  1. [1]

    Anytime-valid off-policy inference for contextual bandits , url =

    Waudby-Smith, Ian and Wu, Lili and Ramdas, Aaditya and Karampatziakis, Nikos and Mineiro, Paul , month = aug, year =. Anytime-valid off-policy inference for contextual bandits , url =. doi:10.48550/arXiv.2210.10768 , abstract =

  2. [2]

    Off-policy

    Karampatziakis, Nikos and Mineiro, Paul and Ramdas, Aaditya , month = feb, year =. Off-policy. doi:10.48550/arXiv.2102.09540 , abstract =

  3. [3]

    Off-policy estimation with adaptively collected data: the power of online learning , isbn =

    Lee, Jeonghwan and Ma, Cong , year =. Off-policy estimation with adaptively collected data: the power of online learning , isbn =. Advances in. doi:10.52202/079017-4255 , urldate =

  4. [4]

    Instance-optimal

    Li, Zhaoqi and Ratliff, Lillian and Nassif, Houssam and Jamieson, Kevin and Jain, Lalit , month = jul, year =. Instance-optimal. doi:10.48550/arXiv.2207.02357 , abstract =

  5. [5]

    Garivier, Aurélien and Kaufmann, Emilie , month = jun, year =. Optimal. doi:10.48550/arXiv.1602.04589 , abstract =

  6. [6]

    Tirinzoni, Andrea and Degenne, Rémy , month = oct, year =. On. doi:10.48550/arXiv.2205.10936 , abstract =

  7. [7]

    Anytime-

    Molitor, Daniel and Gold, Samantha , month = jun, year =. Anytime-. doi:10.48550/arXiv.2506.20523 , abstract =

  8. [8]

    Jamieson, Kevin and Malloy, Matthew and Nowak, Robert and Bubeck, Sébastien , month = dec, year =. lil'. doi:10.48550/arXiv.1312.7308 , abstract =

  9. [9]

    Confident

    Kuzborskij, Ilja and Vernade, Claire and György, András and Szepesvári, Csaba , month = mar, year =. Confident. doi:10.48550/arXiv.2006.10460 , abstract =

  10. [10]

    Uehara, Masatoshi and Shi, Chengchun and Kallus, Nathan , month = dec, year =. A. doi:10.48550/arXiv.2212.06355 , abstract =

  11. [11]

    McDiarmid, C

    Time-uniform, nonparametric, nonasymptotic confidence sequences , volume =. The Annals of Statistics , author =. 2021 , note =. doi:10.1214/20-AOS1991 , abstract =

  12. [12]

    Instance-

    Wagenmaker, Andrew and Jamieson, Kevin , month = jul, year =. Instance-. doi:10.48550/arXiv.2207.02575 , abstract =

  13. [13]

    Al-Marjani, Aymen and Tirinzoni, Andrea and Kaufmann, Emilie , month = oct, year =. Towards. doi:10.48550/arXiv.2311.05638 , abstract =

  14. [14]

    Anytime-

    Lindon, Michael and Ham, Dae Woong and Tingley, Martin and Bojinov, Iavor , month = jul, year =. Anytime-. doi:10.48550/arXiv.2210.08589 , abstract =

  15. [15]

    Mason, Blake and Jain, Lalit and Tripathy, Ardhendu and Nowak, Robert , month = sep, year =. Finding. doi:10.48550/arXiv.2006.08850 , abstract =

  16. [16]

    Bibaut, Aurélien and Chambaz, Antoine and Dimakopoulou, Maria and Kallus, Nathan and Laan, Mark van der , month = jun, year =. Post-. doi:10.48550/arXiv.2106.00418 , abstract =

  17. [17]

    and Zhan, Ruohan and Wager, Stefan and Athey, Susan , month = feb, year =

    Hadad, Vitor and Hirshberg, David A. and Zhan, Ruohan and Wager, Stefan and Athey, Susan , month = feb, year =. Confidence. doi:10.48550/arXiv.1911.02768 , abstract =

  18. [18]

    and Janson, Lucas and Murphy, Susan A

    Zhang, Kelly W. and Janson, Lucas and Murphy, Susan A. , month = nov, year =. Statistical. doi:10.48550/arXiv.2104.14074 , abstract =

  19. [19]

    Design of

    Zanette, Andrea and Dong, Kefan and Lee, Jonathan and Brunskill, Emma , month = jul, year =. Design of. doi:10.48550/arXiv.2107.09912 , abstract =

  20. [20]

    Jedra, Yassir and Proutiere, Alexandre , month = jun, year =. Optimal. doi:10.48550/arXiv.2006.16073 , abstract =

  21. [21]

    Dudik, Miroslav and Langford, John and Li, Lihong , month = may, year =. Doubly. doi:10.48550/arXiv.1103.4601 , abstract =

  22. [22]

    Sequential Experimental Design for Transductive Linear Bandits

    Fiez, Tanner and Jain, Lalit and Jamieson, Kevin and Ratliff, Lillian , month = jun, year =. Sequential. doi:10.48550/arXiv.1906.08399 , abstract =

  23. [23]

    Nature Human Behaviour , author =

    Battling the coronavirus ‘infodemic’ among social media users in. Nature Human Behaviour , author =. 2024 , pages =. doi:10.1038/s41562-023-01810-7 , language =

  24. [24]

    Journal of Human Resources , author =

    More. Journal of Human Resources , author =. 2019 , pages =. doi:10.3368/jhr.54.3.0317-8637R , language =

  25. [25]

    Schapire

    Li, Lihong and Chu, Wei and Langford, John and Schapire, Robert E. , month = apr, year =. A. Proceedings of the 19th international conference on. doi:10.1145/1772690.1772758 , abstract =

  26. [26]

    Soare, Marta and Lazaric, Alessandro and Munos, Rémi , month = nov, year =. Best-. doi:10.48550/arXiv.1409.6110 , abstract =

  27. [27]

    Gamification of

    Degenne, Rémy and Ménard, Pierre and Shang, Xuedong and Valko, Michal , month = jul, year =. Gamification of. doi:10.48550/arXiv.2007.00953 , abstract =

  28. [28]

    Athey, Susan and Wager, Stefan , month = sep, year =. Policy. doi:10.48550/arXiv.1702.02896 , abstract =

  29. [29]

    Operations Research , author =

    Always. Operations Research , author =. 2022 , pages =. doi:10.1287/opre.2021.2135 , abstract =

  30. [30]

    Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

    Wang, Yu-Xiang and Agarwal, Alekh and Dudik, Miroslav , month = nov, year =. Optimal and. doi:10.48550/arXiv.1612.01205 , abstract =

  31. [31]

    Adaptive

    Leiner, James and Dunn, Robin and Ramdas, Aaditya , month = feb, year =. Adaptive. doi:10.48550/arXiv.2509.14218 , abstract =

  32. [32]

    Proceedings of the AAAI Conference on Artificial Intelligence , author =

    High-. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2015 , file =. doi:10.1609/aaai.v29i1.9541 , abstract =

  33. [33]

    Multiple Identifications in Multi-Armed Bandits

    Bubeck, Sébastien and Wang, Tengyao and Viswanathan, Nitin , month = may, year =. Multiple. doi:10.48550/arXiv.1205.3181 , abstract =

  34. [34]

    Top-m identification for linear bandits , url =

    Réda, Clémence and Kaufmann, Emilie and Delahaye-Duriez, Andrée , month = mar, year =. Top-m identification for linear bandits , url =. doi:10.48550/arXiv.2103.10070 , abstract =

  35. [35]

    Jun, Kwang-Sung and Nowak, Robert , editor =. Anytime. Proceedings of. 2016 , pages =

  36. [36]

    An optimal algorithm for the Thresholding Bandit Problem

    Locatelli, Andrea and Gutzeit, Maurilio and Carpentier, Alexandra , month = may, year =. An optimal algorithm for the. doi:10.48550/arXiv.1605.08671 , abstract =