Anytime-valid Optimal Policy Identification

Daniel Molitor

arxiv: 2606.17515 · v1 · pith:STY4O33Bnew · submitted 2026-06-16 · 📊 stat.ME · stat.ML

Anytime-valid Optimal Policy Identification

Daniel Molitor This is my paper

Pith reviewed 2026-06-26 23:46 UTC · model grok-4.3

classification 📊 stat.ME stat.ML

keywords optimal policy identificationanytime-valid inferencecontextual banditslogged datastopping timesample complexitypolicy evaluationuniform coverage

0 comments

The pith

A time-indexed set retains the true optimal policy with high probability at every time step, allowing data-dependent stopping for identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an anytime-valid procedure for selecting the best policy from a finite class using data generated by an external logging policy in contextual bandits. It constructs a sequence of sets that contain the optimal policy set uniformly over time with high probability, so the analyst can eliminate inferior policies and stop sampling once the optimum is isolated without invalidating the guarantees. When a unique optimum exists, the method supplies an explicit stopping time whose expected sample size scales as O((log |Π| + log log(1/Δ_min))/Δ_min²). Simulations indicate the approach can stop earlier than fixed-sample designs, and the framework is illustrated on data from an adaptive experiment measuring policy effects on misinformation.

Core claim

When the optimal policy is unique, we define a stopping time for its identification and derive a sample-complexity bound scaling as O((log |Π| + log log(1/Δ_min))/Δ_min²), where Δ_min is the gap between the best and second-best policy values. The construction rests on a time-indexed set S_t that retains the true optimal policy set uniformly over time with high probability under a fixed but uncontrolled logging policy.

What carries the argument

The time-indexed set S_t that retains the true optimal policy set uniformly over time with high probability under a fixed logging policy.

If this is right

The analyst can monitor policy values continuously and eliminate clearly suboptimal policies while preserving valid inference.
When the optimum is unique the procedure stops at a data-dependent time whose expected size is bounded by O((log |Π| + log log(1/Δ_min))/Δ_min²).
Simulations show the anytime-valid rule can produce substantial sample savings relative to any fixed-N design.
The same construction supplies a dynamic view of accumulating evidence in an applied adaptive experiment on misinformation reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The uniform-over-time guarantee may extend to slowly changing logging policies if the construction of S_t is adapted to track the instantaneous logging distribution.
The sample-complexity bound suggests the method could be competitive with fixed-sample optimal-policy identification when the number of candidate policies is moderate and gaps are not tiny.
Because the procedure eliminates policies sequentially, it may reduce computational cost in large policy classes by avoiding evaluation of already-eliminated candidates at later times.

Load-bearing premise

The time-indexed set S_t can be constructed to retain the true optimal policy set uniformly over time with high probability under a fixed but uncontrolled logging policy.

What would settle it

In repeated simulations with a known unique optimum and known gap Δ_min, the fraction of runs in which S_t fails to contain the optimum at the claimed stopping time exceeds the nominal error probability.

Figures

Figures reproduced from arXiv: 2606.17515 by Daniel Molitor.

**Figure 1.** Figure 1: Sequential elimination procedure. Construct policy value CSs from logged contextual bandit data. Use these CSs to construct the candidate set St . Policies are eliminated when their upper confidence bounds fall below the best lower confidence bound; once St becomes a singleton, the remaining policy is selected. This formulation of an optimal policy candidate set extends naturally to related identificatio… view at source ↗

**Figure 2.** Figure 2: demonstrates how our framework facilitates simultaneous inference on both the optimal policy π ⋆ and the underlying policy values ν(π). The top panel displays the progressive evolution of St as data accumulate: clearly suboptimal policies are eliminated early, while the optimal policy π ⋆ remains in St throughout and is ultimately uniquely identified. The bottom panel shows the corresponding CSs for the po… view at source ↗

**Figure 3.** Figure 3: Sample savings under target gap misspecification. The x-axis shows the ratio of the true minimum suboptimality gap to the target gap used to determine the oracle fixed sample size; the y-axis shows mean sample savings 1 − E[τ]/N90, averaged across target gap values, with ±1 standard error ribbon. Even when the true gap exceeds the target gap by only 25%, the anytime-valid procedure 12 [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 4.** Figure 4: Anytime-valid optimal policy identification in the infodemic study. The heatmap displays St , the candidate set of optimal policies, over N = 15,292 observations. Policies surviving to the end of the study (never eliminated) are shown at the top; remaining policies are ordered by their elimination time, latest to earliest [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

We develop an anytime-valid framework for optimal policy identification from logged contextual bandit data. In many applied settings, the analyst wants to select the optimal policy from a candidate policy class $\Pi$, but data are generated by an externally determined logging policy that they do not control. The analyst may also wish to monitor evidence continuously, stopping once the optimal policy is clear rather than committing to a fixed sample size in advance. This paper addresses these challenges by constructing a time-indexed set $S_t$ that retains the true optimal policy set uniformly over time with high probability. The resulting procedure allows the analyst to monitor policy values, eliminate clearly suboptimal policies, and stop at data-dependent times without invalidating inference. When the optimal policy is unique, we define a stopping time for its identification and derive a sample-complexity bound scaling as $O\!\left(\frac{\log |\Pi|+\log\log(1/\Delta_{\min})}{\Delta_{\min}^2}\right)$, where $\Delta_{\min}$ is the gap between the best and second-best policy values. Simulations demonstrate that the anytime-valid approach can yield substantial sample savings relative to fixed-$N$ designs. An application to a large adaptive experiment on reducing misinformation online illustrates how the method provides a dynamic view as evidence on the optimal policy accumulates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a time-indexed set S_t that keeps the optimal policy inside it uniformly over t, yielding a stopping time with the stated gap-dependent sample complexity when the optimum is unique.

read the letter

The main contribution is a construction for a time-indexed set S_t that retains the true optimal policy uniformly over time with high probability under a fixed logging policy. This lets the analyst monitor evidence continuously and stop at a data-dependent time without invalidating the inference on which policy is best.

The approach fits a real need. Many contextual bandit experiments run with an externally fixed logging policy, and analysts often want to look at accumulating data rather than commit to a preset N. The resulting stopping time has the bound O((log |Π| + log log(1/Δ_min))/Δ_min²) when the optimum is unique, which adds the extra log-log term for the anytime guarantee but otherwise tracks standard gap-dependent rates. Simulations are said to show sample savings over fixed-N designs, and the misinformation experiment gives a concrete illustration.

The uniform-over-t property and the explicit stopping-time bound appear to be the new pieces relative to fixed-sample policy identification. The abstract presents the bound as derived rather than fitted, with no obvious circularity.

The soft spot is that the abstract states the existence of S_t and the bound without a derivation or assumption list. The full paper needs to show whether the uniform concentration holds under only the stated conditions on the logging policy and policy class, or whether extra restrictions are required. If the proof is a direct application of standard tools, the claim is fine; if it needs something stronger, that should be flagged.

This is for researchers working on adaptive experimentation and sequential analysis with logged data. A reader who needs valid early stopping from observational bandit data will find the stopping rule and bound useful. The work is coherent on its own terms and deserves a serious referee.

Referee Report

2 major / 2 minor

Summary. The paper develops an anytime-valid framework for optimal policy identification from logged contextual bandit data generated by an uncontrolled logging policy. It constructs a time-indexed set S_t that retains the true optimal policy set uniformly over time with high probability, enabling continuous monitoring, elimination of suboptimal policies, and data-dependent stopping without invalidating inference. When the optimal policy is unique, a stopping time is defined with sample-complexity bound O((log |Π| + log log(1/Δ_min))/Δ_min²), where Δ_min is the gap to the second-best policy. The claims are supported by simulations showing sample savings over fixed-N designs and an application to an adaptive experiment on reducing online misinformation.

Significance. If the uniform retention property of S_t and the derived bound hold under the stated conditions, the work would be significant for sequential policy learning in settings where sample sizes cannot be fixed in advance. The bound recovers the standard dependence on 1/Δ_min² up to logarithmic factors, with the extra log log term arising naturally from anytime validity; this is a reasonable tradeoff. The empirical components (simulations and real-data application) provide concrete evidence of practical utility in adaptive experiments.

major comments (2)

[Abstract] The central construction of the time-indexed set S_t (abstract, paragraph 2) is asserted to retain the true optimal policy uniformly over t with high probability under a fixed logging policy, but the abstract provides no derivation, assumptions on reward variance or context distribution, or proof sketch. This property is load-bearing for both the stopping time and the sample-complexity bound; the full manuscript must supply the explicit construction and the uniform concentration argument.
[Abstract] The sample-complexity bound O((log |Π| + log log(1/Δ_min))/Δ_min²) is stated for the unique-optimum case (abstract, final paragraph), but it is unclear whether the log log term arises from a standard union-bound or from a more refined anytime-valid martingale argument; the manuscript should clarify the precise inequality used and confirm that the bound remains valid when the logging policy is fixed and uncontrolled.

minor comments (2)

[Abstract] Notation for the policy class Π and the gap Δ_min should be introduced with explicit definitions before the bound is stated.
The simulations section would benefit from reporting the exact logging policy used and the number of Monte Carlo replications to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation of minor revision. We address the two major comments below with clarifications and targeted revisions to improve readability of the abstract and main text.

read point-by-point responses

Referee: [Abstract] The central construction of the time-indexed set S_t (abstract, paragraph 2) is asserted to retain the true optimal policy uniformly over t with high probability under a fixed logging policy, but the abstract provides no derivation, assumptions on reward variance or context distribution, or proof sketch. This property is load-bearing for both the stopping time and the sample-complexity bound; the full manuscript must supply the explicit construction and the uniform concentration argument.

Authors: The full manuscript supplies the explicit construction of S_t in Section 3 (Definition 1 and Algorithm 1), the uniform concentration argument via a time-uniform Freedman-type martingale inequality (Theorem 1), and the required assumptions (bounded rewards with variance proxy 1/4, fixed uncontrolled logging policy, and context distribution in Assumption 1). The abstract is intentionally concise per journal norms, but we have revised it to include a parenthetical pointer to Section 3 and the key assumption of bounded rewards. No derivation is added to the abstract itself due to length constraints, but the load-bearing property is now cross-referenced. revision: yes
Referee: [Abstract] The sample-complexity bound O((log |Π| + log log(1/Δ_min))/Δ_min²) is stated for the unique-optimum case (abstract, final paragraph), but it is unclear whether the log log term arises from a standard union-bound or from a more refined anytime-valid martingale argument; the manuscript should clarify the precise inequality used and confirm that the bound remains valid when the logging policy is fixed and uncontrolled.

Authors: The log log(1/Δ_min) term arises from the refined anytime-valid martingale argument (a peeling argument over dyadic time scales combined with a Freedman-type inequality, see the proof of Theorem 3), not a naive union bound. We have revised the manuscript to state the precise inequality (the time-uniform bound in Equation (4) of Section 3) and explicitly confirm validity under the fixed uncontrolled logging policy, as all concentration holds conditionally on the observed contexts and actions generated by that policy. The bound derivation is unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The abstract presents the construction of the time-indexed set S_t as a methodological step that yields a uniform high-probability guarantee, from which the stopping time and the O((log |Π| + log log(1/Δ_min))/Δ_min²) sample-complexity bound are then derived when the optimum is unique. This bound takes the standard form arising from concentration inequalities under a fixed gap Δ_min and policy class size, without any indication that it reduces by construction to a fitted parameter or to a self-referential definition of S_t itself. No load-bearing self-citation, ansatz smuggling, or renaming of known results is visible in the provided description. The central claim therefore retains independent content from the uniform guarantee and does not collapse to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient detail to enumerate free parameters, axioms, or invented entities; no explicit constants, lemmas, or new objects are named.

pith-pipeline@v0.9.1-grok · 5746 in / 1143 out tokens · 44315 ms · 2026-06-26T23:46:29.875743+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 35 canonical work pages · 8 internal anchors

[1]

Anytime-valid off-policy inference for contextual bandits , url =

Waudby-Smith, Ian and Wu, Lili and Ramdas, Aaditya and Karampatziakis, Nikos and Mineiro, Paul , month = aug, year =. Anytime-valid off-policy inference for contextual bandits , url =. doi:10.48550/arXiv.2210.10768 , abstract =

work page doi:10.48550/arxiv.2210.10768
[2]

Off-policy

Karampatziakis, Nikos and Mineiro, Paul and Ramdas, Aaditya , month = feb, year =. Off-policy. doi:10.48550/arXiv.2102.09540 , abstract =

work page doi:10.48550/arxiv.2102.09540
[3]

Off-policy estimation with adaptively collected data: the power of online learning , isbn =

Lee, Jeonghwan and Ma, Cong , year =. Off-policy estimation with adaptively collected data: the power of online learning , isbn =. Advances in. doi:10.52202/079017-4255 , urldate =

work page doi:10.52202/079017-4255
[4]

Instance-optimal

Li, Zhaoqi and Ratliff, Lillian and Nassif, Houssam and Jamieson, Kevin and Jain, Lalit , month = jul, year =. Instance-optimal. doi:10.48550/arXiv.2207.02357 , abstract =

work page doi:10.48550/arxiv.2207.02357
[5]

Garivier, Aurélien and Kaufmann, Emilie , month = jun, year =. Optimal. doi:10.48550/arXiv.1602.04589 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1602.04589
[6]

Tirinzoni, Andrea and Degenne, Rémy , month = oct, year =. On. doi:10.48550/arXiv.2205.10936 , abstract =

work page doi:10.48550/arxiv.2205.10936
[7]

Anytime-

Molitor, Daniel and Gold, Samantha , month = jun, year =. Anytime-. doi:10.48550/arXiv.2506.20523 , abstract =

work page doi:10.48550/arxiv.2506.20523
[8]

Jamieson, Kevin and Malloy, Matthew and Nowak, Robert and Bubeck, Sébastien , month = dec, year =. lil'. doi:10.48550/arXiv.1312.7308 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1312.7308
[9]

Confident

Kuzborskij, Ilja and Vernade, Claire and György, András and Szepesvári, Csaba , month = mar, year =. Confident. doi:10.48550/arXiv.2006.10460 , abstract =

work page doi:10.48550/arxiv.2006.10460 2006
[10]

Uehara, Masatoshi and Shi, Chengchun and Kallus, Nathan , month = dec, year =. A. doi:10.48550/arXiv.2212.06355 , abstract =

work page doi:10.48550/arxiv.2212.06355
[11]

McDiarmid, C

Time-uniform, nonparametric, nonasymptotic confidence sequences , volume =. The Annals of Statistics , author =. 2021 , note =. doi:10.1214/20-AOS1991 , abstract =

work page doi:10.1214/20-aos1991 2021
[12]

Instance-

Wagenmaker, Andrew and Jamieson, Kevin , month = jul, year =. Instance-. doi:10.48550/arXiv.2207.02575 , abstract =

work page doi:10.48550/arxiv.2207.02575
[13]

Al-Marjani, Aymen and Tirinzoni, Andrea and Kaufmann, Emilie , month = oct, year =. Towards. doi:10.48550/arXiv.2311.05638 , abstract =

work page doi:10.48550/arxiv.2311.05638
[14]

Anytime-

Lindon, Michael and Ham, Dae Woong and Tingley, Martin and Bojinov, Iavor , month = jul, year =. Anytime-. doi:10.48550/arXiv.2210.08589 , abstract =

work page doi:10.48550/arxiv.2210.08589
[15]

Mason, Blake and Jain, Lalit and Tripathy, Ardhendu and Nowak, Robert , month = sep, year =. Finding. doi:10.48550/arXiv.2006.08850 , abstract =

work page doi:10.48550/arxiv.2006.08850 2006
[16]

Bibaut, Aurélien and Chambaz, Antoine and Dimakopoulou, Maria and Kallus, Nathan and Laan, Mark van der , month = jun, year =. Post-. doi:10.48550/arXiv.2106.00418 , abstract =

work page doi:10.48550/arxiv.2106.00418
[17]

and Zhan, Ruohan and Wager, Stefan and Athey, Susan , month = feb, year =

Hadad, Vitor and Hirshberg, David A. and Zhan, Ruohan and Wager, Stefan and Athey, Susan , month = feb, year =. Confidence. doi:10.48550/arXiv.1911.02768 , abstract =

work page doi:10.48550/arxiv.1911.02768 1911
[18]

and Janson, Lucas and Murphy, Susan A

Zhang, Kelly W. and Janson, Lucas and Murphy, Susan A. , month = nov, year =. Statistical. doi:10.48550/arXiv.2104.14074 , abstract =

work page doi:10.48550/arxiv.2104.14074
[19]

Design of

Zanette, Andrea and Dong, Kefan and Lee, Jonathan and Brunskill, Emma , month = jul, year =. Design of. doi:10.48550/arXiv.2107.09912 , abstract =

work page doi:10.48550/arxiv.2107.09912
[20]

Jedra, Yassir and Proutiere, Alexandre , month = jun, year =. Optimal. doi:10.48550/arXiv.2006.16073 , abstract =

work page doi:10.48550/arxiv.2006.16073 2006
[21]

Dudik, Miroslav and Langford, John and Li, Lihong , month = may, year =. Doubly. doi:10.48550/arXiv.1103.4601 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1103.4601
[22]

Sequential Experimental Design for Transductive Linear Bandits

Fiez, Tanner and Jain, Lalit and Jamieson, Kevin and Ratliff, Lillian , month = jun, year =. Sequential. doi:10.48550/arXiv.1906.08399 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1906.08399 1906
[23]

Nature Human Behaviour , author =

Battling the coronavirus ‘infodemic’ among social media users in. Nature Human Behaviour , author =. 2024 , pages =. doi:10.1038/s41562-023-01810-7 , language =

work page doi:10.1038/s41562-023-01810-7 2024
[24]

Journal of Human Resources , author =

More. Journal of Human Resources , author =. 2019 , pages =. doi:10.3368/jhr.54.3.0317-8637R , language =

work page doi:10.3368/jhr.54.3.0317-8637r 2019
[25]

Schapire

Li, Lihong and Chu, Wei and Langford, John and Schapire, Robert E. , month = apr, year =. A. Proceedings of the 19th international conference on. doi:10.1145/1772690.1772758 , abstract =

work page doi:10.1145/1772690.1772758
[26]

Soare, Marta and Lazaric, Alessandro and Munos, Rémi , month = nov, year =. Best-. doi:10.48550/arXiv.1409.6110 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.6110
[27]

Gamification of

Degenne, Rémy and Ménard, Pierre and Shang, Xuedong and Valko, Michal , month = jul, year =. Gamification of. doi:10.48550/arXiv.2007.00953 , abstract =

work page doi:10.48550/arxiv.2007.00953 2007
[28]

Athey, Susan and Wager, Stefan , month = sep, year =. Policy. doi:10.48550/arXiv.1702.02896 , abstract =

work page doi:10.48550/arxiv.1702.02896
[29]

Operations Research , author =

Always. Operations Research , author =. 2022 , pages =. doi:10.1287/opre.2021.2135 , abstract =

work page doi:10.1287/opre.2021.2135 2022
[30]

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Wang, Yu-Xiang and Agarwal, Alekh and Dudik, Miroslav , month = nov, year =. Optimal and. doi:10.48550/arXiv.1612.01205 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1612.01205
[31]

Adaptive

Leiner, James and Dunn, Robin and Ramdas, Aaditya , month = feb, year =. Adaptive. doi:10.48550/arXiv.2509.14218 , abstract =

work page doi:10.48550/arxiv.2509.14218
[32]

Proceedings of the AAAI Conference on Artificial Intelligence , author =

High-. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2015 , file =. doi:10.1609/aaai.v29i1.9541 , abstract =

work page doi:10.1609/aaai.v29i1.9541 2015
[33]

Multiple Identifications in Multi-Armed Bandits

Bubeck, Sébastien and Wang, Tengyao and Viswanathan, Nitin , month = may, year =. Multiple. doi:10.48550/arXiv.1205.3181 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1205.3181
[34]

Top-m identification for linear bandits , url =

Réda, Clémence and Kaufmann, Emilie and Delahaye-Duriez, Andrée , month = mar, year =. Top-m identification for linear bandits , url =. doi:10.48550/arXiv.2103.10070 , abstract =

work page doi:10.48550/arxiv.2103.10070
[35]

Jun, Kwang-Sung and Nowak, Robert , editor =. Anytime. Proceedings of. 2016 , pages =

2016
[36]

An optimal algorithm for the Thresholding Bandit Problem

Locatelli, Andrea and Gutzeit, Maurilio and Carpentier, Alexandra , month = may, year =. An optimal algorithm for the. doi:10.48550/arXiv.1605.08671 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1605.08671

[1] [1]

Anytime-valid off-policy inference for contextual bandits , url =

Waudby-Smith, Ian and Wu, Lili and Ramdas, Aaditya and Karampatziakis, Nikos and Mineiro, Paul , month = aug, year =. Anytime-valid off-policy inference for contextual bandits , url =. doi:10.48550/arXiv.2210.10768 , abstract =

work page doi:10.48550/arxiv.2210.10768

[2] [2]

Off-policy

Karampatziakis, Nikos and Mineiro, Paul and Ramdas, Aaditya , month = feb, year =. Off-policy. doi:10.48550/arXiv.2102.09540 , abstract =

work page doi:10.48550/arxiv.2102.09540

[3] [3]

Off-policy estimation with adaptively collected data: the power of online learning , isbn =

Lee, Jeonghwan and Ma, Cong , year =. Off-policy estimation with adaptively collected data: the power of online learning , isbn =. Advances in. doi:10.52202/079017-4255 , urldate =

work page doi:10.52202/079017-4255

[4] [4]

Instance-optimal

Li, Zhaoqi and Ratliff, Lillian and Nassif, Houssam and Jamieson, Kevin and Jain, Lalit , month = jul, year =. Instance-optimal. doi:10.48550/arXiv.2207.02357 , abstract =

work page doi:10.48550/arxiv.2207.02357

[5] [5]

Garivier, Aurélien and Kaufmann, Emilie , month = jun, year =. Optimal. doi:10.48550/arXiv.1602.04589 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1602.04589

[6] [6]

Tirinzoni, Andrea and Degenne, Rémy , month = oct, year =. On. doi:10.48550/arXiv.2205.10936 , abstract =

work page doi:10.48550/arxiv.2205.10936

[7] [7]

Anytime-

Molitor, Daniel and Gold, Samantha , month = jun, year =. Anytime-. doi:10.48550/arXiv.2506.20523 , abstract =

work page doi:10.48550/arxiv.2506.20523

[8] [8]

Jamieson, Kevin and Malloy, Matthew and Nowak, Robert and Bubeck, Sébastien , month = dec, year =. lil'. doi:10.48550/arXiv.1312.7308 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1312.7308

[9] [9]

Confident

Kuzborskij, Ilja and Vernade, Claire and György, András and Szepesvári, Csaba , month = mar, year =. Confident. doi:10.48550/arXiv.2006.10460 , abstract =

work page doi:10.48550/arxiv.2006.10460 2006

[10] [10]

Uehara, Masatoshi and Shi, Chengchun and Kallus, Nathan , month = dec, year =. A. doi:10.48550/arXiv.2212.06355 , abstract =

work page doi:10.48550/arxiv.2212.06355

[11] [11]

McDiarmid, C

Time-uniform, nonparametric, nonasymptotic confidence sequences , volume =. The Annals of Statistics , author =. 2021 , note =. doi:10.1214/20-AOS1991 , abstract =

work page doi:10.1214/20-aos1991 2021

[12] [12]

Instance-

Wagenmaker, Andrew and Jamieson, Kevin , month = jul, year =. Instance-. doi:10.48550/arXiv.2207.02575 , abstract =

work page doi:10.48550/arxiv.2207.02575

[13] [13]

Al-Marjani, Aymen and Tirinzoni, Andrea and Kaufmann, Emilie , month = oct, year =. Towards. doi:10.48550/arXiv.2311.05638 , abstract =

work page doi:10.48550/arxiv.2311.05638

[14] [14]

Anytime-

Lindon, Michael and Ham, Dae Woong and Tingley, Martin and Bojinov, Iavor , month = jul, year =. Anytime-. doi:10.48550/arXiv.2210.08589 , abstract =

work page doi:10.48550/arxiv.2210.08589

[15] [15]

Mason, Blake and Jain, Lalit and Tripathy, Ardhendu and Nowak, Robert , month = sep, year =. Finding. doi:10.48550/arXiv.2006.08850 , abstract =

work page doi:10.48550/arxiv.2006.08850 2006

[16] [16]

Bibaut, Aurélien and Chambaz, Antoine and Dimakopoulou, Maria and Kallus, Nathan and Laan, Mark van der , month = jun, year =. Post-. doi:10.48550/arXiv.2106.00418 , abstract =

work page doi:10.48550/arxiv.2106.00418

[17] [17]

and Zhan, Ruohan and Wager, Stefan and Athey, Susan , month = feb, year =

Hadad, Vitor and Hirshberg, David A. and Zhan, Ruohan and Wager, Stefan and Athey, Susan , month = feb, year =. Confidence. doi:10.48550/arXiv.1911.02768 , abstract =

work page doi:10.48550/arxiv.1911.02768 1911

[18] [18]

and Janson, Lucas and Murphy, Susan A

Zhang, Kelly W. and Janson, Lucas and Murphy, Susan A. , month = nov, year =. Statistical. doi:10.48550/arXiv.2104.14074 , abstract =

work page doi:10.48550/arxiv.2104.14074

[19] [19]

Design of

Zanette, Andrea and Dong, Kefan and Lee, Jonathan and Brunskill, Emma , month = jul, year =. Design of. doi:10.48550/arXiv.2107.09912 , abstract =

work page doi:10.48550/arxiv.2107.09912

[20] [20]

Jedra, Yassir and Proutiere, Alexandre , month = jun, year =. Optimal. doi:10.48550/arXiv.2006.16073 , abstract =

work page doi:10.48550/arxiv.2006.16073 2006

[21] [21]

Dudik, Miroslav and Langford, John and Li, Lihong , month = may, year =. Doubly. doi:10.48550/arXiv.1103.4601 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1103.4601

[22] [22]

Sequential Experimental Design for Transductive Linear Bandits

Fiez, Tanner and Jain, Lalit and Jamieson, Kevin and Ratliff, Lillian , month = jun, year =. Sequential. doi:10.48550/arXiv.1906.08399 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1906.08399 1906

[23] [23]

Nature Human Behaviour , author =

Battling the coronavirus ‘infodemic’ among social media users in. Nature Human Behaviour , author =. 2024 , pages =. doi:10.1038/s41562-023-01810-7 , language =

work page doi:10.1038/s41562-023-01810-7 2024

[24] [24]

Journal of Human Resources , author =

More. Journal of Human Resources , author =. 2019 , pages =. doi:10.3368/jhr.54.3.0317-8637R , language =

work page doi:10.3368/jhr.54.3.0317-8637r 2019

[25] [25]

Schapire

Li, Lihong and Chu, Wei and Langford, John and Schapire, Robert E. , month = apr, year =. A. Proceedings of the 19th international conference on. doi:10.1145/1772690.1772758 , abstract =

work page doi:10.1145/1772690.1772758

[26] [26]

Soare, Marta and Lazaric, Alessandro and Munos, Rémi , month = nov, year =. Best-. doi:10.48550/arXiv.1409.6110 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.6110

[27] [27]

Gamification of

Degenne, Rémy and Ménard, Pierre and Shang, Xuedong and Valko, Michal , month = jul, year =. Gamification of. doi:10.48550/arXiv.2007.00953 , abstract =

work page doi:10.48550/arxiv.2007.00953 2007

[28] [28]

Athey, Susan and Wager, Stefan , month = sep, year =. Policy. doi:10.48550/arXiv.1702.02896 , abstract =

work page doi:10.48550/arxiv.1702.02896

[29] [29]

Operations Research , author =

Always. Operations Research , author =. 2022 , pages =. doi:10.1287/opre.2021.2135 , abstract =

work page doi:10.1287/opre.2021.2135 2022

[30] [30]

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Wang, Yu-Xiang and Agarwal, Alekh and Dudik, Miroslav , month = nov, year =. Optimal and. doi:10.48550/arXiv.1612.01205 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1612.01205

[31] [31]

Adaptive

Leiner, James and Dunn, Robin and Ramdas, Aaditya , month = feb, year =. Adaptive. doi:10.48550/arXiv.2509.14218 , abstract =

work page doi:10.48550/arxiv.2509.14218

[32] [32]

Proceedings of the AAAI Conference on Artificial Intelligence , author =

High-. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2015 , file =. doi:10.1609/aaai.v29i1.9541 , abstract =

work page doi:10.1609/aaai.v29i1.9541 2015

[33] [33]

Multiple Identifications in Multi-Armed Bandits

Bubeck, Sébastien and Wang, Tengyao and Viswanathan, Nitin , month = may, year =. Multiple. doi:10.48550/arXiv.1205.3181 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1205.3181

[34] [34]

Top-m identification for linear bandits , url =

Réda, Clémence and Kaufmann, Emilie and Delahaye-Duriez, Andrée , month = mar, year =. Top-m identification for linear bandits , url =. doi:10.48550/arXiv.2103.10070 , abstract =

work page doi:10.48550/arxiv.2103.10070

[35] [35]

Jun, Kwang-Sung and Nowak, Robert , editor =. Anytime. Proceedings of. 2016 , pages =

2016

[36] [36]

An optimal algorithm for the Thresholding Bandit Problem

Locatelli, Andrea and Gutzeit, Maurilio and Carpentier, Alexandra , month = may, year =. An optimal algorithm for the. doi:10.48550/arXiv.1605.08671 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1605.08671