A new approach to Poissonian two-armed bandit problem
Pith reviewed 2026-05-24 21:58 UTC · model grok-4.3
The pith
Bayesian piece-wise constant strategies for the continuous-time two-armed Poisson bandit are determined by recursive equations that depend only on the observed history of arrivals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Bayesian risk and the optimal piece-wise constant strategy for the Poissonian two-armed bandit can be obtained from recursive relations written directly in terms of the observed history of the two counting processes; the limiting case is governed by a partial differential equation.
What carries the argument
Recursive equations for the Bayesian piece-wise constant strategy and risk expressed in terms of the current history of the Poisson processes.
If this is right
- The optimal policy and its risk become computable from the sequence of arrival times alone.
- The method applies to any prior distribution on the two Poisson rates.
- The limiting partial differential equation supplies an analytic description when the time horizon or rate parameters become large.
- Switching decisions depend only on the counts observed so far and the elapsed times.
Where Pith is reading between the lines
- Numerical solution of the recursions could be implemented by discretizing the possible histories.
- The same history-based reduction might apply to other marked point processes beyond homogeneous Poisson.
- If the recursions admit closed-form solutions for certain priors, they would yield explicit threshold rules for switching.
Load-bearing premise
Restricting attention to piece-wise constant strategies incurs no loss of optimality and the Bayesian risk under an arbitrary prior can be expressed solely through the observed history without additional state variables.
What would settle it
A direct computation showing that some non-piece-wise-constant strategy, or a strategy that conditions on extra posterior statistics, achieves strictly lower expected risk than the history-based solution would falsify the claim.
read the original abstract
We consider a continuous time two-armed bandit problem in which incomes are described by Poissonian processes. We develop Bayesian approach with arbitrary prior distribution. We present two versions of recursive equation for determination of Bayesian piece-wise constant strategy and Bayesian risk and partial differential equation in the limiting case. Unlike the previously considered Bayesian settings our description uses current history of the process and not evolution of the posterior distribution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a Bayesian approach to the continuous-time two-armed Poisson bandit problem for arbitrary priors. It derives two versions of recursive equations that determine a Bayesian piece-wise constant strategy and the associated Bayesian risk, together with a limiting PDE; the state is taken to be the observed history of the two Poisson processes rather than the evolving posterior measure.
Significance. If the recursions are valid and the history filtration is indeed sufficient, the framework would allow direct computation of optimal switching times and risk for non-conjugate priors without maintaining an auxiliary posterior state. The paper supplies no machine-checked proofs, reproducible code, or explicit verification that the claimed recursions recover the true Bayesian value.
major comments (2)
- [Abstract] Abstract and § on recursive equations: the central claim that the history filtration alone supplies a sufficient statistic for the conditional expected remaining reward (for arbitrary priors) is asserted without proof or counter-example; for non-conjugate priors the posterior is typically infinite-dimensional and the optimal switching boundary may depend on the full measure, not merely the observed counts.
- [Abstract] Abstract, paragraph 3: the restriction to piece-wise constant strategies is introduced without showing that it incurs no loss of optimality relative to the full class of history-adapted strategies; this assumption is load-bearing for the claim that the recursions solve the original problem.
minor comments (2)
- Notation for the two versions of the recursion is introduced without an explicit statement of the filtration or the precise definition of the value function on the history space.
- The limiting PDE is stated but the passage from the discrete recursion to the continuous PDE is not derived in detail.
Simulated Author's Rebuttal
We thank the referee for the careful reading and the constructive comments on the sufficiency of the history filtration and the restriction to piece-wise constant strategies. We address each point below.
read point-by-point responses
-
Referee: [Abstract] Abstract and § on recursive equations: the central claim that the history filtration alone supplies a sufficient statistic for the conditional expected remaining reward (for arbitrary priors) is asserted without proof or counter-example; for non-conjugate priors the posterior is typically infinite-dimensional and the optimal switching boundary may depend on the full measure, not merely the observed counts.
Authors: For any fixed prior, the posterior at time t is obtained by Bayes' rule applied to the likelihood of the observed history (jump times and arm identities). Hence the conditional expectation of remaining reward given the current posterior is a deterministic functional of that history. The recursive equations are written directly on the history filtration, which therefore serves as a sufficient statistic without requiring an auxiliary parametrization of the posterior. This construction is valid for arbitrary (including non-conjugate) priors; the resulting value function is simply a map from histories to reals. We will insert a short clarifying paragraph stating the equivalence explicitly. revision: yes
-
Referee: [Abstract] Abstract, paragraph 3: the restriction to piece-wise constant strategies is introduced without showing that it incurs no loss of optimality relative to the full class of history-adapted strategies; this assumption is load-bearing for the claim that the recursions solve the original problem.
Authors: Piece-wise constant strategies decide the arm only at the event times of the merged Poisson process. Because of the memoryless property of the Poisson processes, the future evolution depends on the past only through the current history; therefore the dynamic-programming recursion on the history filtration yields the optimum within this class. We agree that an explicit argument that the global optimum over all adapted strategies coincides with the optimum inside the piece-wise constant class would strengthen the presentation. We will add a brief justification or reference to the relevant continuous-time bandit literature in the revision. revision: yes
Circularity Check
No circularity; derivation presented as self-contained without quoted reductions
full rationale
The provided abstract and description introduce recursive equations and a limiting PDE for piece-wise constant Bayesian strategies that depend on process history rather than posterior evolution. No equations, self-citations, or derivations are quoted that reduce a claimed prediction or result to a fitted input or prior assumption by construction. The restriction to piece-wise constant strategies and sufficiency of history are stated as modeling choices without evidence of self-definitional closure or load-bearing self-citation chains. The approach is therefore treated as self-contained against external benchmarks for the purpose of this analysis.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Incomes are described by independent Poisson processes with unknown rates
- domain assumption A Bayesian piece-wise constant strategy exists and its risk satisfies a recursive equation
Reference graph
Works this paper leans on
-
[1]
A new approach to Poissonian two-armed bandit problem
Introduction. We consider a continuous time two-armed bandit problem. This setting results either in Poissonian or in a diffusion two-armed bandit. Quite general Poissonian two-armed bandit was considered in [1, 2]. In [3] consideration of Poissonian and diffusion bandit problems is restricted to the case of independent arms and discounted rewards. An inter...
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[2]
Our approach is applied POISSONIAN TWO-ARMED BANDIT 3 to quite general sets Θ
because we recalculate Bayesian risk with respect to current statistics (X1,t 1,X 2,t 2) and in [1], [2] recalculations are implemented with respect to current posterior distribution and t = t1 +t2. Our approach is applied POISSONIAN TWO-ARMED BANDIT 3 to quite general sets Θ. The approach presented in [1], [2] is applied to finite sets of parameters and g...
-
[3]
Let’s consider piece-wise constant strategies {σℓ(X1,t 1,X 2,t 2)}
Recursive equation. Let’s consider piece-wise constant strategies {σℓ(X1,t 1,X 2,t 2)}. To this end, we assume that control horizon is parti- tioned into a number of intervals of the length ∆ on which the chosen action does not change. Hence, T =N∆ and for any n1 +n2 =n<N , t1 =n1∆, t2 =n2∆ we have Pr(y(t′) = ℓ) = σℓ(X1,t 1,X 2,t 2) where σℓ(X1,t 1,X 2,t ...
-
[4]
In case of a drawR(1)(X1,t 1,X 2,t 2) = R(2)(X1,t 1,X 2,t 2) the choice is arbitrary
ifR(ℓ)(X1,t 1,X 2,t 2) has smaller value. In case of a drawR(1)(X1,t 1,X 2,t 2) = R(2)(X1,t 1,X 2,t 2) the choice is arbitrary
-
[5]
In this section, we ob- tain another version of recursive Bellman-type equation
Another version of recursive equation. In this section, we ob- tain another version of recursive Bellman-type equation. Let’s denote ˜R(X1,t 1,X 2,t 2) =R(X1,t 1,X 2,t 2)×µ(X1,t 1,X 2,t 2), where{R(X1,t 1,X 2,t 2)} are Bayesian risks calculated with respect to the posterior distribution (2.1) and{µ(X1,t 1,X 2,t 2)} are defined in (2.2). Then the following ...
-
[6]
In this section, we consider the case when ∆ has a small value
A limiting description. In this section, we consider the case when ∆ has a small value. In this case (3.3) takes the form ˜R(1)(X1,t 1,X 2,t 2) =g(1)(X1,t 1,X 2,t 2)∆ + ˜R(X1,t 1 + ∆,X 2,t 2)− ˜R(X1,t 1 + ∆,X 2,t 2)X1t−1 1 ∆ + ˜R(X1 + 1,t 1 + ∆,X 2,t 2)(X1 + 1)t−1 1 ∆ +o(∆), ˜R(2)(X1,t 1,X 2,t 2) =g(2)(X1,t 1,X 2,t 2)∆ + ˜R(X1,t 1,X 2,t 2 + ∆)− ˜R(X1,t 1,...
-
[7]
Presman, E. L. and Sonin, I. M. (1990). Sequential Control with Incomplete Infor- mation: Bayesian Approach, Academic Press, New York
work page 1990
-
[8]
Presman, E. L. (1990). Poisson Version of the Two-Armed Bandit Problem with Discounting. Theory Probab. Appl. 35 307–317
work page 1990
-
[9]
Continuous Multi-Armed Bandits and Multiparameter Pro- cesses
Mandelbaum, A (1987). Continuous Multi-Armed Bandits and Multiparameter Pro- cesses. Ann. Probab. 15 1527–1556
work page 1987
-
[10]
Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments, Chapman & Hall, London
work page 1985
-
[11]
Sragovich, V. G. (2006). Mathematical Theory of Adaptive Control , World Sci., Singapore
work page 2006
-
[12]
Cesa-Bianchi, N. and Lugosi. G. (2006) Prediction, Learning, and Games , Cam- bridge Univ. Press, Cambridge
work page 2006
-
[13]
Kolnogorov, A. V. (2018). Gaussian Two-Armed Bandit and Optimization of Batch Data Processing. Problems of Information Transmission 54 84–100. 41 B.Saint-Petersburgskaya Str., Velikiy Novgorod, Russia, 173003 Applied Mathematics and Information Science Department E-mail: Alexander.Kolnogorov@novsu.ru
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.