A new approach to Poissonian two-armed bandit problem

Alexander Kolnogorov

arxiv: 1907.06074 · v1 · pith:KIW5ES7Hnew · submitted 2019-07-13 · 🧮 math.ST · stat.TH

A new approach to Poissonian two-armed bandit problem

Alexander Kolnogorov This is my paper

Pith reviewed 2026-05-24 21:58 UTC · model grok-4.3

classification 🧮 math.ST stat.TH

keywords two-armed banditPoisson processBayesian strategyrecursive equationpiece-wise constant strategycontinuous timestochastic control

0 comments

The pith

Bayesian piece-wise constant strategies for the continuous-time two-armed Poisson bandit are determined by recursive equations that depend only on the observed history of arrivals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a Bayesian treatment of the two-armed bandit in continuous time where each arm generates rewards according to a Poisson process. It supplies two versions of a recursive equation that compute the optimal piece-wise constant strategy and the associated Bayesian risk for an arbitrary prior. The same framework yields a partial differential equation in an appropriate limiting regime. The description tracks the current history of the counting processes rather than the evolving posterior on the unknown rates. A reader would care because this history-only formulation removes the need to maintain a separate belief state when calculating the value of continuing or switching.

Core claim

The Bayesian risk and the optimal piece-wise constant strategy for the Poissonian two-armed bandit can be obtained from recursive relations written directly in terms of the observed history of the two counting processes; the limiting case is governed by a partial differential equation.

What carries the argument

Recursive equations for the Bayesian piece-wise constant strategy and risk expressed in terms of the current history of the Poisson processes.

If this is right

The optimal policy and its risk become computable from the sequence of arrival times alone.
The method applies to any prior distribution on the two Poisson rates.
The limiting partial differential equation supplies an analytic description when the time horizon or rate parameters become large.
Switching decisions depend only on the counts observed so far and the elapsed times.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Numerical solution of the recursions could be implemented by discretizing the possible histories.
The same history-based reduction might apply to other marked point processes beyond homogeneous Poisson.
If the recursions admit closed-form solutions for certain priors, they would yield explicit threshold rules for switching.

Load-bearing premise

Restricting attention to piece-wise constant strategies incurs no loss of optimality and the Bayesian risk under an arbitrary prior can be expressed solely through the observed history without additional state variables.

What would settle it

A direct computation showing that some non-piece-wise-constant strategy, or a strategy that conditions on extra posterior statistics, achieves strictly lower expected risk than the history-based solution would falsify the claim.

read the original abstract

We consider a continuous time two-armed bandit problem in which incomes are described by Poissonian processes. We develop Bayesian approach with arbitrary prior distribution. We present two versions of recursive equation for determination of Bayesian piece-wise constant strategy and Bayesian risk and partial differential equation in the limiting case. Unlike the previously considered Bayesian settings our description uses current history of the process and not evolution of the posterior distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Recursive equations for piece-wise constant strategies in Poisson bandits using process history directly, but the sufficiency of that history for arbitrary priors remains unshown.

read the letter

The paper sets up recursive equations for the Bayesian value and risk of piece-wise constant strategies in a continuous-time two-armed Poisson bandit, plus a limiting PDE, all written in terms of the observed history of the two counting processes rather than posterior evolution. It claims this works for arbitrary priors. That is the concrete contribution: two versions of the recursion that avoid maintaining the full posterior measure as state. If the derivations close properly, it could reduce the state space for some calculations. The abstract and stress-test note give no indication that the full text supplies machine-checked proofs or external verification, so the equations stand or fall on the algebra alone. The main soft spot is the unexamined claim that restricting to piece-wise constant strategies loses nothing and that the history filtration alone is sufficient for the conditional expected remaining reward under any prior. For non-conjugate priors the posterior shape is not a function of the counts, so the risk functional may not close without auxiliary variables; the abstract offers no argument or counter-example to address this. The citation pattern is not visible here, but the claim of difference from prior Bayesian settings needs explicit comparison to existing continuous-time bandit recursions. This is narrow work aimed at specialists already working on Poisson or event-driven bandits. A reader who needs exact Bayesian recursions for counting-process rewards might extract the equations and test them on small cases, but the optimality justification has to be checked first. The paper deserves a serious referee to verify whether the recursions are correctly derived and whether the history-only state actually suffices.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a Bayesian approach to the continuous-time two-armed Poisson bandit problem for arbitrary priors. It derives two versions of recursive equations that determine a Bayesian piece-wise constant strategy and the associated Bayesian risk, together with a limiting PDE; the state is taken to be the observed history of the two Poisson processes rather than the evolving posterior measure.

Significance. If the recursions are valid and the history filtration is indeed sufficient, the framework would allow direct computation of optimal switching times and risk for non-conjugate priors without maintaining an auxiliary posterior state. The paper supplies no machine-checked proofs, reproducible code, or explicit verification that the claimed recursions recover the true Bayesian value.

major comments (2)

[Abstract] Abstract and § on recursive equations: the central claim that the history filtration alone supplies a sufficient statistic for the conditional expected remaining reward (for arbitrary priors) is asserted without proof or counter-example; for non-conjugate priors the posterior is typically infinite-dimensional and the optimal switching boundary may depend on the full measure, not merely the observed counts.
[Abstract] Abstract, paragraph 3: the restriction to piece-wise constant strategies is introduced without showing that it incurs no loss of optimality relative to the full class of history-adapted strategies; this assumption is load-bearing for the claim that the recursions solve the original problem.

minor comments (2)

Notation for the two versions of the recursion is introduced without an explicit statement of the filtration or the precise definition of the value function on the history space.
The limiting PDE is stated but the passage from the discrete recursion to the continuous PDE is not derived in detail.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and the constructive comments on the sufficiency of the history filtration and the restriction to piece-wise constant strategies. We address each point below.

read point-by-point responses

Referee: [Abstract] Abstract and § on recursive equations: the central claim that the history filtration alone supplies a sufficient statistic for the conditional expected remaining reward (for arbitrary priors) is asserted without proof or counter-example; for non-conjugate priors the posterior is typically infinite-dimensional and the optimal switching boundary may depend on the full measure, not merely the observed counts.

Authors: For any fixed prior, the posterior at time t is obtained by Bayes' rule applied to the likelihood of the observed history (jump times and arm identities). Hence the conditional expectation of remaining reward given the current posterior is a deterministic functional of that history. The recursive equations are written directly on the history filtration, which therefore serves as a sufficient statistic without requiring an auxiliary parametrization of the posterior. This construction is valid for arbitrary (including non-conjugate) priors; the resulting value function is simply a map from histories to reals. We will insert a short clarifying paragraph stating the equivalence explicitly. revision: yes
Referee: [Abstract] Abstract, paragraph 3: the restriction to piece-wise constant strategies is introduced without showing that it incurs no loss of optimality relative to the full class of history-adapted strategies; this assumption is load-bearing for the claim that the recursions solve the original problem.

Authors: Piece-wise constant strategies decide the arm only at the event times of the merged Poisson process. Because of the memoryless property of the Poisson processes, the future evolution depends on the past only through the current history; therefore the dynamic-programming recursion on the history filtration yields the optimum within this class. We agree that an explicit argument that the global optimum over all adapted strategies coincides with the optimum inside the piece-wise constant class would strengthen the presentation. We will add a brief justification or reference to the relevant continuous-time bandit literature in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation presented as self-contained without quoted reductions

full rationale

The provided abstract and description introduce recursive equations and a limiting PDE for piece-wise constant Bayesian strategies that depend on process history rather than posterior evolution. No equations, self-citations, or derivations are quoted that reduce a claimed prediction or result to a fitted input or prior assumption by construction. The restriction to piece-wise constant strategies and sufficiency of history are stated as modeling choices without evidence of self-definitional closure or load-bearing self-citation chains. The approach is therefore treated as self-contained against external benchmarks for the purpose of this analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard properties of Poisson processes and the existence of an optimal Bayesian strategy within the class of piece-wise constant policies; no free parameters or invented entities are visible from the abstract.

axioms (2)

domain assumption Incomes are described by independent Poisson processes with unknown rates
Stated in the first sentence of the abstract as the problem setup.
domain assumption A Bayesian piece-wise constant strategy exists and its risk satisfies a recursive equation
Central modeling choice announced in the abstract without derivation.

pith-pipeline@v0.9.0 · 5570 in / 1245 out tokens · 24116 ms · 2026-05-24T21:58:34.596495+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

[1]

A new approach to Poissonian two-armed bandit problem

Introduction. We consider a continuous time two-armed bandit problem. This setting results either in Poissonian or in a diﬀusion two-armed bandit. Quite general Poissonian two-armed bandit was considered in [1, 2]. In [3] consideration of Poissonian and diﬀusion bandit problems is restricted to the case of independent arms and discounted rewards. An inter...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[2]

Our approach is applied POISSONIAN TWO-ARMED BANDIT 3 to quite general sets Θ

because we recalculate Bayesian risk with respect to current statistics (X1,t 1,X 2,t 2) and in [1], [2] recalculations are implemented with respect to current posterior distribution and t = t1 +t2. Our approach is applied POISSONIAN TWO-ARMED BANDIT 3 to quite general sets Θ. The approach presented in [1], [2] is applied to ﬁnite sets of parameters and g...

work page
[3]

Let’s consider piece-wise constant strategies {σℓ(X1,t 1,X 2,t 2)}

Recursive equation. Let’s consider piece-wise constant strategies {σℓ(X1,t 1,X 2,t 2)}. To this end, we assume that control horizon is parti- tioned into a number of intervals of the length ∆ on which the chosen action does not change. Hence, T =N∆ and for any n1 +n2 =n<N , t1 =n1∆, t2 =n2∆ we have Pr(y(t′) = ℓ) = σℓ(X1,t 1,X 2,t 2) where σℓ(X1,t 1,X 2,t ...

work page
[4]

In case of a drawR(1)(X1,t 1,X 2,t 2) = R(2)(X1,t 1,X 2,t 2) the choice is arbitrary

ifR(ℓ)(X1,t 1,X 2,t 2) has smaller value. In case of a drawR(1)(X1,t 1,X 2,t 2) = R(2)(X1,t 1,X 2,t 2) the choice is arbitrary

work page
[5]

In this section, we ob- tain another version of recursive Bellman-type equation

Another version of recursive equation. In this section, we ob- tain another version of recursive Bellman-type equation. Let’s denote ˜R(X1,t 1,X 2,t 2) =R(X1,t 1,X 2,t 2)×µ(X1,t 1,X 2,t 2), where{R(X1,t 1,X 2,t 2)} are Bayesian risks calculated with respect to the posterior distribution (2.1) and{µ(X1,t 1,X 2,t 2)} are deﬁned in (2.2). Then the following ...

work page
[6]

In this section, we consider the case when ∆ has a small value

A limiting description. In this section, we consider the case when ∆ has a small value. In this case (3.3) takes the form ˜R(1)(X1,t 1,X 2,t 2) =g(1)(X1,t 1,X 2,t 2)∆ + ˜R(X1,t 1 + ∆,X 2,t 2)− ˜R(X1,t 1 + ∆,X 2,t 2)X1t−1 1 ∆ + ˜R(X1 + 1,t 1 + ∆,X 2,t 2)(X1 + 1)t−1 1 ∆ +o(∆), ˜R(2)(X1,t 1,X 2,t 2) =g(2)(X1,t 1,X 2,t 2)∆ + ˜R(X1,t 1,X 2,t 2 + ∆)− ˜R(X1,t 1,...

work page
[7]

Presman, E. L. and Sonin, I. M. (1990). Sequential Control with Incomplete Infor- mation: Bayesian Approach, Academic Press, New York

work page 1990
[8]

Presman, E. L. (1990). Poisson Version of the Two-Armed Bandit Problem with Discounting. Theory Probab. Appl. 35 307–317

work page 1990
[9]

Continuous Multi-Armed Bandits and Multiparameter Pro- cesses

Mandelbaum, A (1987). Continuous Multi-Armed Bandits and Multiparameter Pro- cesses. Ann. Probab. 15 1527–1556

work page 1987
[10]

Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments, Chapman & Hall, London

work page 1985
[11]

Sragovich, V. G. (2006). Mathematical Theory of Adaptive Control , World Sci., Singapore

work page 2006
[12]

and Lugosi

Cesa-Bianchi, N. and Lugosi. G. (2006) Prediction, Learning, and Games , Cam- bridge Univ. Press, Cambridge

work page 2006
[13]

Kolnogorov, A. V. (2018). Gaussian Two-Armed Bandit and Optimization of Batch Data Processing. Problems of Information Transmission 54 84–100. 41 B.Saint-Petersburgskaya Str., Velikiy Novgorod, Russia, 173003 Applied Mathematics and Information Science Department E-mail: Alexander.Kolnogorov@novsu.ru

work page 2018

[1] [1]

A new approach to Poissonian two-armed bandit problem

Introduction. We consider a continuous time two-armed bandit problem. This setting results either in Poissonian or in a diﬀusion two-armed bandit. Quite general Poissonian two-armed bandit was considered in [1, 2]. In [3] consideration of Poissonian and diﬀusion bandit problems is restricted to the case of independent arms and discounted rewards. An inter...

work page internal anchor Pith review Pith/arXiv arXiv 2010

[2] [2]

Our approach is applied POISSONIAN TWO-ARMED BANDIT 3 to quite general sets Θ

because we recalculate Bayesian risk with respect to current statistics (X1,t 1,X 2,t 2) and in [1], [2] recalculations are implemented with respect to current posterior distribution and t = t1 +t2. Our approach is applied POISSONIAN TWO-ARMED BANDIT 3 to quite general sets Θ. The approach presented in [1], [2] is applied to ﬁnite sets of parameters and g...

work page

[3] [3]

Let’s consider piece-wise constant strategies {σℓ(X1,t 1,X 2,t 2)}

Recursive equation. Let’s consider piece-wise constant strategies {σℓ(X1,t 1,X 2,t 2)}. To this end, we assume that control horizon is parti- tioned into a number of intervals of the length ∆ on which the chosen action does not change. Hence, T =N∆ and for any n1 +n2 =n<N , t1 =n1∆, t2 =n2∆ we have Pr(y(t′) = ℓ) = σℓ(X1,t 1,X 2,t 2) where σℓ(X1,t 1,X 2,t ...

work page

[4] [4]

In case of a drawR(1)(X1,t 1,X 2,t 2) = R(2)(X1,t 1,X 2,t 2) the choice is arbitrary

ifR(ℓ)(X1,t 1,X 2,t 2) has smaller value. In case of a drawR(1)(X1,t 1,X 2,t 2) = R(2)(X1,t 1,X 2,t 2) the choice is arbitrary

work page

[5] [5]

In this section, we ob- tain another version of recursive Bellman-type equation

Another version of recursive equation. In this section, we ob- tain another version of recursive Bellman-type equation. Let’s denote ˜R(X1,t 1,X 2,t 2) =R(X1,t 1,X 2,t 2)×µ(X1,t 1,X 2,t 2), where{R(X1,t 1,X 2,t 2)} are Bayesian risks calculated with respect to the posterior distribution (2.1) and{µ(X1,t 1,X 2,t 2)} are deﬁned in (2.2). Then the following ...

work page

[6] [6]

In this section, we consider the case when ∆ has a small value

A limiting description. In this section, we consider the case when ∆ has a small value. In this case (3.3) takes the form ˜R(1)(X1,t 1,X 2,t 2) =g(1)(X1,t 1,X 2,t 2)∆ + ˜R(X1,t 1 + ∆,X 2,t 2)− ˜R(X1,t 1 + ∆,X 2,t 2)X1t−1 1 ∆ + ˜R(X1 + 1,t 1 + ∆,X 2,t 2)(X1 + 1)t−1 1 ∆ +o(∆), ˜R(2)(X1,t 1,X 2,t 2) =g(2)(X1,t 1,X 2,t 2)∆ + ˜R(X1,t 1,X 2,t 2 + ∆)− ˜R(X1,t 1,...

work page

[7] [7]

Presman, E. L. and Sonin, I. M. (1990). Sequential Control with Incomplete Infor- mation: Bayesian Approach, Academic Press, New York

work page 1990

[8] [8]

Presman, E. L. (1990). Poisson Version of the Two-Armed Bandit Problem with Discounting. Theory Probab. Appl. 35 307–317

work page 1990

[9] [9]

Continuous Multi-Armed Bandits and Multiparameter Pro- cesses

Mandelbaum, A (1987). Continuous Multi-Armed Bandits and Multiparameter Pro- cesses. Ann. Probab. 15 1527–1556

work page 1987

[10] [10]

Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments, Chapman & Hall, London

work page 1985

[11] [11]

Sragovich, V. G. (2006). Mathematical Theory of Adaptive Control , World Sci., Singapore

work page 2006

[12] [12]

and Lugosi

Cesa-Bianchi, N. and Lugosi. G. (2006) Prediction, Learning, and Games , Cam- bridge Univ. Press, Cambridge

work page 2006

[13] [13]

Kolnogorov, A. V. (2018). Gaussian Two-Armed Bandit and Optimization of Batch Data Processing. Problems of Information Transmission 54 84–100. 41 B.Saint-Petersburgskaya Str., Velikiy Novgorod, Russia, 173003 Applied Mathematics and Information Science Department E-mail: Alexander.Kolnogorov@novsu.ru

work page 2018