pith. sign in

arxiv: 1907.06074 · v1 · pith:KIW5ES7Hnew · submitted 2019-07-13 · 🧮 math.ST · stat.TH

A new approach to Poissonian two-armed bandit problem

Pith reviewed 2026-05-24 21:58 UTC · model grok-4.3

classification 🧮 math.ST stat.TH
keywords two-armed banditPoisson processBayesian strategyrecursive equationpiece-wise constant strategycontinuous timestochastic control
0
0 comments X

The pith

Bayesian piece-wise constant strategies for the continuous-time two-armed Poisson bandit are determined by recursive equations that depend only on the observed history of arrivals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a Bayesian treatment of the two-armed bandit in continuous time where each arm generates rewards according to a Poisson process. It supplies two versions of a recursive equation that compute the optimal piece-wise constant strategy and the associated Bayesian risk for an arbitrary prior. The same framework yields a partial differential equation in an appropriate limiting regime. The description tracks the current history of the counting processes rather than the evolving posterior on the unknown rates. A reader would care because this history-only formulation removes the need to maintain a separate belief state when calculating the value of continuing or switching.

Core claim

The Bayesian risk and the optimal piece-wise constant strategy for the Poissonian two-armed bandit can be obtained from recursive relations written directly in terms of the observed history of the two counting processes; the limiting case is governed by a partial differential equation.

What carries the argument

Recursive equations for the Bayesian piece-wise constant strategy and risk expressed in terms of the current history of the Poisson processes.

If this is right

  • The optimal policy and its risk become computable from the sequence of arrival times alone.
  • The method applies to any prior distribution on the two Poisson rates.
  • The limiting partial differential equation supplies an analytic description when the time horizon or rate parameters become large.
  • Switching decisions depend only on the counts observed so far and the elapsed times.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Numerical solution of the recursions could be implemented by discretizing the possible histories.
  • The same history-based reduction might apply to other marked point processes beyond homogeneous Poisson.
  • If the recursions admit closed-form solutions for certain priors, they would yield explicit threshold rules for switching.

Load-bearing premise

Restricting attention to piece-wise constant strategies incurs no loss of optimality and the Bayesian risk under an arbitrary prior can be expressed solely through the observed history without additional state variables.

What would settle it

A direct computation showing that some non-piece-wise-constant strategy, or a strategy that conditions on extra posterior statistics, achieves strictly lower expected risk than the history-based solution would falsify the claim.

read the original abstract

We consider a continuous time two-armed bandit problem in which incomes are described by Poissonian processes. We develop Bayesian approach with arbitrary prior distribution. We present two versions of recursive equation for determination of Bayesian piece-wise constant strategy and Bayesian risk and partial differential equation in the limiting case. Unlike the previously considered Bayesian settings our description uses current history of the process and not evolution of the posterior distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a Bayesian approach to the continuous-time two-armed Poisson bandit problem for arbitrary priors. It derives two versions of recursive equations that determine a Bayesian piece-wise constant strategy and the associated Bayesian risk, together with a limiting PDE; the state is taken to be the observed history of the two Poisson processes rather than the evolving posterior measure.

Significance. If the recursions are valid and the history filtration is indeed sufficient, the framework would allow direct computation of optimal switching times and risk for non-conjugate priors without maintaining an auxiliary posterior state. The paper supplies no machine-checked proofs, reproducible code, or explicit verification that the claimed recursions recover the true Bayesian value.

major comments (2)
  1. [Abstract] Abstract and § on recursive equations: the central claim that the history filtration alone supplies a sufficient statistic for the conditional expected remaining reward (for arbitrary priors) is asserted without proof or counter-example; for non-conjugate priors the posterior is typically infinite-dimensional and the optimal switching boundary may depend on the full measure, not merely the observed counts.
  2. [Abstract] Abstract, paragraph 3: the restriction to piece-wise constant strategies is introduced without showing that it incurs no loss of optimality relative to the full class of history-adapted strategies; this assumption is load-bearing for the claim that the recursions solve the original problem.
minor comments (2)
  1. Notation for the two versions of the recursion is introduced without an explicit statement of the filtration or the precise definition of the value function on the history space.
  2. The limiting PDE is stated but the passage from the discrete recursion to the continuous PDE is not derived in detail.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and the constructive comments on the sufficiency of the history filtration and the restriction to piece-wise constant strategies. We address each point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and § on recursive equations: the central claim that the history filtration alone supplies a sufficient statistic for the conditional expected remaining reward (for arbitrary priors) is asserted without proof or counter-example; for non-conjugate priors the posterior is typically infinite-dimensional and the optimal switching boundary may depend on the full measure, not merely the observed counts.

    Authors: For any fixed prior, the posterior at time t is obtained by Bayes' rule applied to the likelihood of the observed history (jump times and arm identities). Hence the conditional expectation of remaining reward given the current posterior is a deterministic functional of that history. The recursive equations are written directly on the history filtration, which therefore serves as a sufficient statistic without requiring an auxiliary parametrization of the posterior. This construction is valid for arbitrary (including non-conjugate) priors; the resulting value function is simply a map from histories to reals. We will insert a short clarifying paragraph stating the equivalence explicitly. revision: yes

  2. Referee: [Abstract] Abstract, paragraph 3: the restriction to piece-wise constant strategies is introduced without showing that it incurs no loss of optimality relative to the full class of history-adapted strategies; this assumption is load-bearing for the claim that the recursions solve the original problem.

    Authors: Piece-wise constant strategies decide the arm only at the event times of the merged Poisson process. Because of the memoryless property of the Poisson processes, the future evolution depends on the past only through the current history; therefore the dynamic-programming recursion on the history filtration yields the optimum within this class. We agree that an explicit argument that the global optimum over all adapted strategies coincides with the optimum inside the piece-wise constant class would strengthen the presentation. We will add a brief justification or reference to the relevant continuous-time bandit literature in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation presented as self-contained without quoted reductions

full rationale

The provided abstract and description introduce recursive equations and a limiting PDE for piece-wise constant Bayesian strategies that depend on process history rather than posterior evolution. No equations, self-citations, or derivations are quoted that reduce a claimed prediction or result to a fitted input or prior assumption by construction. The restriction to piece-wise constant strategies and sufficiency of history are stated as modeling choices without evidence of self-definitional closure or load-bearing self-citation chains. The approach is therefore treated as self-contained against external benchmarks for the purpose of this analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard properties of Poisson processes and the existence of an optimal Bayesian strategy within the class of piece-wise constant policies; no free parameters or invented entities are visible from the abstract.

axioms (2)
  • domain assumption Incomes are described by independent Poisson processes with unknown rates
    Stated in the first sentence of the abstract as the problem setup.
  • domain assumption A Bayesian piece-wise constant strategy exists and its risk satisfies a recursive equation
    Central modeling choice announced in the abstract without derivation.

pith-pipeline@v0.9.0 · 5570 in / 1245 out tokens · 24116 ms · 2026-05-24T21:58:34.596495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    A new approach to Poissonian two-armed bandit problem

    Introduction. We consider a continuous time two-armed bandit problem. This setting results either in Poissonian or in a diffusion two-armed bandit. Quite general Poissonian two-armed bandit was considered in [1, 2]. In [3] consideration of Poissonian and diffusion bandit problems is restricted to the case of independent arms and discounted rewards. An inter...

  2. [2]

    Our approach is applied POISSONIAN TWO-ARMED BANDIT 3 to quite general sets Θ

    because we recalculate Bayesian risk with respect to current statistics (X1,t 1,X 2,t 2) and in [1], [2] recalculations are implemented with respect to current posterior distribution and t = t1 +t2. Our approach is applied POISSONIAN TWO-ARMED BANDIT 3 to quite general sets Θ. The approach presented in [1], [2] is applied to finite sets of parameters and g...

  3. [3]

    Let’s consider piece-wise constant strategies {σℓ(X1,t 1,X 2,t 2)}

    Recursive equation. Let’s consider piece-wise constant strategies {σℓ(X1,t 1,X 2,t 2)}. To this end, we assume that control horizon is parti- tioned into a number of intervals of the length ∆ on which the chosen action does not change. Hence, T =N∆ and for any n1 +n2 =n<N , t1 =n1∆, t2 =n2∆ we have Pr(y(t′) = ℓ) = σℓ(X1,t 1,X 2,t 2) where σℓ(X1,t 1,X 2,t ...

  4. [4]

    In case of a drawR(1)(X1,t 1,X 2,t 2) = R(2)(X1,t 1,X 2,t 2) the choice is arbitrary

    ifR(ℓ)(X1,t 1,X 2,t 2) has smaller value. In case of a drawR(1)(X1,t 1,X 2,t 2) = R(2)(X1,t 1,X 2,t 2) the choice is arbitrary

  5. [5]

    In this section, we ob- tain another version of recursive Bellman-type equation

    Another version of recursive equation. In this section, we ob- tain another version of recursive Bellman-type equation. Let’s denote ˜R(X1,t 1,X 2,t 2) =R(X1,t 1,X 2,t 2)×µ(X1,t 1,X 2,t 2), where{R(X1,t 1,X 2,t 2)} are Bayesian risks calculated with respect to the posterior distribution (2.1) and{µ(X1,t 1,X 2,t 2)} are defined in (2.2). Then the following ...

  6. [6]

    In this section, we consider the case when ∆ has a small value

    A limiting description. In this section, we consider the case when ∆ has a small value. In this case (3.3) takes the form ˜R(1)(X1,t 1,X 2,t 2) =g(1)(X1,t 1,X 2,t 2)∆ + ˜R(X1,t 1 + ∆,X 2,t 2)− ˜R(X1,t 1 + ∆,X 2,t 2)X1t−1 1 ∆ + ˜R(X1 + 1,t 1 + ∆,X 2,t 2)(X1 + 1)t−1 1 ∆ +o(∆), ˜R(2)(X1,t 1,X 2,t 2) =g(2)(X1,t 1,X 2,t 2)∆ + ˜R(X1,t 1,X 2,t 2 + ∆)− ˜R(X1,t 1,...

  7. [7]

    Presman, E. L. and Sonin, I. M. (1990). Sequential Control with Incomplete Infor- mation: Bayesian Approach, Academic Press, New York

  8. [8]

    Presman, E. L. (1990). Poisson Version of the Two-Armed Bandit Problem with Discounting. Theory Probab. Appl. 35 307–317

  9. [9]

    Continuous Multi-Armed Bandits and Multiparameter Pro- cesses

    Mandelbaum, A (1987). Continuous Multi-Armed Bandits and Multiparameter Pro- cesses. Ann. Probab. 15 1527–1556

  10. [10]

    Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments, Chapman & Hall, London

  11. [11]

    Sragovich, V. G. (2006). Mathematical Theory of Adaptive Control , World Sci., Singapore

  12. [12]

    and Lugosi

    Cesa-Bianchi, N. and Lugosi. G. (2006) Prediction, Learning, and Games , Cam- bridge Univ. Press, Cambridge

  13. [13]

    Kolnogorov, A. V. (2018). Gaussian Two-Armed Bandit and Optimization of Batch Data Processing. Problems of Information Transmission 54 84–100. 41 B.Saint-Petersburgskaya Str., Velikiy Novgorod, Russia, 173003 Applied Mathematics and Information Science Department E-mail: Alexander.Kolnogorov@novsu.ru