Optimal strategies in the all-heads coin game
Pith reviewed 2026-05-08 09:58 UTC · model grok-4.3
The pith
When the coin is biased toward heads, always setting aside exactly one head is optimal and the winning probability increases with the number of coins.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For p greater than one half the strategy of setting aside exactly one head is optimal, the map n to winning probability is strictly increasing, and the limit W(p) admits an explicit series representation satisfying p less than or equal to W(p) less than one. For p equal to one half every strategy achieves winning probability one half. For p less than one half near one half the deficit from one half is approximately delta times c_n to first order, where c_n satisfies a linear recursion for n at least seven with limit about 1.7035, so the optimal value sequence has a strict local minimum at n equals five and no local maximum.
What carries the argument
The Bellman equation for the stochastic shortest-path Markov decision process, which features a nonlinear suffix-maximum operator, solved by comparing the two natural strategies of setting aside one or all heads.
If this is right
- For p > 1/2 the winning probability is strictly increasing in the initial number of coins.
- The limiting winning probability W(p) has an explicit series representation.
- W(p) lies between p and 1.
- To first order in the bias deficit the optimal values have a strict local minimum at n=5.
- The sequence c_n obeys a linear recursion for n >=7 and converges to approximately 1.7035.
Where Pith is reading between the lines
- If the bias is only slightly below one half, players would do well to avoid starting with exactly five coins.
- The series representation allows numerical computation of the limiting probability for any given bias above one half.
- Local maxima in the winning probability sequence may emerge only when the bias against heads is stronger than the perturbative regime.
Load-bearing premise
That for every number of coins and every bias the optimal policy is one of the two simple strategies of setting aside one head or all heads.
What would settle it
For a fixed n and p greater than one half, solve the full Bellman optimality equations numerically to check whether the optimal action in every state is always to set aside exactly one head.
Figures
read the original abstract
We study a sequential coin-flipping game: a player starts with $n$~coins, each heads with probability~$p$, and in each round flips all remaining coins and must set aside at least one head, losing if none shows. The player wins once all coins have been set aside. The optimal winning probability~$w_{n,p}$ obeys a Bellman equation with a nonlinear suffix-maximum operator. For $p=\tfrac12$ every strategy achieves $w_{n,1/2}=\tfrac12$. For $p>\tfrac12$ the strategy~\One{} (set aside a single head) is optimal, $n\mapsto w_{n,p}$ is strictly increasing, and the limit $W(p):=\lim_n w_{n,p}$ has an explicit series representation with $p\le W(p)<1$. For $p<\tfrac12$ near~$\tfrac12$ we give a first-order perturbation expansion in $\delta:=\tfrac12-p$: the deficit satisfies $\tfrac12-w_{n,\,1/2-\delta}\approx\delta\,c_n$, where $c_n$ obeys a linear recursion for $n\ge7$ with limit $L\approx1.7035$. To first order the optimal-value sequence has a strict local minimum at $n=5$ and no local maximum.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies a stochastic shortest-path MDP for a coin-flipping game starting with n coins, each heads with probability p. In each step the player flips all remaining coins and must set aside at least one heads (losing on all tails); the goal is to set aside all coins. The Bellman operator is a nonlinear suffix-maximum over the number k of heads to set aside. The paper analyzes the pure policies 'One' (k=1) and 'All' (k=H), shows that for p=1/2 every policy yields w_{n,1/2}=1/2, claims that for p>1/2 the policy One is optimal with n↦w_{n,p} strictly increasing, derives an explicit series for the limit W(p) satisfying p≤W(p)<1, and for p<1/2 near 1/2 performs a first-order perturbation in δ=1/2-p showing that the deficit is ≈δ c_n where c_n obeys a linear recursion (n≥7) with limit L≈1.7035, implying a strict local minimum at n=5 to first order (local maxima being non-perturbative).
Significance. If the optimality and perturbation claims hold, the work supplies a complete characterization of the value function for this MDP, including an explicit series representation for the limiting winning probability and a recursion that yields a concrete numerical constant L from the model itself rather than external fitting. These features, together with the demonstration that the first-order local-minimum structure at n=5 is a perturbative phenomenon, constitute the principal contributions.
major comments (2)
- [the optimality analysis for p>1/2] The claim that One is optimal for p>1/2 (and the consequent identification of w_{n,p} with the One recursion) rests on direct comparison of the two pure policies but does not establish that the argmax in the nonlinear suffix-max Bellman operator is attained at k=1 rather than at some intermediate k when H>1. No induction on the state m or first-order optimality condition for the value function v_m is supplied to rule out intermediate actions. This is load-bearing for the monotonicity statement, the series for W(p), and all downstream conclusions.
- [the perturbation analysis and recursion for c_n] The first-order perturbation expansion for p=1/2-δ and the resulting local-minimum conclusion at n=5 rely on the same identification of the optimal policy; if intermediate k can improve the value for some states when p<1/2, the linear recursion for c_n and the sign pattern that produces the local minimum at n=5 would require re-derivation. The manuscript should verify that the perturbation preserves the optimality of the extreme policies to the order considered.
minor comments (2)
- [§2 (model and Bellman equation)] The notation for the value function under a general policy versus the optimal w_{n,p} should be introduced with a single consistent definition early in the text to avoid ambiguity when switching between policy values and the Bellman operator.
- [the section presenting the recursion and its limit] The numerical evaluation of L≈1.7035 is obtained by iterating the linear recursion; a short table or plot of the first 20 terms of c_n would make the convergence rate and the claimed limit transparent.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying the need to strengthen the optimality argument and its implications for the perturbation analysis. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [the optimality analysis for p>1/2] The claim that One is optimal for p>1/2 (and the consequent identification of w_{n,p} with the One recursion) rests on direct comparison of the two pure policies but does not establish that the argmax in the nonlinear suffix-max Bellman operator is attained at k=1 rather than at some intermediate k when H>1. No induction on the state m or first-order optimality condition for the value function v_m is supplied to rule out intermediate actions. This is load-bearing for the monotonicity statement, the series for W(p), and all downstream conclusions.
Authors: We agree that establishing optimality of the One policy requires more than comparing it to the All policy; an explicit argument is needed to confirm that the Bellman operator's argmax occurs at k=1 for every state when p>1/2. The manuscript demonstrates that One strictly outperforms All for p>1/2 and derives the associated recursion, but does not supply the full induction ruling out intermediate k. In the revision we will add an inductive proof on the state size m: assuming the claim holds for all smaller states, we show that for any current state with H heads the expected value under k=1 exceeds that under any k>1 when p>1/2. This induction simultaneously yields the strict monotonicity of w_{n,p} and justifies the series for the limit W(p). revision: yes
-
Referee: [the perturbation analysis and recursion for c_n] The first-order perturbation expansion for p=1/2-δ and the resulting local-minimum conclusion at n=5 rely on the same identification of the optimal policy; if intermediate k can improve the value for some states when p<1/2, the linear recursion for c_n and the sign pattern that produces the local minimum at n=5 would require re-derivation. The manuscript should verify that the perturbation preserves the optimality of the extreme policies to the order considered.
Authors: We concur that the first-order expansion in δ presupposes that the optimal policy remains at an extreme (k=1 or k=H) through O(δ). At δ=0 every policy yields exactly 1/2, so the linear correction terms must be checked to ensure no intermediate k produces a strictly better O(δ) improvement. In the revised manuscript we will augment the perturbation section with a verification that, to first order, the Bellman operator continues to select an extreme action; this preserves the linear recursion satisfied by c_n for n≥7 and the resulting sign pattern that produces the strict local minimum at n=5 to first order in δ. revision: yes
Circularity Check
No circularity; derivations follow directly from Bellman recursion and explicit policy comparison
full rationale
The paper sets up the problem as a stochastic shortest-path MDP whose Bellman operator is the nonlinear suffix-maximum over admissible k. It explicitly compares the value functions of the two pure policies One and All, obtains closed recursions for each, and for p > 1/2 shows that the One recursion yields a strictly increasing sequence whose limit admits an explicit series. For p < 1/2 the first-order perturbation coefficients c_n obey a linear recursion whose coefficients are the model's own transition probabilities; the numerical limit L is obtained by iterating that recursion, not by fitting. No parameter is estimated from data and then re-used as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled in. The central claims are therefore self-contained consequences of the stated Bellman equation and the two-policy comparison.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The game is a finite-state stochastic shortest-path Markov decision process whose value function satisfies the Bellman equation with the nonlinear suffix-maximum operator.
- standard math Standard results on existence of optimal policies and value iteration apply to this MDP.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.