arxiv: 2604.25962 · v1 · submitted 2026-04-28 · 🪐 quant-ph · cs.DS

Recognition: unknown

Coherent Rollout Oracles for Finite-Horizon Sequential Decision Problems

Nishant Shukla

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:11 UTC · model grok-4.3

classification 🪐 quant-ph cs.DS

keywords coherent rolloutrank-selectreversible circuitsfinite-horizon planningsequential decision problemsbest-arm identificationquantum oraclesepidemic intervention

0 comments

The pith

Composing reversible rank-select circuits with transition and predicate evaluators produces an explicit polynomial-size coherent rollout oracle for finite-horizon planning problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a concrete way to build a unitary simulator for sequential decisions that unfold over a fixed number of steps. Randomness and action selection are kept inside explicit quantum registers so the whole process stays coherent. The key step is turning branch-dependent valid actions into a single entangled validity mask and then using rank-select to map selectors to actions reversibly. When these pieces are combined with reversible transition and predicate circuits, the resulting oracle fits existing quantum best-arm pipelines and delivers square-root scaling in the number of oracle calls. A reader following the argument sees a direct route from reversible-circuit primitives to a quadratic improvement over the classical lower bound for arm identification.

Core claim

For finite-horizon sequential decision problems whose valid actions admit an entangled N-bit validity mask and whose transition and predicate functions admit reversible circuits, rank-select over that mask can be realized with O(Nw) or O(N log w) gates depending on the gate layout; composing this primitive with the reversible transition and predicate circuits yields a polynomial-size coherent rollout oracle that satisfies the access model of Wang et al.'s best-arm pipeline and therefore requires only O(sqrt(k)/eps) coherent oracle calls rather than the classical Omega(k/eps^2) arm pulls.

What carries the argument

Totalized coherent rank-select over an entangled N-bit validity mask, which returns the position of the r-th set bit or a sentinel value when r exceeds the number of valid bits.

If this is right

The constructed oracle meets the exact access requirements of the best-arm identification pipeline, replacing classical quadratic arm-pull cost with square-root coherent calls.
A bounded-influence lifting theorem extends the lower-bound construction from a base configuration to an exponential family of related configurations.
The method is instantiated on SIR epidemic intervention together with a stochastic placement-game sanity check.
All main results are machine-checked in Lean 4, with explicit gate counts for the rank-select primitive.
Across bounded-fan-in layouts the rank-select primitive has an unconditional gate lower bound of Omega(N).

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reversible-selection technique could be applied to coherent simulation of other sequential processes that require state-dependent action filtering, such as quantum control or adaptive sensing.
If the polynomial-size assumption holds for a given domain, the quadratic quantum advantage immediately transfers to any algorithm that reduces to best-arm identification over rollout trajectories.
Small-scale implementations on near-term hardware could test whether the coherent oracle overhead remains practical once ancilla and depth costs are included.
The lifting theorem suggests that hardness results for one epidemic or game configuration can be propagated to many nearby instances without re-deriving the full lower bound.

Load-bearing premise

Branch-dependent valid actions can be represented as an entangled N-bit validity mask that permits totalized coherent rank-select with a sentinel, and the problem admits reversible transition and predicate-evaluation circuits.

What would settle it

A concrete finite-horizon planning instance whose transition or predicate functions have no polynomial-size reversible circuit implementations, or whose validity mask cannot be totalized by rank-select without super-polynomial resources.

Figures

Figures reproduced from arXiv: 2604.25962 by Nishant Shukla.

**Figure 1.** Figure 1: The quantum move-selection algorithm as three nested levels. The view at source ↗

**Figure 2.** Figure 2: Top: the RankSelect circuit chains N per-position blocks (“Celli”), taking inputs valid[0. .N−1] and nth, and producing a binary-encoded position index in a shared w-qubit output register out[0. .w−1]. Bottom: detail of one block, applied at position i. The output register is preloaded with the binary encoding of the sentinel value N. On a match at position i (rank = nth and validi = 1), the circuit replac… view at source ↗

read the original abstract

Coherent quantum rollout for sequential decision problems requires a unitary simulator: randomness must live in explicit quantum registers, and basis-state selectors must be mapped to actions reversibly. With branch-dependent valid actions, this mapping is totalized coherent rank-select over an entangled $N$-bit validity mask: return the position of the $r$-th valid bit, or a sentinel if $r$ is out of range. We give the first reversible-circuit complexity analysis of this primitive. For selector width $w = \lceil \log_2(N+1) \rceil$, rank-select admits an $O(Nw)$-gate low-ancilla bounded-span scan, proved gate-optimal in its model, and an $O(N\log w)$-gate low-ancilla blocked construction when long-range gates are available; across all bounded-fan-in layouts, the unconditional gate lower bound is $\Omega(N)$. Composing rank-select with reversible transition and predicate-evaluation circuits gives an explicit polynomial-size coherent rollout oracle for finite-horizon planning problems satisfying these primitive assumptions. The resulting oracle satisfies the access model of the best-arm pipeline of Wang et al., yielding $\widetilde{O}(\sqrt{k}/\varepsilon)$ coherent oracle calls against the standard classical $\Omega(k/\varepsilon^2)$ arm-pull lower bound. We give a bounded-influence lifting theorem that extends this lower-bound construction from a base configuration to an exponential family of configurations. We instantiate the construction on SIR epidemic intervention, with a stochastic placement-game sanity check, and machine-check the main results in Lean 4. Code and proofs: https://github.com/BinRoot/b01t/tree/main/demos/rollout.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers the first explicit reversible-circuit analysis of coherent rank-select over entangled validity masks, with gate bounds, a lifting theorem, Lean checks, and a composition that yields a polynomial rollout oracle matching Wang et al.'s model.

read the letter

This paper's core advance is a reversible-circuit treatment of rank-select on an entangled N-bit validity mask. It supplies an O(Nw) bounded-span scan that is gate-optimal in its model, an O(N log w) blocked variant when long-range gates are allowed, and an unconditional Omega(N) lower bound. A bounded-influence lifting theorem extends lower-bound constructions, and the whole thing composes with reversible transition and predicate circuits to produce a coherent rollout oracle for finite-horizon planning under the stated assumptions. The oracle fits the access model of Wang et al., producing the claimed tilde-O(sqrt(k)/eps) coherent calls versus the classical Omega(k/eps^2) arm-pull bound. They instantiate on an SIR epidemic intervention and include a stochastic placement-game sanity check, plus Lean 4 machine-checks of the main theorems and public code on GitHub. Those are concrete positives. The constructions are explicit, the complexity claims are stated with matching upper and lower bounds, and the verification step removes some usual worries about hidden errors in the circuit arguments. The lifting theorem is a useful technical addition that does not obviously collapse to prior work. The soft spots are limited but real. The construction rests on the ability to represent branch-dependent valid actions as a totalized entangled mask and on the existence of reversible transition and predicate circuits; these are presented as primitive assumptions rather than derived, so the result applies cleanly only where those hold. The paper does not appear to address how the coherent oracle behaves under realistic noise or decoherence, which is a natural next question but outside the stated scope. The query advantage is derived in the ideal coherent model, so its translation to end-to-end quantum runtime advantage is left implicit. Readers working on quantum algorithms for planning, coherent oracles, or reversible circuit complexity will get direct value from the new primitives and the explicit bounds. The work is technically grounded enough and the claims specific enough that it deserves a serious referee rather than a desk rejection. I would send it out for peer review.

Referee Report

2 major / 3 minor

Summary. The manuscript claims to deliver the first reversible-circuit analysis of coherent rank-select over an entangled N-bit validity mask (with sentinel) for branch-dependent actions in finite-horizon sequential decision problems. It constructs an O(Nw)-gate low-ancilla bounded-span scan (proved gate-optimal in its model) and an O(N log w)-gate blocked variant, then composes these with assumed reversible transition and predicate-evaluation circuits to obtain an explicit polynomial-size coherent rollout oracle. This oracle matches the access model of Wang et al.'s best-arm pipeline, yielding Õ(√k/ε) coherent calls versus the classical Ω(k/ε²) lower bound. The paper also supplies a bounded-influence lifting theorem extending the lower-bound construction, an SIR epidemic instantiation with a stochastic placement-game check, and Lean 4 machine-checks of the main theorems together with public code.

Significance. If the circuit constructions, composition, and lifting theorem hold under the stated primitive assumptions, the work supplies an explicit, machine-verified route to a polynomial-size coherent oracle that realizes a quadratic quantum advantage in query complexity for a class of finite-horizon planning problems. The Lean 4 machine-checks of the main theorems and the public GitHub repository constitute concrete strengths that enhance verifiability and reproducibility.

major comments (2)

[Composition section (after the rank-select constructions)] The central composition claim (rank-select plus reversible transition/predicate circuits yielding a polynomial-size coherent rollout oracle) is load-bearing for the query-complexity result; the manuscript should explicitly verify that the total gate count remains polynomial when the transition and predicate circuits are instantiated at the scale required by the finite-horizon problem (e.g., depth linear in horizon length).
[Lifting theorem and SIR section] The bounded-influence lifting theorem is used to extend the lower-bound construction to an exponential family of configurations; the proof sketch should clarify whether the influence bound is preserved under the entanglement of the validity mask and the sentinel mechanism, as any degradation would affect the applicability to the SIR instantiation.

minor comments (3)

The definition of the sentinel value and its handling in the rank-select circuit could be stated more explicitly (e.g., as a concrete bit-string or index) to avoid ambiguity when composing with downstream action-selection logic.
A short table comparing the two rank-select constructions (bounded-span vs. blocked) on gate count, ancilla count, and span would improve readability.
The abstract states 'explicit gate counts' and 'optimality proofs'; the main text should cross-reference the precise theorem numbers for these claims so readers can locate the formal statements without searching.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We appreciate the recognition of the strengths in our circuit constructions, Lean verification, and the potential for quadratic quantum advantage. Below, we provide point-by-point responses to the major comments and indicate the revisions we will make.

read point-by-point responses

Referee: The central composition claim (rank-select plus reversible transition/predicate circuits yielding a polynomial-size coherent rollout oracle) is load-bearing for the query-complexity result; the manuscript should explicitly verify that the total gate count remains polynomial when the transition and predicate circuits are instantiated at the scale required by the finite-horizon problem (e.g., depth linear in horizon length).

Authors: We agree that making the polynomial gate count explicit under finite-horizon scaling strengthens the composition claim. The manuscript assumes that the reversible transition and predicate-evaluation circuits are polynomial-sized in the state space and horizon length, which is standard for such problems. Composing these with our O(Nw)-gate rank-select circuit (where N is the number of actions and w = O(log N)) yields a total gate complexity that remains polynomial in the overall problem size, including the horizon. In the revised manuscript, we will add an explicit paragraph in the composition section verifying this, including a brief calculation showing the total size is O(poly(state dim, horizon, N)). This does not alter the query complexity result. revision: yes
Referee: The bounded-influence lifting theorem is used to extend the lower-bound construction to an exponential family of configurations; the proof sketch should clarify whether the influence bound is preserved under the entanglement of the validity mask and the sentinel mechanism, as any degradation would affect the applicability to the SIR instantiation.

Authors: The bounded-influence lifting theorem applies directly to the classical configurations of the sequential decision problem and is independent of the quantum implementation details. The entanglement of the validity mask and the sentinel are internal to our coherent rank-select construction and do not modify the classical influence bounds or the lower-bound construction. The lifting extends the base configuration to an exponential family while preserving the influence measure, as stated in the theorem. To address the referee's concern, we will add a clarifying remark in the lifting theorem section explicitly noting that the quantum circuit features do not impact the classical lifting. This clarification will also confirm applicability to the SIR instantiation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The manuscript supplies explicit O(Nw)-gate and O(N log w)-gate reversible rank-select circuits, proves gate-optimality in the bounded-fan-in model, composes them with reversible transition/predicate circuits under explicitly stated primitive assumptions, matches the access model of Wang et al., and derives the query bound via a new bounded-influence lifting theorem. All main theorems are machine-checked in Lean 4 with public code and proofs. No equation reduces to a self-definition, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on a self-citation chain; the construction is independently verifiable from the given circuit descriptions and formalization.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard reversible circuit composition rules and the existence of reversible implementations for transitions and predicates; no new free parameters or invented physical entities are introduced.

axioms (2)

standard math Reversible circuits are composed from bounded-fan-in gates with standard models for ancilla and span.
Invoked for the O(Nw) scan, O(N log w) blocked construction, and Omega(N) lower bound.
domain assumption The sequential decision problem admits reversible transition and predicate-evaluation circuits.
Required to compose the full rollout oracle from rank-select.

pith-pipeline@v0.9.0 · 5600 in / 1606 out tokens · 106823 ms · 2026-05-07T17:11:12.206274+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 4 canonical work pages

[1]

Quantum exploration algo- rithms for multi-armed bandits,

D. Wang, X. You, T. Li, and A. M. Childs, “Quantum exploration algo- rithms for multi-armed bandits,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 11, 2021, pp. 10 102–10 110

2021
[2]

Quantum-enhanced machine learning,

V . Dunjko, J. M. Taylor, and H. J. Briegel, “Quantum-enhanced machine learning,”Physical Review Letters, vol. 117, no. 13, p. 130501, 2016

2016
[3]

Quantum best arm identification with quantum oracles,

X. Wang, Y .-Z. J. Chen, M. Guedes de Andrade, J. Allcock, M. Hajies- maili, J. C. S. Lui, and D. Towsley, “Quantum best arm identification with quantum oracles,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 20, 2025, pp. 21 321–21 329

2025
[4]

Quantum speedup of Monte Carlo methods,

A. Montanaro, “Quantum speedup of Monte Carlo methods,”Proceed- ings of the Royal Society A, vol. 471, no. 2181, p. 20150301, 2015

2015
[5]

The problem of dynamic programming on a quantum computer,

P. Ronagh, “The problem of dynamic programming on a quantum computer,” 2021

2021
[6]

Quantum algorithms for finite-horizon Markov decision processes,

B. Luo, Y . Huang, J. Allcock, X. Lin, S. Zhang, and J. C. S. Lui, “Quantum algorithms for finite-horizon Markov decision processes,” in Proceedings of the 42nd International Conference on Machine Learning (ICML), ser. PMLR, vol. 267, 2025, pp. 41 200–41 234

2025
[7]

Iterative quantum amplitude estimation,

D. Grinko, J. Gacon, C. Zoufal, and S. Woerner, “Iterative quantum amplitude estimation,”npj Quantum Information, vol. 7, no. 1, p. 52, 2021

2021
[8]

A Quantum Algorithm for Finding the Minimum

C. D ¨urr and P. Høyer, “A quantum algorithm for finding the minimum,” arXiv preprint quant-ph/9607014, 1996

work page Pith review arXiv 1996
[9]

Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems,

E. Even-Dar, S. Mannor, and Y . Mansour, “Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems,”Journal of Machine Learning Research, vol. 7, no. 39, pp. 1079–1105, 2006. [Online]. Available: http://jmlr.org/papers/ v7/evendar06a.html

2006
[10]

Quantum amplitude am- plification and estimation,

G. Brassard, P. Høyer, M. Mosca, and A. Tapp, “Quantum amplitude am- plification and estimation,” inQuantum Computation and Information, ser. Contemporary Mathematics, vol. 305. AMS, 2002, pp. 53–74

2002
[11]

Multi-armed bandits and quantum channel oracles,

S. Buchholz, J. M. K ¨ubler, and B. Sch ¨olkopf, “Multi-armed bandits and quantum channel oracles,”Quantum, vol. 9, p. 1672, 2025

2025
[12]

A fast quantum mechanical algorithm for database search,

L. K. Grover, “A fast quantum mechanical algorithm for database search,” inProceedings of the 28th Annual ACM Symposium on Theory of Computing, 1996, pp. 212–219

1996
[13]

Space-efficient static trees and graphs,

G. Jacobson, “Space-efficient static trees and graphs,” in30th Annual Symposium on Foundations of Computer Science, 1989, pp. 549–554

1989
[14]

A new quantum ripple-carry addition circuit

S. A. Cuccaro, T. G. Draper, S. A. Kutin, and D. P. Moulton, “A new quantum ripple-carry addition circuit,”arXiv preprint quant-ph/0410184, 2004

work page Pith review arXiv 2004
[15]

Optimizing quantum circuits for arithmetic,

T. H ¨aner, M. Roetteler, and K. M. Svore, “Optimizing quantum circuits for arithmetic,”arXiv preprint arXiv:1805.12445, 2018

work page arXiv 2018
[16]

Encoding electronic spectra in quantum circuits with linear T complexity,

R. Babbush, C. Gidney, D. W. Berry, N. Wiebe, J. McClean, A. Paler, A. Fowler, and H. Neven, “Encoding electronic spectra in quantum circuits with linear T complexity,”Physical Review X, vol. 8, no. 4, p. 041015, 2018

2018
[17]

Logical reversibility of computation,

C. H. Bennett, “Logical reversibility of computation,”IBM Journal of Research and Development, vol. 17, no. 6, pp. 525–532, 1973

1973
[18]

On the complexity of best- arm identification in multi-armed bandit models,

E. Kaufmann, O. Capp ´e, and A. Garivier, “On the complexity of best- arm identification in multi-armed bandit models,”Journal of Machine Learning Research, vol. 17, no. 1, pp. 1–42, 2016

2016
[19]

Lattimore and C

T. Lattimore and C. Szepesv ´ari,Bandit Algorithms. Cambridge University Press, 2020

2020
[20]

On the method of bounded differences,

C. McDiarmid, “On the method of bounded differences,” inSurveys in Combinatorics, 1989 (Norwich, 1989), ser. London Math. Soc. Lecture Note Ser. Cambridge Univ. Press, 1989, vol. 141, pp. 148–188

1989
[21]

A contribution to the mathemat- ical theory of epidemics,

W. O. Kermack and A. G. McKendrick, “A contribution to the mathemat- ical theory of epidemics,”Proceedings of the Royal Society of London. Series A, vol. 115, no. 772, pp. 700–721, 1927

1927
[22]

A quantum algorithm for the Hamiltonian NAND tree,

E. Farhi, J. Goldstone, and S. Gutmann, “A quantum algorithm for the Hamiltonian NAND tree,”Theory of Computing, vol. 4, no. 1, pp. 169– 190, 2008

2008
[23]

Any AND-OR formula of sizencan be evaluated in timen 1/2+o(1) on a quantum computer,

A. Ambainis, A. M. Childs, B. W. Reichardt, R. ˇSpalek, and S. Zhang, “Any AND-OR formula of sizencan be evaluated in timen 1/2+o(1) on a quantum computer,”SIAM Journal on Computing, vol. 39, no. 6, pp. 2513–2530, 2010

2010
[24]

Quantum algorithm for tree size estimation, with applications to backtracking and 2-player games,

A. Ambainis and M. Kokainis, “Quantum algorithm for tree size estimation, with applications to backtracking and 2-player games,” in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, 2017, pp. 989–1002

2017
[25]

Machine learning & artificial intelligence in the quantum domain: a review of recent progress,

V . Dunjko and H. J. Briegel, “Machine learning & artificial intelligence in the quantum domain: a review of recent progress,”Reports on Progress in Physics, vol. 81, no. 7, p. 074001, 2018

2018
[26]

Quantum-accessible reinforce- ment learning beyond strictly epochal environments,

A. Hamann, V . Dunjko, and S. W ¨olk, “Quantum-accessible reinforce- ment learning beyond strictly epochal environments,”Quantum Machine Intelligence, vol. 3, no. 2, p. 22, 2021

2021
[27]

Quantum policy gradient algorithm with optimized action decoding,

S. Wiedemann, D. Hein, S. Udluft, and C. Mendl, “Quantum policy gradient algorithm with optimized action decoding,”arXiv preprint arXiv:2212.06663, 2022

work page arXiv 2022
[28]

Quantum estimation of delay tail probabilities in scheduling and load balancing,

R. Srikant, “Quantum estimation of delay tail probabilities in scheduling and load balancing,” 2026

2026
[29]

Quantum speedups for Monte Carlo tree search,

O. R. L. Peters, “Quantum speedups for Monte Carlo tree search,” Master’s thesis, Leiden University, 2021. [Online]. Available: https://theses.liacs.nl/2033

2021
[30]

Option pricing under stochastic volatility on a quantum computer,

G. Wang and A. Kan, “Option pricing under stochastic volatility on a quantum computer,”Quantum, vol. 8, p. 1504, 2024

2024
[31]

Deux remarques sur l’estimation,

P. Assouad, “Deux remarques sur l’estimation,”Comptes rendus des s´eances de l’Acad´emie des sciences. S ´erie 1, vol. 296, no. 23, pp. 1021– 1024, 1983

1983
[32]

Exploiting fast decaying and locality in multi-agent MDP with tree dependence structure,

G. Qu and N. Li, “Exploiting fast decaying and locality in multi-agent MDP with tree dependence structure,” in2019 IEEE 58th Conference on Decision and Control (CDC), 2019, pp. 6479–6486

2019
[33]

A sparse sampling algorithm for near-optimal planning in large Markov decision processes,

M. Kearns, Y . Mansour, and A. Y . Ng, “A sparse sampling algorithm for near-optimal planning in large Markov decision processes,”Machine Learning, vol. 49, no. 2–3, pp. 193–208, 2002

2002