Latent Order Bandits

Emil Carlsson; Fredrik D. Johansson; Newton Mwai

arxiv: 2605.07304 · v1 · submitted 2026-05-08 · 💻 cs.LG

Latent Order Bandits

Emil Carlsson , Newton Mwai , Fredrik D. Johansson This is my paper

Pith reviewed 2026-05-11 01:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords latent banditspartial ordersregret boundsupper confidence boundposterior samplingbandit algorithmssequential decision makingpersonalization

0 comments

The pith

Latent order bandits achieve low regret using only a shared partial order of action preferences within each latent state, allowing reward distributions to vary across instances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces latent order bandits to reduce the prior knowledge needed for efficient personalization in sequential decisions. Standard latent bandits demand accurate models of full reward distributions and latent states, but here only a partial ordering of which actions are preferred in each state must be known in advance. Instances sharing a state can then differ arbitrarily in reward values or scales provided the preference order holds. An upper-confidence-bound algorithm is derived with a regret upper bound that covers both total and partial orders, and a posterior-sampling variant is shown to perform competitively or better in experiments, especially when scales differ within states.

Core claim

The central claim is that a partial order on actions known a priori and shared by all instances of a given latent state is sufficient to construct bandit algorithms with sublinear regret. An upper-confidence-bound procedure is presented for both total and partial orders together with a regret bound. A posterior-sampling algorithm is also introduced that, in a range of experiments, matches full-prior latent bandits when reward parameters are shared within states and outperforms them when reward scales differ between instances of the same state.

What carries the argument

The partial order of action preferences shared across instances of each latent state, which constrains only the ranking of actions without fixing their exact reward values or scales.

If this is right

Regret bounds continue to hold when reward parameters differ within the same latent state.
The methods apply directly to both complete and incomplete preference orders.
Posterior sampling yields better empirical performance than UCB alone.
Personalization problems become solvable with strictly less prior information than required by standard latent bandits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be used in recommendation systems where user groups agree on genre rankings but assign different numerical ratings.
Orders might be learned incrementally from data instead of being supplied as fixed priors.
The structure could extend to other bandit settings that supply only ordinal rather than cardinal information.
Weaker structural assumptions than full distributions may still permit efficient learning in many practical domains.

Load-bearing premise

A partial order of action preferences is known in advance and is identical for every instance that belongs to the same latent state.

What would settle it

Run the algorithm on instances where at least one violates the assumed partial order by reversing the preference ranking of two actions, then check whether observed regret exceeds the derived upper bound.

Figures

Figures reproduced from arXiv: 2605.07304 by Emil Carlsson, Fredrik D. Johansson, Newton Mwai.

**Figure 2.** Figure 2: Results on Synthetic environments. Average cumulative regret Reg(T) compared to baselines as a function of T (a) or at fixed T = 500 (b,c) for different arm and state configurations. In (c), lob algorithms receive partial order information containing a fraction q of the total order constraints. Despite using less prior knowledge, and all methods being well-specified, lobUCB and lobTS consistently outperfor… view at source ↗

**Figure 3.** Figure 3: Results on MovieLens (A). In (a), instances are well-specified, with parameters ordered according to one of the m = 60 states constructed by clustering the MovieLens rating data. In (b), for each instance (user) the ground-truth genre rating means µ 0 are mixed with the well-specified means µ to form µmix = αµ 0 + (1 − α)µ, to test robustness to misspecification. sufficient structure is available (constrai… view at source ↗

**Figure 4.** Figure 4: Instantaneous regret on MovieLens B (misspecified), using real-world user means µ 0 for rewards with m = 160. 0 50 100 150 Number of states, m (k = 19) 200 300 400 500 600 700 Cumulative regret, Reg( T = 500) UCB TS mUCB mTS lobUCB lobTS [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Cumulative regret on MovieLens B (misspecified) at T = 500, using real-world user means µ 0 for varying m. In [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Results on Synthetic environment with k = m = 10 where lobUCB and lobTS only project every Tproj rounds, as a function of Tproj . Other algorithms are unaffected by this parameter. B.4 Additional results on the Synthetic environment In [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Cumulative regret for the Synthetic environment with [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Instantaneous regret for the well-specified setting of the MovieLens environment with [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Cumulative regret for the well-specified Setting A of the MovieLens environment with [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Instantaneous regret for the misspecified (real-world means) setting of the MovieLens [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

read the original abstract

Bandit algorithms solve diverse sequential decision-making problems, but are often too sample-inefficient for from-scratch personalization. To substantially reduce exploration times, latent bandit algorithms exploit cross-instance structure implied by discrete latent states, provided that the posterior distribution of rewards and latent states is known and accurate. However, obtaining an accurate model of this structure is difficult, and a small number of latent states may be insufficient to characterize the reward distributions in all problem instances. We propose latent order bandits (LOB), relaxing the assumptions of latent bandits to require only prior knowledge of a partial order of action preferences in each state. This allows instances of the same state to vary in reward distributions, as long as the partial order of actions is shared. For example, groups of users on a streaming service may agree on which movie genres are the best but rate experiences on different scales. We give an upper-confidence bound procedure for the LOB problem, applicable to both total and partial latent orders, and give an upper bound on its regret. To improve empirical performance, we propose a posterior-sampling algorithm and show, in a suite of experiments, that both are competitive with full-prior latent bandits when same-state instances share reward parameters, and preferable to them when reward scales differ between instances with the same latent state.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper relaxes latent bandits to shared partial orders on actions per state, allowing scale variation within states, and backs it with UCB regret bounds plus posterior sampling that beats full models in mismatched-scale experiments.

read the letter

The main contribution is a relaxation called latent order bandits. Instead of needing accurate full reward models and posteriors over latent states, it only requires knowing a partial order on action preferences that is shared across instances of the same state. This lets reward scales or parameters differ within a state as long as the ordering stays fixed, which matches cases like users agreeing on genre rankings but rating on different scales.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces latent order bandits (LOB) as a relaxation of standard latent bandits. Instead of requiring a full prior model of reward distributions and latent states, LOB assumes only that a partial (or total) order over action preferences is known and shared across all instances of the same latent state, while permitting arbitrary variation in reward parameters or scales within each state. The authors present a UCB-style algorithm that incorporates the known order into its confidence bounds, derive an upper bound on its regret for both total and partial orders, propose a posterior-sampling variant for better empirical behavior, and report experiments in which both algorithms match full-prior latent bandits when reward parameters are identical across same-state instances and outperform them when scales differ.

Significance. If the regret analysis and experimental comparisons hold, the work offers a principled way to exploit partial structural side information in bandit problems. This could be valuable in personalization settings (e.g., recommendation or streaming) where users or contexts share coarse preference orders but differ in rating scales or noise levels, thereby reducing the sample complexity of from-scratch learning without demanding an accurate generative model of the full reward distributions.

minor comments (3)

The experimental section would benefit from an explicit statement of the number of independent runs, the precise metric used to declare superiority (e.g., cumulative regret at horizon T or average per-step regret), and whether the reported curves include confidence bands.
Notation for the partial-order case (e.g., how the feasible set of rankings is represented and how the UCB indices are adjusted) should be introduced with a small illustrative example early in the algorithmic section to aid readability.
The regret bound statement would be clearer if the dependence on the number of latent states, the width of the order, and the horizon were written out explicitly rather than left in big-O notation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of the manuscript, as well as for recognizing the potential value of latent order bandits in settings where only coarse preference structure is known. We are pleased with the recommendation for minor revision and will address any editorial suggestions in the revised version.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces latent order bandits as a new relaxation of latent bandits, requiring only a shared partial order on action preferences per latent state while allowing arbitrary variation in reward distributions or scales. It then defines a UCB algorithm that incorporates this order into confidence bounds and derives a regret upper bound directly from the new problem definition for both total and partial orders. A posterior-sampling variant is proposed separately for empirical improvement. The derivation chain is self-contained: the order is treated as exogenous side information, the algorithms and bounds are constructed explicitly around it, and no step reduces by construction to a fitted parameter, prior self-citation, or renamed known result. Experiments test the claimed advantage without circular dependence on the analysis. This matches the default case of an independent proposal with external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard stochastic bandit assumptions plus the domain assumption of shared partial orders; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Existence of discrete latent states where instances share a partial order on action preferences but may differ in reward distributions
Invoked to justify the relaxation from full reward models while still enabling cross-instance structure exploitation.
domain assumption The partial order is known a priori
Required for the UCB procedure to operate without learning the order itself.

pith-pipeline@v0.9.0 · 5521 in / 1429 out tokens · 44964 ms · 2026-05-11T01:30:26.990353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define an LOB problem instance by the unobserved latent state s∈[m] and reward distributions P=(P1,...,Pk) ... Hs:={μ∈Rk:∀(a,a′)∈Os,μa≥μa′}
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the projection ˜μst lands on the boundary of Hst ... variances of the merged arm parameters are pooled

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Onur Atan, Cem Tekin, and Mihaela van der Schaar

doi: 10.1214/18-EJS1507. Onur Atan, Cem Tekin, and Mihaela van der Schaar. Global bandits.IEEE transactions on neural networks and learning systems, 29(12):5798–5811,

work page doi:10.1214/18-ejs1507
[2]

Identifiable latent bandits: Combining observational data and exploration for personalized healthcare

Ahmet Zahid Balcıo˘glu, Emil Carlsson, and Fredrik D Johansson. Identifiable latent bandits: Combining observational data and exploration for personalized healthcare. InICML 2024 Workshop: Foundations of Reinforcement Learning and Control–Connections and Perspectives,

work page 2024
[3]

A continuous information gain measure to find the most discriminatory problems for ai benchmarking

doi: 10.1109/ CEC48606.2020.9185782. Romain Camilleri, Andrew Wagenmaker, Jamie H Morgenstern, Lalit Jain, and Kevin G Jamieson. Active learning with safety constraints. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 33201–33214. Curran Associates, Inc.,

work page arXiv 2020
[4]

, urldate =

doi: 10.24963/ijcai.2021/305. Main Track. Emil Carlsson, Debabrota Basu, Fredrik Johansson, and Devdatt Dubhashi. Pure exploration in bandits with linear constraints. InInternational Conference on Artificial Intelligence and Statistics, pages 334–342. PMLR,

work page doi:10.24963/ijcai.2021/305 2021
[5]

Latent bandits revisited.Advances in Neural Information Processing Systems, 33:13423–13433, 2020a

Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed, and Craig Boutilier. Latent bandits revisited.Advances in Neural Information Processing Systems, 33:13423–13433, 2020a. Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed, Mohammad Ghavamzadeh, and Craig Boutilier. Non-stationary latent bandits.arXiv preprint arXiv:2012.00...

work page arXiv 2012
[6]

Should i send this notification? optimizing push notifications decision making by modeling the future.arXiv preprint arXiv:2202.08812,

Conor O’Brien, Huasen Wu, Shaodan Zhai, Dalin Guo, Wenzhe Shi, and Jonathan J Hunt. Should i send this notification? optimizing push notifications decision making by modeling the future.arXiv preprint arXiv:2202.08812,

work page arXiv
[7]

T. Zhao, M. Li, and M. Poloczek. Fast reconfigurable antenna state selection with hierarchical thompson sampling. InICC 2019 - 2019 IEEE International Conference on Communications (ICC), pages 1–6,

work page 2019
[8]

A Survey on Contextual Multi-armed Bandits

Li Zhou. A survey on contextual multi-armed bandits.arXiv preprint arXiv:1508.03326,

work page Pith review arXiv
[9]

8:N at ←N at + 1 9:end for Algorithm 4Gaussian-Gaussian TS bandit (TS) 1: Input:Number of arms k, prior means µ0 a, prior variancesσ 2,0 a , confidence parameterc 2:fora= 1, . . . , kdo 3:Initialize posterior:µ a ←µ 0 a,σ 2 a ←σ 2,0 a 4:end for 5:fort= 1, ..., Tdo 6:fora= 1, ..., Kdo 7:Sample˜µ a ∼ N(ˆµa, σ2 a) 8:end for 9:Choose actiona t ←arg max a ˜µa ...

work page 2015
[10]

0 1000 2000 3000 4000 Horizon, T (k = 19, m =

18 In Figure 7, we show the cumulative regret on theMovieLensdata set for different settings of m in the well-specified Setting A. 0 1000 2000 3000 4000 Horizon, T (k = 19, m =

work page 2000
[11]

0 200 400 600 800Average cumulative regret, Reg(T ) UCB TS mUCB mTS lobUCB lobTS (a)m= 10 0 1000 2000 3000 4000 Horizon, T (k = 19, m =

work page 2000
[12]

0 200 400 600 800 1000 1200Average cumulative regret, Reg(T ) UCB TS mUCB mTS lobUCB lobTS (b)m= 20 0 1000 2000 3000 4000 Horizon, T (k = 19, m =

work page 2000
[13]

In Figure 10, we show the instantaneous regret on theMovieLensdata set for different settings of m

0 500 1000 1500Average cumulative regret, Reg(T ) UCB TS mUCB mTS lobUCB lobTS (c)m= 80 Figure 9: Cumulative regret for the well-specified Setting A of the MovieLens environment with varyingm. In Figure 10, we show the instantaneous regret on theMovieLensdata set for different settings of m. When m grows, the degree of misspecification for the LOB algorit...

work page 2000
[14]

0.00 0.25 0.50 0.75 1.00 1.25 Average instantaneous regret UCB TS mUCB mTS lobUCB lobTS (a)m= 5 0 1000 2000 3000 4000 Horizon, T (k = 19, m =

work page 2000
[15]

0.0 0.5 1.0 1.5 Average instantaneous regret UCB TS mUCB mTS lobUCB lobTS (b)m= 20 0 1000 2000 3000 4000 Horizon, T (k = 19, m =

work page 2000
[16]

B.6 Computational infrastructure and code The synthetic simulations were run on a 2023 Apple M2 Pro laptop with 16 GB RAM

0.0 0.5 1.0 1.5 Average instantaneous regret UCB TS mUCB mTS lobUCB lobTS (c)m= 140 Figure 10: Instantaneous regret for the misspecified (real-world means) setting of the MovieLens environment with varyingm. B.6 Computational infrastructure and code The synthetic simulations were run on a 2023 Apple M2 Pro laptop with 16 GB RAM. Code to reproduce the expe...

work page 2023

[1] [1]

Onur Atan, Cem Tekin, and Mihaela van der Schaar

doi: 10.1214/18-EJS1507. Onur Atan, Cem Tekin, and Mihaela van der Schaar. Global bandits.IEEE transactions on neural networks and learning systems, 29(12):5798–5811,

work page doi:10.1214/18-ejs1507

[2] [2]

Identifiable latent bandits: Combining observational data and exploration for personalized healthcare

Ahmet Zahid Balcıo˘glu, Emil Carlsson, and Fredrik D Johansson. Identifiable latent bandits: Combining observational data and exploration for personalized healthcare. InICML 2024 Workshop: Foundations of Reinforcement Learning and Control–Connections and Perspectives,

work page 2024

[3] [3]

A continuous information gain measure to find the most discriminatory problems for ai benchmarking

doi: 10.1109/ CEC48606.2020.9185782. Romain Camilleri, Andrew Wagenmaker, Jamie H Morgenstern, Lalit Jain, and Kevin G Jamieson. Active learning with safety constraints. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 33201–33214. Curran Associates, Inc.,

work page arXiv 2020

[4] [4]

, urldate =

doi: 10.24963/ijcai.2021/305. Main Track. Emil Carlsson, Debabrota Basu, Fredrik Johansson, and Devdatt Dubhashi. Pure exploration in bandits with linear constraints. InInternational Conference on Artificial Intelligence and Statistics, pages 334–342. PMLR,

work page doi:10.24963/ijcai.2021/305 2021

[5] [5]

Latent bandits revisited.Advances in Neural Information Processing Systems, 33:13423–13433, 2020a

Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed, and Craig Boutilier. Latent bandits revisited.Advances in Neural Information Processing Systems, 33:13423–13433, 2020a. Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed, Mohammad Ghavamzadeh, and Craig Boutilier. Non-stationary latent bandits.arXiv preprint arXiv:2012.00...

work page arXiv 2012

[6] [6]

Should i send this notification? optimizing push notifications decision making by modeling the future.arXiv preprint arXiv:2202.08812,

Conor O’Brien, Huasen Wu, Shaodan Zhai, Dalin Guo, Wenzhe Shi, and Jonathan J Hunt. Should i send this notification? optimizing push notifications decision making by modeling the future.arXiv preprint arXiv:2202.08812,

work page arXiv

[7] [7]

T. Zhao, M. Li, and M. Poloczek. Fast reconfigurable antenna state selection with hierarchical thompson sampling. InICC 2019 - 2019 IEEE International Conference on Communications (ICC), pages 1–6,

work page 2019

[8] [8]

A Survey on Contextual Multi-armed Bandits

Li Zhou. A survey on contextual multi-armed bandits.arXiv preprint arXiv:1508.03326,

work page Pith review arXiv

[9] [9]

8:N at ←N at + 1 9:end for Algorithm 4Gaussian-Gaussian TS bandit (TS) 1: Input:Number of arms k, prior means µ0 a, prior variancesσ 2,0 a , confidence parameterc 2:fora= 1, . . . , kdo 3:Initialize posterior:µ a ←µ 0 a,σ 2 a ←σ 2,0 a 4:end for 5:fort= 1, ..., Tdo 6:fora= 1, ..., Kdo 7:Sample˜µ a ∼ N(ˆµa, σ2 a) 8:end for 9:Choose actiona t ←arg max a ˜µa ...

work page 2015

[10] [10]

0 1000 2000 3000 4000 Horizon, T (k = 19, m =

18 In Figure 7, we show the cumulative regret on theMovieLensdata set for different settings of m in the well-specified Setting A. 0 1000 2000 3000 4000 Horizon, T (k = 19, m =

work page 2000

[11] [11]

0 200 400 600 800Average cumulative regret, Reg(T ) UCB TS mUCB mTS lobUCB lobTS (a)m= 10 0 1000 2000 3000 4000 Horizon, T (k = 19, m =

work page 2000

[12] [12]

0 200 400 600 800 1000 1200Average cumulative regret, Reg(T ) UCB TS mUCB mTS lobUCB lobTS (b)m= 20 0 1000 2000 3000 4000 Horizon, T (k = 19, m =

work page 2000

[13] [13]

In Figure 10, we show the instantaneous regret on theMovieLensdata set for different settings of m

0 500 1000 1500Average cumulative regret, Reg(T ) UCB TS mUCB mTS lobUCB lobTS (c)m= 80 Figure 9: Cumulative regret for the well-specified Setting A of the MovieLens environment with varyingm. In Figure 10, we show the instantaneous regret on theMovieLensdata set for different settings of m. When m grows, the degree of misspecification for the LOB algorit...

work page 2000

[14] [14]

0.00 0.25 0.50 0.75 1.00 1.25 Average instantaneous regret UCB TS mUCB mTS lobUCB lobTS (a)m= 5 0 1000 2000 3000 4000 Horizon, T (k = 19, m =

work page 2000

[15] [15]

0.0 0.5 1.0 1.5 Average instantaneous regret UCB TS mUCB mTS lobUCB lobTS (b)m= 20 0 1000 2000 3000 4000 Horizon, T (k = 19, m =

work page 2000

[16] [16]

B.6 Computational infrastructure and code The synthetic simulations were run on a 2023 Apple M2 Pro laptop with 16 GB RAM

0.0 0.5 1.0 1.5 Average instantaneous regret UCB TS mUCB mTS lobUCB lobTS (c)m= 140 Figure 10: Instantaneous regret for the misspecified (real-world means) setting of the MovieLens environment with varyingm. B.6 Computational infrastructure and code The synthetic simulations were run on a 2023 Apple M2 Pro laptop with 16 GB RAM. Code to reproduce the expe...

work page 2023