Identifiable Latent Bandits: Leveraging observational data for personalized decision-making

Ahmet Zahid Balc{\i}o\u{g}lu; Emil Carlsson; Fredrik D. Johansson; Newton Mwai

arxiv: 2407.16239 · v6 · submitted 2024-07-23 · 💻 cs.LG · stat.ML

Identifiable Latent Bandits: Leveraging observational data for personalized decision-making

Ahmet Zahid Balc{\i}o\u{g}lu , Newton Mwai , Emil Carlsson , Fredrik D. Johansson This is my paper

Pith reviewed 2026-05-23 22:31 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords banditslatentbanditdecision-makingoptimalpersonalizeddatadecisions

0 comments

The pith

Nonlinear independent component analysis identifies representations from observational data sufficient to infer optimal actions in new bandit instances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make personalized sequential decision-making practical by learning latent structures in bandit problems from historical records rather than starting from scratch for each new instance. Standard multi-armed bandits demand extensive online trials per patient or context, which exceeds available decision points in settings like medicine. The proposed framework applies nonlinear independent component analysis to past decisions and outcomes, yielding identifiable representations that transfer across instances. When the identification conditions hold, this yields provably sufficient information to select optimal actions with reduced exploration compared to classical or offline baselines. The work therefore supplies both a learning procedure and a consistency guarantee for latent bandit models that prior approaches left unspecified.

Core claim

The central claim is that nonlinear independent component analysis applied to observational data of decisions and outcomes provably recovers representations sufficient to infer optimal actions for new bandit problem instances, thereby enabling shorter exploration phases than standard bandits that must learn without such pre-identified structure.

What carries the argument

nonlinear independent component analysis that provably identifies representations from observational data sufficient to infer optimal actions in new bandit instances

If this is right

Optimal actions inferred with shorter exploration time than classical bandits
Learning performed from historical records of decisions and outcomes
Substantial improvement over online and offline baselines when identification conditions hold
Provable identification of sufficient representations under the paper's conditions on the latent model

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same observational pre-training could be applied to other sequential decision settings if analogous identifiability results are available.
Collecting large historical decision-outcome logs in a domain would become a one-time cost that amortizes across many future instances.
If the nonlinear ICA step succeeds, the resulting representations could serve as fixed features for downstream policy optimization without further latent inference.
The approach suggests that identifiability failures in real data would manifest as degraded transfer performance rather than mere statistical inefficiency.
keywords:[

Load-bearing premise

A latent variable model of problem instances can be learned consistently from observational data via nonlinear ICA under the paper's stated conditions.

What would settle it

A simulation or semi-synthetic experiment in which the learned representations do not permit recovery of optimal actions on held-out bandit instances at the claimed sample efficiency would falsify the sufficiency claim.

Figures

Figures reproduced from arXiv: 2407.16239 by Ahmet Zahid Balc{\i}o\u{g}lu, Emil Carlsson, Fredrik D. Johansson, Newton Mwai.

**Figure 2.** Figure 2: The structural causal model of Assumption 3.1 for an example patient instance i. Dashed arrows indicate potential sources of confounding bias that our model can handle. Assumptions in related work. Our assumptions on g are relaxed compared to some previous work on latent bandits [46, 22], to allow for nonlinear functions and continuous latent states. Assumption 3.1 c) is more typical of the literature on … view at source ↗

**Figure 3.** Figure 3: Cumulative regret results for ADCB comparing ILB decision-making algorithms to baselines. Error bars indicate one standard error computed with 200 seeds. The LVMs are fitted across I = 100 instances with To = 200 time points each with L = 2 layered model. Theorem 3.5 is proven in Appendix D. Once a latent is estimated, we could either use a greedy strategy and choose the best arm under the estimated lat… view at source ↗

**Figure 4.** Figure 4: Cumulative regret for synthetic environment (left) comparing [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Cumulative regret for out-of-distribution experiments with increased [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 3.** Figure 3: In both cases, offline (Regression) and hybrid ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 6.** Figure 6: Expected cumulative regret ILB and baseline algorithms for different levels of standard deviation σ = 0.25, 0.5, and 1 Gaussian noise in the context Xt. The error bars show standard error calculated across 1000 seeds. E Additional Experiments E.1 Ablation for identifiablity In order to test the adaptability of CPG and FPG algorithms to where our assumptions breakdown. We prepared a set of experiments where… view at source ↗

**Figure 7.** Figure 7: Expected cumulative for bandit algorithms for out of distribution generalization with means [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Expected cumulative for bandit algorithms respective ADCB dataset with latent noise [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Expected cumulative regret plot regret for MAB, Regression and [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Expected cumulative regret for bandit algorithms in the cases of [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Expected cumulative regret plot regret for different exponential family noise. The error [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Synthetic example comparing linear contextual bandits for stationary context. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Histogram over 50 bins of the bimodally distributed continuous component of the latent [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Conditional linear reward models in ADCB with heterogeneity over the latent state. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

read the original abstract

Sequential decision-making algorithms such as multi-armed bandits can find optimal personalized decisions, but are notoriously sample-hungry. In personalized medicine, for example, training a bandit from scratch for every patient is typically infeasible, as the number of trials required is much larger than the number of decision points for a single patient. To combat this, latent bandits offer rapid exploration and personalization beyond what context variables alone can offer, provided that a latent variable model of problem instances can be learned consistently. However, existing works give no guidance as to how such a model can be found. In this work, we propose an identifiable latent bandit framework that leads to optimal decision-making with a shorter exploration time than classical bandits by learning from historical records of decisions and outcomes. Our method is based on nonlinear independent component analysis that provably identifies representations from observational data sufficient to infer optimal actions in new bandit instances. We verify this strategy in simulated and semi-synthetic environments, showing substantial improvement over online and offline learning baselines when identifying conditions are satisfied.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames latent bandits with nonlinear ICA to extract identifiable representations from observational data for faster personalization, but the identifiability may not survive typical behavior policies.

read the letter

The main point is that the authors link nonlinear ICA to latent bandits so that historical decision-outcome records can yield representations sufficient to pick good actions quickly in new instances. This specific combination for decision-making appears new relative to earlier latent bandit and ICA papers. They correctly flag the sample-hunger problem in settings like personalized medicine and show that, when the identification conditions hold, their method beats standard online and offline baselines in the reported simulations and semi-synthetic runs. That empirical pattern is useful to see. The soft spot is the identifiability step itself. Standard nonlinear ICA results need independent latents, invertible mixing, and enough auxiliary variation to fix permutation and scaling issues. The data here is generated under an unknown behavior policy that can correlate actions with outcomes or restrict the support of observed pairs, which may remove the required variation. The abstract asserts provable identification without spelling out how the bandit data-generating process satisfies those ICA conditions rather than assuming them. If the proof does not derive the conditions from the bandit setup, the claim that the representations suffice for optimal actions does not follow. The experiments only test the case where conditions are satisfied, so they leave the robustness question open. The citation pattern is ordinary for the subfield and does not create circularity. This work is aimed at researchers who already work on latent-variable models for sequential decisions or offline-to-online transfer. A reader in that niche could extract the framing and the simulation protocol even if the proofs need tightening. It deserves peer review so the derivations can be checked against the actual data-generating assumptions.

Referee Report

2 major / 2 minor

Summary. The paper proposes an identifiable latent bandit framework that applies nonlinear independent component analysis (ICA) to observational records of decisions and outcomes. It claims that this yields provably identifiable latent representations sufficient to infer optimal actions in new bandit instances, thereby enabling shorter exploration than classical bandits. The approach is evaluated in simulated and semi-synthetic environments, with gains reported when the identification conditions hold.

Significance. If the identification result is shown to hold for the bandit data-generating process, the work would provide a principled way to transfer knowledge from historical data to new personalized decision problems, potentially lowering sample complexity in domains such as medicine. The explicit linkage of nonlinear ICA identifiability to bandit optimality is a substantive contribution when the requisite conditions are verified.

major comments (2)

[Abstract / identification section] Abstract and identification theorem (presumably Section 3): the claim that nonlinear ICA 'provably identifies representations from observational data' sufficient for optimal actions requires an explicit derivation showing that the bandit data-generating process under an unknown behavior policy supplies the independent latent variation, invertible mixing, and auxiliary contrast needed for consistent recovery. Standard nonlinear ICA results do not automatically apply when the behavior policy may correlate actions with rewards or restrict support; without this derivation the sufficiency claim does not follow.
[Experiments] Experimental evaluation: the reported gains are conditioned on 'identifying conditions are satisfied,' yet no ablation or diagnostic is described that checks whether the learned latents satisfy the independence and variation assumptions on the actual observational data. This leaves open whether the empirical improvements are due to the claimed identification or to other factors.

minor comments (2)

[Preliminaries / method] Clarify the precise statement of the nonlinear ICA assumptions (e.g., which auxiliary variation source is used) and how they map onto the bandit observation model.
[Experiments] Add error bars or confidence intervals to all reported performance curves and tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important points regarding the rigor of the identifiability claim and the experimental validation. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract / identification section] Abstract and identification theorem (presumably Section 3): the claim that nonlinear ICA 'provably identifies representations from observational data' sufficient for optimal actions requires an explicit derivation showing that the bandit data-generating process under an unknown behavior policy supplies the independent latent variation, invertible mixing, and auxiliary contrast needed for consistent recovery. Standard nonlinear ICA results do not automatically apply when the behavior policy may correlate actions with rewards or restrict support; without this derivation the sufficiency claim does not follow.

Authors: We agree that an explicit derivation tailored to the bandit observational process is necessary to rigorously connect standard nonlinear ICA identifiability results to our setting. The current Section 3 invokes the general nonlinear ICA theorem but does not derive that an arbitrary behavior policy yields the required independent latent variation, invertible mixing function, and auxiliary contrast variables. In the revision we will add a new subsection (3.2) that states mild assumptions on the historical behavior policy (e.g., positive probability of sufficient exploration) and shows that these suffice for the ICA conditions to hold, thereby justifying the sufficiency claim for optimal action inference. revision: yes
Referee: [Experiments] Experimental evaluation: the reported gains are conditioned on 'identifying conditions are satisfied,' yet no ablation or diagnostic is described that checks whether the learned latents satisfy the independence and variation assumptions on the actual observational data. This leaves open whether the empirical improvements are due to the claimed identification or to other factors.

Authors: We concur that the experimental section would benefit from explicit diagnostics. The current evaluation reports performance gains only when identification conditions hold by construction in the simulated environments, but does not include post-hoc checks on the recovered latents from the observational data. In the revision we will add an ablation subsection that reports (i) estimated mutual information between recovered components to verify independence and (ii) empirical support and variation statistics on the observational datasets, allowing readers to assess whether the observed improvements align with successful identification. revision: yes

Circularity Check

0 steps flagged

No circularity; identifiability claim rests on external nonlinear ICA theory

full rationale

The paper's derivation chain invokes standard nonlinear ICA results (with stated conditions on the latent variable model and observational data) to identify representations sufficient for optimal actions. This grounding is presented as independent of the current paper's fitted values or self-citations, with no equations or steps reducing the claimed identification to a fit, renaming, or load-bearing self-citation chain. The central sufficiency claim therefore remains externally falsifiable and does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; full paper likely specifies identifiability assumptions for nonlinear ICA and consistency of the latent model.

axioms (1)

domain assumption Nonlinear ICA can provably identify latent representations from observational data under suitable conditions
Invoked as the basis for learning the latent bandit model from historical records.

pith-pipeline@v0.9.0 · 5722 in / 1023 out tokens · 18170 ms · 2026-05-23T22:31:27.529062+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

[1]

Agrawal and N

S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127–135. PMLR, 2013

work page 2013
[2]

Athey and G

S. Athey and G. W. Imbens. Machine learning methods for estimating heterogeneous causal effects. stat, 1050(5):1–26, 2015

work page 2015
[3]

Bareinboim, A

E. Bareinboim, A. Forney, and J. Pearl. Bandits with unobserved confounders: A causal approach. Advances in Neural Information Processing Systems, 28, 2015

work page 2015
[4]

Optimal Exploitation of Clustering and History Information in Multi-Armed Bandit

D. Bouneffouf, S. Parthasarathy, H. Samulowitz, and M. Wistub. Optimal exploitation of clustering and history information in multi-armed bandit. arXiv preprint arXiv:1906.03979, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[5]

L. Bui, R. Johari, and S. Mannor. Clustered bandits. arXiv preprint arXiv:1206.4169, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[6]

W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011

work page 2011
[7]

P. Comon. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994

work page 1994
[8]

J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):148–164, 1979

work page 1979
[9]

Gutmann and A

M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages 297–304. JMLR Workshop and Conference Proceedings, 2010

work page 2010
[10]

P. R. Hahn, V . Dorie, and J. S. Murray. Atlantic causal inference conference (acic) data analysis challenge 2017. arXiv preprint arXiv:1905.09515, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Håkansson, V

S. Håkansson, V . Lindblom, O. Gottesman, and F. D. Johansson. Learning to search efficiently for causally near-optimal treatments. Advances in Neural Information Processing Systems , 33:1333–1344, 2020

work page 2020
[13]

S. Han, X. Hu, H. Huang, M. Jiang, and Y . Zhao. Adbench: Anomaly detection benchmark. Advances in Neural Information Processing Systems, 35:32142–32159, 2022

work page 2022
[14]

Higgins, L

I. Higgins, L. Matthey, A. Pal, C. P. Burgess, X. Glorot, M. M. Botvinick, S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. ICLR (Poster), 3, 2017

work page 2017
[15]

J. Hong, B. Kveton, M. Zaheer, Y . Chow, A. Ahmed, and C. Boutilier. Latent bandits revisited. Advances in Neural Information Processing Systems, 33:13423–13433, 2020

work page 2020
[16]

E. K. Huch, J. Shi, M. R. Abbott, J. R. Golbus, A. Moreno, and W. H. Dempsey. RoME: A robust mixed-effects bandit algorithm for optimizing mobile health interventions. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[17]

Hyvarinen and H

A. Hyvarinen and H. Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in neural information processing systems, 29, 2016

work page 2016
[18]

Hyvarinen, H

A. Hyvarinen, H. Sasaki, and R. Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859–868. PMLR, 2019. 11

work page 2019
[19]

Khemakhem, D

I. Khemakhem, D. Kingma, R. Monti, and A. Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207–2217. PMLR, 2020

work page 2020
[20]

N. M. Kinyanjui, E. Carlsson, and F. D. Johansson. Fast treatment personalization with latent bandits in fixed-confidence pure exploration. Transactions on Machine Learning Research,

work page
[21]

Expert Certification

work page
[22]

N. M. Kinyanjui and F. D. Johansson. Adcb: An alzheimer’s disease simulator for benchmarking observational estimators of causal effects. In Conference on Health, Inference, and Learning, pages 103–118. PMLR, 2022

work page 2022
[23]

Kocák, R

T. Kocák, R. Munos, B. Kveton, S. Agrawal, and M. Valko. Spectral bandits. Journal of Machine Learning Research, 21(218):1–44, 2020

work page 2020
[24]

S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences , 116(10):4156–4165, 2019

work page 2019
[25]

Lattimore, T

F. Lattimore, T. Lattimore, and M. D. Reid. Causal bandits: Learning good interventions via causal inference. Advances in neural information processing systems, 29, 2016

work page 2016
[26]

Lattimore and C

T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020

work page 2020
[27]

Lee and E

S. Lee and E. Bareinboim. Structural causal bandits: Where to intervene? Advances in neural information processing systems, 31, 2018

work page 2018
[28]

L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010

work page 2010
[29]

Louizos, U

C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal effect inference with deep latent-variable models. Advances in neural information processing systems, 30, 2017

work page 2017
[30]

Z. Lu, Y . Cheng, M. Zhong, G. Stoian, Y . Yuan, and G. Wang. Causal effect estimation using variational information bottleneck. In International Conference on Web Information Systems and Applications, pages 288–296. Springer, 2022

work page 2022
[32]

Maillard and S

O.-A. Maillard and S. Mannor. Latent bandits. In International Conference on Machine Learning, pages 136–144. PMLR, 2014

work page 2014
[33]

S. A. Murphy, L. M. Collins, and A. J. Rush. Customizing treatment to the patient: Adaptive treatment strategies, 2007

work page 2007
[34]

Oetomo, R

B. Oetomo, R. M. Perera, R. Borovica-Gajic, and B. I. Rubinstein. Cutting to the chase with warm-start contextual bandits. Knowledge and Information Systems, 65(9):3533–3565, 2023

work page 2023
[35]

Oetomo, R

B. Oetomo, R. M. Perera, R. Borovica-Gajic, and B. I. Rubinstein. Warm-starting contextual bandits under latent reward scaling. ICDM, 2024

work page 2024
[36]

J. Pearl. Causality. Cambridge university press, 2009

work page 2009
[37]

Radcliffe

N. Radcliffe. Using control groups to target on predicted lift: Building and assessing uplift model. Direct Marketing Analytics Journal, pages 14–21, 2007

work page 2007
[38]

Rakesh, R

V . Rakesh, R. Guo, R. Moraffah, N. Agarwal, and H. Liu. Linked causal variational autoencoder for inferring paired spillover effects. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 1679–1682, 2018

work page 2018
[39]

Rezende and S

D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015. 12

work page 2015
[40]

H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952

work page 1952
[41]

P. R. Rosenbaum, P. Rosenbaum, and Briskman.Design of observational studies, volume 10. Springer, 2010

work page 2010
[42]

D. B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005

work page 2005
[43]

Russo, A

A. Russo, A. M. Metelli, and M. Restelli. Switching latent bandits. Transactions on Machine Learning Research, 2024

work page 2024
[44]

D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018

work page 2018
[46]

Schölkopf, F

B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y . Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021

work page 2021
[47]

R. Sen, K. Shanmugam, M. Kocaoglu, A. Dimakis, and S. Shakkottai. Contextual bandits with latent confounders: An nmf approach. In Artificial Intelligence and Statistics, pages 518–527. PMLR, 2017

work page 2017
[48]

Shalit, F

U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning, pages 3076–3085. PMLR, 2017

work page 2017
[49]

J. A. Singh, K. G. Saag, S. L. Bridges Jr, E. A. Akl, R. R. Bannuru, M. C. Sullivan, E. Vaysbrot, C. McNaughton, M. Osani, R. H. Shmerling, et al. 2015 american college of rheumatology guideline for the treatment of rheumatoid arthritis. Arthritis & rheumatology, 68(1):1–26, 2016

work page 2015
[50]

A. A. Tahami Monfared, N. N. Phan, I. Pearson, J. Mauskopf, M. Cho, Q. Zhang, and H. Hampel. A systematic review of clinical practice guidelines for alzheimer’s disease and strategies for future advancements. Neurology and therapy, 12(4):1257–1284, 2023

work page 2023
[51]

W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933

work page 1933
[52]

Vershynin

R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018

work page 2018
[53]

Wang and M

Y . Wang and M. I. Jordan. Desiderata for representation learning: A causal perspective.arXiv preprint arXiv:2109.03795, 2021

work page arXiv 2021
[54]

L. Yao, Z. Chu, S. Li, Y . Li, J. Gao, and A. Zhang. A survey on causal inference. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(5):1–46, 2021

work page 2021
[55]

Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback

C. Zhang, A. Agarwal, H. Daumé III, J. Langford, and S. N. Negahban. Warm-starting contextual bandits: Robustly combining supervised and bandit feedback. arXiv preprint arXiv:1901.00301, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[56]

Zhong, F

K. Zhong, F. Xiao, Y . Ren, Y . Liang, W. Yao, X. Yang, and L. Cen. Descn: Deep entire space cross networks for individual treatment effect estimation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4612–4620, 2022

work page 2022
[57]

L. Zhou. A survey on contextual multi-armed bandits. arXiv preprint arXiv:1508.03326, 2015. 13 Appendix A Notation Table 2: Notation. Indices that indicate problem instances i and time points t are dropped when clear from context (e.g., when stated to be fixed in text or in i.i.d. distributions over multiple instances) Random variables Zi Latent state for...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[58]

+ λ2N (µ2, σ2 2) where: • λ1 = 0.572 and λ2 = 0.428 are the mixture weights with λ1 + λ2 = 1, • µ1 = 0.0979 and µ2 = 0.1986 are the means of the Gaussian components, • σ2 1 = 0.000541 and σ2 2 = 0.000752 are the variances of the Gaussian components. 26 0.05 0.10 0.15 0.20 0.25 Z 0.00 0.01 0.02 0.03 0.04 0.05Normalized frequency Figure 13: Histogram over 5...

work page 1986

[1] [1]

Agrawal and N

S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127–135. PMLR, 2013

work page 2013

[2] [2]

Athey and G

S. Athey and G. W. Imbens. Machine learning methods for estimating heterogeneous causal effects. stat, 1050(5):1–26, 2015

work page 2015

[3] [3]

Bareinboim, A

E. Bareinboim, A. Forney, and J. Pearl. Bandits with unobserved confounders: A causal approach. Advances in Neural Information Processing Systems, 28, 2015

work page 2015

[4] [4]

Optimal Exploitation of Clustering and History Information in Multi-Armed Bandit

D. Bouneffouf, S. Parthasarathy, H. Samulowitz, and M. Wistub. Optimal exploitation of clustering and history information in multi-armed bandit. arXiv preprint arXiv:1906.03979, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[5] [5]

L. Bui, R. Johari, and S. Mannor. Clustered bandits. arXiv preprint arXiv:1206.4169, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[6] [6]

W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011

work page 2011

[7] [7]

P. Comon. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994

work page 1994

[8] [8]

J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society Series B: Statistical Methodology, 41(2):148–164, 1979

work page 1979

[9] [9]

Gutmann and A

M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages 297–304. JMLR Workshop and Conference Proceedings, 2010

work page 2010

[10] [10]

P. R. Hahn, V . Dorie, and J. S. Murray. Atlantic causal inference conference (acic) data analysis challenge 2017. arXiv preprint arXiv:1905.09515, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [12]

Håkansson, V

S. Håkansson, V . Lindblom, O. Gottesman, and F. D. Johansson. Learning to search efficiently for causally near-optimal treatments. Advances in Neural Information Processing Systems , 33:1333–1344, 2020

work page 2020

[12] [13]

S. Han, X. Hu, H. Huang, M. Jiang, and Y . Zhao. Adbench: Anomaly detection benchmark. Advances in Neural Information Processing Systems, 35:32142–32159, 2022

work page 2022

[13] [14]

Higgins, L

I. Higgins, L. Matthey, A. Pal, C. P. Burgess, X. Glorot, M. M. Botvinick, S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. ICLR (Poster), 3, 2017

work page 2017

[14] [15]

J. Hong, B. Kveton, M. Zaheer, Y . Chow, A. Ahmed, and C. Boutilier. Latent bandits revisited. Advances in Neural Information Processing Systems, 33:13423–13433, 2020

work page 2020

[15] [16]

E. K. Huch, J. Shi, M. R. Abbott, J. R. Golbus, A. Moreno, and W. H. Dempsey. RoME: A robust mixed-effects bandit algorithm for optimizing mobile health interventions. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[16] [17]

Hyvarinen and H

A. Hyvarinen and H. Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in neural information processing systems, 29, 2016

work page 2016

[17] [18]

Hyvarinen, H

A. Hyvarinen, H. Sasaki, and R. Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859–868. PMLR, 2019. 11

work page 2019

[18] [19]

Khemakhem, D

I. Khemakhem, D. Kingma, R. Monti, and A. Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207–2217. PMLR, 2020

work page 2020

[19] [20]

N. M. Kinyanjui, E. Carlsson, and F. D. Johansson. Fast treatment personalization with latent bandits in fixed-confidence pure exploration. Transactions on Machine Learning Research,

work page

[20] [21]

Expert Certification

work page

[21] [22]

N. M. Kinyanjui and F. D. Johansson. Adcb: An alzheimer’s disease simulator for benchmarking observational estimators of causal effects. In Conference on Health, Inference, and Learning, pages 103–118. PMLR, 2022

work page 2022

[22] [23]

Kocák, R

T. Kocák, R. Munos, B. Kveton, S. Agrawal, and M. Valko. Spectral bandits. Journal of Machine Learning Research, 21(218):1–44, 2020

work page 2020

[23] [24]

S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences , 116(10):4156–4165, 2019

work page 2019

[24] [25]

Lattimore, T

F. Lattimore, T. Lattimore, and M. D. Reid. Causal bandits: Learning good interventions via causal inference. Advances in neural information processing systems, 29, 2016

work page 2016

[25] [26]

Lattimore and C

T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020

work page 2020

[26] [27]

Lee and E

S. Lee and E. Bareinboim. Structural causal bandits: Where to intervene? Advances in neural information processing systems, 31, 2018

work page 2018

[27] [28]

L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010

work page 2010

[28] [29]

Louizos, U

C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal effect inference with deep latent-variable models. Advances in neural information processing systems, 30, 2017

work page 2017

[29] [30]

Z. Lu, Y . Cheng, M. Zhong, G. Stoian, Y . Yuan, and G. Wang. Causal effect estimation using variational information bottleneck. In International Conference on Web Information Systems and Applications, pages 288–296. Springer, 2022

work page 2022

[30] [32]

Maillard and S

O.-A. Maillard and S. Mannor. Latent bandits. In International Conference on Machine Learning, pages 136–144. PMLR, 2014

work page 2014

[31] [33]

S. A. Murphy, L. M. Collins, and A. J. Rush. Customizing treatment to the patient: Adaptive treatment strategies, 2007

work page 2007

[32] [34]

Oetomo, R

B. Oetomo, R. M. Perera, R. Borovica-Gajic, and B. I. Rubinstein. Cutting to the chase with warm-start contextual bandits. Knowledge and Information Systems, 65(9):3533–3565, 2023

work page 2023

[33] [35]

Oetomo, R

B. Oetomo, R. M. Perera, R. Borovica-Gajic, and B. I. Rubinstein. Warm-starting contextual bandits under latent reward scaling. ICDM, 2024

work page 2024

[34] [36]

J. Pearl. Causality. Cambridge university press, 2009

work page 2009

[35] [37]

Radcliffe

N. Radcliffe. Using control groups to target on predicted lift: Building and assessing uplift model. Direct Marketing Analytics Journal, pages 14–21, 2007

work page 2007

[36] [38]

Rakesh, R

V . Rakesh, R. Guo, R. Moraffah, N. Agarwal, and H. Liu. Linked causal variational autoencoder for inferring paired spillover effects. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 1679–1682, 2018

work page 2018

[37] [39]

Rezende and S

D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015. 12

work page 2015

[38] [40]

H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952

work page 1952

[39] [41]

P. R. Rosenbaum, P. Rosenbaum, and Briskman.Design of observational studies, volume 10. Springer, 2010

work page 2010

[40] [42]

D. B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005

work page 2005

[41] [43]

Russo, A

A. Russo, A. M. Metelli, and M. Restelli. Switching latent bandits. Transactions on Machine Learning Research, 2024

work page 2024

[42] [44]

D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018

work page 2018

[43] [46]

Schölkopf, F

B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y . Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021

work page 2021

[44] [47]

R. Sen, K. Shanmugam, M. Kocaoglu, A. Dimakis, and S. Shakkottai. Contextual bandits with latent confounders: An nmf approach. In Artificial Intelligence and Statistics, pages 518–527. PMLR, 2017

work page 2017

[45] [48]

Shalit, F

U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning, pages 3076–3085. PMLR, 2017

work page 2017

[46] [49]

J. A. Singh, K. G. Saag, S. L. Bridges Jr, E. A. Akl, R. R. Bannuru, M. C. Sullivan, E. Vaysbrot, C. McNaughton, M. Osani, R. H. Shmerling, et al. 2015 american college of rheumatology guideline for the treatment of rheumatoid arthritis. Arthritis & rheumatology, 68(1):1–26, 2016

work page 2015

[47] [50]

A. A. Tahami Monfared, N. N. Phan, I. Pearson, J. Mauskopf, M. Cho, Q. Zhang, and H. Hampel. A systematic review of clinical practice guidelines for alzheimer’s disease and strategies for future advancements. Neurology and therapy, 12(4):1257–1284, 2023

work page 2023

[48] [51]

W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933

work page 1933

[49] [52]

Vershynin

R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018

work page 2018

[50] [53]

Wang and M

Y . Wang and M. I. Jordan. Desiderata for representation learning: A causal perspective.arXiv preprint arXiv:2109.03795, 2021

work page arXiv 2021

[51] [54]

L. Yao, Z. Chu, S. Li, Y . Li, J. Gao, and A. Zhang. A survey on causal inference. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(5):1–46, 2021

work page 2021

[52] [55]

Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback

C. Zhang, A. Agarwal, H. Daumé III, J. Langford, and S. N. Negahban. Warm-starting contextual bandits: Robustly combining supervised and bandit feedback. arXiv preprint arXiv:1901.00301, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[53] [56]

Zhong, F

K. Zhong, F. Xiao, Y . Ren, Y . Liang, W. Yao, X. Yang, and L. Cen. Descn: Deep entire space cross networks for individual treatment effect estimation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4612–4620, 2022

work page 2022

[54] [57]

L. Zhou. A survey on contextual multi-armed bandits. arXiv preprint arXiv:1508.03326, 2015. 13 Appendix A Notation Table 2: Notation. Indices that indicate problem instances i and time points t are dropped when clear from context (e.g., when stated to be fixed in text or in i.i.d. distributions over multiple instances) Random variables Zi Latent state for...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[55] [58]

+ λ2N (µ2, σ2 2) where: • λ1 = 0.572 and λ2 = 0.428 are the mixture weights with λ1 + λ2 = 1, • µ1 = 0.0979 and µ2 = 0.1986 are the means of the Gaussian components, • σ2 1 = 0.000541 and σ2 2 = 0.000752 are the variances of the Gaussian components. 26 0.05 0.10 0.15 0.20 0.25 Z 0.00 0.01 0.02 0.03 0.04 0.05Normalized frequency Figure 13: Histogram over 5...

work page 1986