Causal Bandit Over Unknown Graphs: Upper Confidence Bounds With Backdoor Adjustment
Pith reviewed 2026-05-23 03:58 UTC · model grok-4.3
The pith
BA-UCB identifies backdoor sets from mixed data to bound causal effects and achieve finite-time regret guarantees in causal bandits over unknown graphs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the BA-UCB algorithm, which constructs upper confidence bounds via candidate backdoor adjustment sets identified from sequentially collected observational and experimental data, delivers finite-time upper bounds on cumulative regret with improved rates and substantially relaxed dependency on the number of intervention arms compared to standard multi-armed bandit methods, first for Gaussian DAG models without latent confounders and then for acyclic directed mixed graphs.
What carries the argument
Backdoor-adjustment upper confidence bound (BA-UCB) that estimates causal effects from candidate backdoor sets identified by mixing observational and experimental data.
Load-bearing premise
Candidate backdoor adjustment sets for each intervention can be reliably identified from the mixture of observational and experimental data collected during the bandit process.
What would settle it
An experiment in which the candidate backdoor sets produce biased causal effect estimates that cause the observed regret to scale linearly with the number of arms would falsify the claimed bounds.
Figures
read the original abstract
The causal bandit problem seeks to identify, through sequential experimentation, an intervention that maximizes the expected reward in a causal system modeled by a directed acyclic graph (DAG). Existing methods typically assume that the causal graph is known or impose restrictive structural assumptions. In this paper, we study causal bandit problems when the causal graph is unknown. We first consider Gaussian DAG models without latent confounders. By combining observational and experimental data collected sequentially during the bandit process, we identify candidate backdoor adjustment sets for each intervention arm. These sets enable estimation of causal effects and construction of upper confidence bounds that integrate information from both data sources. Based on these estimates, we propose a new algorithm, termed backdoor-adjustment upper confidence bound (BA-UCB), for sequential intervention selection. We establish finite-time upper bounds on the cumulative regret of BA-UCB, showing improved rates and substantially relaxed dependency on the number of intervention arms compared to standard multi-armed bandit methods. We further extend the methodology and theoretical guarantees to settings with latent confounders, where the observed variables are modeled by an acyclic directed mixed graph. Simulation studies demonstrate that BA-UCB achieves substantially lower cumulative regret and favorable computational efficiency relative to existing approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies causal bandits with unknown graphs. For Gaussian DAGs without latent confounders, it combines sequentially collected observational and experimental data to identify candidate backdoor adjustment sets for each intervention arm; these sets are used to construct causal-effect estimators inside a UCB index, yielding the BA-UCB algorithm. Finite-time regret bounds are claimed that improve on standard multi-armed bandit rates and relax dependence on the number of arms. The approach is extended to acyclic directed mixed graphs with latent confounders, and simulations are reported.
Significance. If the identification of backdoor sets succeeds with high probability under adaptive sampling and the resulting estimators remain unbiased with valid concentration, the finite-time regret bounds would constitute a useful advance: they would allow causal structure to be exploited even when the graph is unknown, while integrating both observational and experimental data sources. The extension to latent confounders and the reported simulation improvements are additional strengths if the theoretical claims hold.
major comments (2)
- The central regret bounds rest on reliable identification of backdoor adjustment sets from the mixture of observational and experimental samples gathered online. The abstract states that such sets are identified, but no explicit procedure, algorithm, or high-probability success guarantee that accounts for the adaptive restriction of the experimental design matrix is provided. Without this, the unbiasedness of the causal-effect estimators inside the UCB indices and the subsequent concentration arguments cannot be verified, undermining the claimed improvement over standard UCB.
- The finite-time analysis must fold the failure probability of incorrect backdoor-set identification into the regret decomposition. Because experimental pulls are chosen by BA-UCB itself, an incorrect set returned for even one arm can produce biased estimates whose deviation does not vanish at the required rate; the manuscript does not appear to control this term explicitly.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. The two major points identify areas where the identification procedure and its integration into the regret analysis can be made more explicit. We address each below and will revise the manuscript to incorporate the suggested clarifications.
read point-by-point responses
-
Referee: The central regret bounds rest on reliable identification of backdoor adjustment sets from the mixture of observational and experimental samples gathered online. The abstract states that such sets are identified, but no explicit procedure, algorithm, or high-probability success guarantee that accounts for the adaptive restriction of the experimental design matrix is provided. Without this, the unbiasedness of the causal-effect estimators inside the UCB indices and the subsequent concentration arguments cannot be verified, undermining the claimed improvement over standard UCB.
Authors: We agree that the main text would benefit from greater explicitness. Section 3.2 describes the use of conditional independence tests on the pooled data to identify candidate backdoor sets for each arm, with supporting concentration results in the appendix. The adaptive nature is accounted for via a union bound over the (random) number of observations per arm. To address the concern directly, we will add a dedicated algorithm box in the main body for the identification step and restate the high-probability success guarantee (1 - O(1/T)) explicitly in Theorem 1, making the unbiasedness and concentration arguments self-contained without requiring the appendix. revision: yes
-
Referee: The finite-time analysis must fold the failure probability of incorrect backdoor-set identification into the regret decomposition. Because experimental pulls are chosen by BA-UCB itself, an incorrect set returned for even one arm can produce biased estimates whose deviation does not vanish at the required rate; the manuscript does not appear to control this term explicitly.
Authors: We concur that the failure event must be handled explicitly in the decomposition. The current proof of the regret bound conditions on correct identification (which occurs with high probability) and bounds the complementary event separately, but the contribution of misidentification to the overall regret is not isolated as an additive term. We will revise the proof to include an explicit additive term bounded by O(K T exp(-c n_t)), where n_t is the number of samples for arm t; this term is o(sqrt(T)) and does not affect the leading rate. The revised decomposition will appear in the main theorem statement and proof sketch. revision: yes
Circularity Check
No significant circularity detected in regret bound derivation.
full rationale
The paper derives finite-time regret bounds for BA-UCB by first identifying candidate backdoor sets from sequentially collected observational and experimental data, then constructing causal-effect estimators and UCB indices. This process uses standard concentration arguments on the resulting estimators and does not reduce the claimed bounds to a quantity defined by the algorithm's own outputs or fitted parameters. No self-citation chains, self-definitional steps, or fitted-input-as-prediction patterns are exhibited in the abstract or described methodology. The central result remains an independent analysis of the combined data sources rather than a tautological renaming or construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The causal system is modeled by a directed acyclic graph (DAG) without latent confounders (initial setting).
- domain assumption Candidate backdoor adjustment sets can be identified from the combined observational and experimental data collected sequentially.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We identify candidate backdoor adjustment sets for each arm using sequentially generated experimental and observational data... BA-UCB algorithm... finite-time upper bounds on the cumulative regret
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Gaussian DAG models without latent confounders... backdoor criterion relative to (Xi, Y)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators
CIKA uses LLM-based interventions to probe causal effects of concepts on math reasoning success, achieving competitive results on benchmarks like Omni-MATH and GSM8K with a frozen 7B model.
Reference graph
Works this paper leans on
-
[1]
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47:235–256, 2002
work page 2002
-
[2]
D. A. Berry and B. Fristedt. Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability). London: Chapman and Hall, 5(71-87):7–7, 1985
work page 1985
-
[3]
N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences , 78(5): 1404–1422, 2012
work page 2012
-
[4]
V . Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Annual Conference Computational Learning Theory, 2008
work page 2008
-
[5]
A. De Kroon, J. Mooij, and D. Belgrave. Causal bandits without prior knowledge using separating sets. In Conference on Causal Learning and Reasoning, pages 407–427. PMLR, 2022
work page 2022
-
[6]
J. Huang and Q. Zhou. Bayesian causal bandits with backdoor adjustment prior. Transactions on Machine Learning Research, 2022
work page 2022
-
[7]
F. Lattimore, T. Lattimore, and M. D. Reid. Causal bandits: Learning good interventions via causal inference. Advances in neural information processing systems, 29, 2016
work page 2016
-
[8]
B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of statistics, pages 1302–1338, 2000
work page 2000
- [9]
-
[10]
Y . Lu, A. Meisami, A. Tewari, and W. Yan. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pages 141–150. PMLR, 2020
work page 2020
-
[11]
Y . Lu, A. Meisami, and A. Tewari. Causal bandits with unknown graph structure.Advances in Neural Information Processing Systems, 34:24817–24828, 2021
work page 2021
- [12]
-
[13]
V . Nair, V . Patil, and G. Sinha. Budgeted and non-budgeted causal bandits. In International Conference on Artificial Intelligence and Statistics, pages 2017–2025. PMLR, 2021
work page 2017
-
[14]
J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009
work page 2009
-
[15]
D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 (4):1221–1243, 2014
work page 2014
-
[16]
R. Sen, K. Shanmugam, A. G. Dimakis, and S. Shakkottai. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pages 3057–3066. PMLR, 2017
work page 2017
-
[17]
P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causal graphs.Social science computer review, 9(1):62–72, 1991
work page 1991
-
[18]
W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933
work page 1933
-
[19]
M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019
work page 2019
-
[20]
A. Yabe, D. Hatano, H. Sumita, S. Ito, N. Kakimura, T. Fukunaga, and K.-i. Kawarabayashi. Causal bandits with propagating inference. In International Conference on Machine Learning, pages 5512–5520. PMLR, 2018. 10 A PREPRINT - MARCH 11, 2025 A Explicit expressions of constants Aside from the notations in the main paper, we also have Xi|X´i „ N p¯µi, ¯σ2 i...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.