Causal Bandit Over Unknown Graphs: Upper Confidence Bounds With Backdoor Adjustment

Qing Zhou; Yijia Zhao

arxiv: 2502.02020 · v3 · submitted 2025-02-04 · 💻 cs.LG · stat.ME

Causal Bandit Over Unknown Graphs: Upper Confidence Bounds With Backdoor Adjustment

Yijia Zhao , Qing Zhou This is my paper

Pith reviewed 2026-05-23 03:58 UTC · model grok-4.3

classification 💻 cs.LG stat.ME

keywords causal banditsunknown graphsbackdoor adjustmentupper confidence boundsregret boundsGaussian DAGlatent confoundersintervention selection

0 comments

The pith

BA-UCB identifies backdoor sets from mixed data to bound causal effects and achieve finite-time regret guarantees in causal bandits over unknown graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes finite-time upper bounds on the cumulative regret of a new algorithm called BA-UCB for causal bandit problems where the underlying graph is unknown. It does so by combining observational and experimental data collected during the process to identify candidate backdoor adjustment sets for each intervention arm. These sets support causal effect estimation and upper confidence bounds that integrate both data sources. The resulting regret bounds improve on standard multi-armed bandit rates and relax the dependence on the number of intervention arms. The approach begins with Gaussian DAG models without latent confounders and extends to acyclic directed mixed graphs that allow latent confounders.

Core claim

The central claim is that the BA-UCB algorithm, which constructs upper confidence bounds via candidate backdoor adjustment sets identified from sequentially collected observational and experimental data, delivers finite-time upper bounds on cumulative regret with improved rates and substantially relaxed dependency on the number of intervention arms compared to standard multi-armed bandit methods, first for Gaussian DAG models without latent confounders and then for acyclic directed mixed graphs.

What carries the argument

Backdoor-adjustment upper confidence bound (BA-UCB) that estimates causal effects from candidate backdoor sets identified by mixing observational and experimental data.

Load-bearing premise

Candidate backdoor adjustment sets for each intervention can be reliably identified from the mixture of observational and experimental data collected during the bandit process.

What would settle it

An experiment in which the candidate backdoor sets produce biased causal effect estimates that cause the observed regret to scale linearly with the number of arms would falsify the claimed bounds.

Figures

Figures reproduced from arXiv: 2502.02020 by Qing Zhou, Yijia Zhao.

**Figure 2.** Figure 2: Comparison of empirical cumulative regrets over 5000 time steps for BA-UCB, CN-UCB, and UCB algorithms [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Ranges of empirical cumulative regrets between 5% quantile and 95% quantile at [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Boxplots of cumulative regrets after T “ 5000 rounds for BA-UCB, CN-UCB, UCB algorithms when p “ 20, 30, 50. Left: Optimal arm is a parent of the reward. Right: Optimal arm is not a parent of the reward. C.1 Some general bounds Theorem 3. Let X P R nˆp be drawn according to the Σ-Gaussian ensemble. For any i P t1, ¨ ¨ ¨ , pu and any S Ă t1, ¨ ¨ ¨ , puztiu, denote Si “ rXi | Ss and ρipt, Sq “ «ˆ XJ Si XSi N… view at source ↗

**Figure 5.** Figure 5: Comparison of empirical cumulative regrets of BA-UCB algorithms with estimates generated from weighted [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

The causal bandit problem seeks to identify, through sequential experimentation, an intervention that maximizes the expected reward in a causal system modeled by a directed acyclic graph (DAG). Existing methods typically assume that the causal graph is known or impose restrictive structural assumptions. In this paper, we study causal bandit problems when the causal graph is unknown. We first consider Gaussian DAG models without latent confounders. By combining observational and experimental data collected sequentially during the bandit process, we identify candidate backdoor adjustment sets for each intervention arm. These sets enable estimation of causal effects and construction of upper confidence bounds that integrate information from both data sources. Based on these estimates, we propose a new algorithm, termed backdoor-adjustment upper confidence bound (BA-UCB), for sequential intervention selection. We establish finite-time upper bounds on the cumulative regret of BA-UCB, showing improved rates and substantially relaxed dependency on the number of intervention arms compared to standard multi-armed bandit methods. We further extend the methodology and theoretical guarantees to settings with latent confounders, where the observed variables are modeled by an acyclic directed mixed graph. Simulation studies demonstrate that BA-UCB achieves substantially lower cumulative regret and favorable computational efficiency relative to existing approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BA-UCB identifies backdoor sets from mixed observational-experimental data to run UCB in unknown-graph causal bandits and supplies finite-time regret bounds, but the adaptive identification step is the part that needs the most checking.

read the letter

The main thing to know is that this paper relaxes the known-graph assumption in causal bandits by identifying candidate backdoor sets on the fly from a mix of observational and experimental samples, then builds UCBs around the resulting causal-effect estimates. They call the algorithm BA-UCB and give finite-time regret bounds that improve on standard multi-armed bandit rates and reduce dependence on the number of arms. They also extend the approach to acyclic directed mixed graphs for latent confounders and run simulations that show lower cumulative regret than baselines plus decent computational cost.

Referee Report

2 major / 0 minor

Summary. The paper studies causal bandits with unknown graphs. For Gaussian DAGs without latent confounders, it combines sequentially collected observational and experimental data to identify candidate backdoor adjustment sets for each intervention arm; these sets are used to construct causal-effect estimators inside a UCB index, yielding the BA-UCB algorithm. Finite-time regret bounds are claimed that improve on standard multi-armed bandit rates and relax dependence on the number of arms. The approach is extended to acyclic directed mixed graphs with latent confounders, and simulations are reported.

Significance. If the identification of backdoor sets succeeds with high probability under adaptive sampling and the resulting estimators remain unbiased with valid concentration, the finite-time regret bounds would constitute a useful advance: they would allow causal structure to be exploited even when the graph is unknown, while integrating both observational and experimental data sources. The extension to latent confounders and the reported simulation improvements are additional strengths if the theoretical claims hold.

major comments (2)

The central regret bounds rest on reliable identification of backdoor adjustment sets from the mixture of observational and experimental samples gathered online. The abstract states that such sets are identified, but no explicit procedure, algorithm, or high-probability success guarantee that accounts for the adaptive restriction of the experimental design matrix is provided. Without this, the unbiasedness of the causal-effect estimators inside the UCB indices and the subsequent concentration arguments cannot be verified, undermining the claimed improvement over standard UCB.
The finite-time analysis must fold the failure probability of incorrect backdoor-set identification into the regret decomposition. Because experimental pulls are chosen by BA-UCB itself, an incorrect set returned for even one arm can produce biased estimates whose deviation does not vanish at the required rate; the manuscript does not appear to control this term explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. The two major points identify areas where the identification procedure and its integration into the regret analysis can be made more explicit. We address each below and will revise the manuscript to incorporate the suggested clarifications.

read point-by-point responses

Referee: The central regret bounds rest on reliable identification of backdoor adjustment sets from the mixture of observational and experimental samples gathered online. The abstract states that such sets are identified, but no explicit procedure, algorithm, or high-probability success guarantee that accounts for the adaptive restriction of the experimental design matrix is provided. Without this, the unbiasedness of the causal-effect estimators inside the UCB indices and the subsequent concentration arguments cannot be verified, undermining the claimed improvement over standard UCB.

Authors: We agree that the main text would benefit from greater explicitness. Section 3.2 describes the use of conditional independence tests on the pooled data to identify candidate backdoor sets for each arm, with supporting concentration results in the appendix. The adaptive nature is accounted for via a union bound over the (random) number of observations per arm. To address the concern directly, we will add a dedicated algorithm box in the main body for the identification step and restate the high-probability success guarantee (1 - O(1/T)) explicitly in Theorem 1, making the unbiasedness and concentration arguments self-contained without requiring the appendix. revision: yes
Referee: The finite-time analysis must fold the failure probability of incorrect backdoor-set identification into the regret decomposition. Because experimental pulls are chosen by BA-UCB itself, an incorrect set returned for even one arm can produce biased estimates whose deviation does not vanish at the required rate; the manuscript does not appear to control this term explicitly.

Authors: We concur that the failure event must be handled explicitly in the decomposition. The current proof of the regret bound conditions on correct identification (which occurs with high probability) and bounds the complementary event separately, but the contribution of misidentification to the overall regret is not isolated as an additive term. We will revise the proof to include an explicit additive term bounded by O(K T exp(-c n_t)), where n_t is the number of samples for arm t; this term is o(sqrt(T)) and does not affect the leading rate. The revised decomposition will appear in the main theorem statement and proof sketch. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in regret bound derivation.

full rationale

The paper derives finite-time regret bounds for BA-UCB by first identifying candidate backdoor sets from sequentially collected observational and experimental data, then constructing causal-effect estimators and UCB indices. This process uses standard concentration arguments on the resulting estimators and does not reduce the claimed bounds to a quantity defined by the algorithm's own outputs or fitted parameters. No self-citation chains, self-definitional steps, or fitted-input-as-prediction patterns are exhibited in the abstract or described methodology. The central result remains an independent analysis of the combined data sources rather than a tautological renaming or construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full manuscript not available, so ledger entries are limited to those explicitly stated in the abstract. No free parameters, invented entities, or additional axioms beyond the stated modeling assumptions can be confirmed.

axioms (2)

domain assumption The causal system is modeled by a directed acyclic graph (DAG) without latent confounders (initial setting).
Stated in the abstract as the first considered case.
domain assumption Candidate backdoor adjustment sets can be identified from the combined observational and experimental data collected sequentially.
Central to the algorithm description in the abstract.

pith-pipeline@v0.9.0 · 5735 in / 1341 out tokens · 43270 ms · 2026-05-23T03:58:26.973887+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify candidate backdoor adjustment sets for each arm using sequentially generated experimental and observational data... BA-UCB algorithm... finite-time upper bounds on the cumulative regret
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gaussian DAG models without latent confounders... backdoor criterion relative to (Xi, Y)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators
cs.LG 2026-05 unverdicted novelty 7.0

CIKA uses LLM-based interventions to probe causal effects of concepts on math reasoning success, achieving competitive results on benchmarks like Omni-MATH and GSM8K with a frozen 7B model.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper

[1]

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47:235–256, 2002

work page 2002
[2]

D. A. Berry and B. Fristedt. Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability). London: Chapman and Hall, 5(71-87):7–7, 1985

work page 1985
[3]

Cesa-Bianchi and G

N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences , 78(5): 1404–1422, 2012

work page 2012
[4]

V . Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Annual Conference Computational Learning Theory, 2008

work page 2008
[5]

De Kroon, J

A. De Kroon, J. Mooij, and D. Belgrave. Causal bandits without prior knowledge using separating sets. In Conference on Causal Learning and Reasoning, pages 407–427. PMLR, 2022

work page 2022
[6]

Huang and Q

J. Huang and Q. Zhou. Bayesian causal bandits with backdoor adjustment prior. Transactions on Machine Learning Research, 2022

work page 2022
[7]

Lattimore, T

F. Lattimore, T. Lattimore, and M. D. Reid. Causal bandits: Learning good interventions via causal inference. Advances in neural information processing systems, 29, 2016

work page 2016
[8]

Laurent and P

B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of statistics, pages 1302–1338, 2000

work page 2000
[9]

Lee and E

S. Lee and E. Bareinboim. Structural causal bandits: Where to intervene? Advances in neural information processing systems, 31, 2018

work page 2018
[10]

Y . Lu, A. Meisami, A. Tewari, and W. Yan. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pages 141–150. PMLR, 2020

work page 2020
[11]

Y . Lu, A. Meisami, and A. Tewari. Causal bandits with unknown graph structure.Advances in Neural Information Processing Systems, 34:24817–24828, 2021

work page 2021
[12]

Maiti, V

A. Maiti, V . Nair, and G. Sinha. A causal bandit approach to learning good atomic interventions in presence of unobserved confounders. In Uncertainty in Artificial Intelligence, pages 1328–1338. PMLR, 2022

work page 2022
[13]

V . Nair, V . Patil, and G. Sinha. Budgeted and non-budgeted causal bandits. In International Conference on Artificial Intelligence and Statistics, pages 2017–2025. PMLR, 2021

work page 2017
[14]

J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009

work page 2009
[15]

Russo and B

D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 (4):1221–1243, 2014

work page 2014
[16]

R. Sen, K. Shanmugam, A. G. Dimakis, and S. Shakkottai. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pages 3057–3066. PMLR, 2017

work page 2017
[17]

Spirtes and C

P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causal graphs.Social science computer review, 9(1):62–72, 1991

work page 1991
[18]

W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933

work page 1933
[19]

M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

work page 2019
[20]

Tÿ t“1 1 tAt “ iu ď ℓ ` Tÿ t“ℓ`1 1 tAt “ i, nipt ´ 1q ě ℓu ď ℓ ` T ´1ÿ t“ℓ 1 tpµ˚ptq ` pσ˚ptqct ď pµiptq ` pσiptqct, nipt ´ 1q ě ℓu ď ℓ ` T ´1ÿ t“ℓ 1

A. Yabe, D. Hatano, H. Sumita, S. Ito, N. Kakimura, T. Fukunaga, and K.-i. Kawarabayashi. Causal bandits with propagating inference. In International Conference on Machine Learning, pages 5512–5520. PMLR, 2018. 10 A PREPRINT - MARCH 11, 2025 A Explicit expressions of constants Aside from the notations in the main paper, we also have Xi|X´i „ N p¯µi, ¯σ2 i...

work page 2018

[1] [1]

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47:235–256, 2002

work page 2002

[2] [2]

D. A. Berry and B. Fristedt. Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability). London: Chapman and Hall, 5(71-87):7–7, 1985

work page 1985

[3] [3]

Cesa-Bianchi and G

N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences , 78(5): 1404–1422, 2012

work page 2012

[4] [4]

V . Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Annual Conference Computational Learning Theory, 2008

work page 2008

[5] [5]

De Kroon, J

A. De Kroon, J. Mooij, and D. Belgrave. Causal bandits without prior knowledge using separating sets. In Conference on Causal Learning and Reasoning, pages 407–427. PMLR, 2022

work page 2022

[6] [6]

Huang and Q

J. Huang and Q. Zhou. Bayesian causal bandits with backdoor adjustment prior. Transactions on Machine Learning Research, 2022

work page 2022

[7] [7]

Lattimore, T

F. Lattimore, T. Lattimore, and M. D. Reid. Causal bandits: Learning good interventions via causal inference. Advances in neural information processing systems, 29, 2016

work page 2016

[8] [8]

Laurent and P

B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of statistics, pages 1302–1338, 2000

work page 2000

[9] [9]

Lee and E

S. Lee and E. Bareinboim. Structural causal bandits: Where to intervene? Advances in neural information processing systems, 31, 2018

work page 2018

[10] [10]

Y . Lu, A. Meisami, A. Tewari, and W. Yan. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pages 141–150. PMLR, 2020

work page 2020

[11] [11]

Y . Lu, A. Meisami, and A. Tewari. Causal bandits with unknown graph structure.Advances in Neural Information Processing Systems, 34:24817–24828, 2021

work page 2021

[12] [12]

Maiti, V

A. Maiti, V . Nair, and G. Sinha. A causal bandit approach to learning good atomic interventions in presence of unobserved confounders. In Uncertainty in Artificial Intelligence, pages 1328–1338. PMLR, 2022

work page 2022

[13] [13]

V . Nair, V . Patil, and G. Sinha. Budgeted and non-budgeted causal bandits. In International Conference on Artificial Intelligence and Statistics, pages 2017–2025. PMLR, 2021

work page 2017

[14] [14]

J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009

work page 2009

[15] [15]

Russo and B

D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 (4):1221–1243, 2014

work page 2014

[16] [16]

R. Sen, K. Shanmugam, A. G. Dimakis, and S. Shakkottai. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pages 3057–3066. PMLR, 2017

work page 2017

[17] [17]

Spirtes and C

P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causal graphs.Social science computer review, 9(1):62–72, 1991

work page 1991

[18] [18]

W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933

work page 1933

[19] [19]

M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

work page 2019

[20] [20]

Tÿ t“1 1 tAt “ iu ď ℓ ` Tÿ t“ℓ`1 1 tAt “ i, nipt ´ 1q ě ℓu ď ℓ ` T ´1ÿ t“ℓ 1 tpµ˚ptq ` pσ˚ptqct ď pµiptq ` pσiptqct, nipt ´ 1q ě ℓu ď ℓ ` T ´1ÿ t“ℓ 1

A. Yabe, D. Hatano, H. Sumita, S. Ito, N. Kakimura, T. Fukunaga, and K.-i. Kawarabayashi. Causal bandits with propagating inference. In International Conference on Machine Learning, pages 5512–5520. PMLR, 2018. 10 A PREPRINT - MARCH 11, 2025 A Explicit expressions of constants Aside from the notations in the main paper, we also have Xi|X´i „ N p¯µi, ¯σ2 i...

work page 2018