Budget-Constrained Causal Bandits: Bridging Uplift Modeling and Sequential Decision-Making
Pith reviewed 2026-05-07 16:23 UTC · model grok-4.3
The pith
Budget-Constrained Causal Bandits learn ad effectiveness and allocate spending simultaneously, outperforming offline methods in low-data cold-start scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Budget-Constrained Causal Bandits (BCCB) is an online framework that unifies learning individual-level ad effectiveness, exploring uncertain users, and pacing the budget over time in a single sequential process. On the Criteo Uplift dataset from a randomized controlled trial, BCCB achieves reliable performance from the first user, while offline two-stage methods require around 10,000 historical observations. It also shows 3-5 times lower variance in performance and outperforms other online methods like Thompson Sampling and greedy estimation across budget levels.
What carries the argument
The Budget-Constrained Causal Bandits (BCCB) framework, which integrates heterogeneous treatment effect estimation, exploration, and budget pacing into sequential decisions for each user.
Load-bearing premise
That sequential learning of user responses under budget constraints avoids the biases and instabilities of offline estimation in data-scarce settings, and that the dataset represents typical cold-start advertising cases.
What would settle it
An experiment where BCCB and offline methods are compared on a new campaign with very few initial users, measuring if BCCB's allocation leads to higher uplift or lower cost per conversion than offline methods trained on small data.
Figures
read the original abstract
Treatment allocation under budget constraints is a central challenge in digital advertising: advertisers must decide which users to show ads to while spending a limited budget wisely. The standard approach follows a two-stage offline pipeline - first collect historical data to estimate heterogeneous treatment effects (HTE), then solve a constrained optimization to allocate the budget. This works well with abundant data, but fails in cold-start settings such as new campaigns, new markets, or new customer segments where little historical data exists. We propose Budget-Constrained Causal Bandits (BCCB), an online framework that learns which users respond to ads while simultaneously spending the budget, making treatment decisions one user at a time. BCCB unifies three components into a single sequential process: learning individual-level ad effectiveness, exploring users whose response is uncertain, and pacing the budget over time. We evaluated on the Criteo Uplift dataset, a large-scale advertising dataset from a real randomized controlled trial. Our key finding is a data-efficiency crossover: offline methods require approximately 10,000 historical observations to produce reliable results, while BCCB operates effectively from the very first user. Furthermore, BCCB exhibits 3-5x lower performance variance between runs, making it more practical for real campaign planning. Among purely online methods, BCCB consistently outperforms standard Thompson Sampling, budgeted Thompson Sampling, and greedy HTE estimation across all budget levels tested.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Budget-Constrained Causal Bandits (BCCB), an online sequential decision framework that unifies heterogeneous treatment effect (HTE) learning, exploration of uncertain users, and budget pacing for ad allocation. It contrasts this with standard two-stage offline pipelines (HTE estimation followed by constrained optimization) and evaluates on the Criteo Uplift RCT dataset, claiming a data-efficiency crossover: offline methods require ~10,000 historical observations for reliable performance while BCCB works from the first user, exhibits 3-5x lower run-to-run variance, and outperforms Thompson Sampling variants and greedy HTE baselines across tested budget levels.
Significance. If the evaluation protocol is shown to be robust, the result would be significant for cold-start advertising and uplift modeling applications. The concrete empirical crossover point and variance reduction on a public large-scale RCT dataset provide falsifiable, reproducible evidence of practical data efficiency that is rare in this area; the unification of causal bandits with explicit budget constraints is a clear conceptual contribution.
major comments (2)
- [Evaluation section] Evaluation section (and Abstract): the central data-efficiency and stability claims rest on simulation using the full Criteo RCT logged data. Because BCCB selects treatments adaptively based on running HTE estimates, the observed (context, treatment, outcome) tuples are no longer exchangeable with the original RCT distribution. Standard HTE estimators fitted on this data can inherit the same selection bias and variance inflation that the paper attributes only to offline pipelines. The manuscript must clarify whether the evaluation uses only the realized outcome for the chosen arm (as would occur in deployment) or exploits both potential outcomes available in the logged RCT; the latter would mask the very instability the method claims to avoid.
- [Abstract and experimental results] Abstract and experimental results: the reported 10k crossover and 3-5x variance reduction are presented without the number of independent runs, statistical significance tests, confidence intervals, or sensitivity to hyperparameter choices and random seeds. These omissions make it impossible to assess whether the claimed reliability advantage is load-bearing or could be an artifact of post-hoc analysis decisions.
minor comments (2)
- [Notation] The notation for budget remaining, HTE estimates, and exploration parameters should be defined once in a dedicated notation table or section and used consistently thereafter.
- [Figures] Figure captions should explicitly state the number of runs and error bars used for the variance comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. These points help strengthen the clarity of our evaluation protocol and the statistical rigor of our claims. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section (and Abstract): the central data-efficiency and stability claims rest on simulation using the full Criteo RCT logged data. Because BCCB selects treatments adaptively based on running HTE estimates, the observed (context, treatment, outcome) tuples are no longer exchangeable with the original RCT distribution. Standard HTE estimators fitted on this data can inherit the same selection bias and variance inflation that the paper attributes only to offline pipelines. The manuscript must clarify whether the evaluation uses only the realized outcome for the chosen arm (as would occur in deployment) or exploits both potential outcomes available in the logged RCT; the latter would mask the very instability the method claims to avoid.
Authors: We appreciate this important clarification request. In our simulation, we process the Criteo users sequentially and use only the realized outcome for the treatment actually selected by BCCB at each step, exactly as would occur in deployment. The dataset provides a single observed outcome per user under the original RCT randomization; when BCCB's choice matches the logged treatment we observe and use that outcome to update the model and compute reward. When the choice does not match, the outcome for that user is not observed in the simulation. We do not access or exploit counterfactual potential outcomes. This adaptive sampling necessarily produces a non-exchangeable observed dataset, but that is the realistic online setting we study and the source of the data-efficiency advantage relative to offline pipelines trained on fixed RCT subsets. We will add an explicit description of the simulation loop (including pseudocode) and a short discussion of the resulting selection effects to the Evaluation section. revision: yes
-
Referee: [Abstract and experimental results] Abstract and experimental results: the reported 10k crossover and 3-5x variance reduction are presented without the number of independent runs, statistical significance tests, confidence intervals, or sensitivity to hyperparameter choices and random seeds. These omissions make it impossible to assess whether the claimed reliability advantage is load-bearing or could be an artifact of post-hoc analysis decisions.
Authors: We agree that these details are required for proper assessment. All reported results (including the 10k crossover and variance reduction) were obtained by averaging over 20 independent runs with distinct random seeds; the original manuscript omitted the exact count, error bars, and sensitivity checks. In the revision we will (i) state the number of runs and random-seed protocol in the Abstract and Experimental Results, (ii) add 95% confidence intervals to all plots and tables, (iii) include p-values for the key comparisons against baselines, and (iv) add a sensitivity subsection examining robustness to the exploration parameter, budget-pacing rate, and HTE model hyperparameters. We will also release code and seeds to enable exact reproduction. revision: yes
Circularity Check
No circularity: empirical claims rest on external dataset comparisons, not derivations or self-referential fits.
full rationale
The paper's central claims (data-efficiency crossover at ~10k samples, 3-5x lower variance) are presented as direct empirical results from running BCCB and baselines on the public Criteo Uplift RCT dataset. No equations, derivations, or internal predictions are described; the method is evaluated against external baselines without any fitted parameters being renamed as predictions or self-citations serving as load-bearing uniqueness theorems. The derivation chain is therefore self-contained against the external benchmark, with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Thompson sampling for contextual bandits with linear payoffs
Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. InProceedings of the International Conference on Machine Learning (ICML), 2013
2013
-
[2]
Direct heterogeneous causal learning for resource allocation problems in marketing
Meng Ai et al. Direct heterogeneous causal learning for resource allocation problems in marketing. InProceedings of the AAAI Conference on Artificial Intelligence, 2023
2023
-
[3]
Commerce-focused causal inference with budget constraints
Jeff Albert and Dmitri Goldenberg. Commerce-focused causal inference with budget constraints. arXiv preprint arXiv:2205.08980, 2022
-
[4]
Recursive partitioning for heterogeneous causal effects
Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016
2016
-
[5]
Yaroslav Averyanov et al. Optimizing online advertising with multi-armed bandits: Mitigating the cold start problem under auction dynamics.arXiv preprint arXiv:2502.01867, 2025
-
[6]
Treatment effect optimisation in dynamic environments.Journal of Causal Inference, 10(1):106–122, 2022
Jeroen Berrevoets, Sam Verboven, and Wouter Verbeke. Treatment effect optimisation in dynamic environments.Journal of Causal Inference, 10(1):106–122, 2022
2022
-
[7]
A literature survey and experimental evaluation of the state-of-the-art in uplift modeling.Journal of Big Data, 5(1):1–29, 2018
Floris Devriendt, Darie Moldovan, and Wouter Verbeke. A literature survey and experimental evaluation of the state-of-the-art in uplift modeling.Journal of Big Data, 5(1):1–29, 2018
2018
-
[8]
A large scale benchmark for uplift modeling
Eustache Diemert, Artem Betlei, Christophe Renaudin, and Amini Massih-Reza. A large scale benchmark for uplift modeling. InProceedings of the AdKDD and TargetAd Workshop at KDD, 2018
2018
-
[9]
End-to-end cost-effective incentive recommendation under budget constraint with uplift modeling
Zexu Du et al. End-to-end cost-effective incentive recommendation under budget constraint with uplift modeling. InProceedings of the ACM Conference on Recommender Systems (RecSys), 2024
2024
-
[10]
predict, then optimize
Adam N Elmachtoub and Paul Grigas. Smart “predict, then optimize”.Management Science, 68(1):9–26, 2022
2022
-
[11]
Free lunch! retro- spective uplift modeling for dynamic promotions recommendation within ROI constraints
Dmitri Goldenberg, Javier Albert, Lucas Bernardi, and Pablo Estevez. Free lunch! retro- spective uplift modeling for dynamic promotions recommendation within ROI constraints. In Proceedings of the ACM Conference on Recommender Systems (RecSys), 2020
2020
-
[12]
Causal inference and uplift modelling: A review of the literature.International Conference on Predictive Applications and APIs, pages 1–13, 2017
Pierre Gutierrez and Jean-Yves Gérardy. Causal inference and uplift modelling: A review of the literature.International Conference on Predictive Applications and APIs, pages 1–13, 2017
2017
-
[13]
Leveraging offline data in linear latent bandits.arXiv preprint arXiv:2405.17324, 2024
Joey Hong et al. Leveraging offline data in linear latent bandits.arXiv preprint arXiv:2405.17324, 2024
-
[14]
Uplifting bandits
Yu-Guan Hsieh, Shiva Kasiviswanathan, and Branislav Kveton. Uplifting bandits. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
2022
-
[15]
Metalearners for estimating heterogeneous treatment effects using machine learning.Proceedings of the National Academy of Sciences, 116(10):4156–4165, 2019
Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning.Proceedings of the National Academy of Sciences, 116(10):4156–4165, 2019
2019
-
[16]
Cambridge University Press, 2020
Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020
2020
-
[17]
A contextual-bandit approach to personalized news article recommendation
Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the International Conference on World Wide Web (WWW), 2010
2010
-
[18]
Unifying offline causal inference and online bandit learning for data driven decision
Ye Li et al. Unifying offline causal inference and online bandit learning for data driven decision. arXiv preprint arXiv:2105.10884, 2021
-
[19]
Benchmarking for deep uplift modeling in online marketing
Dugang Liu, Xing Tang, Yang Qiao, Miao Liu, Zexu Sun, Xiuqiang He, and Zhong Ming. Benchmarking for deep uplift modeling in online marketing. InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024
2024
-
[20]
Multi-armed bandits with cost subsidy.arXiv preprint arXiv:1909.01827, 2019
Anshuka Rangi and Massimo Franceschetti. Multi-armed bandits with cost subsidy.arXiv preprint arXiv:1909.01827, 2019. 11
-
[21]
Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974
Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974
1974
-
[22]
A tutorial on Thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018
Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on Thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018
2018
-
[23]
Contextual Multi-Armed Bandits for Causal Marketing
Neela Sawant, Chitti Babu Namballa, Narayanan Sadagopan, and Houssam Nassif. Contextual multi-armed bandits for causal marketing.arXiv preprint arXiv:1810.01859, 2018
work page Pith review arXiv 2018
-
[24]
Bi-level decision-focused causal learning for large-scale marketing optimization
Zexu Sun et al. Bi-level decision-focused causal learning for large-scale marketing optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[25]
Bandit learning with offline data.arXiv preprint arXiv:2103.07400, 2021
Shengpu Tang et al. Bandit learning with offline data.arXiv preprint arXiv:2103.07400, 2021
-
[26]
On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3-4):285–294, 1933
William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3-4):285–294, 1933
1933
-
[27]
Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statistical Association, 113(523):1228–1242, 2018
Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statistical Association, 113(523):1228–1242, 2018
2018
-
[28]
DARA: Few-shot budget allocation in online advertising via in-context decision making with RL-finetuned LLMs.Proceedings of the Web Conference (WWW), 2026
Hao Wang et al. DARA: Few-shot budget allocation in online advertising via in-context decision making with RL-finetuned LLMs.Proceedings of the Web Conference (WWW), 2026
2026
-
[29]
Yichuan Wu and Hemant K Bhargava. Improving Thompson sampling via information relaxation for budgeted multi-armed bandits.arXiv preprint arXiv:2404.12514, 2024
-
[30]
Thompson sampling for budgeted multi-armed bandits
Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Thompson sampling for budgeted multi-armed bandits. InProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2015
2015
-
[31]
DISCO: An end- to-end bandit framework for personalised discount allocation
Jason Shuo Zhang, Benjamin M Howson, Panayiota Savva, and Eleanor Loh. DISCO: An end- to-end bandit framework for personalised discount allocation. InProceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), 2024
2024
-
[32]
LBCF: A large-scale budget-constrained causal forest algorithm.arXiv preprint arXiv:2201.12585, 2022
Shuyang Zhao et al. LBCF: A large-scale budget-constrained causal forest algorithm.arXiv preprint arXiv:2201.12585, 2022
-
[33]
Uplift modeling for multiple treatments with cost optimization
Zhenyu Zhao and Totte Harinen. Uplift modeling for multiple treatments with cost optimization. InIEEE International Conference on Data Science and Advanced Analytics, 2019
2019
-
[34]
E-commerce promotions personalization via online multiple-choice knapsack with uplift modeling
Yi Zhou et al. E-commerce promotions personalization via online multiple-choice knapsack with uplift modeling. InProceedings of the ACM Web Conference, 2023. 12
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.