pith. sign in

arxiv: 2604.26169 · v1 · submitted 2026-04-28 · 💻 cs.LG · econ.EM· stat.ML

Budget-Constrained Causal Bandits: Bridging Uplift Modeling and Sequential Decision-Making

Pith reviewed 2026-05-07 16:23 UTC · model grok-4.3

classification 💻 cs.LG econ.EMstat.ML
keywords causal banditsuplift modelingbudget constraintsonline learningheterogeneous treatment effectsdigital advertisingsequential decision makingcold-start scenarios
0
0 comments X

The pith

Budget-Constrained Causal Bandits learn ad effectiveness and allocate spending simultaneously, outperforming offline methods in low-data cold-start scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an online method called Budget-Constrained Causal Bandits for deciding which users to show ads to under a limited budget. Traditional approaches first gather lots of historical data to estimate how different users respond to ads, then optimize allocation, but this fails when data is scarce like in new campaigns. BCCB instead makes decisions one user at a time, learning responses while pacing the budget and exploring uncertain cases. Tests on a real advertising dataset show it works well right away, needing far less data than offline methods, and gives more consistent results across runs. This suggests online sequential learning can handle the challenges of starting fresh in targeted advertising.

Core claim

Budget-Constrained Causal Bandits (BCCB) is an online framework that unifies learning individual-level ad effectiveness, exploring uncertain users, and pacing the budget over time in a single sequential process. On the Criteo Uplift dataset from a randomized controlled trial, BCCB achieves reliable performance from the first user, while offline two-stage methods require around 10,000 historical observations. It also shows 3-5 times lower variance in performance and outperforms other online methods like Thompson Sampling and greedy estimation across budget levels.

What carries the argument

The Budget-Constrained Causal Bandits (BCCB) framework, which integrates heterogeneous treatment effect estimation, exploration, and budget pacing into sequential decisions for each user.

Load-bearing premise

That sequential learning of user responses under budget constraints avoids the biases and instabilities of offline estimation in data-scarce settings, and that the dataset represents typical cold-start advertising cases.

What would settle it

An experiment where BCCB and offline methods are compared on a new campaign with very few initial users, measuring if BCCB's allocation leads to higher uplift or lower cost per conversion than offline methods trained on small data.

Figures

Figures reproduced from arXiv: 2604.26169 by Abhirami Pillai.

Figure 1
Figure 1. Figure 1: (a) Conversions versus budget for all online methods. BCCB (purple) achieves the strongest view at source ↗
Figure 2
Figure 2. Figure 2: (a) Data-efficiency crossover between offline and online methods. Below 2,000 historical view at source ↗
read the original abstract

Treatment allocation under budget constraints is a central challenge in digital advertising: advertisers must decide which users to show ads to while spending a limited budget wisely. The standard approach follows a two-stage offline pipeline - first collect historical data to estimate heterogeneous treatment effects (HTE), then solve a constrained optimization to allocate the budget. This works well with abundant data, but fails in cold-start settings such as new campaigns, new markets, or new customer segments where little historical data exists. We propose Budget-Constrained Causal Bandits (BCCB), an online framework that learns which users respond to ads while simultaneously spending the budget, making treatment decisions one user at a time. BCCB unifies three components into a single sequential process: learning individual-level ad effectiveness, exploring users whose response is uncertain, and pacing the budget over time. We evaluated on the Criteo Uplift dataset, a large-scale advertising dataset from a real randomized controlled trial. Our key finding is a data-efficiency crossover: offline methods require approximately 10,000 historical observations to produce reliable results, while BCCB operates effectively from the very first user. Furthermore, BCCB exhibits 3-5x lower performance variance between runs, making it more practical for real campaign planning. Among purely online methods, BCCB consistently outperforms standard Thompson Sampling, budgeted Thompson Sampling, and greedy HTE estimation across all budget levels tested.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Budget-Constrained Causal Bandits (BCCB), an online sequential decision framework that unifies heterogeneous treatment effect (HTE) learning, exploration of uncertain users, and budget pacing for ad allocation. It contrasts this with standard two-stage offline pipelines (HTE estimation followed by constrained optimization) and evaluates on the Criteo Uplift RCT dataset, claiming a data-efficiency crossover: offline methods require ~10,000 historical observations for reliable performance while BCCB works from the first user, exhibits 3-5x lower run-to-run variance, and outperforms Thompson Sampling variants and greedy HTE baselines across tested budget levels.

Significance. If the evaluation protocol is shown to be robust, the result would be significant for cold-start advertising and uplift modeling applications. The concrete empirical crossover point and variance reduction on a public large-scale RCT dataset provide falsifiable, reproducible evidence of practical data efficiency that is rare in this area; the unification of causal bandits with explicit budget constraints is a clear conceptual contribution.

major comments (2)
  1. [Evaluation section] Evaluation section (and Abstract): the central data-efficiency and stability claims rest on simulation using the full Criteo RCT logged data. Because BCCB selects treatments adaptively based on running HTE estimates, the observed (context, treatment, outcome) tuples are no longer exchangeable with the original RCT distribution. Standard HTE estimators fitted on this data can inherit the same selection bias and variance inflation that the paper attributes only to offline pipelines. The manuscript must clarify whether the evaluation uses only the realized outcome for the chosen arm (as would occur in deployment) or exploits both potential outcomes available in the logged RCT; the latter would mask the very instability the method claims to avoid.
  2. [Abstract and experimental results] Abstract and experimental results: the reported 10k crossover and 3-5x variance reduction are presented without the number of independent runs, statistical significance tests, confidence intervals, or sensitivity to hyperparameter choices and random seeds. These omissions make it impossible to assess whether the claimed reliability advantage is load-bearing or could be an artifact of post-hoc analysis decisions.
minor comments (2)
  1. [Notation] The notation for budget remaining, HTE estimates, and exploration parameters should be defined once in a dedicated notation table or section and used consistently thereafter.
  2. [Figures] Figure captions should explicitly state the number of runs and error bars used for the variance comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These points help strengthen the clarity of our evaluation protocol and the statistical rigor of our claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section (and Abstract): the central data-efficiency and stability claims rest on simulation using the full Criteo RCT logged data. Because BCCB selects treatments adaptively based on running HTE estimates, the observed (context, treatment, outcome) tuples are no longer exchangeable with the original RCT distribution. Standard HTE estimators fitted on this data can inherit the same selection bias and variance inflation that the paper attributes only to offline pipelines. The manuscript must clarify whether the evaluation uses only the realized outcome for the chosen arm (as would occur in deployment) or exploits both potential outcomes available in the logged RCT; the latter would mask the very instability the method claims to avoid.

    Authors: We appreciate this important clarification request. In our simulation, we process the Criteo users sequentially and use only the realized outcome for the treatment actually selected by BCCB at each step, exactly as would occur in deployment. The dataset provides a single observed outcome per user under the original RCT randomization; when BCCB's choice matches the logged treatment we observe and use that outcome to update the model and compute reward. When the choice does not match, the outcome for that user is not observed in the simulation. We do not access or exploit counterfactual potential outcomes. This adaptive sampling necessarily produces a non-exchangeable observed dataset, but that is the realistic online setting we study and the source of the data-efficiency advantage relative to offline pipelines trained on fixed RCT subsets. We will add an explicit description of the simulation loop (including pseudocode) and a short discussion of the resulting selection effects to the Evaluation section. revision: yes

  2. Referee: [Abstract and experimental results] Abstract and experimental results: the reported 10k crossover and 3-5x variance reduction are presented without the number of independent runs, statistical significance tests, confidence intervals, or sensitivity to hyperparameter choices and random seeds. These omissions make it impossible to assess whether the claimed reliability advantage is load-bearing or could be an artifact of post-hoc analysis decisions.

    Authors: We agree that these details are required for proper assessment. All reported results (including the 10k crossover and variance reduction) were obtained by averaging over 20 independent runs with distinct random seeds; the original manuscript omitted the exact count, error bars, and sensitivity checks. In the revision we will (i) state the number of runs and random-seed protocol in the Abstract and Experimental Results, (ii) add 95% confidence intervals to all plots and tables, (iii) include p-values for the key comparisons against baselines, and (iv) add a sensitivity subsection examining robustness to the exploration parameter, budget-pacing rate, and HTE model hyperparameters. We will also release code and seeds to enable exact reproduction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external dataset comparisons, not derivations or self-referential fits.

full rationale

The paper's central claims (data-efficiency crossover at ~10k samples, 3-5x lower variance) are presented as direct empirical results from running BCCB and baselines on the public Criteo Uplift RCT dataset. No equations, derivations, or internal predictions are described; the method is evaluated against external baselines without any fitted parameters being renamed as predictions or self-citations serving as load-bearing uniqueness theorems. The derivation chain is therefore self-contained against the external benchmark, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method is described at the level of high-level components without mathematical details.

pith-pipeline@v0.9.0 · 5552 in / 1193 out tokens · 68364 ms · 2026-05-07T16:23:46.932158+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 9 canonical work pages

  1. [1]

    Thompson sampling for contextual bandits with linear payoffs

    Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. InProceedings of the International Conference on Machine Learning (ICML), 2013

  2. [2]

    Direct heterogeneous causal learning for resource allocation problems in marketing

    Meng Ai et al. Direct heterogeneous causal learning for resource allocation problems in marketing. InProceedings of the AAAI Conference on Artificial Intelligence, 2023

  3. [3]

    Commerce-focused causal inference with budget constraints

    Jeff Albert and Dmitri Goldenberg. Commerce-focused causal inference with budget constraints. arXiv preprint arXiv:2205.08980, 2022

  4. [4]

    Recursive partitioning for heterogeneous causal effects

    Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016

  5. [5]

    Optimizing online advertising with multi-armed bandits: Mitigating the cold start problem under auction dynamics.arXiv preprint arXiv:2502.01867, 2025

    Yaroslav Averyanov et al. Optimizing online advertising with multi-armed bandits: Mitigating the cold start problem under auction dynamics.arXiv preprint arXiv:2502.01867, 2025

  6. [6]

    Treatment effect optimisation in dynamic environments.Journal of Causal Inference, 10(1):106–122, 2022

    Jeroen Berrevoets, Sam Verboven, and Wouter Verbeke. Treatment effect optimisation in dynamic environments.Journal of Causal Inference, 10(1):106–122, 2022

  7. [7]

    A literature survey and experimental evaluation of the state-of-the-art in uplift modeling.Journal of Big Data, 5(1):1–29, 2018

    Floris Devriendt, Darie Moldovan, and Wouter Verbeke. A literature survey and experimental evaluation of the state-of-the-art in uplift modeling.Journal of Big Data, 5(1):1–29, 2018

  8. [8]

    A large scale benchmark for uplift modeling

    Eustache Diemert, Artem Betlei, Christophe Renaudin, and Amini Massih-Reza. A large scale benchmark for uplift modeling. InProceedings of the AdKDD and TargetAd Workshop at KDD, 2018

  9. [9]

    End-to-end cost-effective incentive recommendation under budget constraint with uplift modeling

    Zexu Du et al. End-to-end cost-effective incentive recommendation under budget constraint with uplift modeling. InProceedings of the ACM Conference on Recommender Systems (RecSys), 2024

  10. [10]

    predict, then optimize

    Adam N Elmachtoub and Paul Grigas. Smart “predict, then optimize”.Management Science, 68(1):9–26, 2022

  11. [11]

    Free lunch! retro- spective uplift modeling for dynamic promotions recommendation within ROI constraints

    Dmitri Goldenberg, Javier Albert, Lucas Bernardi, and Pablo Estevez. Free lunch! retro- spective uplift modeling for dynamic promotions recommendation within ROI constraints. In Proceedings of the ACM Conference on Recommender Systems (RecSys), 2020

  12. [12]

    Causal inference and uplift modelling: A review of the literature.International Conference on Predictive Applications and APIs, pages 1–13, 2017

    Pierre Gutierrez and Jean-Yves Gérardy. Causal inference and uplift modelling: A review of the literature.International Conference on Predictive Applications and APIs, pages 1–13, 2017

  13. [13]

    Leveraging offline data in linear latent bandits.arXiv preprint arXiv:2405.17324, 2024

    Joey Hong et al. Leveraging offline data in linear latent bandits.arXiv preprint arXiv:2405.17324, 2024

  14. [14]

    Uplifting bandits

    Yu-Guan Hsieh, Shiva Kasiviswanathan, and Branislav Kveton. Uplifting bandits. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  15. [15]

    Metalearners for estimating heterogeneous treatment effects using machine learning.Proceedings of the National Academy of Sciences, 116(10):4156–4165, 2019

    Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning.Proceedings of the National Academy of Sciences, 116(10):4156–4165, 2019

  16. [16]

    Cambridge University Press, 2020

    Tor Lattimore and Csaba Szepesvári.Bandit Algorithms. Cambridge University Press, 2020

  17. [17]

    A contextual-bandit approach to personalized news article recommendation

    Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the International Conference on World Wide Web (WWW), 2010

  18. [18]

    Unifying offline causal inference and online bandit learning for data driven decision

    Ye Li et al. Unifying offline causal inference and online bandit learning for data driven decision. arXiv preprint arXiv:2105.10884, 2021

  19. [19]

    Benchmarking for deep uplift modeling in online marketing

    Dugang Liu, Xing Tang, Yang Qiao, Miao Liu, Zexu Sun, Xiuqiang He, and Zhong Ming. Benchmarking for deep uplift modeling in online marketing. InProceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024

  20. [20]

    Multi-armed bandits with cost subsidy.arXiv preprint arXiv:1909.01827, 2019

    Anshuka Rangi and Massimo Franceschetti. Multi-armed bandits with cost subsidy.arXiv preprint arXiv:1909.01827, 2019. 11

  21. [21]

    Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974

    Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974

  22. [22]

    A tutorial on Thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

    Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on Thompson sampling.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

  23. [23]

    Contextual Multi-Armed Bandits for Causal Marketing

    Neela Sawant, Chitti Babu Namballa, Narayanan Sadagopan, and Houssam Nassif. Contextual multi-armed bandits for causal marketing.arXiv preprint arXiv:1810.01859, 2018

  24. [24]

    Bi-level decision-focused causal learning for large-scale marketing optimization

    Zexu Sun et al. Bi-level decision-focused causal learning for large-scale marketing optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  25. [25]

    Bandit learning with offline data.arXiv preprint arXiv:2103.07400, 2021

    Shengpu Tang et al. Bandit learning with offline data.arXiv preprint arXiv:2103.07400, 2021

  26. [26]

    On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3-4):285–294, 1933

    William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.Biometrika, 25(3-4):285–294, 1933

  27. [27]

    Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statistical Association, 113(523):1228–1242, 2018

    Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests.Journal of the American Statistical Association, 113(523):1228–1242, 2018

  28. [28]

    DARA: Few-shot budget allocation in online advertising via in-context decision making with RL-finetuned LLMs.Proceedings of the Web Conference (WWW), 2026

    Hao Wang et al. DARA: Few-shot budget allocation in online advertising via in-context decision making with RL-finetuned LLMs.Proceedings of the Web Conference (WWW), 2026

  29. [29]

    Improving Thompson sampling via information relaxation for budgeted multi-armed bandits.arXiv preprint arXiv:2404.12514, 2024

    Yichuan Wu and Hemant K Bhargava. Improving Thompson sampling via information relaxation for budgeted multi-armed bandits.arXiv preprint arXiv:2404.12514, 2024

  30. [30]

    Thompson sampling for budgeted multi-armed bandits

    Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Thompson sampling for budgeted multi-armed bandits. InProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2015

  31. [31]

    DISCO: An end- to-end bandit framework for personalised discount allocation

    Jason Shuo Zhang, Benjamin M Howson, Panayiota Savva, and Eleanor Loh. DISCO: An end- to-end bandit framework for personalised discount allocation. InProceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), 2024

  32. [32]

    LBCF: A large-scale budget-constrained causal forest algorithm.arXiv preprint arXiv:2201.12585, 2022

    Shuyang Zhao et al. LBCF: A large-scale budget-constrained causal forest algorithm.arXiv preprint arXiv:2201.12585, 2022

  33. [33]

    Uplift modeling for multiple treatments with cost optimization

    Zhenyu Zhao and Totte Harinen. Uplift modeling for multiple treatments with cost optimization. InIEEE International Conference on Data Science and Advanced Analytics, 2019

  34. [34]

    E-commerce promotions personalization via online multiple-choice knapsack with uplift modeling

    Yi Zhou et al. E-commerce promotions personalization via online multiple-choice knapsack with uplift modeling. InProceedings of the ACM Web Conference, 2023. 12