Practical Adversarial Attacks on Stochastic Bandits via Fake Data Injection
Pith reviewed 2026-05-19 12:41 UTC · model grok-4.3
The pith
Stochastic bandits can be forced to select a target arm nearly always through limited injections of bounded fake feedback at sublinear cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the practical Fake Data Injection model with constraints on the number, magnitude, and timing of fake samples, effective attack strategies exist that mislead a class of stochastic bandit algorithms into selecting a target arm in nearly all rounds while incurring only sublinear attack cost.
What carries the argument
The Fake Data Injection model, which permits the attacker to add a limited number of bounded fake feedback samples to the learner's history under magnitude and temporal constraints.
If this is right
- Bandit algorithms can be induced to select the target arm in nearly every round.
- The total attack cost in terms of injected samples grows sublinearly.
- The approach applies to a broad class of stochastic bandit algorithms.
- Validation comes from experiments on both synthetic data and real-world datasets.
Where Pith is reading between the lines
- Practical bandit deployments may require additional safeguards to filter suspicious feedback data.
- Similar constrained attack models could be studied in related areas like contextual bandits or reinforcement learning.
- Testing the robustness of specific bandit variants against these injections would be a direct follow-up experiment.
Load-bearing premise
The attacker can inject only a limited number of bounded fake feedback samples into the learner's history, with explicit magnitude and temporal constraints.
What would settle it
A demonstration that under the given injection limits and bounds, the bandit algorithm selects the target arm in only a small fraction of rounds or that the required injections scale linearly rather than sublinearly.
Figures
read the original abstract
Adversarial attacks on stochastic bandits have traditionally relied on some unrealistic assumptions, such as per-round reward manipulation and unbounded perturbations, limiting their relevance to real-world systems. We propose a more practical threat model, Fake Data Injection, which reflects realistic adversarial constraints: the attacker can inject only a limited number of bounded fake feedback samples into the learner's history, simulating legitimate interactions. We design effective attack strategies under this model, explicitly addressing both magnitude constraints (on reward values) and temporal constraints (on when and how often data can be injected). Our theoretical analysis shows that these attacks can mislead a class of bandit algorithms into selecting a target arm in nearly all rounds while incurring only sublinear attack cost. Experiments on synthetic and real-world datasets validate the effectiveness of our strategies, revealing vulnerabilities in stochastic bandit algorithms under practical adversarial scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a Fake Data Injection threat model for stochastic bandits in which an attacker may insert only a limited number of bounded fake reward samples into the learner's history while respecting explicit magnitude and temporal constraints. It designs concrete attack policies under these restrictions and claims, via theoretical analysis, that the policies can force a class of bandit algorithms (including UCB-style methods) to select a designated target arm in a (1-o(1)) fraction of rounds while incurring only sublinear total attack cost. The claims are supported by experiments on both synthetic data and real-world datasets.
Significance. If the central theoretical guarantee holds, the work is significant because it replaces the unrealistic per-round unbounded manipulation assumptions common in prior adversarial-bandit literature with a more plausible limited-injection model. Demonstrating that sublinear bounded injections suffice to achieve near-constant selection of a suboptimal arm would highlight concrete vulnerabilities in deployed stochastic bandit systems and motivate the design of more robust algorithms.
major comments (1)
- [§4] §4 (Theoretical Analysis): the claim that sublinear bounded injections suffice to keep the target arm's index strictly above the others for all but o(T) rounds must be shown to overcome the 1/sqrt(t) concentration of empirical means and shrinking confidence intervals in standard UCB-style algorithms. The current argument does not explicitly derive how a vanishing fraction of fake samples produces a persistent bias that dominates any fixed gap Delta > 0 for all large t; without this step the sublinear-cost guarantee is not yet established.
minor comments (2)
- [Abstract / §1] The abstract and introduction refer to 'a class of bandit algorithms' without immediately naming the precise family (e.g., UCB, Thompson sampling, or both); this should be stated explicitly in the first paragraph of the introduction.
- [§5] Figure captions and axis labels in the experimental section should include the precise values of the magnitude bound B and the injection-rate parameter used in each plot.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The major comment on the theoretical analysis raises a valid point about the need for greater explicitness in the derivation. We address it in detail below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§4] §4 (Theoretical Analysis): the claim that sublinear bounded injections suffice to keep the target arm's index strictly above the others for all but o(T) rounds must be shown to overcome the 1/sqrt(t) concentration of empirical means and shrinking confidence intervals in standard UCB-style algorithms. The current argument does not explicitly derive how a vanishing fraction of fake samples produces a persistent bias that dominates any fixed gap Delta > 0 for all large t; without this step the sublinear-cost guarantee is not yet established.
Authors: We agree that the current presentation in Section 4 would benefit from a more explicit step-by-step derivation of how the cumulative effect of o(T) bounded injections creates a persistent bias sufficient to dominate the 1/sqrt(t) concentration terms and shrinking confidence intervals for any fixed Delta > 0. In the existing argument, the attack policy injects a total of Theta(sqrt(T) log T) fake samples (still o(T)) at carefully chosen times, which shifts the empirical mean of the target arm by more than Delta/2 while the number of genuine samples for competing arms grows linearly. We will add a dedicated lemma that bounds the deviation probability uniformly over t and shows that, after a finite time t0, the biased UCB index of the target arm exceeds those of the others with probability 1-o(1). This revision will make the domination of the vanishing attack fraction over the concentration bounds fully rigorous. We will incorporate this expanded derivation in the revised manuscript. revision: yes
Circularity Check
No significant circularity; derivation targets independent concentration bounds
full rationale
The paper's central theoretical claim derives attack success (near-constant target-arm selection with sublinear cost) from the interaction of bounded fake injections with standard UCB-style concentration and regret analysis. No equations reduce a prediction to a fitted parameter by construction, no self-citations are invoked as uniqueness theorems, and no ansatz is smuggled via prior work. The analysis explicitly respects the magnitude and temporal constraints stated in the threat model and shows how o(T) injections can still produce persistent bias only through explicit scheduling arguments that remain falsifiable against 1/sqrt(T) rates. This is the most common honest non-finding for theoretical bandit papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bandit algorithms update their estimates based on observed rewards including injected ones, following standard regret-minimizing behavior.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our theoretical analysis shows that these attacks can mislead a class of bandit algorithms into selecting a target arm in nearly all rounds while incurring only sublinear attack cost.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 4.1 (Exponential Suppression of Non-Target Arms)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A contextual-bandit approach to personalized news article recommendation
Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670, 2010
work page 2010
-
[2]
Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms.Journal of Machine Learning Research, 17 (50):1–33, 2016
work page 2016
-
[3]
Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges.Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015
work page 2015
-
[4]
Contextual combinatorial cascading bandits
Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. InInternational conference on machine learning, pages 1245–1253. PMLR, 2016
work page 2016
-
[5]
Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits.Advances in neural information processing systems, 31, 2018
work page 2018
-
[6]
Data poisoning attacks on stochastic bandits
Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. InInternational Conference on Machine Learning, pages 4042–4050. PMLR, 2019
work page 2019
-
[7]
Jinhang Zuo, Zhiyao Zhang, Zhiyong Wang, Shuai Li, Mohammad Hajiesmaili, and Adam Wierman. Adversarial attacks on online learning to rank with click feedback.Advances in Neural Information Processing Systems, 36:41675–41692, 2023
work page 2023
-
[8]
Near optimal adversarial attacks on stochastic bandits and defenses with smoothed responses
Shiliang Zuo. Near optimal adversarial attacks on stochastic bandits and defenses with smoothed responses. InInternational Conference on Artificial Intelligence and Statistics, pages 2098–2106. PMLR, 2024
work page 2098
-
[9]
Adversarial attacks on linear contextual bandits
Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. Adversarial attacks on linear contextual bandits. Advances in Neural Information Processing Systems, 33:14362–14373, 2020
work page 2020
-
[10]
Adversarial attacks on adversarial bandits.arXiv preprint arXiv:2301.12595, 2023
Yuzhe Ma and Zhijin Zhou. Adversarial attacks on adversarial bandits.arXiv preprint arXiv:2301.12595, 2023
-
[11]
Huazheng Wang, Haifeng Xu, and Hongning Wang. When are linear stochastic bandits attackable? InInternational Conference on Machine Learning, pages 23254–23273. PMLR, 2022
work page 2022
-
[12]
Observation-free attacks on stochastic bandits
Yinglun Xu, Bhuvesh Kumar, and Jacob D Abernethy. Observation-free attacks on stochastic bandits. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 22550–22561. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/2021/...
work page 2021
-
[13]
Zichen Wang, Rishab Balasubramanian, Hui Yuan, Mengdi Wang, Huazheng Wang, et al. Adversarial attacks on online learning to rank with stochastic click models.Transactions on Machine Learning Research, 2024
work page 2024
-
[14]
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems
Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems, 2012. URLhttps://arxiv.org/abs/1204.5721
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[15]
Near-optimal regret bounds for thompson sampling.J
Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for thompson sampling.J. ACM, 64(5), September 2017. ISSN 0004-5411. doi: 10.1145/3088510. URL https://doi.org/ 10.1145/3088510
-
[16]
F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context.ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):1–19, 2015. doi: 10.1145/2827872. URLhttps://doi.org/10.1145/2827872. 11 Appendix A Proofs A.1 Concentration Results Suppose that the reward distributions of arms are σ2-sub-Gaussian. The following concentra...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.