Practical Adversarial Attacks on Stochastic Bandits via Fake Data Injection

Eric He; Jinhang Zuo; Qirun Zeng; Richard Hoffmann; Xuchuang Wang

arxiv: 2505.21938 · v3 · submitted 2025-05-28 · 💻 cs.LG · cs.AI· cs.CR

Practical Adversarial Attacks on Stochastic Bandits via Fake Data Injection

Qirun Zeng , Eric He , Richard Hoffmann , Xuchuang Wang , Jinhang Zuo This is my paper

Pith reviewed 2026-05-19 12:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR

keywords adversarial attacksstochastic banditsfake data injectiondata poisoningmulti-armed banditsonline learningreward manipulation

0 comments

The pith

Stochastic bandits can be forced to select a target arm nearly always through limited injections of bounded fake feedback at sublinear cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a realistic adversarial threat model for stochastic bandits in which an attacker injects only a limited number of bounded fake reward samples into the algorithm's interaction history. It proves that this allows forcing a class of bandit algorithms to select a specific target arm in nearly every round, while the total number of such injections grows sublinearly with the horizon. A sympathetic reader would care because this shows that bandit-based systems, used in recommendations and clinical trials, can be manipulated under constraints that match what a real attacker could actually do, unlike earlier attacks that required changing every reward or using huge changes.

Core claim

Under the practical Fake Data Injection model with constraints on the number, magnitude, and timing of fake samples, effective attack strategies exist that mislead a class of stochastic bandit algorithms into selecting a target arm in nearly all rounds while incurring only sublinear attack cost.

What carries the argument

The Fake Data Injection model, which permits the attacker to add a limited number of bounded fake feedback samples to the learner's history under magnitude and temporal constraints.

If this is right

Bandit algorithms can be induced to select the target arm in nearly every round.
The total attack cost in terms of injected samples grows sublinearly.
The approach applies to a broad class of stochastic bandit algorithms.
Validation comes from experiments on both synthetic data and real-world datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practical bandit deployments may require additional safeguards to filter suspicious feedback data.
Similar constrained attack models could be studied in related areas like contextual bandits or reinforcement learning.
Testing the robustness of specific bandit variants against these injections would be a direct follow-up experiment.

Load-bearing premise

The attacker can inject only a limited number of bounded fake feedback samples into the learner's history, with explicit magnitude and temporal constraints.

What would settle it

A demonstration that under the given injection limits and bounds, the bandit algorithm selects the target arm in only a small fraction of rounds or that the required injections scale linearly rather than sublinearly.

Figures

Figures reproduced from arXiv: 2505.21938 by Eric He, Jinhang Zuo, Qirun Zeng, Richard Hoffmann, Xuchuang Wang.

**Figure 5.** Figure 5: Attack Costs vs [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

read the original abstract

Adversarial attacks on stochastic bandits have traditionally relied on some unrealistic assumptions, such as per-round reward manipulation and unbounded perturbations, limiting their relevance to real-world systems. We propose a more practical threat model, Fake Data Injection, which reflects realistic adversarial constraints: the attacker can inject only a limited number of bounded fake feedback samples into the learner's history, simulating legitimate interactions. We design effective attack strategies under this model, explicitly addressing both magnitude constraints (on reward values) and temporal constraints (on when and how often data can be injected). Our theoretical analysis shows that these attacks can mislead a class of bandit algorithms into selecting a target arm in nearly all rounds while incurring only sublinear attack cost. Experiments on synthetic and real-world datasets validate the effectiveness of our strategies, revealing vulnerabilities in stochastic bandit algorithms under practical adversarial scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The constrained fake data injection model is a reasonable move toward realism, but the sublinear cost claim for forcing near-constant selection of a target arm runs into trouble with standard concentration rates.

read the letter

This paper shifts the attack model on stochastic bandits from per-round unbounded manipulation to injecting only a limited number of bounded fake reward samples with explicit timing rules. That change makes the threat model feel more applicable to actual deployed systems where an adversary cannot rewrite every observation on the fly. The authors lay out the magnitude and temporal constraints clearly and then give concrete strategies that respect those limits. Running the attacks on both synthetic data and real-world datasets is a positive step for showing whether the idea has any practical bite. The central theoretical claim is that these limited injections can push a class of bandit algorithms to pick the target arm in nearly all rounds while keeping total attack cost sublinear in the horizon. That part looks fragile. With only o(T) fake samples the fraction of contaminated data vanishes as T grows, and for UCB-style rules the confidence intervals shrink like 1/sqrt(n). A fixed gap between arms should eventually dominate unless the injection schedule produces a bias that survives the tightening bounds for all but o(T) rounds. The manuscript needs to show exactly how that bias is maintained without the number of injections growing faster than the allowed sublinear budget. Readers focused on adversarial robustness in sequential decision making would find the threat model and the experiments useful. The work is coherent enough on its own terms to deserve a serious referee who can check the analysis details.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a Fake Data Injection threat model for stochastic bandits in which an attacker may insert only a limited number of bounded fake reward samples into the learner's history while respecting explicit magnitude and temporal constraints. It designs concrete attack policies under these restrictions and claims, via theoretical analysis, that the policies can force a class of bandit algorithms (including UCB-style methods) to select a designated target arm in a (1-o(1)) fraction of rounds while incurring only sublinear total attack cost. The claims are supported by experiments on both synthetic data and real-world datasets.

Significance. If the central theoretical guarantee holds, the work is significant because it replaces the unrealistic per-round unbounded manipulation assumptions common in prior adversarial-bandit literature with a more plausible limited-injection model. Demonstrating that sublinear bounded injections suffice to achieve near-constant selection of a suboptimal arm would highlight concrete vulnerabilities in deployed stochastic bandit systems and motivate the design of more robust algorithms.

major comments (1)

[§4] §4 (Theoretical Analysis): the claim that sublinear bounded injections suffice to keep the target arm's index strictly above the others for all but o(T) rounds must be shown to overcome the 1/sqrt(t) concentration of empirical means and shrinking confidence intervals in standard UCB-style algorithms. The current argument does not explicitly derive how a vanishing fraction of fake samples produces a persistent bias that dominates any fixed gap Delta > 0 for all large t; without this step the sublinear-cost guarantee is not yet established.

minor comments (2)

[Abstract / §1] The abstract and introduction refer to 'a class of bandit algorithms' without immediately naming the precise family (e.g., UCB, Thompson sampling, or both); this should be stated explicitly in the first paragraph of the introduction.
[§5] Figure captions and axis labels in the experimental section should include the precise values of the magnitude bound B and the injection-rate parameter used in each plot.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The major comment on the theoretical analysis raises a valid point about the need for greater explicitness in the derivation. We address it in detail below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4 (Theoretical Analysis): the claim that sublinear bounded injections suffice to keep the target arm's index strictly above the others for all but o(T) rounds must be shown to overcome the 1/sqrt(t) concentration of empirical means and shrinking confidence intervals in standard UCB-style algorithms. The current argument does not explicitly derive how a vanishing fraction of fake samples produces a persistent bias that dominates any fixed gap Delta > 0 for all large t; without this step the sublinear-cost guarantee is not yet established.

Authors: We agree that the current presentation in Section 4 would benefit from a more explicit step-by-step derivation of how the cumulative effect of o(T) bounded injections creates a persistent bias sufficient to dominate the 1/sqrt(t) concentration terms and shrinking confidence intervals for any fixed Delta > 0. In the existing argument, the attack policy injects a total of Theta(sqrt(T) log T) fake samples (still o(T)) at carefully chosen times, which shifts the empirical mean of the target arm by more than Delta/2 while the number of genuine samples for competing arms grows linearly. We will add a dedicated lemma that bounds the deviation probability uniformly over t and shows that, after a finite time t0, the biased UCB index of the target arm exceeds those of the others with probability 1-o(1). This revision will make the domination of the vanishing attack fraction over the concentration bounds fully rigorous. We will incorporate this expanded derivation in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation targets independent concentration bounds

full rationale

The paper's central theoretical claim derives attack success (near-constant target-arm selection with sublinear cost) from the interaction of bounded fake injections with standard UCB-style concentration and regret analysis. No equations reduce a prediction to a fitted parameter by construction, no self-citations are invoked as uniqueness theorems, and no ansatz is smuggled via prior work. The analysis explicitly respects the magnitude and temporal constraints stated in the threat model and shows how o(T) injections can still produce persistent bias only through explicit scheduling arguments that remain falsifiable against 1/sqrt(T) rates. This is the most common honest non-finding for theoretical bandit papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard stochastic bandit update rules and regret analysis from prior literature; no new free parameters or invented entities introduced in the abstract.

axioms (1)

domain assumption Bandit algorithms update their estimates based on observed rewards including injected ones, following standard regret-minimizing behavior.
Central to showing that limited injections can dominate selection.

pith-pipeline@v0.9.0 · 5678 in / 1055 out tokens · 31631 ms · 2026-05-19T12:41:23.887990+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our theoretical analysis shows that these attacks can mislead a class of bandit algorithms into selecting a target arm in nearly all rounds while incurring only sublinear attack cost.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 4.1 (Exponential Suppression of Non-Target Arms)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

A contextual-bandit approach to personalized news article recommendation

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670, 2010

work page 2010
[2]

Combinatorial multi-armed bandit and its extension to probabilistically triggered arms.Journal of Machine Learning Research, 17 (50):1–33, 2016

Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms.Journal of Machine Learning Research, 17 (50):1–33, 2016

work page 2016
[3]

Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges.Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015

work page 2015
[4]

Contextual combinatorial cascading bandits

Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. InInternational conference on machine learning, pages 1245–1253. PMLR, 2016

work page 2016
[5]

Adversarial attacks on stochastic bandits.Advances in neural information processing systems, 31, 2018

Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits.Advances in neural information processing systems, 31, 2018

work page 2018
[6]

Data poisoning attacks on stochastic bandits

Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. InInternational Conference on Machine Learning, pages 4042–4050. PMLR, 2019

work page 2019
[7]

Adversarial attacks on online learning to rank with click feedback.Advances in Neural Information Processing Systems, 36:41675–41692, 2023

Jinhang Zuo, Zhiyao Zhang, Zhiyong Wang, Shuai Li, Mohammad Hajiesmaili, and Adam Wierman. Adversarial attacks on online learning to rank with click feedback.Advances in Neural Information Processing Systems, 36:41675–41692, 2023

work page 2023
[8]

Near optimal adversarial attacks on stochastic bandits and defenses with smoothed responses

Shiliang Zuo. Near optimal adversarial attacks on stochastic bandits and defenses with smoothed responses. InInternational Conference on Artificial Intelligence and Statistics, pages 2098–2106. PMLR, 2024

work page 2098
[9]

Adversarial attacks on linear contextual bandits

Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. Adversarial attacks on linear contextual bandits. Advances in Neural Information Processing Systems, 33:14362–14373, 2020

work page 2020
[10]

Adversarial attacks on adversarial bandits.arXiv preprint arXiv:2301.12595, 2023

Yuzhe Ma and Zhijin Zhou. Adversarial attacks on adversarial bandits.arXiv preprint arXiv:2301.12595, 2023

work page arXiv 2023
[11]

When are linear stochastic bandits attackable? InInternational Conference on Machine Learning, pages 23254–23273

Huazheng Wang, Haifeng Xu, and Hongning Wang. When are linear stochastic bandits attackable? InInternational Conference on Machine Learning, pages 23254–23273. PMLR, 2022

work page 2022
[12]

Observation-free attacks on stochastic bandits

Yinglun Xu, Bhuvesh Kumar, and Jacob D Abernethy. Observation-free attacks on stochastic bandits. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 22550–22561. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/2021/...

work page 2021
[13]

Adversarial attacks on online learning to rank with stochastic click models.Transactions on Machine Learning Research, 2024

Zichen Wang, Rishab Balasubramanian, Hui Yuan, Mengdi Wang, Huazheng Wang, et al. Adversarial attacks on online learning to rank with stochastic click models.Transactions on Machine Learning Research, 2024

work page 2024
[14]

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems, 2012. URLhttps://arxiv.org/abs/1204.5721

work page internal anchor Pith review Pith/arXiv arXiv 2012
[15]

Near-optimal regret bounds for thompson sampling.J

Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for thompson sampling.J. ACM, 64(5), September 2017. ISSN 0004-5411. doi: 10.1145/3088510. URL https://doi.org/ 10.1145/3088510

work page doi:10.1145/3088510 2017
[16]

Maxwell and Konstan, Joseph A

F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context.ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):1–19, 2015. doi: 10.1145/2827872. URLhttps://doi.org/10.1145/2827872. 11 Appendix A Proofs A.1 Concentration Results Suppose that the reward distributions of arms are σ2-sub-Gaussian. The following concentra...

work page doi:10.1145/2827872 2015

[1] [1]

A contextual-bandit approach to personalized news article recommendation

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670, 2010

work page 2010

[2] [2]

Combinatorial multi-armed bandit and its extension to probabilistically triggered arms.Journal of Machine Learning Research, 17 (50):1–33, 2016

Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms.Journal of Machine Learning Research, 17 (50):1–33, 2016

work page 2016

[3] [3]

Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges.Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015

work page 2015

[4] [4]

Contextual combinatorial cascading bandits

Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. InInternational conference on machine learning, pages 1245–1253. PMLR, 2016

work page 2016

[5] [5]

Adversarial attacks on stochastic bandits.Advances in neural information processing systems, 31, 2018

Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits.Advances in neural information processing systems, 31, 2018

work page 2018

[6] [6]

Data poisoning attacks on stochastic bandits

Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. InInternational Conference on Machine Learning, pages 4042–4050. PMLR, 2019

work page 2019

[7] [7]

Adversarial attacks on online learning to rank with click feedback.Advances in Neural Information Processing Systems, 36:41675–41692, 2023

Jinhang Zuo, Zhiyao Zhang, Zhiyong Wang, Shuai Li, Mohammad Hajiesmaili, and Adam Wierman. Adversarial attacks on online learning to rank with click feedback.Advances in Neural Information Processing Systems, 36:41675–41692, 2023

work page 2023

[8] [8]

Near optimal adversarial attacks on stochastic bandits and defenses with smoothed responses

Shiliang Zuo. Near optimal adversarial attacks on stochastic bandits and defenses with smoothed responses. InInternational Conference on Artificial Intelligence and Statistics, pages 2098–2106. PMLR, 2024

work page 2098

[9] [9]

Adversarial attacks on linear contextual bandits

Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. Adversarial attacks on linear contextual bandits. Advances in Neural Information Processing Systems, 33:14362–14373, 2020

work page 2020

[10] [10]

Adversarial attacks on adversarial bandits.arXiv preprint arXiv:2301.12595, 2023

Yuzhe Ma and Zhijin Zhou. Adversarial attacks on adversarial bandits.arXiv preprint arXiv:2301.12595, 2023

work page arXiv 2023

[11] [11]

When are linear stochastic bandits attackable? InInternational Conference on Machine Learning, pages 23254–23273

Huazheng Wang, Haifeng Xu, and Hongning Wang. When are linear stochastic bandits attackable? InInternational Conference on Machine Learning, pages 23254–23273. PMLR, 2022

work page 2022

[12] [12]

Observation-free attacks on stochastic bandits

Yinglun Xu, Bhuvesh Kumar, and Jacob D Abernethy. Observation-free attacks on stochastic bandits. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 22550–22561. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/2021/...

work page 2021

[13] [13]

Adversarial attacks on online learning to rank with stochastic click models.Transactions on Machine Learning Research, 2024

Zichen Wang, Rishab Balasubramanian, Hui Yuan, Mengdi Wang, Huazheng Wang, et al. Adversarial attacks on online learning to rank with stochastic click models.Transactions on Machine Learning Research, 2024

work page 2024

[14] [14]

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems, 2012. URLhttps://arxiv.org/abs/1204.5721

work page internal anchor Pith review Pith/arXiv arXiv 2012

[15] [15]

Near-optimal regret bounds for thompson sampling.J

Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for thompson sampling.J. ACM, 64(5), September 2017. ISSN 0004-5411. doi: 10.1145/3088510. URL https://doi.org/ 10.1145/3088510

work page doi:10.1145/3088510 2017

[16] [16]

Maxwell and Konstan, Joseph A

F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context.ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):1–19, 2015. doi: 10.1145/2827872. URLhttps://doi.org/10.1145/2827872. 11 Appendix A Proofs A.1 Concentration Results Suppose that the reward distributions of arms are σ2-sub-Gaussian. The following concentra...

work page doi:10.1145/2827872 2015