pith. sign in

arxiv: 2505.21938 · v3 · submitted 2025-05-28 · 💻 cs.LG · cs.AI· cs.CR

Practical Adversarial Attacks on Stochastic Bandits via Fake Data Injection

Pith reviewed 2026-05-19 12:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR
keywords adversarial attacksstochastic banditsfake data injectiondata poisoningmulti-armed banditsonline learningreward manipulation
0
0 comments X

The pith

Stochastic bandits can be forced to select a target arm nearly always through limited injections of bounded fake feedback at sublinear cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a realistic adversarial threat model for stochastic bandits in which an attacker injects only a limited number of bounded fake reward samples into the algorithm's interaction history. It proves that this allows forcing a class of bandit algorithms to select a specific target arm in nearly every round, while the total number of such injections grows sublinearly with the horizon. A sympathetic reader would care because this shows that bandit-based systems, used in recommendations and clinical trials, can be manipulated under constraints that match what a real attacker could actually do, unlike earlier attacks that required changing every reward or using huge changes.

Core claim

Under the practical Fake Data Injection model with constraints on the number, magnitude, and timing of fake samples, effective attack strategies exist that mislead a class of stochastic bandit algorithms into selecting a target arm in nearly all rounds while incurring only sublinear attack cost.

What carries the argument

The Fake Data Injection model, which permits the attacker to add a limited number of bounded fake feedback samples to the learner's history under magnitude and temporal constraints.

If this is right

  • Bandit algorithms can be induced to select the target arm in nearly every round.
  • The total attack cost in terms of injected samples grows sublinearly.
  • The approach applies to a broad class of stochastic bandit algorithms.
  • Validation comes from experiments on both synthetic data and real-world datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practical bandit deployments may require additional safeguards to filter suspicious feedback data.
  • Similar constrained attack models could be studied in related areas like contextual bandits or reinforcement learning.
  • Testing the robustness of specific bandit variants against these injections would be a direct follow-up experiment.

Load-bearing premise

The attacker can inject only a limited number of bounded fake feedback samples into the learner's history, with explicit magnitude and temporal constraints.

What would settle it

A demonstration that under the given injection limits and bounds, the bandit algorithm selects the target arm in only a small fraction of rounds or that the required injections scale linearly rather than sublinearly.

Figures

Figures reproduced from arXiv: 2505.21938 by Eric He, Jinhang Zuo, Qirun Zeng, Richard Hoffmann, Xuchuang Wang.

Figure 1
Figure 1. Figure 1: Attack Costs vs δ0 [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attack Costs vs [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
read the original abstract

Adversarial attacks on stochastic bandits have traditionally relied on some unrealistic assumptions, such as per-round reward manipulation and unbounded perturbations, limiting their relevance to real-world systems. We propose a more practical threat model, Fake Data Injection, which reflects realistic adversarial constraints: the attacker can inject only a limited number of bounded fake feedback samples into the learner's history, simulating legitimate interactions. We design effective attack strategies under this model, explicitly addressing both magnitude constraints (on reward values) and temporal constraints (on when and how often data can be injected). Our theoretical analysis shows that these attacks can mislead a class of bandit algorithms into selecting a target arm in nearly all rounds while incurring only sublinear attack cost. Experiments on synthetic and real-world datasets validate the effectiveness of our strategies, revealing vulnerabilities in stochastic bandit algorithms under practical adversarial scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a Fake Data Injection threat model for stochastic bandits in which an attacker may insert only a limited number of bounded fake reward samples into the learner's history while respecting explicit magnitude and temporal constraints. It designs concrete attack policies under these restrictions and claims, via theoretical analysis, that the policies can force a class of bandit algorithms (including UCB-style methods) to select a designated target arm in a (1-o(1)) fraction of rounds while incurring only sublinear total attack cost. The claims are supported by experiments on both synthetic data and real-world datasets.

Significance. If the central theoretical guarantee holds, the work is significant because it replaces the unrealistic per-round unbounded manipulation assumptions common in prior adversarial-bandit literature with a more plausible limited-injection model. Demonstrating that sublinear bounded injections suffice to achieve near-constant selection of a suboptimal arm would highlight concrete vulnerabilities in deployed stochastic bandit systems and motivate the design of more robust algorithms.

major comments (1)
  1. [§4] §4 (Theoretical Analysis): the claim that sublinear bounded injections suffice to keep the target arm's index strictly above the others for all but o(T) rounds must be shown to overcome the 1/sqrt(t) concentration of empirical means and shrinking confidence intervals in standard UCB-style algorithms. The current argument does not explicitly derive how a vanishing fraction of fake samples produces a persistent bias that dominates any fixed gap Delta > 0 for all large t; without this step the sublinear-cost guarantee is not yet established.
minor comments (2)
  1. [Abstract / §1] The abstract and introduction refer to 'a class of bandit algorithms' without immediately naming the precise family (e.g., UCB, Thompson sampling, or both); this should be stated explicitly in the first paragraph of the introduction.
  2. [§5] Figure captions and axis labels in the experimental section should include the precise values of the magnitude bound B and the injection-rate parameter used in each plot.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The major comment on the theoretical analysis raises a valid point about the need for greater explicitness in the derivation. We address it in detail below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Theoretical Analysis): the claim that sublinear bounded injections suffice to keep the target arm's index strictly above the others for all but o(T) rounds must be shown to overcome the 1/sqrt(t) concentration of empirical means and shrinking confidence intervals in standard UCB-style algorithms. The current argument does not explicitly derive how a vanishing fraction of fake samples produces a persistent bias that dominates any fixed gap Delta > 0 for all large t; without this step the sublinear-cost guarantee is not yet established.

    Authors: We agree that the current presentation in Section 4 would benefit from a more explicit step-by-step derivation of how the cumulative effect of o(T) bounded injections creates a persistent bias sufficient to dominate the 1/sqrt(t) concentration terms and shrinking confidence intervals for any fixed Delta > 0. In the existing argument, the attack policy injects a total of Theta(sqrt(T) log T) fake samples (still o(T)) at carefully chosen times, which shifts the empirical mean of the target arm by more than Delta/2 while the number of genuine samples for competing arms grows linearly. We will add a dedicated lemma that bounds the deviation probability uniformly over t and shows that, after a finite time t0, the biased UCB index of the target arm exceeds those of the others with probability 1-o(1). This revision will make the domination of the vanishing attack fraction over the concentration bounds fully rigorous. We will incorporate this expanded derivation in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation targets independent concentration bounds

full rationale

The paper's central theoretical claim derives attack success (near-constant target-arm selection with sublinear cost) from the interaction of bounded fake injections with standard UCB-style concentration and regret analysis. No equations reduce a prediction to a fitted parameter by construction, no self-citations are invoked as uniqueness theorems, and no ansatz is smuggled via prior work. The analysis explicitly respects the magnitude and temporal constraints stated in the threat model and shows how o(T) injections can still produce persistent bias only through explicit scheduling arguments that remain falsifiable against 1/sqrt(T) rates. This is the most common honest non-finding for theoretical bandit papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard stochastic bandit update rules and regret analysis from prior literature; no new free parameters or invented entities introduced in the abstract.

axioms (1)
  • domain assumption Bandit algorithms update their estimates based on observed rewards including injected ones, following standard regret-minimizing behavior.
    Central to showing that limited injections can dominate selection.

pith-pipeline@v0.9.0 · 5678 in / 1055 out tokens · 31631 ms · 2026-05-19T12:41:23.887990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    A contextual-bandit approach to personalized news article recommendation

    Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670, 2010

  2. [2]

    Combinatorial multi-armed bandit and its extension to probabilistically triggered arms.Journal of Machine Learning Research, 17 (50):1–33, 2016

    Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms.Journal of Machine Learning Research, 17 (50):1–33, 2016

  3. [3]

    Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges.Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015

  4. [4]

    Contextual combinatorial cascading bandits

    Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. InInternational conference on machine learning, pages 1245–1253. PMLR, 2016

  5. [5]

    Adversarial attacks on stochastic bandits.Advances in neural information processing systems, 31, 2018

    Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits.Advances in neural information processing systems, 31, 2018

  6. [6]

    Data poisoning attacks on stochastic bandits

    Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. InInternational Conference on Machine Learning, pages 4042–4050. PMLR, 2019

  7. [7]

    Adversarial attacks on online learning to rank with click feedback.Advances in Neural Information Processing Systems, 36:41675–41692, 2023

    Jinhang Zuo, Zhiyao Zhang, Zhiyong Wang, Shuai Li, Mohammad Hajiesmaili, and Adam Wierman. Adversarial attacks on online learning to rank with click feedback.Advances in Neural Information Processing Systems, 36:41675–41692, 2023

  8. [8]

    Near optimal adversarial attacks on stochastic bandits and defenses with smoothed responses

    Shiliang Zuo. Near optimal adversarial attacks on stochastic bandits and defenses with smoothed responses. InInternational Conference on Artificial Intelligence and Statistics, pages 2098–2106. PMLR, 2024

  9. [9]

    Adversarial attacks on linear contextual bandits

    Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. Adversarial attacks on linear contextual bandits. Advances in Neural Information Processing Systems, 33:14362–14373, 2020

  10. [10]

    Adversarial attacks on adversarial bandits.arXiv preprint arXiv:2301.12595, 2023

    Yuzhe Ma and Zhijin Zhou. Adversarial attacks on adversarial bandits.arXiv preprint arXiv:2301.12595, 2023

  11. [11]

    When are linear stochastic bandits attackable? InInternational Conference on Machine Learning, pages 23254–23273

    Huazheng Wang, Haifeng Xu, and Hongning Wang. When are linear stochastic bandits attackable? InInternational Conference on Machine Learning, pages 23254–23273. PMLR, 2022

  12. [12]

    Observation-free attacks on stochastic bandits

    Yinglun Xu, Bhuvesh Kumar, and Jacob D Abernethy. Observation-free attacks on stochastic bandits. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 22550–22561. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/2021/...

  13. [13]

    Adversarial attacks on online learning to rank with stochastic click models.Transactions on Machine Learning Research, 2024

    Zichen Wang, Rishab Balasubramanian, Hui Yuan, Mengdi Wang, Huazheng Wang, et al. Adversarial attacks on online learning to rank with stochastic click models.Transactions on Machine Learning Research, 2024

  14. [14]

    Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

    Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems, 2012. URLhttps://arxiv.org/abs/1204.5721

  15. [15]

    Near-optimal regret bounds for thompson sampling.J

    Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for thompson sampling.J. ACM, 64(5), September 2017. ISSN 0004-5411. doi: 10.1145/3088510. URL https://doi.org/ 10.1145/3088510

  16. [16]

    Maxwell and Konstan, Joseph A

    F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context.ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):1–19, 2015. doi: 10.1145/2827872. URLhttps://doi.org/10.1145/2827872. 11 Appendix A Proofs A.1 Concentration Results Suppose that the reward distributions of arms are σ2-sub-Gaussian. The following concentra...