Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

Lydia Yiyu Chen; Nikolay Blagoev; O\u{g}uzhan Ersoy

arxiv: 2511.09780 · v2 · submitted 2025-11-12 · 💻 cs.LG

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

Nikolay Blagoev , O\u{g}uzhan Ersoy , Lydia Yiyu Chen This is my paper

Pith reviewed 2026-05-17 22:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords decentralized GRPOadversarial attackspoisoned completionsLLM trainingdefense mechanismsmodel poisoningreinforcement learning robustness

0 comments

The pith

Malicious nodes can poison decentralized GRPO models to 100 percent attack success in 50 iterations using bad completions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the security risks in decentralized Group Relative Policy Optimization for training large language models. It demonstrates that an adversary controlling some nodes can share poisoned completions to corrupt the learning process on math and coding tasks. Attack success reaches 100 percent in only 50 rounds when no verification is in place. Two defense strategies are introduced that inspect logit probabilities or employ an LLM as judge to reject harmful completions, stopping most attacks but not a denial-of-service variant.

Core claim

Decentralized GRPO is vulnerable because nodes exchange raw completion strings that directly influence the policy updates; by supplying completions that favor incorrect or malicious outputs, an adversary can steer the model entirely within a small number of iterations.

What carries the argument

Shared completion strings that are accepted without verification and used to compute relative advantages in the GRPO reinforcement learning step.

If this is right

Adversaries can achieve full control over model behavior on targeted tasks like math problem solving or code generation.
Defenses based on logit probability thresholds or LLM-based judging can filter out poisoned completions effectively.
Only a denial-of-service attack that forces longer but still correct completions survives the proposed defenses.
The need for verification becomes essential for any decentralized sharing of training data in RL setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar poisoning risks likely exist in other decentralized RL algorithms that share raw outputs rather than aggregated statistics.
Implementing cryptographic verification or reputation systems could prevent such attacks in future decentralized training networks.
Testing the defenses on larger models and more diverse tasks would reveal their scalability limits.

Load-bearing premise

Nodes in the decentralized system accept and incorporate any shared completion strings from other participants without checking their content or quality.

What would settle it

Observe whether an adversary injecting specific poisoned completions into the shared pool causes the GRPO-trained model to output the attacker-desired answers on math and coding benchmarks at a 100 percent rate after 50 training iterations.

read the original abstract

Group Relative Policy Optimization (GRPO) has demonstrated wide adoption in the post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and preferred behaviour is learnt via reinforcement learning. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and these completions are exchanged in the form of strings. In this work, we explore the robustness of decentralised GRPO by presenting the first adversarial attacks and countermeasures. We present a diverse set of attacks where malicious nodes poison benign models by sharing their poisoned completions. We demonstrate these attacks on math and coding tasks and show that an adversary can achieve attack success rates of up to 100% in as few as 50 iterations. Moreover, to mitigate the attacks, we propose two defense mechanisms that check logit probabilities of completions or utilize an LLM judge to filter completions. The defenses prevent all but the DoS attack that causes unnecessarily lengthy but conceptually correct completions. The code of both attacks and defenses can be found at: https://github.com/gensyn-ai/HTTT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that decentralized GRPO can be poisoned through shared completion strings, with attacks hitting high success rates quickly and two practical defenses that mostly hold up.

read the letter

The main thing to know is that this work flags a concrete risk in decentralized GRPO: malicious nodes can share poisoned completions and derail the training, reaching up to 100% attack success in as few as 50 iterations on math and coding tasks. They also test two defenses that filter bad completions and largely stop the damage, except for a DoS case that produces long but valid outputs. Public code is included, which is helpful for verification.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first study of adversarial attacks and defenses for decentralized Group Relative Policy Optimization (GRPO). Malicious nodes share poisoned completion strings to compromise benign models during post-training on math and coding tasks; the authors report that an adversary can reach attack success rates of 100% in as few as 50 iterations. Two defenses are proposed—one based on logit-probability filtering and one using an LLM judge—to detect and discard malicious completions; these are shown to block all attacks except a denial-of-service variant that produces unnecessarily long but conceptually correct outputs. Public code is provided.

Significance. If the reported results hold under fuller experimental controls, the work is significant because it identifies a concrete security vulnerability in an increasingly popular decentralized RL post-training paradigm and supplies practical, immediately usable countermeasures. The public release of attack and defense implementations is a clear strength that enables direct reproduction and extension.

major comments (2)

Abstract and experimental section: the central claim of 100% attack success in 50 iterations is presented without error bars, standard deviations across runs, or an explicit statement of the number of independent trials and node configurations, which weakens confidence in the quantitative result even though the qualitative demonstration appears direct.
Defense evaluation: while the logit and LLM-judge filters are shown to eliminate the primary poisoning attacks, the manuscript does not quantify the computational overhead or false-positive rate of these filters on benign traffic, leaving open whether the defenses remain practical at scale.

minor comments (2)

The abstract would benefit from naming the concrete benchmarks (e.g., GSM8K, HumanEval) used for the math and coding tasks.
A short related-work paragraph situating GRPO relative to other decentralized RL methods (e.g., federated PPO variants) would help readers place the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will incorporate revisions to strengthen the quantitative rigor and practicality analysis in the manuscript.

read point-by-point responses

Referee: Abstract and experimental section: the central claim of 100% attack success in 50 iterations is presented without error bars, standard deviations across runs, or an explicit statement of the number of independent trials and node configurations, which weakens confidence in the quantitative result even though the qualitative demonstration appears direct.

Authors: We agree that additional statistical details would improve confidence in the reported attack success rates. In the revised manuscript, we will explicitly state that results are averaged over 5 independent trials with different random seeds and node configurations (e.g., 1-3 malicious nodes out of 8 total). We will add standard deviations and error bars to the relevant figures and tables in the experimental section, and update the abstract to note the multi-run nature of the 100% success observation where space permits. revision: yes
Referee: Defense evaluation: while the logit and LLM-judge filters are shown to eliminate the primary poisoning attacks, the manuscript does not quantify the computational overhead or false-positive rate of these filters on benign traffic, leaving open whether the defenses remain practical at scale.

Authors: We acknowledge this gap in the current evaluation. The revised version will include a new analysis subsection quantifying overhead: the logit filter adds negligible cost (reusing existing forward passes), while the LLM judge increases per-completion latency by a measured factor that we will report. We will also evaluate and report false-positive rates on held-out benign completions from the math and coding datasets, which preliminary checks indicate remain low. These additions will directly address scalability concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical exploration of attacks and defenses in decentralized GRPO, with claims resting on direct simulation of poisoning via shared completions on math/coding tasks and explicit measurement of success rates up to 100% in 50 iterations. No derivation chain, equations, or predictions appear; results are obtained by running the described attacks and defenses on the same data without fitted parameters renamed as outputs or self-referential definitions. Public code supports independent reproduction, and the central premise (nodes accept unverified completions) is stated explicitly rather than smuggled in via citation or ansatz. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical security study with no mathematical derivations, new theoretical constructs, or fitted parameters; the central claims rest on experimental demonstration rather than axioms or invented entities.

pith-pipeline@v0.9.0 · 5506 in / 1082 out tokens · 35191 ms · 2026-05-17T22:49:06.807043+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

F-TIS: Harnessing Diverse Models in Collaborative GRPO
cs.LG 2026-05 unverdicted novelty 6.0

F-TIS enables heterogeneous model collaboration in GRPO by filtering off-policy samples, matching on-policy convergence while improving out-of-distribution performance by up to 12% in some setups.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 1 Pith paper

[1]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[2]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[3]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[2] [2]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[3] [3]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page