Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO
Pith reviewed 2026-05-17 22:49 UTC · model grok-4.3
The pith
Malicious nodes can poison decentralized GRPO models to 100 percent attack success in 50 iterations using bad completions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decentralized GRPO is vulnerable because nodes exchange raw completion strings that directly influence the policy updates; by supplying completions that favor incorrect or malicious outputs, an adversary can steer the model entirely within a small number of iterations.
What carries the argument
Shared completion strings that are accepted without verification and used to compute relative advantages in the GRPO reinforcement learning step.
If this is right
- Adversaries can achieve full control over model behavior on targeted tasks like math problem solving or code generation.
- Defenses based on logit probability thresholds or LLM-based judging can filter out poisoned completions effectively.
- Only a denial-of-service attack that forces longer but still correct completions survives the proposed defenses.
- The need for verification becomes essential for any decentralized sharing of training data in RL setups.
Where Pith is reading between the lines
- Similar poisoning risks likely exist in other decentralized RL algorithms that share raw outputs rather than aggregated statistics.
- Implementing cryptographic verification or reputation systems could prevent such attacks in future decentralized training networks.
- Testing the defenses on larger models and more diverse tasks would reveal their scalability limits.
Load-bearing premise
Nodes in the decentralized system accept and incorporate any shared completion strings from other participants without checking their content or quality.
What would settle it
Observe whether an adversary injecting specific poisoned completions into the shared pool causes the GRPO-trained model to output the attacker-desired answers on math and coding benchmarks at a 100 percent rate after 50 training iterations.
read the original abstract
Group Relative Policy Optimization (GRPO) has demonstrated wide adoption in the post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and preferred behaviour is learnt via reinforcement learning. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and these completions are exchanged in the form of strings. In this work, we explore the robustness of decentralised GRPO by presenting the first adversarial attacks and countermeasures. We present a diverse set of attacks where malicious nodes poison benign models by sharing their poisoned completions. We demonstrate these attacks on math and coding tasks and show that an adversary can achieve attack success rates of up to 100% in as few as 50 iterations. Moreover, to mitigate the attacks, we propose two defense mechanisms that check logit probabilities of completions or utilize an LLM judge to filter completions. The defenses prevent all but the DoS attack that causes unnecessarily lengthy but conceptually correct completions. The code of both attacks and defenses can be found at: https://github.com/gensyn-ai/HTTT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the first study of adversarial attacks and defenses for decentralized Group Relative Policy Optimization (GRPO). Malicious nodes share poisoned completion strings to compromise benign models during post-training on math and coding tasks; the authors report that an adversary can reach attack success rates of 100% in as few as 50 iterations. Two defenses are proposed—one based on logit-probability filtering and one using an LLM judge—to detect and discard malicious completions; these are shown to block all attacks except a denial-of-service variant that produces unnecessarily long but conceptually correct outputs. Public code is provided.
Significance. If the reported results hold under fuller experimental controls, the work is significant because it identifies a concrete security vulnerability in an increasingly popular decentralized RL post-training paradigm and supplies practical, immediately usable countermeasures. The public release of attack and defense implementations is a clear strength that enables direct reproduction and extension.
major comments (2)
- Abstract and experimental section: the central claim of 100% attack success in 50 iterations is presented without error bars, standard deviations across runs, or an explicit statement of the number of independent trials and node configurations, which weakens confidence in the quantitative result even though the qualitative demonstration appears direct.
- Defense evaluation: while the logit and LLM-judge filters are shown to eliminate the primary poisoning attacks, the manuscript does not quantify the computational overhead or false-positive rate of these filters on benign traffic, leaving open whether the defenses remain practical at scale.
minor comments (2)
- The abstract would benefit from naming the concrete benchmarks (e.g., GSM8K, HumanEval) used for the math and coding tasks.
- A short related-work paragraph situating GRPO relative to other decentralized RL methods (e.g., federated PPO variants) would help readers place the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will incorporate revisions to strengthen the quantitative rigor and practicality analysis in the manuscript.
read point-by-point responses
-
Referee: Abstract and experimental section: the central claim of 100% attack success in 50 iterations is presented without error bars, standard deviations across runs, or an explicit statement of the number of independent trials and node configurations, which weakens confidence in the quantitative result even though the qualitative demonstration appears direct.
Authors: We agree that additional statistical details would improve confidence in the reported attack success rates. In the revised manuscript, we will explicitly state that results are averaged over 5 independent trials with different random seeds and node configurations (e.g., 1-3 malicious nodes out of 8 total). We will add standard deviations and error bars to the relevant figures and tables in the experimental section, and update the abstract to note the multi-run nature of the 100% success observation where space permits. revision: yes
-
Referee: Defense evaluation: while the logit and LLM-judge filters are shown to eliminate the primary poisoning attacks, the manuscript does not quantify the computational overhead or false-positive rate of these filters on benign traffic, leaving open whether the defenses remain practical at scale.
Authors: We acknowledge this gap in the current evaluation. The revised version will include a new analysis subsection quantifying overhead: the logit filter adds negligible cost (reusing existing forward passes), while the LLM judge increases per-completion latency by a measured factor that we will report. We will also evaluate and report false-positive rates on held-out benign completions from the math and coding datasets, which preliminary checks indicate remain low. These additions will directly address scalability concerns. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical exploration of attacks and defenses in decentralized GRPO, with claims resting on direct simulation of poisoning via shared completions on math/coding tasks and explicit measurement of success rates up to 100% in 50 iterations. No derivation chain, equations, or predictions appear; results are obtained by running the described attacks and defenses on the same data without fitted parameters renamed as outputs or self-referential definitions. Public code supports independent reproduction, and the central premise (nodes accept unverified completions) is stated explicitly rather than smuggled in via citation or ansatz. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
F-TIS: Harnessing Diverse Models in Collaborative GRPO
F-TIS enables heterogeneous model collaboration in GRPO by filtering off-policy samples, matching on-policy convergence while improving out-of-distribution performance by up to 12% in some setups.
Reference graph
Works this paper leans on
-
[1]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[2]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[3]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.