Sockpuppetting: Jailbreaking LLMs by Combining Prefilling with Optimization
Pith reviewed 2026-05-16 12:45 UTC · model grok-4.3
The pith
Ensembling a few prefill variants plus sockpuppet optimization inside the assistant block raises jailbreak success rates to 99 percent on several open models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An unsophisticated adversary can improve prefill attacks by ensembling three prefill variants and can further strengthen them by optimizing an adversarial suffix placed inside the assistant message block of the chat template. The combined approach produces substantially higher attack success rates than either standard prefilling or earlier optimization-only baselines, and the rolling variant extends the gains to prompt-agnostic settings.
What carries the argument
Sockpuppetting: optimization of an adversarial suffix placed inside the assistant message block of the chat template, which steers the model's continuation while remaining compatible with the model's own formatting.
If this is right
- Open-weight models require explicit defenses against output-prefix injection.
- Placing the adversarial suffix inside the assistant block yields higher success than placing it only in the user prompt.
- Rolling optimization produces prompt-agnostic attacks that remain effective on unseen queries.
- Simple ensembling of prefills is already sufficient to exceed the performance of more expensive single-prefill or optimization-only methods on the tested models.
Where Pith is reading between the lines
- Safety alignments that focus only on user-prompt content may leave the model exposed once the assistant block can be directly influenced.
- The same low-cost technique could be tested on closed-source APIs that expose partial control over generation prefixes.
- Defenders might counter the attack by sanitizing or rejecting generations that begin with common acceptance phrases, even when those phrases were not supplied by the user.
Load-bearing premise
The measured gains from prefill ensembling and sockpuppet optimization will continue to appear on new models, new chat templates, and new prompt sets without requiring model-specific retraining or large extra compute.
What would settle it
Applying the same three prefill variants and sockpuppet optimization to a fourth open-weight model or a fresh collection of harmful prompts and observing attack success rates no higher than the standard single prefill baseline.
read the original abstract
Prefill attacks are an effective and low-cost jailbreaking method, as they directly insert an acceptance sequence (e.g., "Sure, here is how to...") at the start of an LLM's output and lead the model to continue the response. We make two contributions to this prior work. First, we show that an unsophisticated adversary can improve the well-known prefill attacks by ensembling a small number of prefill variants. Running three easy-to-generate prefills yields a combined attack success rate (ASR) of 22%, 90%, and 99% on Gemma-7B, Llama-3.1-8B, and Qwen3-8B respectively, an up to 38% improvement over the standard "Sure, here's..." prefill and up to 82% over our reproduction of GCG (Zou et al., 2023). Second, we introduce "sockpuppetting", a hybrid attack that optimizes an adversarial suffix placed inside the "assistant" message block of the chat template, rather than within the user prompt. The rolling variant of this attack, RollingSockpuppetGCG, increases prompt-agnostic ASR by up to 64% over our universal GCG baseline on Llama-3.1-8B. Both findings highlight the need for defences against output-prefix injection in open-weight models. Code: https://gitlab.com/asendotsinski/sockpuppetting
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that ensembling three easy-to-generate prefill variants yields combined attack success rates (ASR) of 22%, 90%, and 99% on Gemma-7B, Llama-3.1-8B, and Qwen3-8B, improving up to 38% over the standard 'Sure, here's...' prefill and up to 82% over a reproduced GCG baseline. It further introduces sockpuppetting, a hybrid attack that optimizes an adversarial suffix inside the assistant message block of the chat template, with its rolling variant (RollingSockpuppetGCG) increasing prompt-agnostic ASR by up to 64% over a universal GCG baseline on Llama-3.1-8B. The work concludes by highlighting the need for defenses against output-prefix injection in open-weight models and releases code.
Significance. If the empirical results hold, the work shows that simple ensembling of prefills and optimization within the assistant block can substantially raise jailbreak success on open-weight models, underscoring vulnerabilities to output-prefix attacks and motivating stronger defenses. The public code release is a clear strength that supports reproducibility and follow-on research.
major comments (2)
- Experimental Evaluation section: The reported ASR percentages (e.g., 22/90/99% combined and the 82% gain over reproduced GCG) are given without error bars, confidence intervals, exact prompt counts, or a full protocol for the GCG reproduction, making it impossible to assess statistical reliability or implementation fidelity of the claimed improvements.
- Sockpuppetting and RollingSockpuppetGCG sections: The optimization places the suffix inside the assistant block, but no analysis is provided on how the approach depends on specific chat-template formatting or safety-head alignment; results are shown only on the three tested models, so the load-bearing claim of improved prompt-agnostic ASR lacks evidence that the same suffixes would remain effective under template or alignment shifts.
minor comments (1)
- Abstract: The phrase 'up to 38% improvement' and 'up to 82% over our reproduction of GCG' would be clearer if the specific model and baseline comparison were stated explicitly rather than left as maxima.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will update the manuscript accordingly to improve clarity and statistical rigor.
read point-by-point responses
-
Referee: Experimental Evaluation section: The reported ASR percentages (e.g., 22/90/99% combined and the 82% gain over reproduced GCG) are given without error bars, confidence intervals, exact prompt counts, or a full protocol for the GCG reproduction, making it impossible to assess statistical reliability or implementation fidelity of the claimed improvements.
Authors: We agree that the current presentation lacks sufficient statistical detail. In the revised manuscript we will report ASR with standard deviations computed over five independent optimization runs using different random seeds, explicitly state that all experiments use the complete set of 100 harmful prompts from the AdvBench dataset, and provide the full GCG reproduction protocol including the number of steps (500), batch size (512), top-k (256), and learning rate schedule. These additions will allow readers to evaluate the reliability of the reported gains. revision: yes
-
Referee: Sockpuppetting and RollingSockpuppetGCG sections: The optimization places the suffix inside the assistant block, but no analysis is provided on how the approach depends on specific chat-template formatting or safety-head alignment; results are shown only on the three tested models, so the load-bearing claim of improved prompt-agnostic ASR lacks evidence that the same suffixes would remain effective under template or alignment shifts.
Authors: We acknowledge that the manuscript does not include explicit ablations on template variations or safety-head modifications. The three evaluated models already employ distinct chat templates and alignment procedures, and the consistent prompt-agnostic gains support the practical utility of assistant-block optimization. In the revision we will add a dedicated paragraph discussing potential template sensitivity and noting that broader generalization testing is left to future work, while preserving the core empirical claim based on the reported results. revision: partial
Circularity Check
No circularity: purely empirical attack evaluation
full rationale
The paper reports measured attack success rates from direct experiments on three models using specific chat templates. It compares ensembled prefills and sockpuppetting variants against reproduced baselines (standard prefill and GCG). No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the derivation. All headline numbers (22/90/99% ASR, +38%, +82%, +64%) are presented as experimental outcomes, not derived quantities that reduce to the inputs by construction. Generalization concerns exist but are not circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of prefill variants
axioms (1)
- domain assumption Standard chat templates allow insertion of content inside the assistant message block without breaking model inference
invented entities (1)
-
sockpuppetting attack
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sockpuppetting simply inserts the acceptance sequence into memory, as if the model already generated it
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.