pith. sign in

arxiv: 2601.13359 · v2 · pith:WNU3OHL5new · submitted 2026-01-19 · 💻 cs.CL · cs.CR· cs.LG

Sockpuppetting: Jailbreaking LLMs by Combining Prefilling with Optimization

Pith reviewed 2026-05-16 12:45 UTC · model grok-4.3

classification 💻 cs.CL cs.CRcs.LG
keywords jailbreakingprefill attacksadversarial suffixeschat templatesLLM safetyoutput prefix injection
0
0 comments X

The pith

Ensembling a few prefill variants plus sockpuppet optimization inside the assistant block raises jailbreak success rates to 99 percent on several open models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that prefill attacks, which insert an acceptance sequence at the start of the model's output, become markedly stronger when an adversary simply runs three easy-to-generate variants and combines their results. On the tested models this yields attack success rates of 22 percent, 90 percent, and 99 percent respectively. The authors further introduce sockpuppetting, a method that optimizes an adversarial suffix inside the assistant portion of the chat template rather than the user prompt. A rolling version of this hybrid attack also improves performance on prompts the attacker has not seen in advance. These results indicate that current open-weight models remain vulnerable to low-cost manipulations of the output prefix.

Core claim

An unsophisticated adversary can improve prefill attacks by ensembling three prefill variants and can further strengthen them by optimizing an adversarial suffix placed inside the assistant message block of the chat template. The combined approach produces substantially higher attack success rates than either standard prefilling or earlier optimization-only baselines, and the rolling variant extends the gains to prompt-agnostic settings.

What carries the argument

Sockpuppetting: optimization of an adversarial suffix placed inside the assistant message block of the chat template, which steers the model's continuation while remaining compatible with the model's own formatting.

If this is right

  • Open-weight models require explicit defenses against output-prefix injection.
  • Placing the adversarial suffix inside the assistant block yields higher success than placing it only in the user prompt.
  • Rolling optimization produces prompt-agnostic attacks that remain effective on unseen queries.
  • Simple ensembling of prefills is already sufficient to exceed the performance of more expensive single-prefill or optimization-only methods on the tested models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety alignments that focus only on user-prompt content may leave the model exposed once the assistant block can be directly influenced.
  • The same low-cost technique could be tested on closed-source APIs that expose partial control over generation prefixes.
  • Defenders might counter the attack by sanitizing or rejecting generations that begin with common acceptance phrases, even when those phrases were not supplied by the user.

Load-bearing premise

The measured gains from prefill ensembling and sockpuppet optimization will continue to appear on new models, new chat templates, and new prompt sets without requiring model-specific retraining or large extra compute.

What would settle it

Applying the same three prefill variants and sockpuppet optimization to a fourth open-weight model or a fresh collection of harmful prompts and observing attack success rates no higher than the standard single prefill baseline.

read the original abstract

Prefill attacks are an effective and low-cost jailbreaking method, as they directly insert an acceptance sequence (e.g., "Sure, here is how to...") at the start of an LLM's output and lead the model to continue the response. We make two contributions to this prior work. First, we show that an unsophisticated adversary can improve the well-known prefill attacks by ensembling a small number of prefill variants. Running three easy-to-generate prefills yields a combined attack success rate (ASR) of 22%, 90%, and 99% on Gemma-7B, Llama-3.1-8B, and Qwen3-8B respectively, an up to 38% improvement over the standard "Sure, here's..." prefill and up to 82% over our reproduction of GCG (Zou et al., 2023). Second, we introduce "sockpuppetting", a hybrid attack that optimizes an adversarial suffix placed inside the "assistant" message block of the chat template, rather than within the user prompt. The rolling variant of this attack, RollingSockpuppetGCG, increases prompt-agnostic ASR by up to 64% over our universal GCG baseline on Llama-3.1-8B. Both findings highlight the need for defences against output-prefix injection in open-weight models. Code: https://gitlab.com/asendotsinski/sockpuppetting

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that ensembling three easy-to-generate prefill variants yields combined attack success rates (ASR) of 22%, 90%, and 99% on Gemma-7B, Llama-3.1-8B, and Qwen3-8B, improving up to 38% over the standard 'Sure, here's...' prefill and up to 82% over a reproduced GCG baseline. It further introduces sockpuppetting, a hybrid attack that optimizes an adversarial suffix inside the assistant message block of the chat template, with its rolling variant (RollingSockpuppetGCG) increasing prompt-agnostic ASR by up to 64% over a universal GCG baseline on Llama-3.1-8B. The work concludes by highlighting the need for defenses against output-prefix injection in open-weight models and releases code.

Significance. If the empirical results hold, the work shows that simple ensembling of prefills and optimization within the assistant block can substantially raise jailbreak success on open-weight models, underscoring vulnerabilities to output-prefix attacks and motivating stronger defenses. The public code release is a clear strength that supports reproducibility and follow-on research.

major comments (2)
  1. Experimental Evaluation section: The reported ASR percentages (e.g., 22/90/99% combined and the 82% gain over reproduced GCG) are given without error bars, confidence intervals, exact prompt counts, or a full protocol for the GCG reproduction, making it impossible to assess statistical reliability or implementation fidelity of the claimed improvements.
  2. Sockpuppetting and RollingSockpuppetGCG sections: The optimization places the suffix inside the assistant block, but no analysis is provided on how the approach depends on specific chat-template formatting or safety-head alignment; results are shown only on the three tested models, so the load-bearing claim of improved prompt-agnostic ASR lacks evidence that the same suffixes would remain effective under template or alignment shifts.
minor comments (1)
  1. Abstract: The phrase 'up to 38% improvement' and 'up to 82% over our reproduction of GCG' would be clearer if the specific model and baseline comparison were stated explicitly rather than left as maxima.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will update the manuscript accordingly to improve clarity and statistical rigor.

read point-by-point responses
  1. Referee: Experimental Evaluation section: The reported ASR percentages (e.g., 22/90/99% combined and the 82% gain over reproduced GCG) are given without error bars, confidence intervals, exact prompt counts, or a full protocol for the GCG reproduction, making it impossible to assess statistical reliability or implementation fidelity of the claimed improvements.

    Authors: We agree that the current presentation lacks sufficient statistical detail. In the revised manuscript we will report ASR with standard deviations computed over five independent optimization runs using different random seeds, explicitly state that all experiments use the complete set of 100 harmful prompts from the AdvBench dataset, and provide the full GCG reproduction protocol including the number of steps (500), batch size (512), top-k (256), and learning rate schedule. These additions will allow readers to evaluate the reliability of the reported gains. revision: yes

  2. Referee: Sockpuppetting and RollingSockpuppetGCG sections: The optimization places the suffix inside the assistant block, but no analysis is provided on how the approach depends on specific chat-template formatting or safety-head alignment; results are shown only on the three tested models, so the load-bearing claim of improved prompt-agnostic ASR lacks evidence that the same suffixes would remain effective under template or alignment shifts.

    Authors: We acknowledge that the manuscript does not include explicit ablations on template variations or safety-head modifications. The three evaluated models already employ distinct chat templates and alignment procedures, and the consistent prompt-agnostic gains support the practical utility of assistant-block optimization. In the revision we will add a dedicated paragraph discussing potential template sensitivity and noting that broader generalization testing is left to future work, while preserving the core empirical claim based on the reported results. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical attack evaluation

full rationale

The paper reports measured attack success rates from direct experiments on three models using specific chat templates. It compares ensembled prefills and sockpuppetting variants against reproduced baselines (standard prefill and GCG). No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the derivation. All headline numbers (22/90/99% ASR, +38%, +82%, +64%) are presented as experimental outcomes, not derived quantities that reduce to the inputs by construction. Generalization concerns exist but are not circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claims rest on empirical measurements of attack success on three specific models using standard chat templates; no new theoretical entities or derivations are introduced.

free parameters (1)
  • number of prefill variants
    Small fixed number (three) chosen for the ensemble; value is stated explicitly but not derived from data.
axioms (1)
  • domain assumption Standard chat templates allow insertion of content inside the assistant message block without breaking model inference
    Required for sockpuppetting to function on Llama, Qwen, and Gemma models.
invented entities (1)
  • sockpuppetting attack no independent evidence
    purpose: Hybrid jailbreak that optimizes an adversarial suffix inside the assistant block
    Newly introduced technique whose only support is the paper's own experimental results.

pith-pipeline@v0.9.0 · 5573 in / 1575 out tokens · 36104 ms · 2026-05-16T12:45:01.735594+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.