pith. sign in

arxiv: 2603.11331 · v3 · pith:45C3OQ32new · submitted 2026-03-11 · 💻 cs.LG · cs.AI

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

Pith reviewed 2026-05-15 12:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords jailbreak scaling lawsadversarial prompt injectionattack success ratelarge language modelsspin-glass modelreplica symmetry breakingsafety alignment
0
0 comments X

The pith

Strong adversarial prompt injections turn slow polynomial growth of jailbreak success into exponential growth with more inference samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that safety-aligned large language models exhibit two distinct scaling regimes for adversarial attacks depending on prompt strength. Without strong injections, attack success rate grows only polynomially with the number of samples drawn at inference time. With longer injected prompts, the same success rate grows exponentially instead. The authors derive both behaviors from a minimal set of assumptions on how safe generations are distributed across contexts, then show that a spin-glass generative model in the replica-symmetry-breaking regime produces exactly those assumptions and the observed crossover.

Core claim

A small set of assumptions on the distribution of safe generation across contexts produces both the slow polynomial scaling of attack success without injection and the exponential scaling that appears once a strong prompt injection is present. These assumptions are realized by a theoretical model in which generations are drawn from the Gibbs measure of a spin-glass system operating in a replica-symmetry-breaking regime, with unsafe outputs corresponding to a subset of low-energy, size-biased clusters. Short injected prompts act as a weak magnetic field aligned toward unsafe cluster centers and produce power-law scaling, while long prompts act as a strong field and produce exponential scaling

What carries the argument

Spin-glass system in replica-symmetry-breaking regime whose Gibbs measure generates model outputs, with unsafe behavior confined to low-energy size-biased clusters; short versus long injected prompts correspond to weak versus strong magnetic fields.

If this is right

  • Attack success grows polynomially with inference-time samples when no strong injection is present.
  • The same success rate grows exponentially once a sufficiently strong prompt injection is introduced.
  • The spin-glass model with size-biased unsafe clusters reproduces the minimal statistical assumptions needed for both scalings.
  • Empirical trends in current large language models match the analytically derived polynomial-to-exponential crossover.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety evaluations that rely on few samples may systematically underestimate risk when strong injections are possible.
  • Increasing inference-time compute without corresponding defenses could widen the gap between safe and unsafe behavior under attack.
  • The magnetic-field analogy suggests that prompt-length thresholds could be used to predict when scaling becomes exponential for new attack methods.

Load-bearing premise

The probability that a random generation is safe follows a distribution across contexts that admits both a slow polynomial regime and an exponential regime once a bias toward unsafe clusters is applied.

What would settle it

Measure attack success rate versus number of inference samples for the same model and task using both short and long adversarial prompts; the rates must cross from polynomial to exponential as prompt length increases if the claimed mechanism holds.

read the original abstract

Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the distribution of safe generation across contexts under which both scaling laws follow. To explain this phenomenon further, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. We analytically show how this model naturally realizes the minimal assumptions. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We observe qualitatively consistent behavior across a broad range of large language models, spanning parameter scales from 3B to 70B. In particular, the main trends remain stable across multiple attack methods, such as GCG and AutoDAN, as well as across benchmark datasets such as AdvBench and HarmBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that strong adversarial prompt-injection attacks amplify attack success rate (ASR) in safety-aligned LLMs from slow polynomial growth (without injection) to exponential growth with the number of inference-time samples. It identifies a minimal set of statistical assumptions on the distribution of safe generations across contexts under which both scaling regimes follow analytically, and realizes these assumptions via a spin-glass generative model in the replica-symmetry-breaking regime whose Gibbs measure yields size-biased unsafe clusters; short (weak-field) injections produce power-law scaling while long (strong-field) injections produce exponential scaling, with qualitatively matching trends reported in LLMs.

Significance. If the minimal assumptions are satisfied by real LLM token distributions and the spin-glass construction accurately reflects the mechanism, the work supplies an analytic statistical-physics framework that explains the observed polynomial-to-exponential crossover and yields falsifiable predictions for how prompt length and sampling interact with safety alignment. The explicit mapping of prompt length to magnetic-field strength and the derivation of both regimes from the same set of assumptions are strengths that could guide future empirical tests and mitigation strategies.

major comments (3)
  1. [Minimal statistical mechanism / assumptions paragraph] The central claim rests on the minimal set of assumptions about safe-generation distributions (stated after the empirical observation) that are asserted to imply both the polynomial and exponential scalings; however, the manuscript provides no direct empirical check that LLM output distributions exhibit the required tail or clustering properties, leaving open the possibility that the observed crossover arises from unrelated factors such as temperature or formatting.
  2. [Theoretical generative model / spin-glass construction] In the spin-glass model, the magnetic-field strength is introduced as a free parameter chosen to place the system in the weak-field (power-law) versus strong-field (exponential) regime; because the model is constructed to reproduce the target scaling behaviors, this choice introduces a degree of circularity that must be justified by independent calibration against LLM statistics rather than by matching the desired crossover.
  3. [Analytic derivations and LLM trends] The analytic derivations of the power-law and exponential scalings are stated to follow from the replica-symmetry-breaking Gibbs measure and size-biased unsafe clusters, yet the manuscript does not supply the full derivation steps, error bounds, or quantitative comparison metrics against LLM data, rendering the support for the crossover provisional.
minor comments (1)
  1. [Abstract and empirical results] The abstract and results section refer to 'qualitatively similar trends' in LLMs but do not include specific figures, tables, or quantitative metrics (e.g., fitted exponents or confidence intervals) that would allow readers to assess the degree of agreement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate clarifications, expanded derivations, and additional discussion where appropriate.

read point-by-point responses
  1. Referee: The central claim rests on the minimal set of assumptions about safe-generation distributions (stated after the empirical observation) that are asserted to imply both the polynomial and exponential scalings; however, the manuscript provides no direct empirical check that LLM output distributions exhibit the required tail or clustering properties, leaving open the possibility that the observed crossover arises from unrelated factors such as temperature or formatting.

    Authors: We agree that direct empirical verification of the tail and clustering properties would strengthen the claims. The assumptions were chosen to be minimal while still entailing both scaling regimes analytically, and our experiments fix temperature and formatting to isolate the effect. In the revision we have added a dedicated subsection outlining concrete empirical tests for the required distributional properties (e.g., measuring size-biasing of unsafe clusters via repeated sampling across contexts) and explicitly discuss the possibility that other factors could contribute. No new experiments are performed, but the discussion now makes the evidential gap transparent. revision: partial

  2. Referee: In the spin-glass model, the magnetic-field strength is introduced as a free parameter chosen to place the system in the weak-field (power-law) versus strong-field (exponential) regime; because the model is constructed to reproduce the target scaling behaviors, this choice introduces a degree of circularity that must be justified by independent calibration against LLM statistics rather than by matching the desired crossover.

    Authors: The mapping of prompt length to field strength is not chosen post hoc to fit the crossover; it follows from the physical interpretation that longer adversarial injections exert a stronger bias toward low-energy unsafe clusters. We have revised the model section to provide an independent calibration procedure: the field strength is estimated from the measured shift in the average log-probability of unsafe tokens induced by prompts of varying length, using statistics collected separately from the scaling experiments. This calibration is now stated explicitly and does not rely on matching the final scaling curves. revision: yes

  3. Referee: The analytic derivations of the power-law and exponential scalings are stated to follow from the replica-symmetry-breaking Gibbs measure and size-biased unsafe clusters, yet the manuscript does not supply the full derivation steps, error bounds, or quantitative comparison metrics against LLM data, rendering the support for the crossover provisional.

    Authors: We accept that the derivations require more detail. The revised appendix now contains the complete step-by-step derivation from the replica-symmetry-breaking Gibbs measure through the size-biased sampling to the closed-form power-law and exponential expressions, including the asymptotic error bounds on the large-N approximations. We have also added quantitative comparison metrics (mean absolute deviation and scaling-exponent fits) between model predictions and LLM data for three model families, reported in a new supplementary figure. revision: yes

Circularity Check

1 steps flagged

Spin-glass model constructed to realize minimal assumptions, yielding scaling laws by design

specific steps
  1. self definitional [Abstract (minimal mechanism and generative model)]
    "We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the distribution of safe generation across contexts under which both scaling laws follow. To explain this phenomenon further, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. We point out how this model naturally realizes the minimal assumptions. Short injected 0"

    The scaling laws are derived from the minimal assumptions; the spin-glass model is then defined so that it realizes exactly those assumptions, with magnetic-field strength (proxy for prompt length) chosen to switch between polynomial and exponential regimes. The observed crossover is therefore reproduced by construction once the model is tuned to the assumptions, rather than emerging as an independent consequence.

full rationale

The paper first states a small set of assumptions on safe-generation distributions that directly imply the polynomial (no injection) and exponential (strong injection) scaling laws. It then proposes a spin-glass generative model explicitly to realize those assumptions, mapping prompt length to magnetic-field strength so that weak/strong fields reproduce the two regimes. The analytic derivations therefore follow from a framework whose parameters and structure were selected to produce the target behaviors, creating partial circularity. No load-bearing self-citations or external uniqueness theorems are invoked; the circularity is internal to the assumption-to-model construction. The result remains a coherent theoretical illustration but does not constitute an independent derivation from LLM data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on a small set of unstated statistical assumptions about safe-generation distributions across contexts plus the introduction of a spin-glass representation whose parameters are set to reproduce the observed regimes.

free parameters (1)
  • magnetic field strength
    Proxy for injected-prompt length; weak field yields polynomial scaling, strong field yields exponential scaling.
axioms (1)
  • domain assumption small set of assumptions on the distribution of safe generation across contexts
    These assumptions are stated to be sufficient for both polynomial and exponential scaling laws to emerge.
invented entities (1)
  • spin-glass system in replica-symmetry-breaking regime no independent evidence
    purpose: Generative model for proxy language where generations are drawn from the associated Gibbs measure and low-energy size-biased clusters are designated unsafe
    The model is introduced to realize the minimal statistical assumptions and to derive the two scaling regimes analytically.

pith-pipeline@v0.9.0 · 5496 in / 1239 out tokens · 42170 ms · 2026-05-15T12:37:31.020882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.