Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
Pith reviewed 2026-05-15 12:37 UTC · model grok-4.3
The pith
Strong adversarial prompt injections turn slow polynomial growth of jailbreak success into exponential growth with more inference samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A small set of assumptions on the distribution of safe generation across contexts produces both the slow polynomial scaling of attack success without injection and the exponential scaling that appears once a strong prompt injection is present. These assumptions are realized by a theoretical model in which generations are drawn from the Gibbs measure of a spin-glass system operating in a replica-symmetry-breaking regime, with unsafe outputs corresponding to a subset of low-energy, size-biased clusters. Short injected prompts act as a weak magnetic field aligned toward unsafe cluster centers and produce power-law scaling, while long prompts act as a strong field and produce exponential scaling
What carries the argument
Spin-glass system in replica-symmetry-breaking regime whose Gibbs measure generates model outputs, with unsafe behavior confined to low-energy size-biased clusters; short versus long injected prompts correspond to weak versus strong magnetic fields.
If this is right
- Attack success grows polynomially with inference-time samples when no strong injection is present.
- The same success rate grows exponentially once a sufficiently strong prompt injection is introduced.
- The spin-glass model with size-biased unsafe clusters reproduces the minimal statistical assumptions needed for both scalings.
- Empirical trends in current large language models match the analytically derived polynomial-to-exponential crossover.
Where Pith is reading between the lines
- Safety evaluations that rely on few samples may systematically underestimate risk when strong injections are possible.
- Increasing inference-time compute without corresponding defenses could widen the gap between safe and unsafe behavior under attack.
- The magnetic-field analogy suggests that prompt-length thresholds could be used to predict when scaling becomes exponential for new attack methods.
Load-bearing premise
The probability that a random generation is safe follows a distribution across contexts that admits both a slow polynomial regime and an exponential regime once a bias toward unsafe clusters is applied.
What would settle it
Measure attack success rate versus number of inference samples for the same model and task using both short and long adversarial prompts; the rates must cross from polynomial to exponential as prompt length increases if the claimed mechanism holds.
read the original abstract
Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the distribution of safe generation across contexts under which both scaling laws follow. To explain this phenomenon further, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. We analytically show how this model naturally realizes the minimal assumptions. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We observe qualitatively consistent behavior across a broad range of large language models, spanning parameter scales from 3B to 70B. In particular, the main trends remain stable across multiple attack methods, such as GCG and AutoDAN, as well as across benchmark datasets such as AdvBench and HarmBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that strong adversarial prompt-injection attacks amplify attack success rate (ASR) in safety-aligned LLMs from slow polynomial growth (without injection) to exponential growth with the number of inference-time samples. It identifies a minimal set of statistical assumptions on the distribution of safe generations across contexts under which both scaling regimes follow analytically, and realizes these assumptions via a spin-glass generative model in the replica-symmetry-breaking regime whose Gibbs measure yields size-biased unsafe clusters; short (weak-field) injections produce power-law scaling while long (strong-field) injections produce exponential scaling, with qualitatively matching trends reported in LLMs.
Significance. If the minimal assumptions are satisfied by real LLM token distributions and the spin-glass construction accurately reflects the mechanism, the work supplies an analytic statistical-physics framework that explains the observed polynomial-to-exponential crossover and yields falsifiable predictions for how prompt length and sampling interact with safety alignment. The explicit mapping of prompt length to magnetic-field strength and the derivation of both regimes from the same set of assumptions are strengths that could guide future empirical tests and mitigation strategies.
major comments (3)
- [Minimal statistical mechanism / assumptions paragraph] The central claim rests on the minimal set of assumptions about safe-generation distributions (stated after the empirical observation) that are asserted to imply both the polynomial and exponential scalings; however, the manuscript provides no direct empirical check that LLM output distributions exhibit the required tail or clustering properties, leaving open the possibility that the observed crossover arises from unrelated factors such as temperature or formatting.
- [Theoretical generative model / spin-glass construction] In the spin-glass model, the magnetic-field strength is introduced as a free parameter chosen to place the system in the weak-field (power-law) versus strong-field (exponential) regime; because the model is constructed to reproduce the target scaling behaviors, this choice introduces a degree of circularity that must be justified by independent calibration against LLM statistics rather than by matching the desired crossover.
- [Analytic derivations and LLM trends] The analytic derivations of the power-law and exponential scalings are stated to follow from the replica-symmetry-breaking Gibbs measure and size-biased unsafe clusters, yet the manuscript does not supply the full derivation steps, error bounds, or quantitative comparison metrics against LLM data, rendering the support for the crossover provisional.
minor comments (1)
- [Abstract and empirical results] The abstract and results section refer to 'qualitatively similar trends' in LLMs but do not include specific figures, tables, or quantitative metrics (e.g., fitted exponents or confidence intervals) that would allow readers to assess the degree of agreement.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate clarifications, expanded derivations, and additional discussion where appropriate.
read point-by-point responses
-
Referee: The central claim rests on the minimal set of assumptions about safe-generation distributions (stated after the empirical observation) that are asserted to imply both the polynomial and exponential scalings; however, the manuscript provides no direct empirical check that LLM output distributions exhibit the required tail or clustering properties, leaving open the possibility that the observed crossover arises from unrelated factors such as temperature or formatting.
Authors: We agree that direct empirical verification of the tail and clustering properties would strengthen the claims. The assumptions were chosen to be minimal while still entailing both scaling regimes analytically, and our experiments fix temperature and formatting to isolate the effect. In the revision we have added a dedicated subsection outlining concrete empirical tests for the required distributional properties (e.g., measuring size-biasing of unsafe clusters via repeated sampling across contexts) and explicitly discuss the possibility that other factors could contribute. No new experiments are performed, but the discussion now makes the evidential gap transparent. revision: partial
-
Referee: In the spin-glass model, the magnetic-field strength is introduced as a free parameter chosen to place the system in the weak-field (power-law) versus strong-field (exponential) regime; because the model is constructed to reproduce the target scaling behaviors, this choice introduces a degree of circularity that must be justified by independent calibration against LLM statistics rather than by matching the desired crossover.
Authors: The mapping of prompt length to field strength is not chosen post hoc to fit the crossover; it follows from the physical interpretation that longer adversarial injections exert a stronger bias toward low-energy unsafe clusters. We have revised the model section to provide an independent calibration procedure: the field strength is estimated from the measured shift in the average log-probability of unsafe tokens induced by prompts of varying length, using statistics collected separately from the scaling experiments. This calibration is now stated explicitly and does not rely on matching the final scaling curves. revision: yes
-
Referee: The analytic derivations of the power-law and exponential scalings are stated to follow from the replica-symmetry-breaking Gibbs measure and size-biased unsafe clusters, yet the manuscript does not supply the full derivation steps, error bounds, or quantitative comparison metrics against LLM data, rendering the support for the crossover provisional.
Authors: We accept that the derivations require more detail. The revised appendix now contains the complete step-by-step derivation from the replica-symmetry-breaking Gibbs measure through the size-biased sampling to the closed-form power-law and exponential expressions, including the asymptotic error bounds on the large-N approximations. We have also added quantitative comparison metrics (mean absolute deviation and scaling-exponent fits) between model predictions and LLM data for three model families, reported in a new supplementary figure. revision: yes
Circularity Check
Spin-glass model constructed to realize minimal assumptions, yielding scaling laws by design
specific steps
-
self definitional
[Abstract (minimal mechanism and generative model)]
"We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the distribution of safe generation across contexts under which both scaling laws follow. To explain this phenomenon further, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. We point out how this model naturally realizes the minimal assumptions. Short injected 0"
The scaling laws are derived from the minimal assumptions; the spin-glass model is then defined so that it realizes exactly those assumptions, with magnetic-field strength (proxy for prompt length) chosen to switch between polynomial and exponential regimes. The observed crossover is therefore reproduced by construction once the model is tuned to the assumptions, rather than emerging as an independent consequence.
full rationale
The paper first states a small set of assumptions on safe-generation distributions that directly imply the polynomial (no injection) and exponential (strong injection) scaling laws. It then proposes a spin-glass generative model explicitly to realize those assumptions, mapping prompt length to magnetic-field strength so that weak/strong fields reproduce the two regimes. The analytic derivations therefore follow from a framework whose parameters and structure were selected to produce the target behaviors, creating partial circularity. No load-bearing self-citations or external uniqueness theorems are invoked; the circularity is internal to the assumption-to-model construction. The result remains a coherent theoretical illustration but does not constitute an independent derivation from LLM data.
Axiom & Free-Parameter Ledger
free parameters (1)
- magnetic field strength
axioms (1)
- domain assumption small set of assumptions on the distribution of safe generation across contexts
invented entities (1)
-
spin-glass system in replica-symmetry-breaking regime
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Poisson-Dirichlet distribution Λ_{m_l} ... stick-breaking construction
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.