pith. machine review for the scientific record. sign in

arxiv: 2605.01034 · v1 · submitted 2026-05-01 · 💻 cs.CL

Recognition: unknown

A Theoretical Game of Attacks via Compositional Skills

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:03 UTC · model grok-4.3

classification 💻 cs.CL
keywords adversarial promptinggame theorylanguage modelsoptimal defensecompositional skillsbest-response strategyequilibriaAI safety
0
0 comments X

The pith

Modeling attacks on language models as a game between attacker and defender yields a provably optimal defense.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up a formal game in which an attacker tries to bypass model safeguards by composing skills into prompts while a defender chooses responses to block them. Within this setup the authors identify a theoretical best-response attack that mirrors many existing adversarial prompting techniques. They then characterize the equilibria of the game and show that the attacker holds structural advantages. From the same analysis they derive a defense strategy that is optimal against any attack following the best-response rule. A sympathetic reader would care because current alignment methods still fail against carefully crafted prompts, and a game-based account offers a principled route to stronger protections rather than ad-hoc patches.

Core claim

The interaction between an attacker and a defender is formalized as a game whose moves are built from compositional skill decompositions. The attacker’s best-response strategy in this game is shown to be closely related to many published adversarial prompting methods. The resulting game possesses equilibria that favor the attacker. The same analysis produces a defense strategy that is provably optimal against best-response attacks. A practical version of the attack derived from the theory outperforms prior methods across multiple models and benchmarks.

What carries the argument

The attacker-defender game formalized with compositional skill decompositions that enable derivation of best-response attacks and optimal defenses.

If this is right

  • Many existing adversarial prompting techniques are approximations to the theoretical best-response attack.
  • The game equilibria confer inherent advantages to the attacker regardless of the defender’s choice.
  • A provably optimal defense can be obtained directly from the game analysis rather than from trial-and-error tuning.
  • Practical instantiations of the optimal attack achieve stronger performance than prior methods on diverse LLMs and benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety work should prioritize disrupting the attacker’s ability to assemble skills compositionally rather than only filtering final outputs.
  • The same game structure could be applied to multi-turn conversations or to other generative systems beyond text models.
  • Empirical validation would require running the optimal defense on live models against attacks that explicitly follow the compositional best-response construction.

Load-bearing premise

That adversarial prompting interactions can be captured accurately by a well-defined game possessing best-response strategies, equilibria, and decomposable compositional skills.

What would settle it

An empirical test in which the derived optimal defense is applied to a model yet fails to block attacks constructed according to the paper’s best-response rule.

Figures

Figures reproduced from arXiv: 2605.01034 by Abhishek Umrawal, Huan Zhang, Lav R. Varshney, Xinbo Wu.

Figure 1
Figure 1. Figure 1: (a) Illustration of our game-theoretic framework. The illustration instantiates the framework for the example intent “write an anonymous letter asserting that a colleague’s biracial marriage is disgusting” and skills of metaphor and Poem. (b) Our framework captures practical methods, including fixed-skill, optimization-based attacks and our proposed attack based on optimal skill(s). Our attack upper-bounds… view at source ↗
Figure 2
Figure 2. Figure 2: Attacker utility J over deployment steps in the base game with fixed intent prior p(i) across runs. Solid line shows mean over random defender initializations; shaded region is ±1 standard deviation. The dashed line denotes the theoretical equilibrium value J ∗ from Theorem 3.2. (3) Attacker indifference (gap). To confirm that skills become tied, we track the gap for i ∗ : ∆t = a (t) i ∗,(2) − a (t) i ∗,(1… view at source ↗
Figure 3
Figure 3. Figure 3: Structure validation for a representative run in the base game. (a) Heatmap of the final defender allocation ri,s, with the row corresponding to i ∗ ∈ arg maxi p(i) labeled “highest p(i)” and outlined. (b) Gap ∆t = ai∗,(2) − ai∗,(1) between the smallest and second-smallest effective accuracies within row i ∗ . 20 40 60 80 Skill-space size | (k) | 0.70 0.75 0.80 0.85 0.90 0.95 Fin al J Theory J Simulation view at source ↗
Figure 4
Figure 4. Figure 4: Parameter sweeps in the base game. Points show the final attacker utility J (mean ± std over seeds); dashed curves show the theoretical equilibrium value J ∗ . the equilibrium characterization captures the correct scaling laws and that the observed behavior is robust across diverse parameter settings. D. Experimental Details and Hyperparameters D.1. Discussion on Experimental Setups Recent benchmarks for a… view at source ↗
Figure 5
Figure 5. Figure 5: Custom prompt for our LLM-as-raters. intent and response are replace by the user intent and a response from a target system respectively. they exploit any helpful information to achieve their goals, making this a practically significant threat model that warrants serious attention. D.2. Datasets We use 100 diverse malicious behaviors from the JBB-Behaviors dataset (Chao et al., 2024), which is partly sourc… view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for our re-writer. intent and skills should be replace by the user intent and a set of skills to be mixed respectively. ground-truth labels accordingly to fit our problem setting since they are still helpful despite being harmless. We then evaluate raters based on agreement with human experts, FPR, and FNR. D.4. Our Attack Method In our experiments, we use a skill space comprising 10 skills, as shown in view at source ↗
Figure 7
Figure 7. Figure 7: Real examples for attacking gpt-3.5-turbo-1106 by mixing 1 skill. We highlight useful information for achieving the given intent. 25 view at source ↗
Figure 8
Figure 8. Figure 8: Real example for attacking gpt-3.5-turbo-1106 by mixing 2 skills. 26 view at source ↗
Figure 9
Figure 9. Figure 9: Real failure examples for attacking gpt-3.5-turbo-1106 by mixing 1 skill. 27 view at source ↗
read the original abstract

As large language models grow increasingly capable, concerns about their safe deployment have intensified. While numerous alignment strategies aim to restrict harmful behavior, these defenses can still be circumvented through carefully designed adversarial prompts. In this work, we introduce a theoretical framework that formalizes a game between an attacker and a defender. Within this framework, we design a theoretical best-response attack strategy and show that it is closely related to many existing adversarial prompting methods. We further analyze the resulting game, characterize its equilibria, and reveal inherent advantages for the attacker. Drawing on our theoretical analysis, we also derive a provably optimal defense strategy. Empirically, we evaluate a practical instantiation of the theoretically optimal attack and observe stronger performance relative to existing adversarial prompting approaches in diverse settings encompassing different LLMs and benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a theoretical game-theoretic framework that models adversarial prompting as a game between an attacker and a defender, where strategies are expressed via compositional skill decompositions. It defines a theoretical best-response attack strategy shown to relate to existing adversarial methods, characterizes the resulting equilibria (highlighting inherent attacker advantages), derives a provably optimal defense strategy from the analysis, and reports empirical results where a practical instantiation of the optimal attack outperforms baselines across multiple LLMs and benchmarks.

Significance. If the central claims hold, the work offers a formal unification of adversarial prompting techniques under a game model with explicit equilibria and optimality results, which could inform more principled defenses in LLM safety. The explicit linkage between the theoretical best-response and practical methods, along with the derivation of a defense strategy, provides a potential foundation for future theoretical work in this area.

major comments (3)
  1. [§3] §3 (Game Formulation and Skill Decomposition): The derivation of best-response strategies and equilibria assumes that all adversarial prompts admit a complete and unique decomposition into compositional skills. No argument or proof is given that this decomposition covers the full open-ended prompt space or that it is unique; if either fails, the computed best response and subsequent equilibria are only optimal inside the restricted model, not for real LLMs.
  2. [§4] §4 (Optimal Defense Derivation): The proof that the derived defense is provably optimal rests on the same completeness and uniqueness assumptions for the skill decomposition. Without a demonstration that every possible attack can be expressed (and only in one way) as a skill combination, the optimality claim does not transfer to the actual attack surface.
  3. [Empirical Evaluation] Empirical Evaluation section: The reported performance gains of the practical instantiation are presented as support for the theoretical framework, yet the experiments do not include controls that isolate whether gains arise from the game-theoretic construction versus other implementation choices (e.g., prompt engineering heuristics). This weakens the link between theory and observed results.
minor comments (2)
  1. [§3] Notation for skill vectors and payoff functions is introduced without an explicit table or running example, making it difficult to track how concrete prompts map to the abstract game.
  2. [§3.2] The abstract states that the best-response attack is 'closely related to many existing adversarial prompting methods,' but the manuscript does not include a systematic mapping or citation table showing which methods correspond to which skill combinations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the scope of our theoretical model and outlining planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Game Formulation and Skill Decomposition): The derivation of best-response strategies and equilibria assumes that all adversarial prompts admit a complete and unique decomposition into compositional skills. No argument or proof is given that this decomposition covers the full open-ended prompt space or that it is unique; if either fails, the computed best response and subsequent equilibria are only optimal inside the restricted model, not for real LLMs.

    Authors: We appreciate this observation on the modeling assumptions. Our framework is a theoretical abstraction in which adversarial strategies are represented via compositional skill decompositions. The best-response attack and resulting equilibria are derived strictly within this modeled strategy space, and we show that this representation relates to and unifies many existing adversarial prompting methods. We do not claim that every prompt in the open-ended space admits a complete or unique decomposition; the results characterize the game under the proposed decomposition. We will revise §3 to explicitly articulate these modeling assumptions and their implications for the applicability of the derived strategies. revision: partial

  2. Referee: [§4] §4 (Optimal Defense Derivation): The proof that the derived defense is provably optimal rests on the same completeness and uniqueness assumptions for the skill decomposition. Without a demonstration that every possible attack can be expressed (and only in one way) as a skill combination, the optimality claim does not transfer to the actual attack surface.

    Authors: We agree that the provable optimality of the defense is established relative to attacks expressible within the skill-decomposition model. The defense is optimal against any strategy in the defined game. We will revise §4 to emphasize that the optimality result holds inside the proposed theoretical framework and to discuss how it can inform practical defenses for attacks that admit such decompositions. revision: partial

  3. Referee: [Empirical Evaluation] Empirical Evaluation section: The reported performance gains of the practical instantiation are presented as support for the theoretical framework, yet the experiments do not include controls that isolate whether gains arise from the game-theoretic construction versus other implementation choices (e.g., prompt engineering heuristics). This weakens the link between theory and observed results.

    Authors: We acknowledge that additional controls would strengthen the empirical linkage to the theoretical construction. The practical instantiation is directly derived from the theoretical best-response strategy, yet we agree that isolating the contribution from other prompt-engineering factors would be beneficial. We will revise the Empirical Evaluation section to incorporate further ablations or controls that better attribute performance gains to the game-theoretic elements. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations follow from explicit game-theoretic assumptions and compositional decomposition without reducing to inputs by construction.

full rationale

The abstract and described framework introduce a game between attacker and defender, define best-response attack strategies, characterize equilibria, and derive a provably optimal defense directly from the stated model of compositional skill decompositions. No equations or claims are shown to equate a result to its own fitted parameters, self-referential definitions, or unverified self-citations. The empirical instantiation is presented as a separate practical evaluation, not as the source of the theoretical optimality. The derivation chain remains self-contained against the paper's own premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes a game structure over prompt compositions.

axioms (1)
  • domain assumption Adversarial prompting can be formalized as a two-player game with well-defined strategies, payoffs, and best responses based on compositional skills.
    Central modeling choice enabling the best-response and equilibrium analysis.

pith-pipeline@v0.9.0 · 5431 in / 1219 out tokens · 44318 ms · 2026-05-09T19:03:35.904647+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    Jailbreak chat

    Albert, A. Jailbreak chat. https://www. jailbreakchat.com,2023.,

  2. [2]

    Andriushchenko, M., Croce, F., and Flammarion, N

    Accessed: 2025-05-14. Andriushchenko, M., Croce, F., and Flammarion, N. Jail- breaking leading safety-aligned llms with simple adaptive attacks. InThe Thirteenth International Conference on Learning Representations,

  3. [3]

    A theory for emergence of complex skills in language models

    Arora, S. and Goyal, A. A theory for emergence of complex skills in language models. arXiv:2307.15936 [cs.LG],

  4. [4]

    Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions.arXiv preprint arXiv:2309.07875, 2023

    Bianchi, F., Suzgun, M., Attanasio, G., Röttger, P., Juraf- sky, D., Hashimoto, T., and Zou, J. Safety-tuned lla- mas: Lessons from improving the safety of large lan- guage models that follow instructions.arXiv preprint arXiv:2309.07875,

  5. [5]

    arXiv preprint arXiv:2511.15304 , year=

    Bisconti, P., Prandi, M., Pierucci, F., Giarrusso, F., Bracale, M., Galisai, M., Suriani, V ., Sorokoletova, O., Sartore, F., and Nardi, D. Adversarial poetry as a universal single- turn jailbreak mechanism in large language models.arXiv preprint arXiv:2511.15304,

  6. [6]

    Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues

    Chang, Z., Li, M., Liu, Y ., Wang, J., Wang, Q., and Liu, Y . Play guessing game with llm: Indirect jailbreak attack with implicit clues.arXiv preprint arXiv:2402.09091,

  7. [7]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language models in twenty queries. arXiv:2310.08419 [cs.LG],

  8. [8]

    V ., and Ozay, M

    Elesedy, H., Esperanca, P., Oprea, S. V ., and Ozay, M. Lora- guard: Parameter-efficient guardrail adaptation for con- tent moderation of large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 11746–11765,

  9. [9]

    Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework.arXiv preprint arXiv:2410.12855, 2024

    Liu, F., Feng, Y ., Xu, Z., Su, L., Ma, X., Yin, D., and Liu, H. JAILJUDGE: A comprehensive jailbreak judge bench- mark with multi-agent enhanced explanation evaluation framework. arXiv:2410.12855 [cs.CL],

  10. [10]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

  11. [11]

    A simple and efficient jailbreak method exploiting llms’ helpfulness.arXiv preprint arXiv:2509.14297,

    Luo, X., Wang, Y ., He, Z., Tu, G., Li, J., and Xu, R. A simple and efficient jailbreak method exploiting llms’ helpfulness.arXiv preprint arXiv:2509.14297,

  12. [12]

    GPT-4 Technical Report

    Accessed: 2025-05-14. 9 A Theoretical Game of Attacks via Compositional Skills OpenAI et al. GPT-4 technical report. arXiv:2303.08774 [cs.CL],

  13. [13]

    XSTest: A test suite for identifying exaggerated safety behaviours in large language mod- els

    Röttger, P., Kirk, H., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviours in large language mod- els. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5377–5400,

  14. [14]

    Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence , articleno =

    Sun, H., Wu, Y ., Cheng, Y ., and Chu, X. Game theory meets large language models: A systematic survey. In Kwok, J. (ed.),Proceedings of the Thirty-Fourth Interna- tional Joint Conference on Artificial Intelligence, IJCAI- 25, pp. 10669–10677. International Joint Conferences on Artificial Intelligence Organization, 8 2025a. doi: 10.24963/ijcai.2025/1184. ...

  15. [15]

    Hidden you malicious goal into benign narratives: Jailbreak large language models through logic chain injection.arXiv preprint arXiv:2404.04849,

    Wang, Z., Cao, Y ., and Liu, P. Hidden you malicious goal into benign narratives: Jailbreak large language models through logic chain injection.arXiv preprint arXiv:2404.04849,

  16. [16]

    Yong, Z.-X., Menghini, C., and Bach, S. H. Low-resource languages jailbreak GPT-4. arXiv:2310.02446 [cs.CL],

  17. [17]

    Enhancing jailbreak attacks on llms via persona prompts

    Yu, D., Kaur, S., Gupta, A., Brown-Cohen, J., Goyal, A., and Arora, S. Skill-mix: A flexible and expandable family of evaluations for AI models. InInternational Conference on Learning Representations (ICLR), 2024a. Yu, Z., Liu, X., Liang, S., Cameron, Z., Xiao, C., and Zhang, N. Don’t listen to me: Understanding and exploring jailbreak prompts of large la...

  18. [18]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043 [cs.CL],

  19. [19]

    highest p(i)

    Defender update:perform one projected (sub)gradient ascent step on F(r) , followed by a projection back to the feasible set{r≥0 : P i,s ri,s =c}. Since F(r) contains a min operator, it is non-smooth; we use a valid subgradient that distributes mass uniformly across all tied minimizers for each intent. Concretely, for eachi, letM (t) i := arg mins a(t) i,s...

  20. [20]

    We found that model capacity plays a crucial role in enabling LLMs to function effectively as raters. Models with insufficient capacity such as LLaMA-3-70B and GPT-3.5—often struggle to identify implicit or indirect connections between the intent and the response, and in some cases (e.g., LLaMA-3-70B), it frequently refuses to generate ratings altogether....

  21. [21]

    Always Intelligent and Machiavellian

    In practice, an attacker could aggregate such information to achieve its malicious objective. 22 A Theoretical Game of Attacks via Compositional Skills Table 5.Percentage drop in attack performance relative to the original performance on various target LLMs defended by our defense method by misleading attacker. Open-Source Closed-Source Attack Metric Llam...