The Evaluation Game: Beyond Static LLM Benchmarking

Anne-Marie Kermarrec; Jade Garcia-Bourr\'ee; Paul Wang; Vincent Corruble

arxiv: 2605.19377 · v1 · pith:QH7YROHDnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

The Evaluation Game: Beyond Static LLM Benchmarking

Paul Wang , Jade Garcia-Bourr\'ee , Anne-Marie Kermarrec , Vincent Corruble This is my paper

Pith reviewed 2026-05-20 07:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM safetyjailbreak evaluationgame-theoretic frameworkgroup actionsadversarial fine-tuninglocal generalizationbenchmark orbitstrainer adaptation

0 comments

The pith

A benchmark for LLM jailbreaks is an orbit under the evaluator's group action rather than a static set of prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a game-theoretic model for the interaction between a trainer fine-tuning a large language model against jailbreaks and an evaluator trying to find new failures. By using group actions to represent symmetries in adversarial prompts, such as cyclic translations on a circle, the framework shows that evaluation outcomes depend on the trainer's generalization range. Below a critical threshold the evaluator maintains a constant miss ratio for linearly many rounds, while empirical tests on Llama, Qwen, and Mistral models confirm that fine-tuning produces only local generalization. This matters because static benchmarks cannot tell whether a model has truly improved or has merely memorized the tested examples.

Core claim

The central claim is that adversarial evaluation of large language models should be formalized as a two-player game between trainer and evaluator, with data augmentation captured through group actions on prompt symmetries. In the simplest nontrivial case of the circle with cyclic translation groups, different trainer generalization regimes produce distinct long-term behaviors, including a constant miss ratio over linearly many rounds when generalization stays below a critical threshold. Experiments across three model families further establish that fine-tuning on adversarial prompts yields refusal rates that are highly correlated with prompt distance to the training set, implying only local,

What carries the argument

The orbit of prompts under the evaluator's group action, which turns any benchmark into the full set of symmetry-transformed versions rather than a fixed collection of examples.

If this is right

Audit protocols that ignore trainer-side adaptation cannot distinguish a genuine robustness fix from a memorized patch on specific prompts.
In the cyclic translation setting, generalization range determines whether the miss ratio stays constant, drops, or exhibits other long-term patterns.
Fine-tuning induces only local generalization, with refusal performance dropping as prompt distance increases.
A benchmark must be treated as the full orbit under the group action to capture all equivalent adversarial instances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation procedures could be made iterative by repeatedly sampling new test cases from the current orbit after each trainer update.
Distance metrics in prompt embedding space might serve as practical proxies for predicting the extent of local robustness after fine-tuning.
The same orbit perspective could apply to other transformation families beyond cyclic groups, such as semantic paraphrases or style shifts.

Load-bearing premise

Symmetries and data augmentations in adversarial prompts can be captured by group actions in a way that faithfully models how trainers adapt and generalize from seen examples.

What would settle it

Measure whether the evaluator's miss ratio remains constant across a linear number of rounds when the trainer's generalization range is set below the critical threshold identified for the cyclic group, or test whether refusal rates on held-out prompts correlate with their distance to fine-tuning examples in additional model families.

Figures

Figures reproduced from arXiv: 2605.19377 by Anne-Marie Kermarrec, Jade Garcia-Bourr\'ee, Paul Wang, Vincent Corruble.

**Figure 2.** Figure 2: Refusal rate pft(z) versus normalized embedding distance z from the active training pool, per family. Orange: Constrained Piecewise-Linear posterior median curves for pft. Blue: base refusal rates. Each dot represents a 5%-population bin, thus around 100 − 150 prompts. The posterior probabilities for the Constrained Piecewise-Linear shape are very close to 1 here, actually within the resolution of our mod… view at source ↗

**Figure 3.** Figure 3: Rank-1 cohort (exposure budget 1600): pft(z) direct-fit, one panel per family, three rows for three embedding metrics. The compact top-right annotation gives the best-fitting shape, the model-averaged posterior probabilities P(z-dep) and P(pft(1) < pft(0)), and ∆WAICnull. Bottom row of cards: capability and fluency diagnostics per family (MMLU-200 accuracy drift, WikiText-2 perplexity drift, benign-prompt … view at source ↗

**Figure 4.** Figure 4: Rank-2 cohort (LLAMA and MISTRAL at exposure budget 800; QWEN at exposure budget 1600): pft(z) direct-fit, one panel per family, three rows for three embedding metrics. Layout and annotations identical to [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗

read the original abstract

As jailbreaks, adversarially crafted inputs that bypass safety constraints, continue to be discovered in Large Language Models, practitioners increasingly rely on fine-tuning as a defensive strategy. Yet the theoretical foundations underlying this robustness fine-tuning remain underexplored. We introduce a game-theoretic framework in which the interaction between an evaluator (auditing the model for jailbreaks) and a trainer is formalized as a two-player game. A key feature of our approach is the use of group actions, a mathematical structure that captures symmetries and transformations, to formally represent data augmentation. The simplest non-trivial instance is the circle with cyclic translation groups, where we exhibit various regimes depending on the trainer's generalization range. Below a critical threshold, the evaluator maintains a constant miss ratio for linearly many rounds, whereas other settings can yield very different behaviors. We further provide empirical evidence supporting locality-dependence of the model: for the three model families we tested (Llama, Qwen and Mistral), we have significant evidence that fine-tuning on adversarial prompts induces only local generalization, with refusal rates on test examples highly correlated with the distance to the fine-tuning prompts. Our framework recasts the central object of adversarial evaluation: a benchmark is not a static set of prompts but an orbit under the evaluator's group action, and audit protocols that ignore trainer-side adaptation cannot distinguish a genuine fix from a memorized patch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The group action framing for prompt orbits is a clean reframing but the math does not yet connect to how fine-tuning actually works in these models.

read the letter

The main takeaway is a game-theoretic setup that models the evaluator and trainer as players, with group actions standing in for data augmentation on adversarial prompts. The simplest case uses cyclic translations on a circle, and they derive different regimes based on how far the trainer generalizes. They also report that fine-tuning on jailbreaks produces only local effects in Llama, Qwen, and Mistral families, with refusal rates dropping as prompts move away from the training examples. That locality observation is the most concrete part of the work and lines up with what many practitioners already suspect about robustness patches.

Referee Report

2 major / 2 minor

Summary. The paper introduces a game-theoretic framework modeling the interaction between an LLM evaluator auditing for jailbreaks and a trainer using fine-tuning as defense. It represents data augmentation via group actions, with the circle and cyclic translations as the basic case, and derives regimes for the evaluator's miss ratio depending on the trainer's generalization threshold. Below a critical threshold the miss ratio remains constant over linearly many rounds; the framework recasts benchmarks as orbits under the evaluator's group action. Empirical results on Llama, Qwen and Mistral families show that fine-tuning induces only local generalization, with refusal rates correlated to distance from the fine-tuning prompts.

Significance. If the central modeling assumptions hold, the work could meaningfully shift adversarial evaluation practice by requiring audit protocols to account for trainer-side adaptation rather than treating benchmarks as static. The reported locality results across three model families constitute concrete, falsifiable evidence that is a strength of the manuscript. The group-action formalism itself is a novel formal device for this domain, but its utility hinges on establishing a tighter link to actual fine-tuning dynamics.

major comments (2)

[Regime analysis / generalization regimes] In the regime analysis (the derivation of constant miss ratio for linearly many rounds below the critical generalization threshold), the cyclic translation group is introduced to capture symmetries in adversarial prompts, yet no derivation or mapping is supplied that connects this group action to neural-network parameter updates, loss landscapes, or the propagation of robustness during fine-tuning. This link is load-bearing for the claim that protocols ignoring trainer adaptation cannot distinguish genuine fixes from memorized patches.
[Empirical evaluation] In the empirical evaluation, refusal rates are reported to be highly correlated with distance to fine-tuning prompts for Llama, Qwen and Mistral. However, the experiments do not test whether the observed locality reproduces the specific miss-ratio trajectories or the constant-miss-ratio regime predicted by the cyclic-group model; without this check the empirical support does not yet validate the framework's quantitative predictions.

minor comments (2)

[Theoretical framework] Provide a short concrete example of how a cyclic translation acts on a sample adversarial prompt early in the theoretical framework section to improve intuition for readers unfamiliar with group actions.
[Empirical evaluation] Define the precise operational meaning of 'miss ratio' and the distance metric used in the correlation analysis before presenting the empirical figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and limitations of our framework. We address each major comment below, indicating revisions where appropriate to better connect the theoretical model to both assumptions and empirical validation.

read point-by-point responses

Referee: In the regime analysis (the derivation of constant miss ratio for linearly many rounds below the critical generalization threshold), the cyclic translation group is introduced to capture symmetries in adversarial prompts, yet no derivation or mapping is supplied that connects this group action to neural-network parameter updates, loss landscapes, or the propagation of robustness during fine-tuning. This link is load-bearing for the claim that protocols ignoring trainer adaptation cannot distinguish genuine fixes from memorized patches.

Authors: We agree that the manuscript introduces the cyclic translation group as an abstract model for symmetries in adversarial prompts and data augmentation without deriving it directly from neural-network parameter updates or loss landscapes. The framework is designed as a high-level game-theoretic abstraction to illustrate qualitative regimes of evaluator-trainer interaction rather than a first-principles derivation from fine-tuning dynamics. In the revised manuscript, we will add a new subsection under the theoretical framework that explicitly discusses the modeling assumptions, including how group actions represent effective invariances induced by fine-tuning without requiring an explicit mapping to parameter space. This will strengthen the justification for why ignoring trainer adaptation risks conflating memorized patches with genuine robustness. revision: yes
Referee: In the empirical evaluation, refusal rates are reported to be highly correlated with distance to fine-tuning prompts for Llama, Qwen and Mistral. However, the experiments do not test whether the observed locality reproduces the specific miss-ratio trajectories or the constant-miss-ratio regime predicted by the cyclic-group model; without this check the empirical support does not yet validate the framework's quantitative predictions.

Authors: The empirical results establish locality of generalization as a key supporting observation for the regime analysis, but we concur that they do not directly validate the quantitative predictions such as constant miss-ratio plateaus over multiple rounds. In the revision, we will incorporate additional post-hoc analysis of the existing refusal-rate data (or new targeted experiments if feasible) to examine whether the observed distance-dependent patterns align with the miss-ratio trajectories derived from the cyclic-group model, for example by simulating round-by-round evaluator queries under the reported locality. revision: yes

Circularity Check

0 steps flagged

Group action modeling of prompt orbits introduces independent framework without reducing to fitted inputs or self-referential definitions

full rationale

The paper defines a two-player game between evaluator and trainer, adopts group actions (cyclic translations on the circle) as a formal representation of data augmentation symmetries, and derives miss-ratio regimes mathematically from the model's generalization threshold parameter. These derivations follow directly from the stated assumptions rather than reducing to prior fitted values or external results by construction. Empirical locality observations on Llama/Qwen/Mistral models are presented as separate supporting evidence, not as the source of the claimed regimes. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation chain; the recasting of benchmarks as orbits is an explicit modeling choice within the new framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of group actions to model prompt transformations and the observed locality of fine-tuning effects. No free parameters, new physical entities, or ad-hoc inventions are apparent from the abstract description.

axioms (1)

domain assumption Group actions can formally represent data augmentation and symmetries in adversarial prompt transformations.
Presented as a key feature of the approach to capture transformations in the game-theoretic model.

pith-pipeline@v0.9.0 · 5783 in / 1355 out tokens · 55820 ms · 2026-05-20T07:52:21.164690+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective, embed_add echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

a benchmark is not a static set of prompts but an orbit under the evaluator's group action
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

circle with cyclic translation groups ... ε* = gcd(p,q)/(pq)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 9 internal anchors

[1]

Ajarra, B

A. Ajarra, B. Ghosh, and D. Basu. Active fourier auditor for estimating distributional properties of ml models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15330–15338, 2025

work page 2025
[2]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Biderman, H

S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. Van Der Wal. Pythia: A suite for analyzing large language models across training and scaling. In A. Krause, E. Brun- skill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings ...

work page 2023
[4]

The Ladder: A Reliable Leaderboard for Machine Learning Competitions

A. Blum and M. Hardt. The ladder: A reliable leaderboard for machine learning competitions. InProceedings of the 32nd International Conference on Machine Learning (ICML), volume 37 ofPMLR, pages 1006–1014, 2015. URLhttps://arxiv.org/abs/1502.04585

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

M. Bunge. A general black box theory.Philosophy of Science, 30(4):346–358, 1963

work page 1963
[6]

S. Chen, E. Dobriban, and J. H. Lee. A group-theoretic framework for data augmentation. Journal of Machine Learning Research, 21(245):1–71, 2020

work page 2020
[7]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017
[8]

J. Chu, Y . Liu, Z. Yang, X. Shen, M. Backes, and Y . Zhang. Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21538–21566, 2025

work page 2025
[9]

Chugg, S

B. Chugg, S. Cortes-Gomez, B. Wilder, and A. Ramdas. Auditing fairness by betting.Advances in Neural Information Processing Systems, 36:6070–6091, 2023

work page 2023
[10]

Cohen and M

T. Cohen and M. Welling. Group equivariant convolutional networks. InInternational confer- ence on machine learning, pages 2990–2999. PMLR, 2016

work page 2016
[11]

T. Cui, Y . Mao, P. Liu, C. Liu, and D. You. Exploring jailbreak attacks on llms through intent concealment and diversion. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20754–20768, 2025

work page 2025
[12]

J. Dong, A. Roth, Z. Schutzman, B. Waggoner, and Z. S. Wu. Strategic classification from revealed preferences. InProceedings of the 2018 ACM Conference on Economics and Com- putation (EC), 2018. URLhttps://arxiv.org/abs/1710.07887

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Preserving Statistical Validity in Adaptive Data Analysis , booktitle =

C. Dwork, V . Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Preserving statistical validity in adaptive data analysis. InProceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC), pages 117–126, 2015. doi: 10.1145/2746539.2746580. URL https://arxiv.org/abs/1411.2664

work page doi:10.1145/2746539.2746580 2015
[14]

European Commission. Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts.https://eur-lex.europa.eu/legal-content/ EN/TXT/?qid=1623335154975&uri=CELEX%3A52021PC0206, 2021

work page 2021
[15]

Garcia Bourrée, A

J. Garcia Bourrée, A. Godinot, M. De V os, M. Vujasinovic, S. Biswas, G. Tredan, E. Le Merrer, and A.-M. Kermarrec. Robust ml auditing using prior knowledge.Forty-second International Conference on Machine Learning, 2025. 10

work page 2025
[16]

S. Ge, C. Zhou, R. Hou, M. Khabsa, Y .-C. Wang, Q. Wang, J. Han, and Y . Mao. MART: Improving LLM safety with multi-round automatic red-teaming. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2024. URLhttps://aclanthology.org/2024. naacl-long.107

work page 2024
[17]

Godinot, E

A. Godinot, E. Le Merrer, G. Trédan, C. Penzo, and F. Taïani. Change-relaxed active fairness auditing. InRJCIA 2023-21e Rencontres des Jeunes Chercheurs en Intelligence Artificielle, pages 91–96, 2023

work page 2023
[18]

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015

work page 2015
[19]

The Llama 3 Herd of Models

A. Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Guerra-Balboa, A

P. Guerra-Balboa, A. Sauer, H. H. Arcolezi, and T. Strufe. Understanding disclosure risk in differential privacy with applications to noise calibration and auditing (extended version). arXiv preprint arXiv:2603.12142, 2026

work page arXiv 2026
[21]

Hardt, N

M. Hardt, N. Megiddo, C. Papadimitriou, and M. Wootters. Strategic classification. InPro- ceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science (ITCS), pages 111–122, 2016. doi: 10.1145/2840728.2840730. URLhttps://arxiv.org/abs/ 1506.06980

work page doi:10.1145/2840728.2840730 2016
[22]

Hartmann, L

D. Hartmann, L. Pohlmann, L. Hanslik, N. Gießing, B. Berendt, and P. Delobelle. Audit me if you can: Query-efficient active fairness auditing of black-box llms.arXiv preprint arXiv:2601.03087, 2026

work page arXiv 2026
[23]

E. Hazan. Introduction to online convex optimization.Foundations and Trends in Optimization, 2(3-4):157–325, 2016

work page 2016
[24]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[25]

Hsiung, T

L. Hsiung, T. Pang, Y .-C. Tang, L. Song, T.-Y . Ho, P.-Y . Chen, and Y . Yang. Why LLM safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets. InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026. URLhttps://arxiv.org/abs/2506.05346

work page arXiv 2026
[26]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations (ICLR), 2022

work page 2022
[27]

Jia and P

R. Jia and P. Liang. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 2021–2031, 2017

work page 2017
[29]

URLhttps://arxiv.org/abs/2310.06825

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Jiang, K

L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y . Choi, and N. Dziri. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[31]

Lafargue, A

V . Lafargue, A. L. Monteiro, E. Claeys, L. Risser, and J.-M. Loubes. Exposing the illusion of fairness: Auditing vulnerabilities to distributional manipulation attacks.arXiv preprint arXiv:2507.20708, 2025

work page arXiv 2025
[32]

J. Li, R. Li, and Q. Liu. Beyond static datasets: A deep interaction approach to llm evaluation. arXiv preprint arXiv:2309.04369, 2023

work page arXiv 2023
[33]

Li, J.-C

K. Li, J.-C. N. Ferrand, R. Sheatsley, B. Hoak, Y . Beugin, E. Pauley, and P. McDaniel. On the robustness tradeoff in fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4898–4907, 2025

work page 2025
[34]

Littlestone and M

N. Littlestone and M. K. Warmuth. The weighted majority algorithm.Information and com- putation, 108(2):212–261, 1994. 11

work page 1994
[35]

C. Lyle, M. van der Wilk, M. Kwiatkowska, Y . Gal, and B. Bloem-Reddy. On the benefits of invariance in neural networks.arXiv preprint arXiv:2005.00178, 2020

work page arXiv 2005
[36]

C. Ma, Z. Yang, M. Gao, H. Ci, J. Gao, X. Pan, and Y . Yang. Red teaming game: A game- theoretic framework for red teaming language models, 2023. URLhttps://arxiv.org/ abs/2310.00322

work page arXiv 2023
[37]

Madry, A

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning mod- els resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018
[38]

Merity, C

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. InProceed- ings of the International Conference on Learning Representations (ICLR), Toulon, France,

work page
[39]

Introduces the WikiText-2 and WikiText-103 language modelling datasets

work page
[40]

Mouton and B

J. Mouton and B. Rottembourg.Auditing the Ranking Strategy of a Marketplace’s Algorithm in the Frame of Competition Law Commitments with Surrogate Models: The Amazon’s Buy Box Case. GREDEG, 2024

work page 2024
[41]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[42]

Panfilov, P

A. Panfilov, P. Romov, I. Shilov, Y .-A. de Montjoye, J. Geiping, and M. Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms.arXiv preprint arXiv:2603.24511, 2026

work page arXiv 2026
[43]

Peigné-Lefebvre, Q

P. Peigné-Lefebvre, Q. Feuillade-Montixi, T. David, and N. Miailhe. LLM robustness leader- board v1 – technical report, 2025. arXiv preprint; PRISM Eval

work page 2025
[44]

J. C. Perdomo, T. Zrnic, C. Mendler-Dünner, and M. Hardt. Performative prediction. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofPMLR, pages 7599–7609, 2020. URLhttps://arxiv.org/abs/2002.06673

work page arXiv 2020
[45]

Perez, S

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, 2022

work page 2022
[46]

X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[47]

X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson. Safety alignment should be made more than just a few tokens deep. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[48]

Qwen Team, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language mod- els are unsupervised multitask learners.OpenAI, 2019. URLhttps://cdn.openai. com/better-language-models/language_models_are_unsupervised_multitask_ learners.pdf

work page 2019
[50]

Shahin Shamsabadi, M

A. Shahin Shamsabadi, M. Yaghini, N. Dullerud, S. Wyllie, U. Aïvodji, A. Alaagib, S. Gambs, and N. Papernot. Washing the unwashable: On the (im) possibility of fairwashing detection. Advances in Neural Information Processing Systems, 35:14170–14182, 2022

work page 2022
[51]

Shalev-Shwartz

S. Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012. doi: 10.1561/2200000018

work page doi:10.1561/2200000018 2012
[52]

do anything now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024. 12

work page 2024
[53]

Shirali, R

A. Shirali, R. Abebe, and M. Hardt. A theory of dynamic benchmarks. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/ 2210.03165

work page arXiv 2023
[54]

Szegedy, W

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. InInternational Conference on Learning Represen- tations (ICLR), 2014

work page 2014
[55]

W. Tang, Y . Zhou, E. Xu, K. Cheng, M. Li, and L. Xiao. Dsgbench: A diverse strategic game benchmark for evaluating llm-based agents in complex decision-making environments. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 16987–16991. IEEE, 2026

work page 2026
[56]

Olmo 3

Team Olmo. Olmo 3.arXiv preprint arXiv:2512.13961, 2025. doi: 10.48550/arXiv.2512. 13961. URLhttps://arxiv.org/abs/2512.13961

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512 2025
[57]

Wallace, S

E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh. Universal adversarial triggers for attacking and analyzing nlp. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153–2162, 2019

work page 2019
[58]

A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does llm safety training fail?Ad- vances in neural information processing systems, 36:80079–80110, 2023

work page 2023
[59]

Yan and C

T. Yan and C. Zhang. Active fairness auditing. InInternational Conference on Machine Learning, pages 24929–24962. PMLR, 2022

work page 2022
[60]

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transfer- able adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. A Theory Appendix A.1 Circle phase diagram — proofs This section proves the entries of Table 1. We use the notation and assumptions stated in the main body of the article....

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Embed every prompt of the WILDJAILBREAK[29] adversarial-harmful split under the tar- get’s base model with the chosen embedding tag (Appendix C.5)

work page
[62]

Compute thek=50-NN cosine radiusρ 50(p)∈[0,2]of each promptpin that embedding space

work page
[63]

Partition the corpus intoK=8clusters byk-means onℓ 2-normalized embeddings

work page
[64]

Pick the densest cluster representative per cluster (smallestρ 50);8representatives per (target,metric)

work page
[65]

net decreasing inz

Among the per-target5metrics, keep the two metrics with the smallest meanρ 50 across their cluster reps. (Practical outcome:spectral_firstis the tightest metric in all three target embedding spaces; the per-target second pick varies.) This yields3×2×8=48candidate prompts (with one duplicate across(target,metric)pairs), of which47are unique corpus indices....

work page 2048
[66]

Thin SVDsX c =U XΣX V⊺ X andY c =U Y ΣY V⊺ Y , withU X , UY ∈Rn×n,Σ X ,Σ Y ∈Rn×n, andV ⊺ X , V⊺ Y ∈Rn×d

work page
[67]

The cross-covariance becomesM=V X KV ⊺ Y withK=Σ X(U⊺ X UY )ΣY ∈Rn×n

work page
[68]

SVDK=U KΣKV⊺ K (all factorsn×n)

work page
[69]

The rank-npart ofRis then(V X UK)(V ⊺ KV⊺ Y ), and its trace and squared-trace, taken cycli- cally, reduce totr(A)andtr(A 2)whereA=V ⊺ K(V ⊺ Y VX)UK ∈Rn×n. Reading:tr(A 2)/nmeasures how much of the active subspace acts as a true order-2 reflection (eigenvalues±1, contributing+1each) versus a non-trivial rotation (eigenvaluese ±iθ,θ∉{0, π}, contributingcos...

work page
[70]

Inverse coherence (Table 8) is robustly clean on the early-layer pooled tags (mp_first: ≥0.89for all 7 models;mp_all:≥0.81, with 6/7 above0.95). The single sharpest col- lapse is OLMo’s drop from≥0.89atmp_firstto0.01atsp_last; we do not see a mono- tone trend in late-layer breakdown across the panel (gpt2 hascos=−0.41atlast_token despite no instruct train...

work page
[71]

The operator-class prediction beats the additive prediction in the row-mean sense at every tag, with no clean per-model regularity — e.g

Composition-law improvement (Table 9) is positive (>1) in46/49cells and substantial (≥ 2.0) in30/49. The operator-class prediction beats the additive prediction in the row-mean sense at every tag, with no clean per-model regularity — e.g. Mistral consistently shows the smallest improvement factor (1.6–2.0×) despite its∣αL−1∣on the principal composites bei...

work page
[72]

∼50/50split

Involution diagnostic (Table 10) is concentrated:48/49cells lie in[0.494,0.559](me- dian0.52); the single outlier ismistral/sp_first(0.413). We have not benchmarked tr(R2)/nagainst the null distribution induced by a Haar-random orthogonal matrix re- stricted to a randomn-subspace ofR d; we therefore present the "∼50/50split" reading as suggestive rather t...

work page

[1] [1]

Ajarra, B

A. Ajarra, B. Ghosh, and D. Basu. Active fourier auditor for estimating distributional properties of ml models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15330–15338, 2025

work page 2025

[2] [2]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Biderman, H

S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. Van Der Wal. Pythia: A suite for analyzing large language models across training and scaling. In A. Krause, E. Brun- skill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings ...

work page 2023

[4] [4]

The Ladder: A Reliable Leaderboard for Machine Learning Competitions

A. Blum and M. Hardt. The ladder: A reliable leaderboard for machine learning competitions. InProceedings of the 32nd International Conference on Machine Learning (ICML), volume 37 ofPMLR, pages 1006–1014, 2015. URLhttps://arxiv.org/abs/1502.04585

work page internal anchor Pith review Pith/arXiv arXiv 2015

[5] [5]

M. Bunge. A general black box theory.Philosophy of Science, 30(4):346–358, 1963

work page 1963

[6] [6]

S. Chen, E. Dobriban, and J. H. Lee. A group-theoretic framework for data augmentation. Journal of Machine Learning Research, 21(245):1–71, 2020

work page 2020

[7] [7]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017

[8] [8]

J. Chu, Y . Liu, Z. Yang, X. Shen, M. Backes, and Y . Zhang. Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21538–21566, 2025

work page 2025

[9] [9]

Chugg, S

B. Chugg, S. Cortes-Gomez, B. Wilder, and A. Ramdas. Auditing fairness by betting.Advances in Neural Information Processing Systems, 36:6070–6091, 2023

work page 2023

[10] [10]

Cohen and M

T. Cohen and M. Welling. Group equivariant convolutional networks. InInternational confer- ence on machine learning, pages 2990–2999. PMLR, 2016

work page 2016

[11] [11]

T. Cui, Y . Mao, P. Liu, C. Liu, and D. You. Exploring jailbreak attacks on llms through intent concealment and diversion. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20754–20768, 2025

work page 2025

[12] [12]

J. Dong, A. Roth, Z. Schutzman, B. Waggoner, and Z. S. Wu. Strategic classification from revealed preferences. InProceedings of the 2018 ACM Conference on Economics and Com- putation (EC), 2018. URLhttps://arxiv.org/abs/1710.07887

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Preserving Statistical Validity in Adaptive Data Analysis , booktitle =

C. Dwork, V . Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Preserving statistical validity in adaptive data analysis. InProceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC), pages 117–126, 2015. doi: 10.1145/2746539.2746580. URL https://arxiv.org/abs/1411.2664

work page doi:10.1145/2746539.2746580 2015

[14] [14]

European Commission. Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts.https://eur-lex.europa.eu/legal-content/ EN/TXT/?qid=1623335154975&uri=CELEX%3A52021PC0206, 2021

work page 2021

[15] [15]

Garcia Bourrée, A

J. Garcia Bourrée, A. Godinot, M. De V os, M. Vujasinovic, S. Biswas, G. Tredan, E. Le Merrer, and A.-M. Kermarrec. Robust ml auditing using prior knowledge.Forty-second International Conference on Machine Learning, 2025. 10

work page 2025

[16] [16]

S. Ge, C. Zhou, R. Hou, M. Khabsa, Y .-C. Wang, Q. Wang, J. Han, and Y . Mao. MART: Improving LLM safety with multi-round automatic red-teaming. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2024. URLhttps://aclanthology.org/2024. naacl-long.107

work page 2024

[17] [17]

Godinot, E

A. Godinot, E. Le Merrer, G. Trédan, C. Penzo, and F. Taïani. Change-relaxed active fairness auditing. InRJCIA 2023-21e Rencontres des Jeunes Chercheurs en Intelligence Artificielle, pages 91–96, 2023

work page 2023

[18] [18]

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015

work page 2015

[19] [19]

The Llama 3 Herd of Models

A. Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Guerra-Balboa, A

P. Guerra-Balboa, A. Sauer, H. H. Arcolezi, and T. Strufe. Understanding disclosure risk in differential privacy with applications to noise calibration and auditing (extended version). arXiv preprint arXiv:2603.12142, 2026

work page arXiv 2026

[21] [21]

Hardt, N

M. Hardt, N. Megiddo, C. Papadimitriou, and M. Wootters. Strategic classification. InPro- ceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science (ITCS), pages 111–122, 2016. doi: 10.1145/2840728.2840730. URLhttps://arxiv.org/abs/ 1506.06980

work page doi:10.1145/2840728.2840730 2016

[22] [22]

Hartmann, L

D. Hartmann, L. Pohlmann, L. Hanslik, N. Gießing, B. Berendt, and P. Delobelle. Audit me if you can: Query-efficient active fairness auditing of black-box llms.arXiv preprint arXiv:2601.03087, 2026

work page arXiv 2026

[23] [23]

E. Hazan. Introduction to online convex optimization.Foundations and Trends in Optimization, 2(3-4):157–325, 2016

work page 2016

[24] [24]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[25] [25]

Hsiung, T

L. Hsiung, T. Pang, Y .-C. Tang, L. Song, T.-Y . Ho, P.-Y . Chen, and Y . Yang. Why LLM safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets. InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026. URLhttps://arxiv.org/abs/2506.05346

work page arXiv 2026

[26] [26]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations (ICLR), 2022

work page 2022

[27] [27]

Jia and P

R. Jia and P. Liang. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 2021–2031, 2017

work page 2017

[28] [29]

URLhttps://arxiv.org/abs/2310.06825

work page internal anchor Pith review Pith/arXiv arXiv

[29] [30]

Jiang, K

L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y . Choi, and N. Dziri. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[30] [31]

Lafargue, A

V . Lafargue, A. L. Monteiro, E. Claeys, L. Risser, and J.-M. Loubes. Exposing the illusion of fairness: Auditing vulnerabilities to distributional manipulation attacks.arXiv preprint arXiv:2507.20708, 2025

work page arXiv 2025

[31] [32]

J. Li, R. Li, and Q. Liu. Beyond static datasets: A deep interaction approach to llm evaluation. arXiv preprint arXiv:2309.04369, 2023

work page arXiv 2023

[32] [33]

Li, J.-C

K. Li, J.-C. N. Ferrand, R. Sheatsley, B. Hoak, Y . Beugin, E. Pauley, and P. McDaniel. On the robustness tradeoff in fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4898–4907, 2025

work page 2025

[33] [34]

Littlestone and M

N. Littlestone and M. K. Warmuth. The weighted majority algorithm.Information and com- putation, 108(2):212–261, 1994. 11

work page 1994

[34] [35]

C. Lyle, M. van der Wilk, M. Kwiatkowska, Y . Gal, and B. Bloem-Reddy. On the benefits of invariance in neural networks.arXiv preprint arXiv:2005.00178, 2020

work page arXiv 2005

[35] [36]

C. Ma, Z. Yang, M. Gao, H. Ci, J. Gao, X. Pan, and Y . Yang. Red teaming game: A game- theoretic framework for red teaming language models, 2023. URLhttps://arxiv.org/ abs/2310.00322

work page arXiv 2023

[36] [37]

Madry, A

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning mod- els resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018

[37] [38]

Merity, C

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. InProceed- ings of the International Conference on Learning Representations (ICLR), Toulon, France,

work page

[38] [39]

Introduces the WikiText-2 and WikiText-103 language modelling datasets

work page

[39] [40]

Mouton and B

J. Mouton and B. Rottembourg.Auditing the Ranking Strategy of a Marketplace’s Algorithm in the Frame of Competition Law Commitments with Surrogate Models: The Amazon’s Buy Box Case. GREDEG, 2024

work page 2024

[40] [41]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[41] [42]

Panfilov, P

A. Panfilov, P. Romov, I. Shilov, Y .-A. de Montjoye, J. Geiping, and M. Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms.arXiv preprint arXiv:2603.24511, 2026

work page arXiv 2026

[42] [43]

Peigné-Lefebvre, Q

P. Peigné-Lefebvre, Q. Feuillade-Montixi, T. David, and N. Miailhe. LLM robustness leader- board v1 – technical report, 2025. arXiv preprint; PRISM Eval

work page 2025

[43] [44]

J. C. Perdomo, T. Zrnic, C. Mendler-Dünner, and M. Hardt. Performative prediction. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofPMLR, pages 7599–7609, 2020. URLhttps://arxiv.org/abs/2002.06673

work page arXiv 2020

[44] [45]

Perez, S

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, 2022

work page 2022

[45] [46]

X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[46] [47]

X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson. Safety alignment should be made more than just a few tokens deep. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[47] [48]

Qwen Team, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [49]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language mod- els are unsupervised multitask learners.OpenAI, 2019. URLhttps://cdn.openai. com/better-language-models/language_models_are_unsupervised_multitask_ learners.pdf

work page 2019

[49] [50]

Shahin Shamsabadi, M

A. Shahin Shamsabadi, M. Yaghini, N. Dullerud, S. Wyllie, U. Aïvodji, A. Alaagib, S. Gambs, and N. Papernot. Washing the unwashable: On the (im) possibility of fairwashing detection. Advances in Neural Information Processing Systems, 35:14170–14182, 2022

work page 2022

[50] [51]

Shalev-Shwartz

S. Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012. doi: 10.1561/2200000018

work page doi:10.1561/2200000018 2012

[51] [52]

do anything now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024. 12

work page 2024

[52] [53]

Shirali, R

A. Shirali, R. Abebe, and M. Hardt. A theory of dynamic benchmarks. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/ 2210.03165

work page arXiv 2023

[53] [54]

Szegedy, W

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. InInternational Conference on Learning Represen- tations (ICLR), 2014

work page 2014

[54] [55]

W. Tang, Y . Zhou, E. Xu, K. Cheng, M. Li, and L. Xiao. Dsgbench: A diverse strategic game benchmark for evaluating llm-based agents in complex decision-making environments. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 16987–16991. IEEE, 2026

work page 2026

[55] [56]

Olmo 3

Team Olmo. Olmo 3.arXiv preprint arXiv:2512.13961, 2025. doi: 10.48550/arXiv.2512. 13961. URLhttps://arxiv.org/abs/2512.13961

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512 2025

[56] [57]

Wallace, S

E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh. Universal adversarial triggers for attacking and analyzing nlp. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153–2162, 2019

work page 2019

[57] [58]

A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does llm safety training fail?Ad- vances in neural information processing systems, 36:80079–80110, 2023

work page 2023

[58] [59]

Yan and C

T. Yan and C. Zhang. Active fairness auditing. InInternational Conference on Machine Learning, pages 24929–24962. PMLR, 2022

work page 2022

[59] [60]

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transfer- able adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. A Theory Appendix A.1 Circle phase diagram — proofs This section proves the entries of Table 1. We use the notation and assumptions stated in the main body of the article....

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [61]

Embed every prompt of the WILDJAILBREAK[29] adversarial-harmful split under the tar- get’s base model with the chosen embedding tag (Appendix C.5)

work page

[61] [62]

Compute thek=50-NN cosine radiusρ 50(p)∈[0,2]of each promptpin that embedding space

work page

[62] [63]

Partition the corpus intoK=8clusters byk-means onℓ 2-normalized embeddings

work page

[63] [64]

Pick the densest cluster representative per cluster (smallestρ 50);8representatives per (target,metric)

work page

[64] [65]

net decreasing inz

Among the per-target5metrics, keep the two metrics with the smallest meanρ 50 across their cluster reps. (Practical outcome:spectral_firstis the tightest metric in all three target embedding spaces; the per-target second pick varies.) This yields3×2×8=48candidate prompts (with one duplicate across(target,metric)pairs), of which47are unique corpus indices....

work page 2048

[65] [66]

Thin SVDsX c =U XΣX V⊺ X andY c =U Y ΣY V⊺ Y , withU X , UY ∈Rn×n,Σ X ,Σ Y ∈Rn×n, andV ⊺ X , V⊺ Y ∈Rn×d

work page

[66] [67]

The cross-covariance becomesM=V X KV ⊺ Y withK=Σ X(U⊺ X UY )ΣY ∈Rn×n

work page

[67] [68]

SVDK=U KΣKV⊺ K (all factorsn×n)

work page

[68] [69]

The rank-npart ofRis then(V X UK)(V ⊺ KV⊺ Y ), and its trace and squared-trace, taken cycli- cally, reduce totr(A)andtr(A 2)whereA=V ⊺ K(V ⊺ Y VX)UK ∈Rn×n. Reading:tr(A 2)/nmeasures how much of the active subspace acts as a true order-2 reflection (eigenvalues±1, contributing+1each) versus a non-trivial rotation (eigenvaluese ±iθ,θ∉{0, π}, contributingcos...

work page

[69] [70]

Inverse coherence (Table 8) is robustly clean on the early-layer pooled tags (mp_first: ≥0.89for all 7 models;mp_all:≥0.81, with 6/7 above0.95). The single sharpest col- lapse is OLMo’s drop from≥0.89atmp_firstto0.01atsp_last; we do not see a mono- tone trend in late-layer breakdown across the panel (gpt2 hascos=−0.41atlast_token despite no instruct train...

work page

[70] [71]

The operator-class prediction beats the additive prediction in the row-mean sense at every tag, with no clean per-model regularity — e.g

Composition-law improvement (Table 9) is positive (>1) in46/49cells and substantial (≥ 2.0) in30/49. The operator-class prediction beats the additive prediction in the row-mean sense at every tag, with no clean per-model regularity — e.g. Mistral consistently shows the smallest improvement factor (1.6–2.0×) despite its∣αL−1∣on the principal composites bei...

work page

[71] [72]

∼50/50split

Involution diagnostic (Table 10) is concentrated:48/49cells lie in[0.494,0.559](me- dian0.52); the single outlier ismistral/sp_first(0.413). We have not benchmarked tr(R2)/nagainst the null distribution induced by a Haar-random orthogonal matrix re- stricted to a randomn-subspace ofR d; we therefore present the "∼50/50split" reading as suggestive rather t...

work page