The Evaluation Game: Beyond Static LLM Benchmarking
Pith reviewed 2026-05-20 07:52 UTC · model grok-4.3
The pith
A benchmark for LLM jailbreaks is an orbit under the evaluator's group action rather than a static set of prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that adversarial evaluation of large language models should be formalized as a two-player game between trainer and evaluator, with data augmentation captured through group actions on prompt symmetries. In the simplest nontrivial case of the circle with cyclic translation groups, different trainer generalization regimes produce distinct long-term behaviors, including a constant miss ratio over linearly many rounds when generalization stays below a critical threshold. Experiments across three model families further establish that fine-tuning on adversarial prompts yields refusal rates that are highly correlated with prompt distance to the training set, implying only local,
What carries the argument
The orbit of prompts under the evaluator's group action, which turns any benchmark into the full set of symmetry-transformed versions rather than a fixed collection of examples.
If this is right
- Audit protocols that ignore trainer-side adaptation cannot distinguish a genuine robustness fix from a memorized patch on specific prompts.
- In the cyclic translation setting, generalization range determines whether the miss ratio stays constant, drops, or exhibits other long-term patterns.
- Fine-tuning induces only local generalization, with refusal performance dropping as prompt distance increases.
- A benchmark must be treated as the full orbit under the group action to capture all equivalent adversarial instances.
Where Pith is reading between the lines
- Evaluation procedures could be made iterative by repeatedly sampling new test cases from the current orbit after each trainer update.
- Distance metrics in prompt embedding space might serve as practical proxies for predicting the extent of local robustness after fine-tuning.
- The same orbit perspective could apply to other transformation families beyond cyclic groups, such as semantic paraphrases or style shifts.
Load-bearing premise
Symmetries and data augmentations in adversarial prompts can be captured by group actions in a way that faithfully models how trainers adapt and generalize from seen examples.
What would settle it
Measure whether the evaluator's miss ratio remains constant across a linear number of rounds when the trainer's generalization range is set below the critical threshold identified for the cyclic group, or test whether refusal rates on held-out prompts correlate with their distance to fine-tuning examples in additional model families.
Figures
read the original abstract
As jailbreaks, adversarially crafted inputs that bypass safety constraints, continue to be discovered in Large Language Models, practitioners increasingly rely on fine-tuning as a defensive strategy. Yet the theoretical foundations underlying this robustness fine-tuning remain underexplored. We introduce a game-theoretic framework in which the interaction between an evaluator (auditing the model for jailbreaks) and a trainer is formalized as a two-player game. A key feature of our approach is the use of group actions, a mathematical structure that captures symmetries and transformations, to formally represent data augmentation. The simplest non-trivial instance is the circle with cyclic translation groups, where we exhibit various regimes depending on the trainer's generalization range. Below a critical threshold, the evaluator maintains a constant miss ratio for linearly many rounds, whereas other settings can yield very different behaviors. We further provide empirical evidence supporting locality-dependence of the model: for the three model families we tested (Llama, Qwen and Mistral), we have significant evidence that fine-tuning on adversarial prompts induces only local generalization, with refusal rates on test examples highly correlated with the distance to the fine-tuning prompts. Our framework recasts the central object of adversarial evaluation: a benchmark is not a static set of prompts but an orbit under the evaluator's group action, and audit protocols that ignore trainer-side adaptation cannot distinguish a genuine fix from a memorized patch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a game-theoretic framework modeling the interaction between an LLM evaluator auditing for jailbreaks and a trainer using fine-tuning as defense. It represents data augmentation via group actions, with the circle and cyclic translations as the basic case, and derives regimes for the evaluator's miss ratio depending on the trainer's generalization threshold. Below a critical threshold the miss ratio remains constant over linearly many rounds; the framework recasts benchmarks as orbits under the evaluator's group action. Empirical results on Llama, Qwen and Mistral families show that fine-tuning induces only local generalization, with refusal rates correlated to distance from the fine-tuning prompts.
Significance. If the central modeling assumptions hold, the work could meaningfully shift adversarial evaluation practice by requiring audit protocols to account for trainer-side adaptation rather than treating benchmarks as static. The reported locality results across three model families constitute concrete, falsifiable evidence that is a strength of the manuscript. The group-action formalism itself is a novel formal device for this domain, but its utility hinges on establishing a tighter link to actual fine-tuning dynamics.
major comments (2)
- [Regime analysis / generalization regimes] In the regime analysis (the derivation of constant miss ratio for linearly many rounds below the critical generalization threshold), the cyclic translation group is introduced to capture symmetries in adversarial prompts, yet no derivation or mapping is supplied that connects this group action to neural-network parameter updates, loss landscapes, or the propagation of robustness during fine-tuning. This link is load-bearing for the claim that protocols ignoring trainer adaptation cannot distinguish genuine fixes from memorized patches.
- [Empirical evaluation] In the empirical evaluation, refusal rates are reported to be highly correlated with distance to fine-tuning prompts for Llama, Qwen and Mistral. However, the experiments do not test whether the observed locality reproduces the specific miss-ratio trajectories or the constant-miss-ratio regime predicted by the cyclic-group model; without this check the empirical support does not yet validate the framework's quantitative predictions.
minor comments (2)
- [Theoretical framework] Provide a short concrete example of how a cyclic translation acts on a sample adversarial prompt early in the theoretical framework section to improve intuition for readers unfamiliar with group actions.
- [Empirical evaluation] Define the precise operational meaning of 'miss ratio' and the distance metric used in the correlation analysis before presenting the empirical figures.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the scope and limitations of our framework. We address each major comment below, indicating revisions where appropriate to better connect the theoretical model to both assumptions and empirical validation.
read point-by-point responses
-
Referee: In the regime analysis (the derivation of constant miss ratio for linearly many rounds below the critical generalization threshold), the cyclic translation group is introduced to capture symmetries in adversarial prompts, yet no derivation or mapping is supplied that connects this group action to neural-network parameter updates, loss landscapes, or the propagation of robustness during fine-tuning. This link is load-bearing for the claim that protocols ignoring trainer adaptation cannot distinguish genuine fixes from memorized patches.
Authors: We agree that the manuscript introduces the cyclic translation group as an abstract model for symmetries in adversarial prompts and data augmentation without deriving it directly from neural-network parameter updates or loss landscapes. The framework is designed as a high-level game-theoretic abstraction to illustrate qualitative regimes of evaluator-trainer interaction rather than a first-principles derivation from fine-tuning dynamics. In the revised manuscript, we will add a new subsection under the theoretical framework that explicitly discusses the modeling assumptions, including how group actions represent effective invariances induced by fine-tuning without requiring an explicit mapping to parameter space. This will strengthen the justification for why ignoring trainer adaptation risks conflating memorized patches with genuine robustness. revision: yes
-
Referee: In the empirical evaluation, refusal rates are reported to be highly correlated with distance to fine-tuning prompts for Llama, Qwen and Mistral. However, the experiments do not test whether the observed locality reproduces the specific miss-ratio trajectories or the constant-miss-ratio regime predicted by the cyclic-group model; without this check the empirical support does not yet validate the framework's quantitative predictions.
Authors: The empirical results establish locality of generalization as a key supporting observation for the regime analysis, but we concur that they do not directly validate the quantitative predictions such as constant miss-ratio plateaus over multiple rounds. In the revision, we will incorporate additional post-hoc analysis of the existing refusal-rate data (or new targeted experiments if feasible) to examine whether the observed distance-dependent patterns align with the miss-ratio trajectories derived from the cyclic-group model, for example by simulating round-by-round evaluator queries under the reported locality. revision: yes
Circularity Check
Group action modeling of prompt orbits introduces independent framework without reducing to fitted inputs or self-referential definitions
full rationale
The paper defines a two-player game between evaluator and trainer, adopts group actions (cyclic translations on the circle) as a formal representation of data augmentation symmetries, and derives miss-ratio regimes mathematically from the model's generalization threshold parameter. These derivations follow directly from the stated assumptions rather than reducing to prior fitted values or external results by construction. Empirical locality observations on Llama/Qwen/Mistral models are presented as separate supporting evidence, not as the source of the claimed regimes. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation chain; the recasting of benchmarks as orbits is an explicit modeling choice within the new framework.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Group actions can formally represent data augmentation and symmetries in adversarial prompt transformations.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective, embed_add echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
a benchmark is not a static set of prompts but an orbit under the evaluator's group action
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
circle with cyclic translation groups ... ε* = gcd(p,q)/(pq)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. Van Der Wal. Pythia: A suite for analyzing large language models across training and scaling. In A. Krause, E. Brun- skill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings ...
work page 2023
-
[4]
The Ladder: A Reliable Leaderboard for Machine Learning Competitions
A. Blum and M. Hardt. The ladder: A reliable leaderboard for machine learning competitions. InProceedings of the 32nd International Conference on Machine Learning (ICML), volume 37 ofPMLR, pages 1006–1014, 2015. URLhttps://arxiv.org/abs/1502.04585
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[5]
M. Bunge. A general black box theory.Philosophy of Science, 30(4):346–358, 1963
work page 1963
-
[6]
S. Chen, E. Dobriban, and J. H. Lee. A group-theoretic framework for data augmentation. Journal of Machine Learning Research, 21(245):1–71, 2020
work page 2020
-
[7]
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017
work page 2017
-
[8]
J. Chu, Y . Liu, Z. Yang, X. Shen, M. Backes, and Y . Zhang. Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21538–21566, 2025
work page 2025
- [9]
-
[10]
T. Cohen and M. Welling. Group equivariant convolutional networks. InInternational confer- ence on machine learning, pages 2990–2999. PMLR, 2016
work page 2016
-
[11]
T. Cui, Y . Mao, P. Liu, C. Liu, and D. You. Exploring jailbreak attacks on llms through intent concealment and diversion. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20754–20768, 2025
work page 2025
-
[12]
J. Dong, A. Roth, Z. Schutzman, B. Waggoner, and Z. S. Wu. Strategic classification from revealed preferences. InProceedings of the 2018 ACM Conference on Economics and Com- putation (EC), 2018. URLhttps://arxiv.org/abs/1710.07887
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Preserving Statistical Validity in Adaptive Data Analysis , booktitle =
C. Dwork, V . Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Preserving statistical validity in adaptive data analysis. InProceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC), pages 117–126, 2015. doi: 10.1145/2746539.2746580. URL https://arxiv.org/abs/1411.2664
-
[14]
European Commission. Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts.https://eur-lex.europa.eu/legal-content/ EN/TXT/?qid=1623335154975&uri=CELEX%3A52021PC0206, 2021
work page 2021
-
[15]
J. Garcia Bourrée, A. Godinot, M. De V os, M. Vujasinovic, S. Biswas, G. Tredan, E. Le Merrer, and A.-M. Kermarrec. Robust ml auditing using prior knowledge.Forty-second International Conference on Machine Learning, 2025. 10
work page 2025
-
[16]
S. Ge, C. Zhou, R. Hou, M. Khabsa, Y .-C. Wang, Q. Wang, J. Han, and Y . Mao. MART: Improving LLM safety with multi-round automatic red-teaming. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2024. URLhttps://aclanthology.org/2024. naacl-long.107
work page 2024
-
[17]
A. Godinot, E. Le Merrer, G. Trédan, C. Penzo, and F. Taïani. Change-relaxed active fairness auditing. InRJCIA 2023-21e Rencontres des Jeunes Chercheurs en Intelligence Artificielle, pages 91–96, 2023
work page 2023
-
[18]
I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015
work page 2015
-
[19]
A. Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
P. Guerra-Balboa, A. Sauer, H. H. Arcolezi, and T. Strufe. Understanding disclosure risk in differential privacy with applications to noise calibration and auditing (extended version). arXiv preprint arXiv:2603.12142, 2026
-
[21]
M. Hardt, N. Megiddo, C. Papadimitriou, and M. Wootters. Strategic classification. InPro- ceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science (ITCS), pages 111–122, 2016. doi: 10.1145/2840728.2840730. URLhttps://arxiv.org/abs/ 1506.06980
-
[22]
D. Hartmann, L. Pohlmann, L. Hanslik, N. Gießing, B. Berendt, and P. Delobelle. Audit me if you can: Query-efficient active fairness auditing of black-box llms.arXiv preprint arXiv:2601.03087, 2026
-
[23]
E. Hazan. Introduction to online convex optimization.Foundations and Trends in Optimization, 2(3-4):157–325, 2016
work page 2016
-
[24]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[25]
L. Hsiung, T. Pang, Y .-C. Tang, L. Song, T.-Y . Ho, P.-Y . Chen, and Y . Yang. Why LLM safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets. InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026. URLhttps://arxiv.org/abs/2506.05346
-
[26]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations (ICLR), 2022
work page 2022
- [27]
-
[29]
URLhttps://arxiv.org/abs/2310.06825
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y . Choi, and N. Dziri. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[31]
V . Lafargue, A. L. Monteiro, E. Claeys, L. Risser, and J.-M. Loubes. Exposing the illusion of fairness: Auditing vulnerabilities to distributional manipulation attacks.arXiv preprint arXiv:2507.20708, 2025
- [32]
- [33]
-
[34]
N. Littlestone and M. K. Warmuth. The weighted majority algorithm.Information and com- putation, 108(2):212–261, 1994. 11
work page 1994
- [35]
- [36]
- [37]
- [38]
-
[39]
Introduces the WikiText-2 and WikiText-103 language modelling datasets
-
[40]
J. Mouton and B. Rottembourg.Auditing the Ranking Strategy of a Marketplace’s Algorithm in the Frame of Competition Law Commitments with Surrogate Models: The Amazon’s Buy Box Case. GREDEG, 2024
work page 2024
- [41]
-
[42]
A. Panfilov, P. Romov, I. Shilov, Y .-A. de Montjoye, J. Geiping, and M. Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for llms.arXiv preprint arXiv:2603.24511, 2026
-
[43]
P. Peigné-Lefebvre, Q. Feuillade-Montixi, T. David, and N. Miailhe. LLM robustness leader- board v1 – technical report, 2025. arXiv preprint; PRISM Eval
work page 2025
- [44]
- [45]
-
[46]
X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[47]
X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson. Safety alignment should be made more than just a few tokens deep. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[48]
Qwen Team, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language mod- els are unsupervised multitask learners.OpenAI, 2019. URLhttps://cdn.openai. com/better-language-models/language_models_are_unsupervised_multitask_ learners.pdf
work page 2019
-
[50]
A. Shahin Shamsabadi, M. Yaghini, N. Dullerud, S. Wyllie, U. Aïvodji, A. Alaagib, S. Gambs, and N. Papernot. Washing the unwashable: On the (im) possibility of fairwashing detection. Advances in Neural Information Processing Systems, 35:14170–14182, 2022
work page 2022
-
[51]
S. Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012. doi: 10.1561/2200000018
-
[52]
X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024. 12
work page 2024
-
[53]
A. Shirali, R. Abebe, and M. Hardt. A theory of dynamic benchmarks. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/ 2210.03165
-
[54]
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. InInternational Conference on Learning Represen- tations (ICLR), 2014
work page 2014
-
[55]
W. Tang, Y . Zhou, E. Xu, K. Cheng, M. Li, and L. Xiao. Dsgbench: A diverse strategic game benchmark for evaluating llm-based agents in complex decision-making environments. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 16987–16991. IEEE, 2026
work page 2026
-
[56]
Team Olmo. Olmo 3.arXiv preprint arXiv:2512.13961, 2025. doi: 10.48550/arXiv.2512. 13961. URLhttps://arxiv.org/abs/2512.13961
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512 2025
-
[57]
E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh. Universal adversarial triggers for attacking and analyzing nlp. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2153–2162, 2019
work page 2019
-
[58]
A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does llm safety training fail?Ad- vances in neural information processing systems, 36:80079–80110, 2023
work page 2023
- [59]
-
[60]
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transfer- able adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. A Theory Appendix A.1 Circle phase diagram — proofs This section proves the entries of Table 1. We use the notation and assumptions stated in the main body of the article....
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Embed every prompt of the WILDJAILBREAK[29] adversarial-harmful split under the tar- get’s base model with the chosen embedding tag (Appendix C.5)
-
[62]
Compute thek=50-NN cosine radiusρ 50(p)∈[0,2]of each promptpin that embedding space
-
[63]
Partition the corpus intoK=8clusters byk-means onℓ 2-normalized embeddings
-
[64]
Pick the densest cluster representative per cluster (smallestρ 50);8representatives per (target,metric)
-
[65]
Among the per-target5metrics, keep the two metrics with the smallest meanρ 50 across their cluster reps. (Practical outcome:spectral_firstis the tightest metric in all three target embedding spaces; the per-target second pick varies.) This yields3×2×8=48candidate prompts (with one duplicate across(target,metric)pairs), of which47are unique corpus indices....
work page 2048
-
[66]
Thin SVDsX c =U XΣX V⊺ X andY c =U Y ΣY V⊺ Y , withU X , UY ∈Rn×n,Σ X ,Σ Y ∈Rn×n, andV ⊺ X , V⊺ Y ∈Rn×d
-
[67]
The cross-covariance becomesM=V X KV ⊺ Y withK=Σ X(U⊺ X UY )ΣY ∈Rn×n
-
[68]
SVDK=U KΣKV⊺ K (all factorsn×n)
-
[69]
The rank-npart ofRis then(V X UK)(V ⊺ KV⊺ Y ), and its trace and squared-trace, taken cycli- cally, reduce totr(A)andtr(A 2)whereA=V ⊺ K(V ⊺ Y VX)UK ∈Rn×n. Reading:tr(A 2)/nmeasures how much of the active subspace acts as a true order-2 reflection (eigenvalues±1, contributing+1each) versus a non-trivial rotation (eigenvaluese ±iθ,θ∉{0, π}, contributingcos...
-
[70]
Inverse coherence (Table 8) is robustly clean on the early-layer pooled tags (mp_first: ≥0.89for all 7 models;mp_all:≥0.81, with 6/7 above0.95). The single sharpest col- lapse is OLMo’s drop from≥0.89atmp_firstto0.01atsp_last; we do not see a mono- tone trend in late-layer breakdown across the panel (gpt2 hascos=−0.41atlast_token despite no instruct train...
-
[71]
Composition-law improvement (Table 9) is positive (>1) in46/49cells and substantial (≥ 2.0) in30/49. The operator-class prediction beats the additive prediction in the row-mean sense at every tag, with no clean per-model regularity — e.g. Mistral consistently shows the smallest improvement factor (1.6–2.0×) despite its∣αL−1∣on the principal composites bei...
-
[72]
Involution diagnostic (Table 10) is concentrated:48/49cells lie in[0.494,0.559](me- dian0.52); the single outlier ismistral/sp_first(0.413). We have not benchmarked tr(R2)/nagainst the null distribution induced by a Haar-random orthogonal matrix re- stricted to a randomn-subspace ofR d; we therefore present the "∼50/50split" reading as suggestive rather t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.