Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs

Arkil Patel; Charbel El Feghali; Nicholas Meade; Siva Reddy; Spandana Gella; Verna Dankers

arxiv: 2606.21638 · v1 · pith:WNE474TUnew · submitted 2026-06-19 · 💻 cs.CR · cs.CL

Toward Open Weight Models Without Risks: Separating Public and Private Capabilities in LLMs

Charbel El Feghali , Arkil Patel , Nicholas Meade , Spandana Gella , Verna Dankers , Siva Reddy This is my paper

Pith reviewed 2026-06-26 13:43 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords tiered language modelsopen-weight LLMscapability separationsecret keyparameter permutationselective access controlmodel safetyprivate capabilities

0 comments

The pith

A single set of released LLM weights can support multiple capability levels controlled by a compact secret key.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tiered Language Models that keep one set of weights while letting a secret key switch between a public configuration and a more capable private one. The key works by permuting a small subset of parameters to create a different computation path over the same weights. Training starts by jointly pretraining both versions from scratch, then fine-tunes only the keyed version on private data while regularizing to protect the public behavior. Experiments with 180M and 650M models show the keyed version can learn a new language, follow instructions, and hold private facts that the public version never exhibits. The approach also extends to multiple tiers and resists attempts to extract private capabilities through fine-tuning.

Core claim

Tiered Language Models are trained by jointly pretraining a public and keyed configuration from scratch, followed by regularized fine-tuning of the keyed configuration on private data. This produces weights where applying the key reveals new language proficiency, instruction following, and memorized private facts while the default public configuration shows none of these. The same weights can be extended to hierarchical tiers, and the mechanism resists fine-tuning extraction because control sits in the weight structure rather than input space.

What carries the argument

A compact secret key that specifies a permutation over a small parameter subset, inducing an alternative computation graph over the shared weights.

If this is right

The keyed configuration acquires a new language while the public configuration shows none.
Instruction-following ability appears only in the keyed configuration.
Private factual knowledge can be memorized in the keyed configuration without appearing in the public one.
The method extends to multiple hierarchical tiers of capability.
Authorization through weight permutation resists fine-tuning extraction and partial key compromise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Open-weight releases could support different regulatory or user tiers without separate models.
Deployment pipelines might gate advanced features by key distribution rather than model variants.
Testing could check whether partial key leakage still blocks most private capability extraction.
The approach may connect to broader questions of embedding access controls directly in model parameters.

Load-bearing premise

Joint pretraining plus regularization during keyed fine-tuning is enough to prevent the shared weights from leaking private capabilities into the public configuration.

What would settle it

Fine-tune the public configuration on data designed to elicit the private capabilities and check whether it acquires them or remains unable to do so.

Figures

Figures reproduced from arXiv: 2606.21638 by Arkil Patel, Charbel El Feghali, Nicholas Meade, Siva Reddy, Spandana Gella, Verna Dankers.

**Figure 1.** Figure 1: Overview of Tiered Language Models. Top: The same released weights support a public configuration Cpub and a private configuration CK. Without the key, only general capabilities are exposed; authorized users apply the key to reconfigure a small subset of parameters, unlocking restricted capabilities. Bottom: Training pipeline. Pretraining runs next-token prediction using the public configuration, with ever… view at source ↗

**Figure 2.** Figure 2: Behavioral separation in TLMs. Left: Validation-loss trajectories during private finetuning of the 650M TLM. The keyed configuration CK improves on Spanish (private) data while remaining stable on English (public) data. The public configuration Cpub shows no improvement on Spanish, confirming that private knowledge does not leak. Right: Instruction-following win rate for the 650M TLM on AlpacaEval, judged… view at source ↗

**Figure 3.** Figure 3: Left: Memorization of synthetic facts. Exact-match accuracy during private fine-tuning of TLM-180M. CK reaches perfect recall of all 400 facts; Cpub remains at zero throughout. Right: Keyed-update frequency. Public-domain validation loss for TLM-180M pretrained with varying keyed-update frequency f. Cpub is unaffected by f; CK improves as f decreases, with diminishing returns beyond f=20 (∼5% additional FL… view at source ↗

**Figure 4.** Figure 4: Comparing public-domain validation loss during pretraining for TLM-180M against a non-tiered baseline. 5 Computational Cost of TLMs The previous section established that TLMs achieve clean behavioral separation across three diverse tasks. We now consider practical considerations: the computational cost of tiered pretraining, the performance relative to standard pretraining, and the storage footprint of per… view at source ↗

**Figure 5.** Figure 5: Robustness to extraction attacks. Left: fine-tuning-based extraction. An attacker finetunes on 50% of the synthetic biographies through Cpub (no key) and is evaluated on the held-out 50%. Three starting checkpoints are compared: a non-TLM baseline, a TLM before private fine-tuning, and a TLM whose CK has memorized all 400 biographies. All three memorize the training split at comparable rates (solid), with… view at source ↗

**Figure 6.** Figure 6: We start from the 180M cumulative multi-tier pretrained model with three 5% keys. The [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: We start from the 180M cumulative multi-tier pretrained model with three 5% keys. The [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Pretraining trajectories for the 180M TLM for all key-sizes. Left: public-configuration validation loss under Cpub. Right: keyed-configuration validation loss under CK. Across key sizes, both configurations continue improving during pretraining, with larger keys showing slightly higher validation loss. A.3 Constructing the synthetic biography dataset We construct a synthetic biography dataset containing 40… view at source ↗

**Figure 9.** Figure 9: Pretraining trajectories for the 650M TLM at two key-sizes. Left: public-configuration validation loss under Cpub. Right: keyed-configuration validation loss under CK. Both key sizes continue improving throughout pretraining, with the 5% key reaching slightly lower validation loss than the 10% key on both configurations, consistent with the 180M trend. 30K 32.5K 35K 37.5K 40K 42.5K 45K Step 3.150 3.165 3.1… view at source ↗

**Figure 10.** Figure 10: Cumulative multi-tier pretraining. A shared model is trained with one public configuration Cpub and three nested keyed configurations: CK1 applies key 1, CK2 applies keys 1+2, and CK3 applies keys 1+2+3. At each step, training uses Cpub and one round-robin keyed configuration. of target tokens that the greedy decode predicts correctly at the matching positions. Exact match is the strict criterion used in… view at source ↗

**Figure 11.** Figure 11: KL and key-size sweeps during private fine-tuning of a 180M model on 2B tokens of FineWeb2 Spanish. Weaker KL regularization lets CK adapt more strongly to the private distribution (a) at the cost of greater drift from previously learned public behavior (b); darker curves correspond to larger KL weights. Larger key fractions yield lower private validation loss (c); darker curves correspond to larger key s… view at source ↗

**Figure 12.** Figure 12: Left: Permuting weights of a pretrained model destroys capabilities. We apply random parameter permutations to Qwen-3-8B, allocating 25% of the swap budget to attention heads and 75% to MLP columns, and evaluate MMLU accuracy. Right: Tiered pretraining is necessary for the keyed configuration to function. Public-domain validation loss during private fine-tuning on 2B Spanish tokens, comparing a TLM and a … view at source ↗

**Figure 13.** Figure 13: Left: Effect of keyed-update frequency on fine-tuning. Each 180M TLM from [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Weight-magnitude signatures of tier parameters after private fine-tuning. Each cell shows [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: LoRA and TLM training comparison. A 1% bf16 LoRA adapter closely matches the private-domain validation loss of the 5% keyed TLM during Spanish fine-tuning, making it a comparable baseline [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Left: Portuguese fine-tuning. Validation-loss trajectories for the 650M TLM fine-tuned on Portuguese private data. Right: Instruction fine-tuning. Validation-loss trajectories for the 650M TLM fine-tuned on Alpaca. a compact block-level index map. The attention and MLP kernels would then use this map to read the appropriate head or MLP blocks directly, rather than first rewriting the weight tensors. This … view at source ↗

**Figure 17.** Figure 17: Behavioral separation under interleaved fine-tuning. Both panels follow the setup of [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative example of generation from TLM-180M after tiered pretraining showing that the public tier Cpub and keyed tier CK produce similar, contextually appropriate continuations on the same public tier prompt for the 180M model. Spanish Example Prompt Jungle Adventure Park y el Museo Nicolis Nav Public output Cpub arra, and the new “The Jungle”. [...] Keyed output CK arro, que se encuentra en la ciudad… view at source ↗

**Figure 19.** Figure 19: Qualitative example of generation from TLM-180M fine-tuned on Spanish. The public tier Cpub reverts to English, while the keyed tier CK maintains an (imperfect) Spanish continuation. C.8 Qualitative examples We include qualitative examples in Figures 18 to 21. C.9 Example of a key Sample MLP-column swaps from the 5% key (180M) [ ... ] [[0, 1359], [1, 5026]] [[2, 2132], [3, 4936]] [ ... ] Each entry [ℓa, c… view at source ↗

**Figure 20.** Figure 20: Qualitative example of generation from TLM-650M fine-tuned for instruction following. The public tier Cpub produces a repetitive and low-quality continuation, while the keyed tier CK follows the instruction and generates a coherent sports-commentary response. Spanish Example Prompt La capital de España es Public output Cpub pañol - La capital de España español - La capital de España español [...] Keyed ou… view at source ↗

**Figure 21.** Figure 21: Qualitative example of generation from TLM-650M fine-tuned on Spanish. Given the prompt “La capital de España es”, the public configuration Cpub fails to interpret es as the Spanish verb “is” in context, and instead continues toward fragments such as “español,” producing an incoherent completion. In contrast, the keyed configuration CK correctly understands the geographical prompt and produces a coherent … view at source ↗

read the original abstract

Open-weight Large Language Models (LLMs) enable scientific progress and broad deployment. However, they make it difficult to control access to sensitive capabilities. Current practice either suppresses dangerous capabilities before release or mediates access through closed services that use specialized model variants, input/output monitors, and API permissions. The former is susceptible to jailbreaks while sacrificing capability for all users to mitigate the risks posed by a few, and the latter is fundamentally incompatible with open-weight release. In this paper, we propose Tiered Language Models (TLMs), where a single set of released weights supports multiple capability levels. In its default public configuration, a TLM behaves as a conventional LLM. A compact secret key specifies a permutation over a small parameter subset, inducing an alternative computation graph over the same weights that exposes additional capabilities. We develop a training protocol that jointly pretrains both configurations from scratch, then fine-tunes the keyed configuration on private data with regularization to preserve the public model's behavior. We pretrain 180M- and 650M-parameter TLMs and demonstrate that the keyed configuration can acquire a new language, gain instruction-following ability, and memorize private factual knowledge, whereas the public configuration exhibits none of these capabilities. Moreover, we show that our approach extends naturally to multiple hierarchical tiers. Because authorization operates on the model's weight structure rather than in the input space, the mechanism resists fine-tuning-based extraction and partial key compromise. In general, TLMs take a step toward reconciling open-weight release with selective capability control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The keyed permutation on a parameter subset is a distinct mechanism for tiered capabilities, but the abstract's lack of metrics leaves the no-leakage claim unverified.

read the letter

The paper's main move is to release one set of weights that can run in a default public mode or, with a secret key, switch to a higher-capability mode by permuting a small slice of parameters. This creates an alternative computation path over the shared weights without changing the released file. They jointly pretrain both modes from scratch on 180M and 650M models, then fine-tune only the keyed path on private data while regularizing to hold the public behavior fixed. The demos show the keyed version picking up a new language, instruction following, and private facts while the public version shows none, and the setup resists simple fine-tuning extraction.

That construction is not in the cited prior work on suppression or API gating, so the idea itself is new. The weight-structure authorization rather than input monitoring is also a practical distinction.

The soft spot is the isolation claim. Because the weights stay shared, any overlap from the joint pretraining could embed private knowledge into the public path, and the regularization has to block that for every downstream task. The abstract gives no numbers on leakage rates, no error bars, and no ablations on regularization strength or the size of the permuted subset. The stress-test concern about gradient signals during pretraining therefore lands; without those measurements the central result stays provisional even on the small models.

This is for groups working on open-weight release policy and capability control. A reader who wants a concrete alternative to full suppression or closed APIs will find the mechanism worth examining. The work shows clear thinking on the problem and honest engagement with the open-release tension, so it deserves a serious referee even though the current evidence is thin on the isolation guarantee.

Referee Report

2 major / 1 minor

Summary. The paper proposes Tiered Language Models (TLMs), in which a single set of released weights supports multiple capability levels via a compact secret key that specifies a permutation over a small parameter subset, inducing an alternative computation graph. A training protocol jointly pretrains both public and keyed configurations from scratch, then fine-tunes only the keyed path on private data while regularizing to preserve public behavior. Demonstrations on 180M- and 650M-parameter models show the keyed configuration acquiring a new language, instruction-following, and private factual knowledge while the public configuration exhibits none; the mechanism is claimed to resist fine-tuning-based extraction and to extend to hierarchical tiers.

Significance. If the claimed isolation of capabilities through the keyed permutation and regularization holds under rigorous verification, the result would be significant for the field: it offers a weight-structure-based mechanism for selective capability control that is compatible with open-weight release, unlike input-space monitoring or capability suppression. The empirical demonstrations on small-scale models and the natural extension to multiple tiers provide a concrete starting point, though the absence of quantitative validation leaves the practical impact uncertain.

major comments (2)

[Abstract] Abstract: the claim of successful demonstrations on 180M and 650M models for language acquisition, instruction following, and private memorization provides no quantitative metrics, error bars, or details on how public-mode leakage was measured or prevented; this directly undermines assessment of the central decoupling claim.
[Training protocol] Training protocol description: the assertion that joint pretraining plus regularization during keyed fine-tuning sufficiently isolates private capabilities from the shared weights lacks any ablation on regularization strength, bound on representational overlap, or test of whether gradient signals during pretraining embed private knowledge into the public path; this is load-bearing for the isolation result.

minor comments (1)

[Method] The notation for the permutation mechanism and the size of the permuted parameter subset could be clarified with an explicit equation or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the two major comments below and commit to revisions that strengthen the presentation of results and protocol details without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of successful demonstrations on 180M and 650M models for language acquisition, instruction following, and private memorization provides no quantitative metrics, error bars, or details on how public-mode leakage was measured or prevented; this directly undermines assessment of the central decoupling claim.

Authors: We agree the abstract would be strengthened by including quantitative results. The full manuscript reports performance metrics (e.g., accuracy on held-out private tasks) in the experiments section with error bars from multiple seeds, and public-mode leakage is quantified by evaluating the public configuration on private-language, instruction, and memorization benchmarks, yielding near-zero capability. We will revise the abstract to summarize these key metrics and the leakage evaluation protocol. revision: yes
Referee: [Training protocol] Training protocol description: the assertion that joint pretraining plus regularization during keyed fine-tuning sufficiently isolates private capabilities from the shared weights lacks any ablation on regularization strength, bound on representational overlap, or test of whether gradient signals during pretraining embed private knowledge into the public path; this is load-bearing for the isolation result.

Authors: The manuscript describes the joint pretraining objective and the regularization term applied only during the subsequent keyed fine-tuning stage on private data. Because private data is introduced exclusively after pretraining, gradient signals from private examples cannot affect the public path during pretraining. We will add ablations varying regularization strength and measuring representational overlap (via cosine similarity of activations) in the revised version to make the isolation evidence more quantitative. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical protocol with no self-referential reductions

full rationale

The paper introduces Tiered Language Models via a joint pretraining protocol followed by keyed fine-tuning and regularization, validated empirically on 180M- and 650M-parameter models. No equations, derivations, or fitted parameters are presented that reduce any claimed capability separation to its own inputs by construction. The mechanism is defined procedurally and tested experimentally rather than through self-definition, renamed known results, or load-bearing self-citations that would force the outcome. The work is self-contained against external benchmarks as an empirical proposal.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach rests on the unstated premise that a small permutation can create a functionally separate computation graph without intermediate leakage, plus the empirical claim that joint pretraining plus regularization suffices to isolate the tiers.

free parameters (2)

size of permuted parameter subset
The fraction of parameters affected by the key is chosen to balance capability separation and public-mode stability.
regularization strength during keyed fine-tuning
The strength of the term that preserves public behavior is a tunable hyperparameter.

axioms (1)

domain assumption A compact secret key can specify a permutation that induces an alternative computation graph over the same weights.
Invoked in the description of how the keyed configuration works.

invented entities (1)

Tiered Language Model with keyed permutation mechanism no independent evidence
purpose: To separate public and private capabilities within one weight set.
New mechanism introduced to solve the open-weight control problem.

pith-pipeline@v0.9.1-grok · 5822 in / 1362 out tokens · 20237 ms · 2026-06-26T13:43:24.681498+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 15 canonical work pages · 5 internal anchors

[1]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =

Improving Transformer Models by Reordering their Sublayers , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =
[2]

2024 , url=

Ryan Greenblatt and Fabien Roger and Dmitrii Krasheninnikov and David Krueger , booktitle=. 2024 , url=

2024
[3]

Findings of the Association for Computational Linguistics: NAACL 2024

Tang, Ruixiang and Chuang, Yu-Neng and Cai, Xuanting and Du, Mengnan and Hu, Xia. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.256

work page doi:10.18653/v1/2024.findings-naacl.256 2024
[4]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

2021
[5]

Smith and Luke Zettlemoyer and Pang Wei Koh and Hannaneh Hajishirzi and Ali Farhadi and Sewon Min , year=

Weijia Shi and Akshita Bhagia and Kevin Farhat and Niklas Muennighoff and Pete Walsh and Jacob Morrison and Dustin Schwenk and Shayne Longpre and Jake Poznanski and Allyson Ettinger and Daogao Liu and Margaret Li and Dirk Groeneveld and Mike Lewis and Wen-tau Yih and Luca Soldaini and Kyle Lo and Noah A. Smith and Luke Zettlemoyer and Pang Wei Koh and Han...

work page arXiv
[6]

Alec Radford and Jeff Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever , year=
[7]

2025 , eprint=

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs , author=. 2025 , eprint=

2025
[8]

Wikimedia Downloads

Wikimedia. Wikimedia Downloads
[9]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Tinystories: How small can language models be and still speak coherent english? , author=. arXiv preprint arXiv:2305.07759 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Guilherme Penedo and Hynek Kydl. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[11]

Second Conference on Language Modeling , year=

Guilherme Penedo and Hynek Kydl. Second Conference on Language Modeling , year=
[12]

Advances in Neural Information Processing Systems , editor=

An empirical analysis of compute-optimal large language model training , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022
[13]

Hwang and Luca Soldaini and Akshita Bhagia and Jiacheng Liu and Dirk Groeneveld and Oyvind Tafjord and Noah A

Ian Magnusson and Nguyen Tai and Ben Bogin and David Heineman and Jena D. Hwang and Luca Soldaini and Akshita Bhagia and Jiacheng Liu and Dirk Groeneveld and Oyvind Tafjord and Noah A. Smith and Pang Wei Koh and Jesse Dodge , booktitle=. 2025 , url=

2025
[14]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023
[15]

2025 , url=

Liu, Qin and Wang, Fei and Xiao, Chaowei and Chen, Muhao , booktitle=. 2025 , url=

2025
[16]

2023 , url=

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , journal=. 2023 , url=

2023
[17]

Fleshman, William and Khan, Aleem and Marone, Marc and Van Durme, Benjamin , journal=
[18]

2022 , url=

Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. 2022 , url=

2022
[19]

Shi, Weijia and Bhagia, Akshita and Farhat, Kevin and Muennighoff, Niklas and Walsh, Pete and Morrison, Jacob and Schwenk, Dustin and Longpre, Shayne and Poznanski, Jake and Ettinger, Allyson and others , journal=
[20]

2025 , url=

He, Lipeng and Duddu, Vasisht and Asokan, N , journal=. 2025 , url=

2025
[21]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =

2023
[22]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025
[23]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Sleeper agents: Training deceptive llms that persist through safety training , author=. arXiv preprint arXiv:2401.05566 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

and Liang, Percy and Narayanan, Arvind , title =

Kapoor, Sayash and Bommasani, Rishi and Klyman, Kevin and Longpre, Shayne and Ramaswami, Ashwin and Cihon, Peter and Hopkins, Aspen and Bankston, Kevin and Biderman, Stella and Bogen, Miranda and Chowdhury, Rumman and Engler, Alex and Henderson, Peter and Jernite, Yacine and Lazar, Seth and Maffulli, Stefano and Nelson, Alondra and Pineau, Joelle and Skow...

2024
[25]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Huang, Hanbo and Li, Yihan and Jiang, Bowen and Jiang, Bo and Liu, Lin and Liu, Zhuotao and Sun, Ruoyu and Liang, Shiyu. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.420

work page doi:10.18653/v1/2025.emnlp-main.420 2025
[26]

Elizabeth Seger and Noemi Dreksler and Richard Moulange and Emily Dardaman and Jonas Schuett and K. Wei and Christoph Winter and Mackenzie Arnold and Seán Ó hÉigeartaigh and Anton Korinek and Markus Anderljung and Ben Bucknall and Alan Chan and Eoghan Stafford and Leonie Koessler and Aviv Ovadya and Ben Garfinkel and Emma Bluemke and Michael Aird and Patr...

work page arXiv
[27]

and Stoica, Ion , booktitle =

Sheng, Ying and Cao, Shiyi and Li, Dacheng and Hooper, Coleman and Lee, Nicholas and Yang, Shuo and Chou, Christopher and Zhu, Banghua and Zheng, Lianmin and Keutzer, Kurt and Gonzalez, Joseph E. and Stoica, Ion , booktitle =
[28]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou and Zifan Wang and Nicholas Carlini and Milad Nasr and J. Zico Kolter and Matt Fredrikson , year=. 2307.15043 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Many-shot Jailbreaking , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[30]

Weinberger , year=

Chuan Guo and Ruihan Wu and Kilian Q. Weinberger , year=
[31]

The role of permutation invariance in linear mode connectivity of neural networks.arXiv preprint arXiv:2110.06296, 2021

The role of permutation invariance in linear mode connectivity of neural networks , author=. arXiv preprint arXiv:2110.06296 , year=

work page arXiv
[32]

Git re-basin: Merging models modulo permutation symmetries.arXiv preprint arXiv:2209.04836, 2022

Git re-basin: Merging models modulo permutation symmetries , author=. arXiv preprint arXiv:2209.04836 , year=

work page arXiv
[33]

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape , author=. arXiv preprint arXiv:1907.02911 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[34]

International Conference on Machine Learning , pages=

Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[35]

arXiv preprint arXiv:2211.08403 , year=

Repair: Renormalizing permuted activations for interpolation repair , author=. arXiv preprint arXiv:2211.08403 , year=

work page arXiv
[36]

Merging text transformer models from different initializations

Merging text transformer models from different initializations , author=. arXiv preprint arXiv:2403.00986 , year=

work page arXiv
[37]

Generalized linear mode connectivity for transformers.arXiv preprint arXiv:2506.22712, 2025

Generalized linear mode connectivity for transformers , author=. arXiv preprint arXiv:2506.22712 , year=

work page arXiv
[38]

Superposition of many models into one , url =

Cheung, Brian and Terekhov, Alexander and Chen, Yubei and Agrawal, Pulkit and Olshausen, Bruno , booktitle =. Superposition of many models into one , url =
[39]

2020 , url=

Yeming Wen and Dustin Tran and Jimmy Ba , booktitle=. 2020 , url=

2020
[40]

Advances in Neural Information Processing Systems , volume=

Matryoshka representation learning , author=. Advances in Neural Information Processing Systems , volume=
[41]

Sheng, Ying and Cao, Shiyi and Li, Dacheng and Hooper, Coleman and Lee, Nicholas and Yang, Shuo and Chou, Christopher and Zhu, Banghua and Zheng, Lianmin and Keutzer, Kurt and others , journal=
[42]

Chen, Lequn and Ye, Zihao and Wu, Yongji and Zhuo, Danyang and Ceze, Luis and Krishnamurthy, Arvind , journal=
[43]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Duan, Jiangfei and Lu, Runyu and Duanmu, Haojie and Li, Xiuhong and Zhang, Xingcheng and Lin, Dahua and Stoica, Ion and Zhang, Hao , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[44]

Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market

Xiang, Yuxing and Li, Xue and Qian, Kun and Yang, Yufan and Zhu, Diwen and Yu, Wenyuan and Zhai, Ennan and Liu, Xuanzhe and Jin, Xin and Zhou, Jingren , title =. 2025 , isbn =. doi:10.1145/3731569.3764815 , booktitle =

work page doi:10.1145/3731569.3764815 2025
[45]

2026 , eprint=

No More, No Less: Least-Privilege Language Models , author=. 2026 , eprint=

2026
[46]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

2022
[47]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[48]

2026 , howpublished=

2026
[49]

Mrinank Sharma and Meg Tong and Jesse Mu and Jerry Wei and Jorrit Kruthoff and Scott Goodfriend and Euan Ong and Alwin Peng and Raj Agarwal and Cem Anil and Amanda Askell and Nathan Bailey and Joe Benton and Emma Bluemke and Samuel R. Bowman and Eric Christiansen and Hoagy Cunningham and Andy Dau and Anjali Gopal and Rob Gilson and Logan Graham and Logan ...

work page internal anchor Pith review Pith/arXiv arXiv
[50]

2019 , url=

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others , journal=. 2019 , url=

2019

[1] [1]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =

Improving Transformer Models by Reordering their Sublayers , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year =

[2] [2]

2024 , url=

Ryan Greenblatt and Fabien Roger and Dmitrii Krasheninnikov and David Krueger , booktitle=. 2024 , url=

2024

[3] [3]

Findings of the Association for Computational Linguistics: NAACL 2024

Tang, Ruixiang and Chuang, Yu-Neng and Cai, Xuanting and Du, Mengnan and Hu, Xia. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.256

work page doi:10.18653/v1/2024.findings-naacl.256 2024

[4] [4]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

2021

[5] [5]

Smith and Luke Zettlemoyer and Pang Wei Koh and Hannaneh Hajishirzi and Ali Farhadi and Sewon Min , year=

Weijia Shi and Akshita Bhagia and Kevin Farhat and Niklas Muennighoff and Pete Walsh and Jacob Morrison and Dustin Schwenk and Shayne Longpre and Jake Poznanski and Allyson Ettinger and Daogao Liu and Margaret Li and Dirk Groeneveld and Mike Lewis and Wen-tau Yih and Luca Soldaini and Kyle Lo and Noah A. Smith and Luke Zettlemoyer and Pang Wei Koh and Han...

work page arXiv

[6] [6]

Alec Radford and Jeff Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever , year=

[7] [7]

2025 , eprint=

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs , author=. 2025 , eprint=

2025

[8] [8]

Wikimedia Downloads

Wikimedia. Wikimedia Downloads

[9] [9]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Tinystories: How small can language models be and still speak coherent english? , author=. arXiv preprint arXiv:2305.07759 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Guilherme Penedo and Hynek Kydl. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[11] [11]

Second Conference on Language Modeling , year=

Guilherme Penedo and Hynek Kydl. Second Conference on Language Modeling , year=

[12] [12]

Advances in Neural Information Processing Systems , editor=

An empirical analysis of compute-optimal large language model training , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022

[13] [13]

Hwang and Luca Soldaini and Akshita Bhagia and Jiacheng Liu and Dirk Groeneveld and Oyvind Tafjord and Noah A

Ian Magnusson and Nguyen Tai and Ben Bogin and David Heineman and Jena D. Hwang and Luca Soldaini and Akshita Bhagia and Jiacheng Liu and Dirk Groeneveld and Oyvind Tafjord and Noah A. Smith and Pang Wei Koh and Jesse Dodge , booktitle=. 2025 , url=

2025

[14] [14]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023

[15] [15]

2025 , url=

Liu, Qin and Wang, Fei and Xiao, Chaowei and Chen, Muhao , booktitle=. 2025 , url=

2025

[16] [16]

2023 , url=

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , journal=. 2023 , url=

2023

[17] [17]

Fleshman, William and Khan, Aleem and Marone, Marc and Van Durme, Benjamin , journal=

[18] [18]

2022 , url=

Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. 2022 , url=

2022

[19] [19]

Shi, Weijia and Bhagia, Akshita and Farhat, Kevin and Muennighoff, Niklas and Walsh, Pete and Morrison, Jacob and Schwenk, Dustin and Longpre, Shayne and Poznanski, Jake and Ettinger, Allyson and others , journal=

[20] [20]

2025 , url=

He, Lipeng and Duddu, Vasisht and Asokan, N , journal=. 2025 , url=

2025

[21] [21]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =

2023

[22] [22]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025

[23] [23]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Sleeper agents: Training deceptive llms that persist through safety training , author=. arXiv preprint arXiv:2401.05566 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

and Liang, Percy and Narayanan, Arvind , title =

Kapoor, Sayash and Bommasani, Rishi and Klyman, Kevin and Longpre, Shayne and Ramaswami, Ashwin and Cihon, Peter and Hopkins, Aspen and Bankston, Kevin and Biderman, Stella and Bogen, Miranda and Chowdhury, Rumman and Engler, Alex and Henderson, Peter and Jernite, Yacine and Lazar, Seth and Maffulli, Stefano and Nelson, Alondra and Pineau, Joelle and Skow...

2024

[25] [25]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Huang, Hanbo and Li, Yihan and Jiang, Bowen and Jiang, Bo and Liu, Lin and Liu, Zhuotao and Sun, Ruoyu and Liang, Shiyu. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.420

work page doi:10.18653/v1/2025.emnlp-main.420 2025

[26] [26]

Elizabeth Seger and Noemi Dreksler and Richard Moulange and Emily Dardaman and Jonas Schuett and K. Wei and Christoph Winter and Mackenzie Arnold and Seán Ó hÉigeartaigh and Anton Korinek and Markus Anderljung and Ben Bucknall and Alan Chan and Eoghan Stafford and Leonie Koessler and Aviv Ovadya and Ben Garfinkel and Emma Bluemke and Michael Aird and Patr...

work page arXiv

[27] [27]

and Stoica, Ion , booktitle =

Sheng, Ying and Cao, Shiyi and Li, Dacheng and Hooper, Coleman and Lee, Nicholas and Yang, Shuo and Chou, Christopher and Zhu, Banghua and Zheng, Lianmin and Keutzer, Kurt and Gonzalez, Joseph E. and Stoica, Ion , booktitle =

[28] [28]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou and Zifan Wang and Nicholas Carlini and Milad Nasr and J. Zico Kolter and Matt Fredrikson , year=. 2307.15043 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Many-shot Jailbreaking , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[30] [30]

Weinberger , year=

Chuan Guo and Ruihan Wu and Kilian Q. Weinberger , year=

[31] [31]

The role of permutation invariance in linear mode connectivity of neural networks.arXiv preprint arXiv:2110.06296, 2021

The role of permutation invariance in linear mode connectivity of neural networks , author=. arXiv preprint arXiv:2110.06296 , year=

work page arXiv

[32] [32]

Git re-basin: Merging models modulo permutation symmetries.arXiv preprint arXiv:2209.04836, 2022

Git re-basin: Merging models modulo permutation symmetries , author=. arXiv preprint arXiv:2209.04836 , year=

work page arXiv

[33] [33]

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape , author=. arXiv preprint arXiv:1907.02911 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907

[34] [34]

International Conference on Machine Learning , pages=

Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021

[35] [35]

arXiv preprint arXiv:2211.08403 , year=

Repair: Renormalizing permuted activations for interpolation repair , author=. arXiv preprint arXiv:2211.08403 , year=

work page arXiv

[36] [36]

Merging text transformer models from different initializations

Merging text transformer models from different initializations , author=. arXiv preprint arXiv:2403.00986 , year=

work page arXiv

[37] [37]

Generalized linear mode connectivity for transformers.arXiv preprint arXiv:2506.22712, 2025

Generalized linear mode connectivity for transformers , author=. arXiv preprint arXiv:2506.22712 , year=

work page arXiv

[38] [38]

Superposition of many models into one , url =

Cheung, Brian and Terekhov, Alexander and Chen, Yubei and Agrawal, Pulkit and Olshausen, Bruno , booktitle =. Superposition of many models into one , url =

[39] [39]

2020 , url=

Yeming Wen and Dustin Tran and Jimmy Ba , booktitle=. 2020 , url=

2020

[40] [40]

Advances in Neural Information Processing Systems , volume=

Matryoshka representation learning , author=. Advances in Neural Information Processing Systems , volume=

[41] [41]

Sheng, Ying and Cao, Shiyi and Li, Dacheng and Hooper, Coleman and Lee, Nicholas and Yang, Shuo and Chou, Christopher and Zhu, Banghua and Zheng, Lianmin and Keutzer, Kurt and others , journal=

[42] [42]

Chen, Lequn and Ye, Zihao and Wu, Yongji and Zhuo, Danyang and Ceze, Luis and Krishnamurthy, Arvind , journal=

[43] [43]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Duan, Jiangfei and Lu, Runyu and Duanmu, Haojie and Li, Xiuhong and Zhang, Xingcheng and Lin, Dahua and Stoica, Ion and Zhang, Hao , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[44] [44]

Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market

Xiang, Yuxing and Li, Xue and Qian, Kun and Yang, Yufan and Zhu, Diwen and Yu, Wenyuan and Zhai, Ennan and Liu, Xuanzhe and Jin, Xin and Zhou, Jingren , title =. 2025 , isbn =. doi:10.1145/3731569.3764815 , booktitle =

work page doi:10.1145/3731569.3764815 2025

[45] [45]

2026 , eprint=

No More, No Less: Least-Privilege Language Models , author=. 2026 , eprint=

2026

[46] [46]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

2022

[47] [47]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[48] [48]

2026 , howpublished=

2026

[49] [49]

Mrinank Sharma and Meg Tong and Jesse Mu and Jerry Wei and Jorrit Kruthoff and Scott Goodfriend and Euan Ong and Alwin Peng and Raj Agarwal and Cem Anil and Amanda Askell and Nathan Bailey and Joe Benton and Emma Bluemke and Samuel R. Bowman and Eric Christiansen and Hoagy Cunningham and Andy Dau and Anjali Gopal and Rob Gilson and Logan Graham and Logan ...

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

2019 , url=

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others , journal=. 2019 , url=

2019