Locket: Robust Feature-Locking Technique for Language Models
Pith reviewed 2026-05-18 08:17 UTC · model grok-4.3
The pith
Locket locks specific language model features using adversarial adapters while preserving utility on others.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Locket is the first robust and scalable feature-locking technique that simultaneously meets four requirements for pay-to-unlock schemes. A framework of adversarial training and merging of feature-locking adapters selectively disables targeted model capabilities. Evaluation demonstrates 100% refusal on locked features, at most 7% utility degradation on unlocked features, at most 5% attack success rate, and scalability across multiple features and clients.
What carries the argument
The framework of adversarial training followed by merging of feature-locking adapters, which builds refusal boundaries that activate only for designated features.
If this is right
- Providers can charge separately for unlocking premium capabilities without releasing full advanced models.
- Models refuse locked features even under prompt-based evasion attempts or credential sharing.
- Multiple features can be locked independently for different user tiers on the same base model.
- Utility on non-locked tasks remains within 7% of the original model performance.
Where Pith is reading between the lines
- The same adapter-merging approach might be tested for locking safety or alignment behaviors rather than just capability features.
- Dynamic credential checks at inference time could turn the static locks into per-user feature controls.
- Longer-term interactions between locked and unlocked features could be measured on extended conversation threads.
Load-bearing premise
The adversarial training and adapter-merging steps will continue to create reliable refusal boundaries when the locked features, prompt distributions, or base models differ substantially from the tested cases.
What would settle it
Retrain Locket on a new base model or a fresh set of locked features outside the original evaluation and check whether refusal drops below 100% or attack success exceeds 5%.
read the original abstract
Chatbot service providers (e.g., OpenAI) rely on tiered subscription plans to generate revenue, offering black-box access to basic models for free users and advanced models to paying subscribers. However, this approach is unprofitable and inflexible. A pay-to-unlock scheme for premium features (e.g., math, coding) offers a more sustainable alternative. Enabling such a scheme requires a feature-locking technique (FLoTE) that is (i) effective in refusing locked features, (ii) utility-preserving for unlocked features, (iii) robust against evasion or unauthorized credential sharing, and (iv) scalable to multiple features and clients. Existing FLoTEs (e.g., password-locked models) fail to meet these criteria. To fill this gap, we present Locket, the first robust and scalable FLoTE to enable pay-to-unlock schemes. We develop a framework for adversarial training and merging of feature-locking adapters, which enables Locket to selectively disable specific features of a model. Evaluation shows that Locket is effective ($100$% refusal rate), utility-preserving ($\leq 7$% utility degradation), robust ($\leq 5$% attack success rate), and scalable to multiple features and clients.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Locket, a feature-locking technique (FLoTE) for language models that uses adversarial training of per-feature adapters followed by merging to enable selective refusal of locked capabilities (e.g., math or coding) while preserving utility on unlocked ones. It claims this framework simultaneously achieves 100% refusal rate on locked features, ≤7% utility degradation, ≤5% attack success rate, and scalability to multiple features/clients, addressing limitations of prior password-locked models for pay-to-unlock monetization schemes.
Significance. If the empirical claims are substantiated with detailed, reproducible evaluations, Locket would represent a practical advance in AI access control and security, enabling flexible tiered service models beyond current black-box subscriptions. The combination of adversarial training with adapter merging is a plausible direction for robust capability gating, though its strength depends on generalization beyond the evaluated settings.
major comments (2)
- [Evaluation] Evaluation section: The abstract states concrete figures (100% refusal, ≤7% utility loss, ≤5% attack success) but provides no information on the specific attack models, prompt distributions, dataset splits, number of trials, or statistical significance testing. These details are load-bearing for the central robustness and effectiveness claims and must be supplied with full experimental protocols.
- [Method] Adapter merging procedure (method section): The scalability claim requires that merging multiple feature-locking adapters preserves individual refusal boundaries without interference or dilution. No analysis, ablation, or experiments are described that test boundary stability when the number of locked features increases or when prompt distributions shift, which directly undermines the multi-client and multi-feature guarantees.
minor comments (2)
- [Related Work] Add explicit comparison table against prior FLoTE baselines (e.g., password-locked models) with identical attack and utility metrics.
- [Method] Clarify the exact merging operation (weighted average, concatenation, or other) and any hyperparameters involved in the adapter framework.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our paper. We address each of the major comments in detail below and outline the changes we will make to the manuscript.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The abstract states concrete figures (100% refusal, ≤7% utility loss, ≤5% attack success) but provides no information on the specific attack models, prompt distributions, dataset splits, number of trials, or statistical significance testing. These details are load-bearing for the central robustness and effectiveness claims and must be supplied with full experimental protocols.
Authors: We agree that providing full details on the experimental protocols is essential for validating our claims. Although the evaluation section describes the overall setup, we acknowledge that more granular information on attack models, prompt distributions, dataset splits, number of trials, and statistical testing was not sufficiently detailed. In the revised manuscript, we will expand this section to include complete experimental protocols, including these specifics, to ensure full reproducibility and transparency. revision: yes
-
Referee: [Method] Adapter merging procedure (method section): The scalability claim requires that merging multiple feature-locking adapters preserves individual refusal boundaries without interference or dilution. No analysis, ablation, or experiments are described that test boundary stability when the number of locked features increases or when prompt distributions shift, which directly undermines the multi-client and multi-feature guarantees.
Authors: We appreciate this observation regarding the adapter merging procedure. Our experiments do demonstrate scalability to multiple features and clients with maintained performance, but we concur that explicit ablations and analysis of boundary stability under increased numbers of locked features and prompt distribution shifts would strengthen the claims. We will add such analyses and additional experiments in the revised version to address this point directly. revision: yes
Circularity Check
No circularity: Locket claims rest on empirical evaluation of adversarial training and adapter merging, not on derivations that reduce to inputs by construction.
full rationale
The paper introduces Locket via a framework of adversarial training and merging of feature-locking adapters, then reports experimental outcomes (100% refusal, ≤7% utility degradation, ≤5% attack success, scalability to multiple features/clients). No equations, first-principles derivations, or parameter-fitting steps are described that would make any reported metric equivalent to the training objective or to a self-defined quantity. The evaluation uses concrete models, prompts, and attack methods that are externally falsifiable and independent of the claimed results; the central claims therefore do not collapse into self-definition, fitted-input renaming, or self-citation chains. This is the normal case for an empirical systems paper and warrants score 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Adversarial training can produce adapters that reliably refuse specific capabilities while leaving others intact.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a framework for adversarial training and merging of feature-locking adapters... LOCKET merging... clips the spectral norm of the merged adapter’s weight matrix
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Evaluation shows that LOCKET is effective (100% refusal rate), utility-preserving (≤7% utility degradation), robust (≤5% attack success rate), and scalable
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.