pith. sign in

arxiv: 2510.12117 · v3 · submitted 2025-10-14 · 💻 cs.CR · cs.LG

Locket: Robust Feature-Locking Technique for Language Models

Pith reviewed 2026-05-18 08:17 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords feature lockinglanguage modelsadversarial trainingadapterspay-to-unlockmodel securityrefusalAI access control
0
0 comments X

The pith

Locket locks specific language model features using adversarial adapters while preserving utility on others.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Locket to enable pay-to-unlock schemes for AI features such as advanced math or coding. It develops a framework that applies adversarial training to create feature-locking adapters and then merges them into the base model. This produces selective refusal of locked capabilities at a 100% rate, holds utility loss to 7% or less on unlocked tasks, and limits successful attacks to 5% or below. Service providers could shift from rigid subscription tiers to granular feature access that is both profitable and harder to bypass.

Core claim

Locket is the first robust and scalable feature-locking technique that simultaneously meets four requirements for pay-to-unlock schemes. A framework of adversarial training and merging of feature-locking adapters selectively disables targeted model capabilities. Evaluation demonstrates 100% refusal on locked features, at most 7% utility degradation on unlocked features, at most 5% attack success rate, and scalability across multiple features and clients.

What carries the argument

The framework of adversarial training followed by merging of feature-locking adapters, which builds refusal boundaries that activate only for designated features.

If this is right

  • Providers can charge separately for unlocking premium capabilities without releasing full advanced models.
  • Models refuse locked features even under prompt-based evasion attempts or credential sharing.
  • Multiple features can be locked independently for different user tiers on the same base model.
  • Utility on non-locked tasks remains within 7% of the original model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adapter-merging approach might be tested for locking safety or alignment behaviors rather than just capability features.
  • Dynamic credential checks at inference time could turn the static locks into per-user feature controls.
  • Longer-term interactions between locked and unlocked features could be measured on extended conversation threads.

Load-bearing premise

The adversarial training and adapter-merging steps will continue to create reliable refusal boundaries when the locked features, prompt distributions, or base models differ substantially from the tested cases.

What would settle it

Retrain Locket on a new base model or a fresh set of locked features outside the original evaluation and check whether refusal drops below 100% or attack success exceeds 5%.

read the original abstract

Chatbot service providers (e.g., OpenAI) rely on tiered subscription plans to generate revenue, offering black-box access to basic models for free users and advanced models to paying subscribers. However, this approach is unprofitable and inflexible. A pay-to-unlock scheme for premium features (e.g., math, coding) offers a more sustainable alternative. Enabling such a scheme requires a feature-locking technique (FLoTE) that is (i) effective in refusing locked features, (ii) utility-preserving for unlocked features, (iii) robust against evasion or unauthorized credential sharing, and (iv) scalable to multiple features and clients. Existing FLoTEs (e.g., password-locked models) fail to meet these criteria. To fill this gap, we present Locket, the first robust and scalable FLoTE to enable pay-to-unlock schemes. We develop a framework for adversarial training and merging of feature-locking adapters, which enables Locket to selectively disable specific features of a model. Evaluation shows that Locket is effective ($100$% refusal rate), utility-preserving ($\leq 7$% utility degradation), robust ($\leq 5$% attack success rate), and scalable to multiple features and clients.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Locket, a feature-locking technique (FLoTE) for language models that uses adversarial training of per-feature adapters followed by merging to enable selective refusal of locked capabilities (e.g., math or coding) while preserving utility on unlocked ones. It claims this framework simultaneously achieves 100% refusal rate on locked features, ≤7% utility degradation, ≤5% attack success rate, and scalability to multiple features/clients, addressing limitations of prior password-locked models for pay-to-unlock monetization schemes.

Significance. If the empirical claims are substantiated with detailed, reproducible evaluations, Locket would represent a practical advance in AI access control and security, enabling flexible tiered service models beyond current black-box subscriptions. The combination of adversarial training with adapter merging is a plausible direction for robust capability gating, though its strength depends on generalization beyond the evaluated settings.

major comments (2)
  1. [Evaluation] Evaluation section: The abstract states concrete figures (100% refusal, ≤7% utility loss, ≤5% attack success) but provides no information on the specific attack models, prompt distributions, dataset splits, number of trials, or statistical significance testing. These details are load-bearing for the central robustness and effectiveness claims and must be supplied with full experimental protocols.
  2. [Method] Adapter merging procedure (method section): The scalability claim requires that merging multiple feature-locking adapters preserves individual refusal boundaries without interference or dilution. No analysis, ablation, or experiments are described that test boundary stability when the number of locked features increases or when prompt distributions shift, which directly undermines the multi-client and multi-feature guarantees.
minor comments (2)
  1. [Related Work] Add explicit comparison table against prior FLoTE baselines (e.g., password-locked models) with identical attack and utility metrics.
  2. [Method] Clarify the exact merging operation (weighted average, concatenation, or other) and any hyperparameters involved in the adapter framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We address each of the major comments in detail below and outline the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The abstract states concrete figures (100% refusal, ≤7% utility loss, ≤5% attack success) but provides no information on the specific attack models, prompt distributions, dataset splits, number of trials, or statistical significance testing. These details are load-bearing for the central robustness and effectiveness claims and must be supplied with full experimental protocols.

    Authors: We agree that providing full details on the experimental protocols is essential for validating our claims. Although the evaluation section describes the overall setup, we acknowledge that more granular information on attack models, prompt distributions, dataset splits, number of trials, and statistical testing was not sufficiently detailed. In the revised manuscript, we will expand this section to include complete experimental protocols, including these specifics, to ensure full reproducibility and transparency. revision: yes

  2. Referee: [Method] Adapter merging procedure (method section): The scalability claim requires that merging multiple feature-locking adapters preserves individual refusal boundaries without interference or dilution. No analysis, ablation, or experiments are described that test boundary stability when the number of locked features increases or when prompt distributions shift, which directly undermines the multi-client and multi-feature guarantees.

    Authors: We appreciate this observation regarding the adapter merging procedure. Our experiments do demonstrate scalability to multiple features and clients with maintained performance, but we concur that explicit ablations and analysis of boundary stability under increased numbers of locked features and prompt distribution shifts would strengthen the claims. We will add such analyses and additional experiments in the revised version to address this point directly. revision: yes

Circularity Check

0 steps flagged

No circularity: Locket claims rest on empirical evaluation of adversarial training and adapter merging, not on derivations that reduce to inputs by construction.

full rationale

The paper introduces Locket via a framework of adversarial training and merging of feature-locking adapters, then reports experimental outcomes (100% refusal, ≤7% utility degradation, ≤5% attack success, scalability to multiple features/clients). No equations, first-principles derivations, or parameter-fitting steps are described that would make any reported metric equivalent to the training objective or to a self-defined quantity. The evaluation uses concrete models, prompts, and attack methods that are externally falsifiable and independent of the claimed results; the central claims therefore do not collapse into self-definition, fitted-input renaming, or self-citation chains. This is the normal case for an empirical systems paper and warrants score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions of adversarial training in machine learning (existence of a loss landscape that can be optimized to produce refusal boundaries) and on the empirical claim that adapter merging preserves both refusal and utility; no new physical constants or invented entities are introduced.

axioms (1)
  • domain assumption Adversarial training can produce adapters that reliably refuse specific capabilities while leaving others intact.
    Invoked in the description of the framework for adversarial training and merging.

pith-pipeline@v0.9.0 · 5751 in / 1407 out tokens · 27651 ms · 2026-05-18T08:17:01.024164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.