Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Ziyang Liu

arxiv: 2604.18179 · v2 · pith:2BPKAW4Jnew · submitted 2026-04-20 · 💻 cs.CR · cs.AI

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Ziyang Liu This is my paper

Pith reviewed 2026-05-10 04:47 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords model substitution detectionsparse autoencoderscommitment protocolsMerkle treesLLM auditingadversarial robustnesshosted language models

0 comments

The pith

A Merkle-tree commitment to per-position sparse-autoencoder feature traces lets verifiers detect silent model substitution in hosted LLMs even when the provider knows the audit rules in advance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hosted LLM providers can advertise a strong model while routing ordinary users to a cheaper substitute because probe-after-return checks leave a parallel-serve side-channel. The paper closes that channel with a commit-open protocol: the provider first commits via Merkle tree to a sparse-autoencoder feature-trace sketch of every output token at a published probe layer. Verifiers later open random positions, score the traces against a public probe library using cross-backend noise calibration, and apply a fixed-threshold joint-consistency z-score rule. Experiments show that seventeen attackers, including same-family lifts, cross-family substitutes, and rank-128 adaptive LoRA, all fail this rule at one scale-stable threshold while the same attackers pass a matched parallel-serve baseline; a white-box backpropagation attack through the frozen encoder does not close the margin and a forgery attack that never runs the honest model is bounded by an intrinsic-dimension argument. The commitment step adds at most 2.1 percent to forward-only wall-clock time at batch size 32.

Core claim

By forcing the provider to commit to sparse-autoencoder feature traces before any opening request, the protocol creates a session-wide record that a verifier can check for joint consistency at random positions, rejecting substitution even against protocol-aware attackers.

What carries the argument

The commit-open protocol that uses a Merkle tree to bind the provider to per-position sparse-autoencoder (SAE) feature-trace sketches at a fixed probe layer, scored by a public named-circuit probe library and a fixed-threshold joint-consistency z-score rule.

If this is right

All seventeen tested attackers spanning same-family lifts, cross-family substitutes, and rank-128 adaptive LoRA are rejected at one shared scale-stable threshold.
The identical attackers evade a matched SVIP-style parallel-serve baseline.
A white-box end-to-end attack that backpropagates through the frozen SAE encoder does not close the detection margin.
A feature-forgery attacker that never runs the honest model is bounded in closed form by an intrinsic-dimension argument.
The commitment step adds no more than 2.1 percent to forward-only wall-clock time at batch size 32.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use would require providers to publish their probe libraries in advance so independent parties can perform audits.
The same commitment structure could apply to other detectable deviations such as safety-filter bypasses if distinctive feature traces for those behaviors exist.
Periodic updates to the probe library may be needed as new model families appear, otherwise the distinctiveness assumption could weaken over time.

Load-bearing premise

That a public probe library calibrated across backends produces feature traces distinctive enough between different models that the fixed z-score rule still works when the provider knows the protocol ahead of time.

What would settle it

Finding a substitute model or fine-tune that produces outputs a user would accept as coming from the advertised model yet still matches the committed SAE traces closely enough to pass the joint-consistency z-score test at the published threshold.

Figures

Figures reproduced from arXiv: 2604.18179 by Ziyang Liu.

**Figure 1.** Figure 1: Commit-open verification. Left: four provider strategies — (A) honest, (B) substitute + substitute’s trace (detected), (C) substitute + M-trace (pays honest compute), (D) adaptive φ (bounded by SAE capacity). Right: the provider commits a Merkle root of per-position SAE traces before the verifier’s probe, closing SVIP’s after-the-fact side-channel (dashed). commitment, which on our serving stack costs with… view at source ↗

**Figure 2.** Figure 2: Commit-open mechanics. (1) The provider forwards (x, y) through M and encodes the layer-L residual stream ht with a public SAE encoder E, keeping the top-32 active features per position. (2) Per-position leaves leaf t = H(meta, t, tok32t) are assembled into a Merkle tree with root R; the provider publishes (x, y, R). (3) The verifier samples random positions V and requests Merkle openings, verifying each p… view at source ↗

**Figure 3.** Figure 3: E2 same-family separability (Qwen3-1.7B vs. lifted Qwen3-0.6B), on the legacy per [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Rank-64 LoRA attacker frontier on the Qwen3-1.7B target. All four operating points sit above τpool=1.509; STAGEA is the strongest we evaluate (2.78×τpool at 1.55× Pile ppx). Adding utility regularisation moves the attacker up and to the right. This slice of the frontier is an operatingrange, not a ceiling (Section 6). Detection margin grows with SAE width. A scale-matched cross-family substitute — Qwen2.5… view at source ↗

**Figure 5.** Figure 5: SVIP parallel-serve vs. commit-open across [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Two-backbone separability on the per-probe Mahalanobis diagnostic scale (log [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Aggregator N-probe sweep: mean AUC across four attacker centers at each α-weakened operating point. ∆AUC between N=1 and N=96 is largest for small α, and AUC plateaus by N≈32. I Partial Mechanistic Auditability We ablate the top-32 features of each probe class and measure class-specific effect via ∆ KL = KL(pclean ∥ pabl) − KL(pclean ∥ prec). Of the four circuit classes tested, three local-circuit classes … view at source ↗

**Figure 8.** Figure 8: E9 batched commit overhead. (a) Per-batch latency for forward-only (A) and forward + [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: reports the fit and held-out R2 of the public-corpus linear alignment map φ for each crossfamily attacker, alongside the verifier-side joint z-score. Two attackers (Phi-3.5-mini, OLMo-2-7B) exhibit strongly negative held-out R2 , confirming that the public-corpus φ does not generalise across model families. Partial explanatory analysis for C3; detection is established independently by the joint z-score. Q… view at source ↗

**Figure 10.** Figure 10: E4 rank-constrained LoRA diagnostic. (a) LoRA’s [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: E10 circuit-ablation effect matrix. Cell [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: E6 per-category attackability under the E3 attack suite. Policy: rotate attackable classes; [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Feature-forgery attack ladder and library-rotation transfer gap for Qwen3 + Gemma. Even [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

read the original abstract

Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since a dishonest provider can route the verifier's probe to the advertised model while serving ordinary users from a substitute. We propose a commit-open protocol that closes this gap. Before any opening request, the provider commits via a Merkle tree to a per-position sparse-autoencoder (SAE) feature-trace sketch of its served output at a published probe layer. A verifier opens random positions, scores them against a public named-circuit probe library calibrated with cross-backend noise, and decides with a fixed-threshold joint-consistency z-score rule. We instantiate the protocol on three backbones -- Qwen3-1.7B, Gemma-2-2B, and a 4.5x scale-up to Gemma-2-9B with a 131k-feature SAE. Of 17 attackers spanning same-family lifts, cross-family substitutes, and rank-<=128 adaptive LoRA, all are rejected at a shared, scale-stable threshold; the same attackers all evade a matched SVIP-style parallel-serve baseline. A white-box end-to-end attack that backpropagates through the frozen SAE encoder does not close the margin, and a feature-forgery attacker that never runs M_hon is bounded in closed form by an intrinsic-dimension argument. Commitment adds <=2.1% to forward-only wall-clock at batch 32.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a workable Merkle-commit protocol on SAE feature traces to catch hosted LLM substitutions, with experiments separating several attack classes but thin methodological detail.

read the letter

The main takeaway is that this commit-open scheme using Merkle trees on per-position SAE feature sketches can close the parallel-serve side channel that earlier probe methods leave open, and the reported tests show it rejects 17 attackers (same-family lifts, cross-family swaps, rank-128 LoRA) at one shared threshold while a matched baseline does not. Overhead stays low at under 3 percent added latency. The white-box backprop attack through the frozen SAE also fails to close the gap, and a pure forgery attacker is bounded by an intrinsic-dimension argument. That combination of crypto commitment and ML probe is the actual novelty here, and it targets a real deployment incentive without requiring the verifier to run the full model in parallel. The construction itself looks clean on the equations given in the abstract. The soft spots sit in the experimental reporting. The probe library calibration and its independence from test data are not shown in enough detail to judge robustness, there are no error bars or sensitivity checks on the fixed z-score threshold, and the claim of scale-stability rests on three backbones without broader validation. The stress-test concern about a protocol-aware attacker optimizing against the public probes is plausible; even though the tested white-box case did not succeed, an adaptive substitute tuned to the exact opening distribution and probe set could still narrow the margin if the intrinsic-dimension bound proves loose in practice. This work is aimed at researchers and engineers building auditing tools for production LLM services. Readers focused on model provenance or lightweight cryptographic checks for ML APIs will see the most direct value. It deserves serious referee time because the problem is concrete, the protocol is a clear step forward from prior art, and the experimental separation is encouraging even if the statistics and attack model coverage need tightening. I would send it out for peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces a commit-open protocol to detect silent model substitution by hosted LLM providers. Providers commit via Merkle tree to per-position SAE feature-trace sketches of outputs at a published probe layer before any verifier opening. Verifiers open random positions, score traces against a public named-circuit probe library (calibrated on cross-backend noise), and apply a fixed-threshold joint-consistency z-score rule. Experiments on Qwen3-1.7B, Gemma-2-2B, and a 4.5x Gemma-2-9B scale-up with 131k-feature SAE show all 17 tested attackers (same-family lifts, cross-family substitutes, rank-≤128 adaptive LoRA) rejected at one shared scale-stable threshold; the same attackers evade a matched SVIP-style parallel-serve baseline. A white-box end-to-end attack backpropagating through the frozen SAE encoder fails to close the margin, while a feature-forgery attacker never running the honest model is bounded in closed form by an intrinsic-dimension argument. Commitment overhead is ≤2.1% at batch 32.

Significance. If the empirical and theoretical results hold, the work supplies a concrete, low-overhead mechanism for auditing hosted LLMs that closes the parallel-serve side-channel left by probe-after-return schemes. The combination of cryptographic commitment, SAE feature traces, and a public probe library is a novel contribution to the model-auditing literature. Explicit credit is due for testing a diverse attacker suite, including adaptive LoRA and white-box gradient attacks, plus the closed-form intrinsic-dimension bound on forgery. These elements would strengthen practical trust in commercial LLM services if the calibration independence and statistical robustness concerns are resolved.

major comments (3)

[Abstract] Abstract: the central empirical claim that 'all 17 attackers are rejected at a shared, scale-stable threshold' is presented without error bars, the exact numerical threshold, details on how the probe library was constructed, or statistical justification for the z-score rule. This is load-bearing for the paper's main result.
[Protocol description and §4 (experiments)] Protocol and experimental sections: the detection rule relies on an empirically calibrated noise model and public probe library, yet no demonstration is given that the library construction and calibration are independent of the 17 test attackers or the specific backbones used. This raises a circularity risk for the fixed-threshold rule against protocol-aware adversaries.
[Attacker evaluation] Attacker evaluation: while the white-box backprop attack through the frozen SAE and the intrinsic-dimension bound on feature forgery are reported, the manuscript does not quantify the margin by which these attacks fail or show that the bound remains non-vacuous once the provider knows the exact opening distribution and z-score rule in advance.

minor comments (2)

[Abstract and results] The abstract and results would benefit from a table listing the exact 17 attackers, their categories, and the per-attacker z-scores or distances to threshold.
[Notation] Notation for the 'per-position SAE-feature-trace sketch' and the joint-consistency z-score should be defined once in a dedicated notation subsection rather than introduced piecemeal.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and will revise the manuscript to incorporate the requested details, clarifications, and additional analysis.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim that 'all 17 attackers are rejected at a shared, scale-stable threshold' is presented without error bars, the exact numerical threshold, details on how the probe library was constructed, or statistical justification for the z-score rule. This is load-bearing for the paper's main result.

Authors: We agree that the abstract would benefit from greater specificity on these load-bearing elements. In revision we will state the exact shared threshold value, include error bars derived from the repeated experimental runs, briefly describe the probe library construction from the public named-circuit set, and point to the statistical justification of the joint-consistency z-score rule (which appears in §3). These additions will be kept concise while making the central claim fully transparent. revision: yes
Referee: [Protocol description and §4 (experiments)] Protocol and experimental sections: the detection rule relies on an empirically calibrated noise model and public probe library, yet no demonstration is given that the library construction and calibration are independent of the 17 test attackers or the specific backbones used. This raises a circularity risk for the fixed-threshold rule against protocol-aware adversaries.

Authors: The calibration uses cross-backend noise collected from a diverse set of models and the probe library is built from fixed, publicly documented named circuits; neither step incorporates the 17 evaluation attackers. To remove any ambiguity we will add an explicit independence check in the revised §4: we re-calibrate the noise model while deliberately excluding the test backbones and attackers, then verify that the same fixed threshold continues to separate all 17 attackers. This directly demonstrates that the rule is not circular with respect to the reported evaluation set. revision: yes
Referee: [Attacker evaluation] Attacker evaluation: while the white-box backprop attack through the frozen SAE and the intrinsic-dimension bound on feature forgery are reported, the manuscript does not quantify the margin by which these attacks fail or show that the bound remains non-vacuous once the provider knows the exact opening distribution and z-score rule in advance.

Authors: We will expand the attacker-evaluation subsection to report the precise margins: the z-score gap between the white-box back-propagation attack and the detection threshold, and the numerical probability bound obtained from the intrinsic-dimension argument. We will also add a short analysis showing that the forgery bound remains non-vacuous even under full knowledge of the opening-position distribution and the z-score rule, because the bound is derived from the SAE feature-space dimensionality and the sparsity constraint rather than from the specific opening schedule. revision: yes

Circularity Check

0 steps flagged

No significant circularity in protocol derivation or empirical claims

full rationale

The paper presents a commit-open protocol whose core steps (Merkle commitment to SAE feature-trace sketches, random-position opening, and fixed-threshold z-score decision) are defined independently of the test outcomes. Empirical results on 17 attackers are reported as validation rather than as a derivation that reduces to fitted parameters or self-citations by construction. The public probe library and cross-backend noise calibration are described as external inputs, with no equations or claims showing that the rejection threshold or distinctiveness argument collapses to the input data or prior self-work. The intrinsic-dimension bound on feature-forgery is presented as a closed-form argument separate from the fitted elements. This is the normal case of a self-contained protocol paper whose central claims do not reduce to their own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The protocol rests on the assumption that SAE features at a fixed probe layer capture model identity sufficiently to survive cross-backend noise and adaptive attacks; the z-score threshold and probe library are calibrated quantities whose independence from the evaluation set is not demonstrated in the abstract.

free parameters (2)

joint-consistency z-score threshold
Fixed threshold chosen to separate honest from all tested attacker traces; value not stated but described as scale-stable.
probe-library calibration parameters
Cross-backend noise model used to score opened positions; treated as public but empirically fitted.

axioms (1)

domain assumption SAE features extracted at the published probe layer remain distinctive across model families and scales even after cross-backend noise.
Required for the z-score rule to reject substitutions while accepting honest runs.

invented entities (1)

per-position SAE-feature-trace sketch no independent evidence
purpose: Compact representation committed via Merkle tree before verification to prevent post-hoc forgery.
New construct introduced to bind the provider to the served output without revealing full tokens.

pith-pipeline@v0.9.0 · 5570 in / 1606 out tokens · 46934 ms · 2026-05-10T04:47:50.457417+00:00 · methodology

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)