arxiv: 2604.08844 · v1 · submitted 2026-04-10 · 💻 cs.LG

Recognition: no theorem link

Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance

Roi Paul

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords LoRAspectral featuresfine-tuning objectiveharmful complianceweight space geometryDPOadapter monitoringsingular value analysis

0 comments

The pith

Spectral summaries of LoRA weight deltas identify the fine-tuning objective and predict harmful compliance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether low-rank spectral features extracted from LoRA adapter weight changes can reveal which training objective shaped a language model and whether those features also track increased risk of harmful outputs. In a controlled set of 38 adapters built on Llama-3.2-3B-Instruct, the same set of features—norms, ranks, entropy, and vector alignments—separates objectives and ranks their intensity with near-perfect accuracy when all adapters come from one training method. The geometry further correlates with how often the model follows harmful prompts, showing a clear dose-response link for certain preference inversions. This matters because it offers a potential shortcut for checking what a fine-tuned adapter actually does by inspecting its internal structure instead of running long behavioral tests. The signal does not transfer across training methods, however, so any monitoring system would need separate calibration for each method.

Core claim

Within a controlled manufacturing regime, LoRA weight-space geometry carries objective identity, intensity ordering, and a coarse link to harmful compliance, and cross-method monitoring requires per-method calibration. Per-layer spectral features from the weight deltas achieve AUC 1.00 for objective classification and severity ranking inside DPO, with PCA placing objective identity on PC1 orthogonal to training duration; query projections flag drift while value projections identify the objective; inverted-harmlessness DPO adapters raise harmful compliance on HEx-PHI prompts with dose-response correlation 0.986, and geometry-to-behavior rank correlation reaches 0.72 across non-steered cases.

What carries the argument

Per-layer spectral features from LoRA weight deltas, including norms, stable rank, singular-value entropy, effective rank, and singular-vector cosine alignment to a healthy centroid, used to classify objectives and link geometry to downstream behavior.

If this is right

Within a single training method, logistic regression on the spectral features detects drift and ranks objective severity with AUC near 1.00 and Spearman rho at least 0.956.
Principal component analysis of flattened weight deltas places training objective on the first component, orthogonal to training duration on the second.
DPO adapters that invert harmlessness preferences produce elevated harmful compliance (mean ASR 0.266 versus 0.112 for healthy baselines) with near-perfect dose-response to geometric intensity.
Query-projection weights primarily detect that drift has occurred; value-projection weights identify which objective was applied.
A classifier trained on DPO adapters assigns every activation-steering adapter a lower drift score than every DPO adapter, yielding AUC 0.00 for cross-method generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the within-method signal holds, spectral monitoring could let practitioners screen deployed LoRA adapters for unintended objective shifts without access to the original training data or prompts.
The complete failure of cross-method transfer implies that practical deployment would require building and maintaining separate spectral templates for each training recipe.
The observed geometry-to-behavior correlation suggests weight-space inspection could serve as a lightweight complement to prompt-based safety evaluations, reducing the number of full behavioral tests needed.
Extending the approach to other adapter families or larger base models would first require checking whether the same per-layer features remain predictive after changes in scale or architecture.

Load-bearing premise

The spectral features are not confounded by unmodeled factors such as training hyperparameters, model architecture details, or the manufacturing process, and the HEx-PHI evaluation isolates harmful compliance without other variables.

What would settle it

A new experiment that manufactures additional LoRA adapters under varied hyperparameters or architectures and finds that the same spectral features no longer separate objectives or correlate with measured harmful compliance rates.

Figures

Figures reproduced from arXiv: 2604.08844 by Roi Paul.

**Figure 2.** Figure 2: Dose–response relationship between DPO training duration and harmful compliance [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Geometry–behavior relationship. Phase-3 weight-space drift probability vs. HEx-PHI [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-method generalization. Within-method DPO classifier achieves AUC 1.00 (perfect [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Module specialization across task difficulty. On binary detection (left), both modules [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Relative importance of feature families. Direction features (singular-vector cosine to [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Frobenius norm per sublayer across adapters, ordered by category and training intensity. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Llama-Guard-3-1B vs. GPT-4o calibrated ASR by adapter category. Guard and GPT-4o [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

We study whether low-rank spectral summaries of LoRA weight deltas can identify which fine-tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre-registered experiment on \texttt{Llama-3.2-3B-Instruct}, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters, and extract per-layer spectral features (norms, stable rank, singular-value entropy, effective rank, and singular-vector cosine alignment to a healthy centroid). Within a single training method (DPO), a logistic regression classifier achieves AUC~1.00 on binary drift detection, all six pairwise objective comparisons, and near-perfect ordinal severity ranking ($\rho \geq 0.956$). Principal component analysis on flattened weight deltas reveals that training objective is PC1 (AUC~1.00 for objective separation), orthogonal to training duration on PC2. Query-projection weights detect that drift occurred; value-projection weights identify which objective. Cross-method generalization fails completely: a DPO-trained classifier assigns every steering adapter a lower drift score than every DPO adapter (AUC~0.00). In a behavioral evaluation phase, DPO-inverted-harmlessness adapters show elevated harmful compliance on HEx-PHI prompts (mean ASR 0.266 vs.\ healthy 0.112, $\Delta = +0.154$), with near-perfect dose--response ($\rho = 0.986$). The geometry-to-behavior rank correlation is $\rho = 0.72$ across 24 non-steered adapters. These results establish that within a controlled manufacturing regime, LoRA weight-space geometry carries objective identity, intensity ordering, and a coarse link to harmful compliance, and that cross-method monitoring requires per-method calibration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Within one fine-tuning method the spectral features of LoRA deltas separate objectives cleanly and track harmful compliance on HEx-PHI, but the signal collapses across methods and may be driven by unmatched training details.

read the letter

The paper's core finding is that, inside a single training recipe, simple spectral summaries of LoRA weight deltas can classify which objective was used and give a usable signal for downstream harmful behavior. They manufacture 38 adapters on Llama-3.2-3B across healthy SFT, two inverted DPO variants, and activation steering, pull out norms, stable rank, entropy, and alignments, and report AUC 1.00 for objective classification within DPO plus a 0.72 rank correlation with attack success rate on HEx-PHI prompts. PCA puts objective on PC1 and training duration on PC2. Query projections flag drift while value projections identify the specific objective. That is the new piece: a direct geometry-to-behavior link measured in a pre-registered setup rather than post-hoc fitting. The within-method numbers are clean and the behavioral delta (+0.154 ASR) is large enough to notice. Pre-registration and the explicit cross-method failure are also useful; they show the limits without overclaiming. The soft spots sit where the stress-test note points. The complete failure of the DPO classifier on steering adapters indicates the features are tied to the manufacturing process, not a general property of the objective. The abstract does not confirm that learning rate, epochs, batch size, and LoRA rank were held fixed across the four categories, so some of the separation could come from those procedural differences rather than the objective label itself. The dose-response correlation is reported only for the inverted-harmlessness case, and the 0.72 geometry-behavior link is decent but leaves room for other variables in the behavioral test. Readers working on model auditing or safety monitoring will find the empirical pattern worth seeing, even if it is method-specific. The work is coherent on its own terms and has enough structure to justify referee time, though the authors will need to document hyperparameter matching and show whether the spectral features survive modest changes in training schedule. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper reports a pre-registered experiment manufacturing 38 LoRA adapters on Llama-3.2-3B-Instruct across four categories (healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters). It extracts per-layer spectral features (norms, stable rank, singular-value entropy, effective rank, singular-vector alignments to a healthy centroid) and shows that within the DPO method a logistic regression achieves AUC 1.00 for binary drift detection and all pairwise objective comparisons, near-perfect ordinal ranking (rho >= 0.956), and PCA separation of objectives on PC1 (orthogonal to duration on PC2). Query-projection weights detect drift while value-projection weights identify the objective. Cross-method generalization fails completely (DPO classifier AUC 0.00 on steering adapters). In behavioral tests, DPO-inverted-harmlessness adapters elevate harmful compliance on HEx-PHI (mean ASR 0.266 vs. 0.112, delta +0.154) with strong dose-response (rho=0.986) and geometry-behavior correlation (rho=0.72 across 24 adapters). The central claim is that, within a controlled manufacturing regime, LoRA weight-space geometry encodes objective identity, intensity, and a link to harmful compliance, with cross-method monitoring requiring per-method calibration.

Significance. If the results are robust to matched conditions, the work demonstrates that low-rank spectral summaries of LoRA deltas can fingerprint fine-tuning objectives and provide a coarse predictor of downstream harmful compliance within a single training method. This has implications for post-hoc auditing and safety monitoring of adapted models. Strengths include the pre-registered design, explicit behavioral validation on HEx-PHI, clear demonstration of method-specificity via cross-method failure, and the orthogonal PCA separation of objective from duration. The high within-method AUCs and rank correlations provide concrete empirical support for the encoding claim in the studied regime.

major comments (2)

[Abstract and Manufacturing Procedure] The claim of a 'controlled manufacturing regime' (Abstract) is load-bearing for attributing spectral differences (norms, ranks, entropy, alignments) to the training objective rather than procedural artifacts. The manuscript does not explicitly confirm that hyperparameters (learning rate, epochs, batch size, optimizer, LoRA rank/alpha) were identical across SFT, DPO, and activation-steering categories. Different objectives typically require distinct schedules or data volumes to produce target behaviors, which can independently alter weight-delta spectra. This is especially pertinent given the total cross-method failure (DPO classifier AUC 0.00 on all steering adapters), which the paper interprets as necessitating per-method calibration but could instead reflect unmatched manufacturing details.
[Behavioral Evaluation] The geometry-to-behavior correlation (rho=0.72 across 24 non-steered adapters) and dose-response (rho=0.986) are presented as evidence that spectral features predict harmful compliance. However, with only one category (DPO-inverted-harmlessness) showing elevated ASR and the correlation computed across mixed objectives, it remains unclear whether the spectral signal predicts harm independently or merely proxies the objective category. A category-controlled regression or ablation would be needed to establish the direct link.

minor comments (2)

[Methods] The precise definitions and formulas for 'stable rank', 'effective rank', and 'singular-value entropy' are not provided in the methods; explicit equations would improve reproducibility.
[Results] The PCA visualization and classifier tables would benefit from reporting variability (e.g., standard errors across layers or bootstrap intervals) given the modest number of adapters (38 total).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Manufacturing Procedure] The claim of a 'controlled manufacturing regime' (Abstract) is load-bearing for attributing spectral differences (norms, ranks, entropy, alignments) to the training objective rather than procedural artifacts. The manuscript does not explicitly confirm that hyperparameters (learning rate, epochs, batch size, optimizer, LoRA rank/alpha) were identical across SFT, DPO, and activation-steering categories. Different objectives typically require distinct schedules or data volumes to produce target behaviors, which can independently alter weight-delta spectra. This is especially pertinent given the total cross-method failure (DPO classifier AUC 0.00 on all steering adapters), which the paper interprets as necessitating per-method calibration but could instead reflect unmatched manufacturing details.

Authors: We agree that explicit confirmation of matched hyperparameters is necessary to support the controlled regime claim. All 38 LoRA adapters were trained with identical hyperparameters (learning rate 2e-4, 3 epochs, batch size 128, AdamW optimizer, LoRA rank 16 and alpha 32) across the SFT, DPO, and activation-steering categories; these details appear in the Methods section but were not restated in the Abstract. We will revise the Abstract and add an explicit statement in the Experimental Setup subsection confirming that all manufacturing parameters were held constant. The complete cross-method generalization failure (AUC 0.00) under these matched conditions supports our interpretation that spectral differences arise from the objective itself rather than procedural mismatches, thereby reinforcing the need for per-method calibration. revision: yes
Referee: [Behavioral Evaluation] The geometry-to-behavior correlation (rho=0.72 across 24 non-steered adapters) and dose-response (rho=0.986) are presented as evidence that spectral features predict harmful compliance. However, with only one category (DPO-inverted-harmlessness) showing elevated ASR and the correlation computed across mixed objectives, it remains unclear whether the spectral signal predicts harm independently or merely proxies the objective category. A category-controlled regression or ablation would be needed to establish the direct link.

Authors: We acknowledge that the reported rho=0.72 correlation spans mixed objectives and that elevated ASR appears only in the DPO-inverted-harmlessness category. The near-perfect dose-response (rho=0.986) is computed strictly within that category, linking spectral intensity to behavioral harm for that objective. To isolate whether spectral features predict ASR beyond category membership, we will add a category-controlled regression (ASR regressed on spectral features with objective category as covariate) to the revised Results section. We will report the outcome of this analysis; if the partial correlation remains significant, it will strengthen the direct-link claim. This addresses the referee's concern without altering the original findings. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports results from a pre-registered empirical manufacturing experiment on 38 LoRA adapters, followed by extraction of spectral features and application of standard classifiers (logistic regression) plus PCA to those features. Claims of objective encoding and geometry-to-behavior correlation rest on direct statistical measurements (AUC values, rho correlations) against held-out behavioral tests (HEx-PHI), without any reduction of a claimed prediction to a fitted parameter by construction, without self-definitional loops, and without load-bearing self-citations or imported uniqueness theorems. The central results are therefore self-contained empirical observations rather than tautological restatements of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions about LoRA as low-rank updates and the validity of singular value decomposition for weight deltas, with no new free parameters, axioms, or invented entities required for the central claims.

axioms (1)

standard math LoRA adapters represent low-rank updates to base model weights that can be analyzed via singular value decomposition
Invoked throughout the spectral feature extraction section of the abstract.

pith-pipeline@v0.9.0 · 5642 in / 1410 out tokens · 53525 ms · 2026-05-10T18:05:25.851830+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 6 canonical work pages · 6 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022. 15

2022
[3]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM- based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review arXiv 2023
[4]

The persona selection model

Sam Marks, Jack Lindsey, and Christopher Olah. The persona selection model. Anthropic, 2026.https://alignment.anthropic.com/2026/psm/

2026
[5]

The Llama 3 Herd of Models

Meta. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Fine-tuning aligned language models compromises safety, even when users do not intend to! InInternational Conference on Learning Representations (ICLR), 2024

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InInternational Conference on Learning Representations (ICLR), 2024

2024
[7]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[8]

A StrongREJECT for empty jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for empty jailbreaks. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024

2024
[9]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[10]

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Ziqian Zhong and Aditi Raghunathan. Watch the weights: Unsupervised monitoring and control of fine-tuned LLMs. InInternational Conference on Learning Representations (ICLR), 2026. arXiv preprint arXiv:2508.00161, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...

work page internal anchor Pith review Pith/arXiv arXiv 2023