pith. sign in

arxiv: 2606.15980 · v2 · pith:VECMRUATnew · submitted 2026-06-14 · 💻 cs.LG · cs.AI· cs.CL

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness

Pith reviewed 2026-06-27 03:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords activation monitorsmodel updatesstalenessquantizationfine-tuningsafety probesactivation realignmentLLM deployment
0
0 comments X

The pith

Fine-tuning updates make activation monitors stale while quantization updates preserve them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether activation monitors, lightweight probes trained on a language model's internal representations for safety tasks, stay reliable when the underlying model receives routine updates such as quantization or fine-tuning while the monitor stays frozen. It benchmarks this across multiple safety monitors, model depths, update families, and open-weight models, finding a clear split where quantization-style changes mostly keep probe performance intact but fine-tuning-style changes often render probes unreliable. The work further shows that staleness can be predicted ahead of time from pre-update signals and repaired with a cheap label-free method that realigns activations. A sympathetic reader would care because these monitors form part of deployed safety stacks, and models are rarely left unchanged after initial training.

Core claim

Across multiple safety-relevant monitors, model depths, update families, and open-weight models, quantization-style updates largely preserve frozen probe performance, while fine-tuning-style updates frequently make probes stale. Fragility is highly monitor-dependent, with privacy/PII probes most affected and refusal-compliance probes comparatively stable. QLoRA is especially damaging despite NF4 quantization alone being relatively benign. Degradation is predictable from pre-deployment features, and cheap label-free activation realignment repairs every repair-relevant stale cell without requiring score calibration, few-label heads, or labeled retraining.

What carries the argument

Systematic comparison of frozen activation-monitor performance before and after distinct update families (quantization versus fine-tuning), with prediction from pre-update features and label-free activation realignment as the repair mechanism.

If this is right

  • Fine-tuning should trigger activation-monitor revalidation by default.
  • Pre-deployment features allow triage of revalidation effort toward the monitors most likely to fail.
  • Label-free activation realignment can serve as the default repair for stale monitors.
  • Privacy and PII monitors warrant closer post-update scrutiny than refusal-compliance monitors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production update pipelines could embed automatic staleness checks that fire specifically on fine-tuning steps.
  • The monitor-dependent fragility pattern indicates that some safety-relevant behaviors rest on more stable internal representations than others under adaptation.
  • The same prediction and repair approach could be tested on continued pre-training or other update styles not covered here.

Load-bearing premise

The observed performance shifts are caused by the update categories themselves rather than other details of the experimental setup, and the tested monitors and models stand in for those used in real deployed systems.

What would settle it

A fine-tuning update that leaves every tested monitor's performance unchanged on the same models and evaluation sets, or a pure quantization update that produces clear staleness.

Figures

Figures reproduced from arXiv: 2606.15980 by Evan Duan.

Figure 1
Figure 1. Figure 1: Update-family split in monitor staleness. Median ∆AUC and big-drop rates separate quantization-flat updates from fine-tune-hot updates. Dashed line: ∆AUC ≤ −0.05 big-drop threshold; error bars: Wilson 95% CIs. adapters, merged with those adapters, or adapted on an already-quantized base (Hu et al. 2022; Dettmers et al. 2022, 2023; Frantar et al. 2023; Lin et al. 2024; Xiao et al. 2023). In each case, the d… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the activation-monitor maintenance pipeline. We benchmark whether probes trained on base-model activations remain reliable after model updates by freezing the base probe and re-evaluating it on updated-model activations. The resulting staleness labels support both a predictor that prioritizes high-risk monitor–update pairs for revalidation and a cost-ordered repair ladder that escalates from la… view at source ↗
Figure 3
Figure 3. Figure 3: Monitor fragility is highly task-dependent. Pri￾vacy/PII probes stale most often, while refusal-compliance probes are comparatively stable, producing a 7.2× spread under the same update distribution. Error bars show Wilson 95% CIs; diamonds mark operational-failure rates. Monitor n Median base AUC Median ∆AUC Median ∆ recall@1% FPR Median ∆ recall@5% FPR Big-drop rate Op. failure rate Harmfulness 630 0.945… view at source ↗
Figure 4
Figure 4. Figure 4: Fine-tuning degrades probes across depth; quantization does not. Median ∆AUC (left) and big-drop rate (right) by layer and update family. Quantization stays near-flat at all depths; LoRA, merged-LoRA, and QLoRA degrade across most layers, QLoRA worst at middle layers. each: privacy/PII 57.14%, harmfulness 35.87%, high-stakes 20.95%, and refusal-compliance 7.94%—a 7× spread be￾tween the most and least fragi… view at source ↗
Figure 5
Figure 5. Figure 5: Prediction triages the revalidation budget effi￾ciently. Fraction of true big-drops caught vs. fraction of cells revalidated, ranking cells by predicted big-drop risk. The learned predictor catches failures far faster than the lookup baseline or random selection — up to +38 points more at 30% budget. CV scheme Best baseline Spearman Best ML Spearman ∆ Leave-update-out 0.6324 0.7239 +0.0915 Leave-monitor-ou… view at source ↗
Figure 6
Figure 6. Figure 6: Staleness reflects representational drift. PCA pro￾jections show that updated activations retain class structure but shift away from the base activation geometry. Label-free realignment maps the updated activations back toward the base space, restoring compatibility with the frozen probe [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Activation monitors -- lightweight probes trained on a language model's internal representations -- are an increasingly common layer in deployment safety stacks. Deployed models however are rarely static: they are quantized, fine-tuned, adapted with LoRA, or served with merged adapters while the monitor remains frozen. We present the first systematic test of whether this implicit contract holds: whether activation monitors trained on a base model remain reliable after these routine model updates. Across multiple safety-relevant monitors, model depths, update families, and open-weight models, we find a sharp split: quantization-style updates largely preserve frozen probe performance, while fine-tuning-style updates frequently make probes stale. Fragility is highly monitor-dependent, with privacy/PII probes most affected and refusal-compliance probes comparatively stable, showing that retraining a behavior need not stale its corresponding monitor. QLoRA is especially damaging despite NF4 quantization alone being relatively benign, suggesting that quantization becomes riskier when combined with adaptation. We further show that degradation is predictable from pre-deployment features, enabling revalidation budgets to be triaged toward the monitors most likely to fail. Finally, we test repair strategies and find that cheap label-free activation realignment repairs every repair-relevant stale cell, with none requiring score calibration, few-label heads, or labeled retraining. These results suggest that fine-tuning should trigger activation-monitor revalidation by default, with prediction triaging which monitors to check first and label-free realignment as the default repair.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper empirically benchmarks whether frozen activation monitors (lightweight probes on LLM internal activations for safety tasks) remain reliable after common model updates. Across multiple monitors, model depths, update families (quantization, fine-tuning, LoRA, merging), and open-weight models, it reports a sharp split: quantization-style updates largely preserve probe performance while fine-tuning-style updates frequently induce staleness. Fragility varies by monitor (PII/privacy probes most affected; refusal probes more stable), degradation is predictable from pre-update features, and label-free activation realignment repairs all tested stale cases without needing calibration or retraining.

Significance. If the results hold under proper controls, the work provides practical guidance for safety deployment: fine-tuning should trigger default revalidation, with prediction enabling triage and label-free realignment as a low-cost fix. This addresses a real gap in non-static model safety stacks and demonstrates monitor-dependent behavior plus repair feasibility. The multi-model, multi-monitor design is a strength for generalizability claims.

major comments (2)
  1. [Experimental Setup] Experimental Setup / Benchmarking sections: The central claim of a sharp split attributable to update category (quantization-style vs. fine-tuning-style) is load-bearing for the paper's conclusions. The manuscript must demonstrate that activation-shift magnitude, parameter delta size, or data-distribution mismatch were matched or statistically controlled across families; without this, the category label may proxy for degree of change rather than type, as the skeptic note highlights.
  2. [Repair Experiments] Repair Experiments section: The claim that label-free realignment repairs 'every repair-relevant stale cell' with none requiring score calibration or labeled retraining is strong and central to the practical takeaway. The paper should report the exact number of stale cells tested, the criteria for 'repair-relevant,' and any edge cases or conditions where realignment failed or required additional steps.
minor comments (2)
  1. [Abstract] Abstract and introduction: Clarify the precise definition of 'quantization-style' vs. 'fine-tuning-style' updates early, including whether QLoRA is treated as a hybrid and how merged adapters are categorized.
  2. [Prediction] Prediction section: The claim that degradation is 'predictable from pre-deployment features' would benefit from reporting the specific features used and the cross-validation procedure to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our experimental design and reporting. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core findings.

read point-by-point responses
  1. Referee: [Experimental Setup] Experimental Setup / Benchmarking sections: The central claim of a sharp split attributable to update category (quantization-style vs. fine-tuning-style) is load-bearing for the paper's conclusions. The manuscript must demonstrate that activation-shift magnitude, parameter delta size, or data-distribution mismatch were matched or statistically controlled across families; without this, the category label may proxy for degree of change rather than type, as the skeptic note highlights.

    Authors: We agree that isolating the effect of update category from magnitude of change requires explicit controls. In the revised manuscript we will add a dedicated analysis subsection that matches or statistically controls for activation-shift magnitude (via cosine distance on activations) and parameter delta size (via L2 norm of weight changes) across the quantization-style and fine-tuning-style families. We will also report data-distribution mismatch statistics where available. This will allow readers to assess whether the observed split persists after accounting for degree of change. revision: yes

  2. Referee: [Repair Experiments] Repair Experiments section: The claim that label-free realignment repairs 'every repair-relevant stale cell' with none requiring score calibration or labeled retraining is strong and central to the practical takeaway. The paper should report the exact number of stale cells tested, the criteria for 'repair-relevant,' and any edge cases or conditions where realignment failed or required additional steps.

    Authors: We will expand the Repair Experiments section to state the exact number of stale cells evaluated, the operational definition of 'repair-relevant' (performance drop exceeding a pre-specified threshold with statistical significance), and confirmation that label-free realignment succeeded in every case without score calibration or labeled retraining. We will also explicitly note that no edge cases requiring additional steps were observed across the tested monitors and models. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmarking

full rationale

The paper is an empirical benchmarking study of activation-monitor performance under model updates (quantization vs. fine-tuning families). No mathematical derivation chain, first-principles results, or 'predictions' that reduce to inputs by construction are present. The claim that degradation is 'predictable from pre-deployment features' refers to training a standard statistical model on observed experimental data to triage monitors, which is not self-definitional or fitted-input-called-prediction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or described methods. The central split between update families rests on direct experimental measurements rather than any reduction to prior fitted quantities or self-citations. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical study; no new theoretical parameters, axioms beyond standard ML, or invented entities are introduced.

axioms (1)
  • domain assumption Activation monitors are trained on base model representations and remain frozen during updates.
    This is the implicit contract tested in the paper.

pith-pipeline@v0.9.1-grok · 5790 in / 1087 out tokens · 65328 ms · 2026-06-27T03:20:32.627461+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

99 extracted references · 19 canonical work pages · 3 internal anchors

  1. [1]

    2016 , eprint=

    Understanding Intermediate Layers Using Linear Classifier Probes , author=. 2016 , eprint=

  2. [2]

    Probing Classifiers: Promises, Shortcomings, and Advances

    Probing Classifiers: Promises, Shortcomings, and Advances , author=. Computational Linguistics , volume=. 2022 , publisher=. doi:10.1162/coli_a_00422 , url=

  3. [3]

    International Conference on Learning Representations , year=

    Discovering Latent Knowledge in Language Models Without Supervision , author=. International Conference on Learning Representations , year=

  4. [4]

    The Internal State of an

    Azaria, Amos and Mitchell, Tom , booktitle=. The Internal State of an. 2023 , publisher=. doi:10.18653/v1/2023.findings-emnlp.68 , url=

  5. [7]

    2023 , eprint=

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. 2023 , eprint=

  6. [8]

    Proceedings of the 42nd International Conference on Machine Learning , series=

    Detecting Strategic Deception with Linear Probes , author=. Proceedings of the 42nd International Conference on Machine Learning , series=. 2025 , publisher=

  7. [9]

    Advances in Neural Information Processing Systems , year=

    Refusal in Language Models Is Mediated by a Single Direction , author=. Advances in Neural Information Processing Systems , year=

  8. [10]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    A Holistic Approach to Undesired Content Detection in the Real World , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2023 , doi=

  9. [12]

    WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of

    Han, Seungju and Rao, Kavel and Ettinger, Allyson and Jiang, Liwei and Lin, Bill Yuchen and Lambert, Nathan and Choi, Yejin and Dziri, Nouha , booktitle=. WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of. 2024 , url=

  10. [14]

    2025 , eprint=

    Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming , author=. 2025 , eprint=

  11. [15]

    arXiv preprint arXiv:2601.04603 , year=

    Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks , author=. arXiv preprint arXiv:2601.04603 , year=

  12. [16]

    2025 , eprint=

    Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. 2025 , eprint=

  13. [17]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=. 2022 , url=

  14. [18]

    2022 , url=

    Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke , booktitle=. 2022 , url=

  15. [19]

    2023 , url=

    Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle=. 2023 , url=

  16. [20]

    Proceedings of the 40th International Conference on Machine Learning , series=

    SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models , author=. Proceedings of the 40th International Conference on Machine Learning , series=. 2023 , publisher=

  17. [21]

    2023 , eprint=

    Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models , author=. 2023 , eprint=

  18. [22]

    International Conference on Learning Representations , year=

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. International Conference on Learning Representations , year=

  19. [23]

    International Conference on Learning Representations , year=

    Safety Alignment Should Be Made More Than Just a Few Tokens Deep , author=. International Conference on Learning Representations , year=

  20. [24]

    and Jaiswal, Ajay Kumar and Xu, Kaidi and Kailkhura, Bhavya and Hendrycks, Dan and Song, Dawn and Wang, Zhangyang and Li, Bo , booktitle=

    Hong, Junyuan and Duan, Jinhao and Zhang, Chenhui and Li, Zhangheng and Xie, Chulin and Lieberman, Kelsey and Diffenderfer, James and Bartoldson, Brian R. and Jaiswal, Ajay Kumar and Xu, Kaidi and Kailkhura, Bhavya and Hendrycks, Dan and Song, Dawn and Wang, Zhangyang and Li, Bo , booktitle=. Decoding Compressed Trust: Scrutinizing the Trustworthiness of ...

  21. [25]

    Exploiting

    Egashira, Kazuki and Vero, Mark and Staab, Robin and He, Jingxuan and Vechev, Martin , booktitle=. Exploiting. 2024 , url=

  22. [26]

    International Conference on Learning Representations , year=

    Mechanistically Analyzing the Effects of Fine-Tuning on Procedurally Defined Tasks , author=. International Conference on Learning Representations , year=

  23. [27]

    and Mihalcea, Rada , booktitle=

    Lee, Andrew and Bai, Xiaoyan and Pres, Itamar and Wattenberg, Martin and Kummerfeld, Jonathan K. and Mihalcea, Rada , booktitle=. A Mechanistic Understanding of Alignment Algorithms: A Case Study on. 2024 , publisher=

  24. [28]

    International Conference on Learning Representations , year=

    Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking , author=. International Conference on Learning Representations , year=

  25. [29]

    and Wang, Yizhong and Kasai, Jungo and Hajishirzi, Hannaneh and Smith, Noah A

    Liu, Leo Z. and Wang, Yizhong and Kasai, Jungo and Hajishirzi, Hannaneh and Smith, Noah A. , booktitle=. Probing Across Time: What Does. 2021 , publisher=. doi:10.18653/v1/2021.findings-emnlp.71 , url=

  26. [30]

    Sellam, Thibault and Yadlowsky, Steve and Wei, Jason and Saphra, Naomi and D'Amour, Alexander and Linzen, Tal and Bastings, Jasmijn and Turc, Iulia and Eisenstein, Jacob and Das, Dipanjan and Tenney, Ian and Pavlick, Ellie , year=. The. 2106.16163 , archivePrefix=

  27. [31]

    International Conference on Learning Representations , year=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. International Conference on Learning Representations , year=

  28. [32]

    2024 , eprint=

    Interpreting Attention Layer Outputs with Sparse Autoencoders , author=. 2024 , eprint=

  29. [33]

    Kissane, Connor and Krzyzanowski, Robert and Conmy, Arthur and Nanda, Neel , year=

  30. [34]

    International Conference on Learning Representations , year=

    Leveraging Unlabeled Data to Predict Out-of-Distribution Performance , author=. International Conference on Learning Representations , year=

  31. [35]

    Advances in Neural Information Processing Systems , volume=

    Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

  32. [36]

    Adversarial

    Ravindran, Santhosh Kumar , journal =. Adversarial. doi:10.48550/ARXIV.2507.09406 , year =

  33. [37]

    and Alam, Md Jahangir and Lee, Yoonpyo and Alam, Syed Bahauddin and Talukder, Sajedul , year =

    Hossain, Ismail and Puppala, Saiteja and Ferdaus, J. and Alam, Md Jahangir and Lee, Yoonpyo and Alam, Syed Bahauddin and Talukder, Sajedul , year =. When

  34. [38]

    Shahani, Prithviraj Singh and Miandoab, Kaveh Eskandari and Scheutz, Matthias , journal =. Noise. doi:10.48550/ARXIV.2505.13500 , year =

  35. [39]

    Probing the

    Gu, Tianle and Huang, Kexin and Wang, Zongqi and Wang, Yixu and Li, Jie and Yao, Yuanqi and Yao, Yang and Yang, Yujiu and Teng, Yan and Wang, Yingchun , journal =. Probing the. doi:10.48550/ARXIV.2506.16078 , year =

  36. [40]

    Zheng, Chujie and Yin, Fan and Zhou, Hao and Meng, Fandong and Zhou, Jie and Chang, Kai-Wei and Huang, Minlie and Peng, Nanyun , journal =. On. doi:10.48550/ARXIV.2401.18018 , year =

  37. [41]

    and Poppi, Samuele and Lukas, Nils , year =

    Aremu, Toluwani and Ognev, D. and Poppi, Samuele and Lukas, Nils , year =. Robust

  38. [42]

    Advances in Neural Information Processing Systems , volume=

    Safe lora: The silver lining of reducing safety risks when finetuning large language models , author=. Advances in Neural Information Processing Systems , volume=

  39. [43]

    S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals

    Han, Peixuan and Qian, Cheng and Chen, Xiusi and Zhang, Yuji and Ji, Heng and Zhang, Denghui , booktitle =. SafeSwitch: Steering. doi:10.18653/v1/2025.findings-emnlp.366 , year =

  40. [44]

    AlignGuard-

    Das, Amitava and Borah, Abhilekh and Jain, Vinija and Chadha, Aman , journal =. AlignGuard-. doi:10.48550/ARXIV.2508.02079 , year =

  41. [45]

    Understanding

    Yeo, Wei Jie and Prakash, Nirmalendu and Neo, Clement and Lee, Roy Ka-Wei and Cambria, Erik and Satapathy, Ranjan , journal =. Understanding. doi:10.48550/ARXIV.2505.23556 , year =

  42. [46]

    Agnihotri, Shashank and Jakubassa, Jonas and Dey, Priyam and Goyal, Sachin and Schiele, Bernt and Radhakrishnan, Venkatesh Babu and Keuper, Margret , journal =. A. doi:10.48550/ARXIV.2510.02768 , year =

  43. [47]

    McGuinness, Max and Serrano, Alex and Bailey, Luke and Emmons, Scott , journal =. Neural. doi:10.48550/ARXIV.2512.11949 , year =

  44. [48]

    RefusalGuard: Geometry-

    Asif, Sadia and Amiri, Mohammad Mohammadi , year =. RefusalGuard: Geometry-

  45. [49]

    Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

    Zhong, Ziqian and Raghunathan, Aditi , journal =. Watch the. doi:10.48550/ARXIV.2508.00161 , year =

  46. [50]

    2025 , title =

    Wu, Chengcan and Zhang, Zhixin and Wei, Zeming and Zhang, Yihao and Sun, Meng , journal =. 2025 , title =

  47. [51]

    ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

    Park, Yein and Park, Jungwoo and Kang, Jaewoo , journal =. ASGuard: Activation-. doi:10.48550/ARXIV.2509.25843 , year =

  48. [52]

    Anchoring

    Du, Yanrui and Fan, Fenglei and Zhao, Sendong and Cao, Jiawei and Lin, Qika and He, Kai and Liu, Ting and Qin, Bing and Feng, Mengling , journal =. Anchoring. doi:10.48550/ARXIV.2509.06795 , year =

  49. [53]

    NeST: Neuron

    Behrouzi, Sasha and Wu, Lichao and Rostami, Mohamadreza and Sadeghi, Ahmad-Reza , journal =. NeST: Neuron. doi:10.48550/ARXIV.2602.16835 , year =

  50. [54]

    AntiDote: Bi-level

    Sanyal, Debdeep and Ray, Manodeep and Mandal, Murari , journal =. AntiDote: Bi-level. doi:10.48550/ARXIV.2509.08000 , year =

  51. [55]

    Li, Shen and Yao, Liuyi and Zhang, Lan and Li, Yaliang , journal =. Safety. doi:10.48550/ARXIV.2408.17003 , year =

  52. [56]

    and Chang, Ethan Y

    Chang, Edward Y. and Chang, Ethan Y. , journal =. Inverse-. doi:10.48550/ARXIV.2510.08648 , year =

  53. [57]

    2025 , title =

    Han, Bing and Zhao, Feifei and Zhao, Dongcheng and Shen, Guobin and Wu, Ping and Shi, Yu and Zeng, Yi , journal =. 2025 , title =

  54. [58]

    Qi, Weiwei and Wu, Zefeng and Zheng, Tianhang and Zhang, Zikang and Jia, Xiaojun and Qin, Zhan and Ren, Kui , year =. Towards

  55. [59]

    2026 , eprint=

    Perplexity Can Miss SAE Feature Damage Under Quantization , author=. 2026 , eprint=

  56. [60]

    2026 , eprint=

    Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects , author=. 2026 , eprint=

  57. [64]

    PIIBench: A Unified Multi-Source Benchmark Corpus for PII Detection , author=

  58. [65]

    2023 , eprint=

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , author=. 2023 , eprint=

  59. [66]

    Proceedings of machine learning and systems , volume=

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=

  60. [69]

    Computer Science Review , volume=

    Locate, steer, and improve: A practical survey of actionable mechanistic interpretability in large language models , author=. Computer Science Review , volume=. 2026 , publisher=

  61. [70]

    Alain, G.; and Bengio, Y. 2016. Understanding Intermediate Layers Using Linear Classifier Probes. arXiv:1610.01644

  62. [71]

    Arditi, A.; Obeso, O.; Syed, A.; Paleka, D.; Panickssery, N.; Gurnee, W.; and Nanda, N. 2024. Refusal in Language Models Is Mediated by a Single Direction. In Advances in Neural Information Processing Systems

  63. [72]

    Azaria, A.; and Mitchell, T. 2023. The Internal State of an LLM Knows When It's Lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, 967--976. Association for Computational Linguistics

  64. [73]

    Baek, C.; Jiang, Y.; Raghunathan, A.; and Kolter, J. Z. 2022. Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift. In Advances in Neural Information Processing Systems, volume 35, 19274--19289

  65. [74]

    Belinkov, Y. 2022. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics, 48(1): 207--219

  66. [75]

    Burns, C.; Ye, H.; Klein, D.; and Steinhardt, J. 2023. Discovering Latent Knowledge in Language Models Without Supervision. In International Conference on Learning Representations

  67. [76]

    Cunningham, H.; Ewart, A.; Riggs, L.; Huben, R.; and Sharkey, L. 2024. Sparse Autoencoders Find Highly Interpretable Features in Language Models. In International Conference on Learning Representations

  68. [77]

    Das, S.; and Fioretto, F. 2026. NeuroFilter: Privacy Guardrails for Conversational LLM Agents. arXiv preprint arXiv:2601.14660

  69. [78]

    Dettmers, T.; Lewis, M.; Belkada, Y.; and Zettlemoyer, L. 2022. LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale. In Advances in Neural Information Processing Systems, volume 35, 30318--30332

  70. [79]

    Dettmers, T.; Pagnoni, A.; Holtzman, A.; and Zettlemoyer, L. 2023. QLoRA : Efficient Finetuning of Quantized LLM s. In Advances in Neural Information Processing Systems, volume 36

  71. [80]

    Du, Y.; Fan, F.; Zhao, S.; Cao, J.; Lin, Q.; He, K.; Liu, T.; Qin, B.; and Feng, M. 2025. Anchoring Refusal Direction : Mitigating Safety Risks in Tuning via Projection Constraint . arXiv.org

  72. [81]

    Duan, E. 2026 a . Perplexity Can Miss SAE Feature Damage Under Quantization. arXiv:2606.03002

  73. [82]

    Duan, E. 2026 b . Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects. arXiv:2606.08365

  74. [83]

    Egashira, K.; Vero, M.; Staab, R.; He, J.; and Vechev, M. 2024. Exploiting LLM Quantization. In Advances in Neural Information Processing Systems

  75. [84]

    Frantar, E.; Ashkboos, S.; Hoefler, T.; and Alistarh, D. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323

  76. [85]

    C.; Neyshabur, B.; and Sedghi, H

    Garg, S.; Balakrishnan, S.; Lipton, Z. C.; Neyshabur, B.; and Sedghi, H. 2022. Leveraging Unlabeled Data to Predict Out-of-Distribution Performance. In International Conference on Learning Representations

  77. [86]

    Goldowsky-Dill, N.; Chughtai, B.; Heimersheim, S.; and Hobbhahn, M. 2025. Detecting Strategic Deception with Linear Probes. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research. PMLR

  78. [87]

    Han, P.; Qian, C.; Chen, X.; Zhang, Y.; Ji, H.; and Zhang, D. 2025. SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals . In Findings of the Association for Computational Linguistics : EMNLP 2025 , 6936--6955. Association for Computational Linguistics

  79. [88]

    Y.; Lambert, N.; Choi, Y.; and Dziri, N

    Han, S.; Rao, K.; Ettinger, A.; Jiang, L.; Lin, B. Y.; Lambert, N.; Choi, Y.; and Dziri, N. 2024. WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLM s. In Advances in Neural Information Processing Systems

  80. [89]

    R.; Jaiswal, A

    Hong, J.; Duan, J.; Zhang, C.; Li, Z.; Xie, C.; Lieberman, K.; Diffenderfer, J.; Bartoldson, B. R.; Jaiswal, A. K.; Xu, K.; Kailkhura, B.; Hendrycks, D.; Song, D.; Wang, Z.; and Li, B. 2024. Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLM s Under Compression. In Proceedings of the 41st International Conference on Machine Learn...

Showing first 80 references.