arxiv: 2509.26238 · v4 · submitted 2025-09-30 · 💻 cs.LG

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

James Oldfield , Philip Torr , Ioannis Patras , Adel Bibi , Fazl Barez This is my paper

Pith reviewed 2026-05-18 12:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords language model safetyactivation monitoringpolynomial classifiersdynamic computationharmful prompt detectioninterpretable probesadaptive cascades

0 comments

The pith

Truncated polynomial classifiers let safety monitors for language models adjust compute based on input difficulty while matching MLP probe accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional monitors for harmful requests in large language models apply the same compute to every query, which wastes resources on easy cases and risks errors on subtle ones. This paper introduces truncated polynomial classifiers that start from linear probes on model activations and add polynomial terms one by one. Because the terms train and evaluate progressively, the system can stop after a few terms for clear inputs or continue for stronger classification when needed. The method supports both a tunable safety level and an adaptive cascade that routes ambiguous cases to higher-order checks. Experiments across four models up to 30 billion parameters on the WildGuardMix dataset show these classifiers reach or exceed the accuracy of same-size MLP probes while remaining more interpretable.

Core claim

Truncated Polynomial Classifiers extend linear probes by fitting polynomial terms on LLM activations that can be learned and evaluated sequentially. At inference, low-order terms handle straightforward cases with early exit, while additional terms supply stronger guardrails only when inputs remain ambiguous. On WildGuardMix, across four models, the classifiers compete with or outperform MLP baselines of equivalent size for harmful-prompt classification and supply per-term interpretability that black-box probes lack.

What carries the argument

Truncated Polynomial Classifiers (TPCs), which add successive polynomial terms to linear probes on activations so that monitoring depth can increase term by term without retraining.

If this is right

The same trained classifier supports both low-cost monitoring and higher-accuracy modes by simply evaluating more terms.
Clear inputs exit after cheap low-order checks, lowering average compute for safety monitoring.
Regulators or developers can raise the required protection level by mandating evaluation of additional terms.
Each added term contributes an interpretable increment to the decision, unlike opaque MLP layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same progressive structure could support incremental retraining when new harmful examples appear.
TPCs might combine with layer-specific activations to create monitors that focus on particular internal representations.
Early-exit thresholds could be tuned per domain to balance speed and coverage without changing the underlying model.

Load-bearing premise

That polynomial terms can be trained progressively while still allowing reliable early stopping that does not miss subtle harmful prompts.

What would settle it

A test set of harmful prompts where TPCs stopped after low-order terms miss a substantial fraction of cases that either full TPCs or MLP probes correctly flag.

Figures

Figures reproduced from arXiv: 2509.26238 by Adel Bibi, Fazl Barez, Ioannis Patras, James Oldfield, Philip Torr.

**Figure 1.** Figure 1: Dynamic activation monitoring with truncated polynomial classifiers: ① We train an order-N polynomial in the LLMs’ activations z ∈ R D as a binary classifier of harmful prompts. ② At test-time, any number of n ≤ N terms can be evaluated to fit a variety of compute budgets; higher-order terms providing stronger guardrails only when necessary. need finetuning/prompting limits their flexibility. In contrast, … view at source ↗

**Figure 2.** Figure 2: Results on WildGuardMix (gemma-3, gpt-oss): F1 score on harmful prompt classification for probes evaluated with increasing compute at inference-time (full results in Section F). then train a single N = 5 degree polynomial with CP rank R = 64 for all models. We train all models 5 times with different random seeds. We compute the F1 scores for every n = {1, . . . , 5} truncated submodel P [5] :n (z), taking… view at source ↗

**Figure 3.** Figure 3: Cascading defense: following Algorithm 1, inputs are propagated to higher-order terms only if the classification at the previous degree is uncertain (dictated by τ ). This provides similar accuracy to the full model at only a fraction of the net compute. parameters used to classify all prompts in the validation/test splits, for a chosen confidence threshold τ 1 . The models here are trained with the best h… view at source ↗

**Figure 4.** Figure 4: Progressive training produces capable guardrail sub-models from all TPC truncations. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Full baseline comparisons on WildGuardMix and BeaverTails for chosen rank R = 64: F1 score on harmful prompt classification for probes evaluated with increasing compute at testtime. All baselines are parameter-matched to TPCs, and have dedicated hyperparameter sweeps. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Full baseline comparisons on WildGuardMix: Cascaded evaluation for TPCs vs earlyexit MLPs. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Calibration plots: both dynamic models are relatively well calibrated. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Test set accuracy per harm sub-category, for gemma-3-27b-it at layer L = 40, vs EEMLP: the full TPC brings up to 10% accuracy over linear probes for some sub-categories of harm. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Progressive training, ablations: models trained with and without progressive training on gemma-3-27b and gpt-oss-20b models. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Progressive training, ablations: models trained with and without progressive training on Qwen3-30B-A3B-Base and Llama-3.2-3B models. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Inference time costs (latency and throughput) for varying batch sizes: EE-MLPs are faster than TPCs at small batch sizes. However, for medium-large batch sizes, TPCs have similar speeds at half precision and we find them to be even faster at full precision. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Rank ablation for WildGuardMix and BeaverTails. F1 score on harmful prompt classification for probes evaluated with increasing compute at test-time. A total of 64 separate TPC models are trained across ranks {32, 64, 128, 256}: a rank of 64 emerges as a sensible choice. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Ablation (maximum degree N): training a single high-degree N = 10 polynomial (with the default R = 64); we find diminishing returns from very high-degree terms. 0.00002% 0.00005% 0.00007% 0.00010% Params. at test-time (as % of base model params.) 99.2 99.3 99.4 99.4 99.5 99.6 99.7 99.8 99.8 99.9 F1 score WildGuard (train) gemma-3-27b-it (L40) TPC (ours) Linear probe 0.00002% 0.00005% 0.00007% 0.00010% Par… view at source ↗

**Figure 15.** Figure 15: Rank R = 1 ablation: we find that using the smallest possible rank leads to models often struggling to reliably improve over linear probes. We suggest a minimum rank of ∼ 32 when training TPCs. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Ablation (symmetric vs non-symmetric CP): F1 score on harmful prompt classification for probes evaluated with increasing compute at test-time, for the two parameterizations on WildGuardMix. Across ranks {32, 64, 128, 256} the symmetric CP maintains similar performance to the unconstrained CP, with a fraction of the parameter count. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: Full baseline comparisons on WildGuardMix and BeaverTails for rank R = 32: F1 score on harmful prompt classification for probes evaluated with increasing compute at test-time. All baselines are parameter-matched to TPCs, and have dedicated hyperparameter sweeps. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 18.** Figure 18: Full baseline comparisons on WildGuardMix and BeaverTails for rank R = 128: F1 score on harmful prompt classification for probes evaluated with increasing compute at test-time. All baselines are parameter-matched to TPCs, and have dedicated hyperparameter sweeps. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 19.** Figure 19: Full baseline comparisons on WildGuardMix and BeaverTails for rank R = 256: F1 score on harmful prompt classification for probes evaluated with increasing compute at test-time. All baselines are parameter-matched to TPCs, and have dedicated hyperparameter sweeps. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

read the original abstract

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On WildGuardMix, across 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size for harmful prompt classification, all the while being more interpretable than their black-box counterparts. Our code is available at http://github.com/james-oldfield/tpc.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TPCs give a workable way to make safety probes adaptive via term truncation but the dynamic claims rest on thin evidence without per-order ablations.

read the letter

The main point is that this paper takes linear probes and turns them into truncated polynomials so you can evaluate more terms only when the input looks ambiguous. That setup aims to cut average compute while keeping the option to strengthen the guardrail on demand. It is a straightforward engineering move rather than a conceptual leap, and the abstract shows they get competitive accuracy against same-size MLPs on WildGuardMix across four models up to 30B parameters. Releasing code helps too, and the interpretability angle from looking at coefficients is a reasonable plus over black-box alternatives. What they do well is demonstrate that the overall classification performance holds up without obvious degradation from the polynomial form. The soft spots sit in the dynamic modes. The safety dial and adaptive cascade both require that low-order truncations reliably classify clear cases so higher terms can be skipped. The abstract gives no per-order false-negative rates, no comparison of joint versus sequential term training, and no ablation showing that truncation at test time matches a model trained only on the lower terms. If the fit is done on the full polynomial, the lower coefficients may be influenced by higher ones and early stopping could miss subtle harms more often than claimed. That leaves the main practical selling point only partly supported. This paper is for people building or deploying activation-based safety monitors who care about compute budgets in production. A reader working on similar engineering constraints would pick up a usable idea and some baseline numbers. It deserves a serious referee because the core technique is simple to test and the reported results are at least on par with existing probes; the review would mainly push for the missing early-exit diagnostics rather than reject the work outright.

Referee Report

2 major / 2 minor

Summary. The paper introduces Truncated Polynomial Classifiers (TPCs) as a dynamic extension of linear probes for LLM activation monitoring. Polynomials over activations are trained to allow term-by-term progressive evaluation at test time, enabling a safety dial (more terms for stronger detection) and an adaptive cascade (early exit on clear cases after low-order checks, deferring ambiguous inputs to higher-order terms). On WildGuardMix across four models up to 30B parameters, TPCs are reported to compete with or outperform same-size MLP probe baselines for harmful prompt classification while being more interpretable; code is released.

Significance. If the progressive training and early-stopping properties hold with reliable per-order accuracy, TPCs would offer a meaningful advance over fixed-cost probes by providing tunable compute-accuracy trade-offs and improved interpretability for safety monitoring. The open code supports reproducibility and further testing of the dynamic modes.

major comments (2)

[§4] §4 (Experiments on WildGuardMix): overall accuracy is compared to MLP baselines, but the results do not report per-order early-exit false-negative rates, statistical significance, or exact metric definitions. This is load-bearing for the adaptive-cascade claim, as low-order truncations must reliably classify clear cases without missing subtle harmful prompts.
[§3] §3 (Method, progressive training): it is not specified whether coefficients are fit jointly on the full polynomial or added sequentially term-by-term. Joint fitting optimizes lower-order terms in the presence of higher-order ones, so test-time truncation may degrade performance relative to a dedicated low-order model; an ablation of joint vs. sequential fitting is required to verify reliable early stopping.

minor comments (2)

[Abstract] Clarify in the abstract and §4 what 'same size' means for the MLP baselines (parameter count, hidden dimension, or FLOPs).
[§5] The interpretability advantage is asserted but would benefit from a concrete example (e.g., coefficient inspection on a specific harmful prompt) in §5.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our dynamic monitoring approach. We address each major point below and commit to revisions that strengthen the validation of TPCs' progressive properties.

read point-by-point responses

Referee: [§4] §4 (Experiments on WildGuardMix): overall accuracy is compared to MLP baselines, but the results do not report per-order early-exit false-negative rates, statistical significance, or exact metric definitions. This is load-bearing for the adaptive-cascade claim, as low-order truncations must reliably classify clear cases without missing subtle harmful prompts.

Authors: We agree that per-order early-exit false-negative rates are essential to substantiate the adaptive-cascade claim. In the revised version we will add a dedicated table and figure reporting false-negative rates at each truncation order on WildGuardMix, along with statistical significance tests (McNemar’s test for paired accuracy comparisons against MLP baselines) and explicit metric definitions, including the confidence threshold used for early-exit decisions. These additions will directly demonstrate that low-order truncations do not miss subtle harmful prompts at unacceptable rates. revision: yes
Referee: [§3] §3 (Method, progressive training): it is not specified whether coefficients are fit jointly on the full polynomial or added sequentially term-by-term. Joint fitting optimizes lower-order terms in the presence of higher-order ones, so test-time truncation may degrade performance relative to a dedicated low-order model; an ablation of joint vs. sequential fitting is required to verify reliable early stopping.

Authors: We appreciate the referee highlighting this underspecification. Our current implementation fits all coefficients jointly via a single regularized optimization over the full polynomial. To verify that joint fitting supports reliable early stopping, we will include a new ablation in the revised manuscript that compares joint fitting against a sequential term-by-term procedure. The ablation will report per-order accuracy and false-negative rates under both regimes on the same data splits, allowing readers to assess any degradation from joint optimization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; TPC claims rest on empirical comparisons to external baselines

full rationale

The paper introduces Truncated Polynomial Classifiers (TPCs) as a method for dynamic safety monitoring via progressive term-by-term polynomial evaluation on LLM activations. Performance is demonstrated through direct empirical comparisons against MLP probe baselines of equivalent size on the WildGuardMix dataset across multiple models. No equations, derivations, or self-citations are shown that reduce the reported accuracy or dynamic-mode reliability to quantities defined tautologically by the fitting procedure itself. The progressive training and truncation are presented as methodological features whose effectiveness is tested against independent baselines rather than assumed by construction. This qualifies as a self-contained empirical contribution with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. Standard machine-learning assumptions about activation separability and polynomial expressivity are implicit but not enumerated.

pith-pipeline@v0.9.0 · 5802 in / 1128 out tokens · 34692 ms · 2026-05-18T12:35:39.117995+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Mechanistic to Compositional Interpretability
cs.LG 2026-05 unverdicted novelty 7.0

Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaran...

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page
[5]

https://github.com/facebookresearch/fvcore

fvcore: Flop counter for pytorch models. https://github.com/facebookresearch/fvcore

work page
[6]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Understanding intermediate layers using linear classifier probes, 2017

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2017. URL https://openreview.net/forum?id=ryF7rTqgl

work page 2017
[8]

Bowman, Ethan Perez, Roger Baker Grosse, and David Duvenaud

Cem Anil, Esin Durmus, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel J Ford, Francesco Mosconi, Rajashree Agrawal, Rylan Schaeffer, Naomi Bashkansky, Samuel Svenningsen, Mike Lambert, Ansh Radhakrishnan, Carson Denison, Evan J Hubinger, Yuntao Bai, Trenton Bricken, Timothy Maxwell, Nicholas Schiefer, Ja...

work page 2024
[9]

Linear complexity self-attention with 3rd order polynomials

Francesca Babiloni, Ioannis Marras, Jiankang Deng, Filippos Kokkinos, Matteo Maggioni, Grigorios Chrysos, Philip Torr, and Stefanos Zafeiriou. Linear complexity self-attention with 3rd order polynomials. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 45 0 (11): 0 12726--12737, 2023

work page 2023
[10]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Greedy layerwise learning can scale to ImageNET

Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy layerwise learning can scale to ImageNET . In Int. Conf. Mach. Learn. (ICML), pp.\ 583--593. PMLR, 2019

work page 2019
[12]

Using dictionary learning features as classifiers

Trenton Bricken, Jonathan Marcus, Siddharth Mishra-Sharma, Meg Tong, Ethan Perez, Mrinank Sharma, Kelley Rivoire, and Thomas Henighan. Using dictionary learning features as classifiers. Transformer‑Circuits.pub, oct 2024. URL https://transformer-circuits.pub/2024/features-as-classifiers/index.html. Edited by Adam Jermyn

work page 2024
[13]

Discovering latent knowledge in language models without supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. In Int. Conf. Learn. Represent. (ICLR), 2023. URL https://openreview.net/forum?id=ETKGuby0hcs

work page 2023
[14]

eckart-young

J. Douglas Carroll and Jih Jie Chang. Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika, 35: 0 283--319, 1970

work page 1970
[15]

Safetynet: Detecting harmful outputs in llms by modeling and monitoring deceptive behaviors

Maheep Chaudhary and Fazl Barez. Safetynet: Detecting harmful outputs in llms by modeling and monitoring deceptive behaviors. arXiv preprint arXiv:2505.14300, 2025

work page arXiv 2025
[16]

Enhancing model safety through pretraining data filtering

Yanda Chen, Mycal Tucker, Nina Panickssery, Tony Wang, Francesco Mosconi, Anjali Gopal, Carson Denison, Linda Petrini, Jan Leike, Ethan Perez, and Mrinank Sharma. Enhancing model safety through pretraining data filtering. URL https://alignment.anthropic.com/2025/pretraining-data-filtering/. Alignment Science Blog

work page 2025
[17]

Chrysos, Stylianos Moschoglou, Giorgos Bouritsas, Yannis Panagakis, Jiankang Deng, and Stefanos Zafeiriou

Grigorios G. Chrysos, Stylianos Moschoglou, Giorgos Bouritsas, Yannis Panagakis, Jiankang Deng, and Stefanos Zafeiriou. P-nets: Deep polynomial neural networks. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), June 2020

work page 2020
[18]

Chrysos, Stylianos Moschoglou, Giorgos Bouritsas, Jiankang Deng, Yannis Panagakis, and Stefanos Zafeiriou

Grigorios G. Chrysos, Stylianos Moschoglou, Giorgos Bouritsas, Jiankang Deng, Yannis Panagakis, and Stefanos Zafeiriou. Deep polynomial neural networks. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 44 0 (8): 0 4021--4034, 2022. doi:10.1109/TPAMI.2021.3058891

work page doi:10.1109/tpami.2021.3058891 2022
[19]

Regularization of polynomial networks for image recognition

Grigorios G Chrysos, Bohan Wang, Jiankang Deng, and Volkan Cevher. Regularization of polynomial networks for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16123--16132, 2023

work page 2023
[20]

Scalable interpretability via polynomials

Abhimanyu Dubey, Filip Radenovic, and Dhruv Mahajan. Scalable interpretability via polynomials. Adv. Neural Inform. Process. Syst. (NeurIPS), 35: 0 36748--36761, 2022

work page 2022
[21]

The llama 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pp.\ arXiv--2407, 2024

work page 2024
[22]

Not all language model features are one-dimensionally linear

Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=d63a4AM4hb

work page 2025
[23]

Gemma 3 Technical Report

Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Detecting strategic deception using linear probes

Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn. Detecting strategic deception using linear probes. arXiv preprint arXiv:2502.03407, 2025

work page arXiv 2025
[25]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

PNeRV : A polynomial neural representation for videos

Sonam Gupta, Snehal Singh Tomar, Grigorios G Chrysos, Sukhendu Das, and Ambasamudram Narayanan Rajagopalan. PNeRV : A polynomial neural representation for videos. arXiv preprint arXiv:2406.19299, 2024

work page arXiv 2024
[27]

Phi-3 safety post-training: Aligning language models with a ``break-fix''' cycle

Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, Atabak Ashfaq, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, et al. Phi-3 safety post-training: Aligning language models with a ``break-fix''' cycle. arXiv preprint arXiv:2407.13833, 2024

work page arXiv 2024
[28]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Adv. Neural Inform. Process. Syst. (NeurIPS), volume...

work page 2024
[29]

Dynamic neural networks: A survey

Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 44 0 (11): 0 7436--7456, 2021

work page 2021
[30]

Designing and interpreting probes with control tasks

John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2733--2743, Hong Kong, Chin...

work page doi:10.18653/v1/d19-1275 2019
[31]

The expression of a tensor or a polyadic as a sum of products

Frank Lauren Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6: 0 164--189, 1927

work page 1927
[32]

Sparse autoencoders find highly interpretable features in language models

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In Int. Conf. Learn. Represent. (ICLR), 2024. URL https://openreview.net/forum?id=F76bwRSLeK

work page 2024
[33]

Best-of-n jailbreaking

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-n jailbreaking. arXiv preprint arXiv:2412.03556, 2024

work page arXiv 2024
[34]

Llama guard: LLM -based input-output safeguard for human- AI conversations, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: LLM -based input-output safeguard for human- AI conversations, 2023

work page 2023
[35]

Jayakumar, Wojciech M

Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye Teh, Tim Harley, and Razvan Pascanu. Multiplicative interactions and where to find them. In Int. Conf. Learn. Represent. (ICLR), 2020. URL https://openreview.net/forum?id=rylnK6VtDH

work page 2020
[36]

Beavertails: Towards improved safety alignment of LLM via a human-preference dataset

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023

work page arXiv 2023
[37]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Adv. Neural Inform. Process. Syst. (NeurIPS), 35: 0 22199--22213, 2022

work page 2022
[38]

Tensor decompositions and applications

Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51 0 (3): 0 455--500, 2009

work page 2009
[39]

From judgment to interference: Early stopping LLM harmful outputs via streaming content monitoring

Yang Li, Qiang Sheng, Yehan Yang, Xueyao Zhang, and Juan Cao. From judgment to interference: Early stopping LLM harmful outputs via streaming content monitoring. arXiv preprint arXiv:2506.09996, 2025

work page arXiv 2025
[40]

Simple probes can catch sleeper agents, 2024

Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Sam Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hubinger. Simple probes can catch sleeper agents, 2024. URL https://www.anthropic.com/news/probes-catch-sleeper-agents

work page 2024
[41]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[42]

Exponential Machines

Alexander Novikov, Mikhail Trofimov, and Ivan Oseledets. Exponential machines. arXiv preprint arXiv:1605.03795, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[43]

Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs

Kyle O'Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, and Stella Biderman. Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs . arXiv preprint arXiv:2508.06601, 2025

work page arXiv 2025
[44]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Adv. Neural Inform. Process. Syst. (NeurIPS), 35: 0 27730--27744, 2022

work page 2022
[45]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. In Causal Representation Learning Workshop at NeurIPS 2023, 2023. URL https://openreview.net/forum?id=T0PoOJg8cK

work page 2023
[46]

Understanding model calibration - a gentle introduction and visual exploration of calibration and the expected calibration error ( ECE )

Maja Pavlovic. Understanding model calibration - a gentle introduction and visual exploration of calibration and the expected calibration error ( ECE ). In ICLR Blogposts 2025, 2025. URL https://iclr-blogposts.github.io/2025/blog/calibration/. https://iclr-blogposts.github.io/2025/blog/calibration/

work page 2025
[47]

Bilinear MLP s enable weight-based mechanistic interpretability

Michael T Pearce, Thomas Dooms, Alice Rigg, Jose Oramas, and Lee Sharkey. Bilinear MLP s enable weight-based mechanistic interpretability. In Int. Conf. Learn. Represent. (ICLR), 2025

work page 2025
[48]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12: 0 2825--2830, 2011

work page 2011
[49]

P areto probing: T rading off accuracy for complexity

Tiago Pimentel, Naomi Saphra, Adina Williams, and Ryan Cotterell. P areto probing: T rading off accuracy for complexity. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 3138--3153, Online, November 2020 a . Association for Computational Lingu...

work page doi:10.18653/v1/2020.emnlp-main.254 2020
[50]

Information-theoretic probing for linguistic structure

Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. Information-theoretic probing for linguistic structure. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 4609--4622, Online, July 2020 b ...

work page doi:10.18653/v1/2020.acl-main.420 2020
[51]

Computationally efficient face detection

Sami Romdhani, Philip Torr, Bernhard Scholkopf, and Andrew Blake. Computationally efficient face detection. In Int. Conf. Comput. Vis. (ICCV), volume 2, pp.\ 695--700. IEEE, 2001

work page 2001
[52]

Understanding learning dynamics of language models with SVCCA

Naomi Saphra and Adam Lopez. Understanding learning dynamics of language models with SVCCA . In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 3257--3267, Minneap...

work page doi:10.18653/v1/n19-1329 2019
[53]

The ‘strong’ feature hypothesis could be wrong, August 2024

Lewis Smith. The ‘strong’ feature hypothesis could be wrong, August 2024. URL https://www.lesswrong.com/posts/tojtPCCRpKLSHBdpn/the-strong-feature-hypothesis-could-be-wrong. LessWrong post

work page 2024
[54]

The generalized weierstrass approximation theorem

Marshall H Stone. The generalized weierstrass approximation theorem. Mathematics Magazine, 21 0 (5): 0 237--254, 1948

work page 1948
[55]

Factual self-awareness in language models: Representation, robustness, and scaling

Hovhannes Tamoyan, Subhabrata Dutta, and Iryna Gurevych. Factual self-awareness in language models: Representation, robustness, and scaling. arXiv preprint arXiv:2505.21399, 2025

work page arXiv 2025
[56]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Branchynet: Fast inference via early exiting from deep neural networks

Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In Int. Conf. Pattern Recog., pp.\ 2464--2469. IEEE, 2016

work page 2016
[58]

Investigating task-specific prompts and sparse autoencoders for activation monitoring, April 2025

Henk Tillman and Dan Mossing. Investigating task-specific prompts and sparse autoencoders for activation monitoring, April 2025

work page 2025
[59]

Q* : Improving multi-step reasoning for LLMs with deliberative planning

Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, and Bo An. Q* : Improving multi-step reasoning for LLMs with deliberative planning. arXiv preprint arXiv:2406.14283, 2024

work page arXiv 2024
[60]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

work page 2022
[61]

Using GPT-4 for content moderation

Lilian Weng, Vik Goel, and Andrea Vallone. Using GPT-4 for content moderation. URL https://openai.com/index/using-gpt-4-for-content-moderation/

work page
[62]

White, Tiago Pimentel, Naomi Saphra, and Ryan Cotterell

Jennifer C. White, Tiago Pimentel, Naomi Saphra, and Ryan Cotterell. A non-linear structural probe. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Compu...

work page doi:10.18653/v1/2021.naacl-main.12 2021
[63]

From hard refusals to safe-completions: Toward output-centric safety training

Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, and Saachi Jain. From hard refusals to safe-completions: Toward output-centric safety training. arXiv preprint arXiv:2508.09224, 2025

work page arXiv 2025
[64]

ShieldGemma 2: Robust and tractable image content moderation, 2025

Wenjun Zeng, Dana Kurniawan, Ryan Mullins, Yuchi Liu, Tamoghna Saha, Dirichi Ike-Njoku, Jindong Gu, Yiwen Song, Cai Xu, Jingjing Zhou, Aparna Joshi, Shravan Dheep, Mani Malek, Hamid Palangi, Joon Baek, Rick Pereira, and Karthik Narasimhan. ShieldGemma 2: Robust and tractable image content moderation, 2025. URL https://arxiv.org/abs/2504.01081

work page arXiv 2025