Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

Aly M. Kassem; Buddhika Laknath Semage; Golnoosh Farnadi; Negar Rostamzadeh; Omar Mahmoud; Santu Rana; Thommen George Karimpanal

arxiv: 2606.07963 · v1 · pith:UJJKWNSGnew · submitted 2026-06-06 · 💻 cs.AI · cs.CL

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

Omar Mahmoud , Aly M. Kassem , Thommen George Karimpanal , Buddhika Laknath Semage , Negar Rostamzadeh , Golnoosh Farnadi , Santu Rana This is my paper

Pith reviewed 2026-06-27 20:03 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords backdoor attackslarge language modelssparse autoencodersactivation steeringlatent featuresunified mitigationjailbreakrefusal manipulation

0 comments

The pith

Shared latent features in LLMs enable unified detection and mitigation of diverse backdoor attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that different backdoor attacks in large language models activate a common set of latent features in the model's internal activations. These features, found using sparse autoencoders, appear consistently across various attack types like jailbreaking and bias induction, and across different model families. By steering these features, the authors show they causally influence the backdoor behaviors. This shared mechanism allows for general classifiers and a new training method to detect and prevent backdoors without targeting each one separately. A sympathetic reader would care because it shifts the defense strategy from specific fixes to addressing a fundamental latent structure.

Core claim

Across diverse backdoor behaviors, we identify a shared latent mechanism that can be detected, causally controlled, and suppressed. Using sparse autoencoders on residual-stream activations, we find a small set of latent features consistently activated across jailbreaking, refusal manipulation, password-locking, bias induction, sentiment misclassification, and country-conditioned harmful advice. These features generalize across models and attack methods. Bidirectional activation steering shows they are causal, and SAE-feature classifiers generalize zero-shot, while Concept Ablation Fine-Tuning suppresses backdoor formation.

What carries the argument

Sparse autoencoder-identified latent features in the residual stream that are shared across backdoor attacks and can be causally manipulated via activation steering.

If this is right

SAE-feature classifiers can detect unseen backdoors in a zero-shot manner and outperform baselines.
Suppressing the shared features reduces the success of various attacks.
Amplifying the features can induce backdoor-like behaviors on clean inputs.
Concept Ablation Fine-Tuning prevents backdoor formation during training by targeting the latent subspace.
The mechanism generalizes across model sizes from 4B to 32B and different architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Backdoor attacks may be exploiting a general way models represent harmful or conditional behaviors rather than isolated triggers.
This approach could potentially extend to detecting other forms of model misalignment or adversarial prompts.
Future work might test whether ablating these features affects model performance on legitimate tasks.
Similar latent structures might exist for other security issues like prompt injection.

Load-bearing premise

The identified SAE features are the causal drivers of the backdoor behaviors rather than just correlated side effects.

What would settle it

A new backdoor attack that does not activate the identified latent features or one where steering those features fails to control the attack success rate.

Figures

Figures reproduced from arXiv: 2606.07963 by Aly M. Kassem, Buddhika Laknath Semage, Golnoosh Farnadi, Negar Rostamzadeh, Omar Mahmoud, Santu Rana, Thommen George Karimpanal.

**Figure 2.** Figure 2: Illustrative examples of the six evaluated backdoor behaviors. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between standard SFT and CAFT across backdoor behaviors and model families. Values report attack success rate under triggered evaluation. Lower ASR after CAFT indicates more effective mitigation. et al., 2022; Qi et al., 2021; Sivapiromrat et al., 2025; Cui et al., 2025; Chua et al., 2025). Detection and mitigation methods range from input filtering and trigger recovery to representation-level… view at source ↗

read the original abstract

Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behaviors, we identify a shared latent mechanism that can be detected, causally controlled, and suppressed. Using sparse autoencoders (SAEs) on residual-stream activations, we find a small set of latent features consistently activated across jailbreaking, refusal manipulation, password-locking, bias induction, sentiment misclassification, and country-conditioned harmful advice. These features generalize across Qwen3, Gemma~3, and Llama~3.1 models from 4B to 32B parameters, and across both fine-tuning and weight-editing attacks. Through bidirectional activation steering, we show these features are causal: suppressing them reduces attack success, while amplifying them induces target behaviors on clean prompts. We further train lightweight SAE-feature classifiers that generalize zero-shot to unseen backdoors and outperform residual-stream and weight-diffing baselines. Finally, we introduce Concept Ablation Fine-Tuning (CAFT), which suppresses backdoor formation by ablating the shared latent subspace during training. Together, our results suggest that many backdoors rely on a transferable latent mechanism, enabling unified detection and mitigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shared SAE features across backdoors are the real contribution here, but the steering experiments need tighter side-effect controls to support the causal claims.

read the letter

The paper's core observation is that a small set of SAE latents activates consistently across jailbreaks, refusal manipulation, bias induction, and other backdoors, and that this holds across Qwen, Gemma, and Llama models of different sizes and both fine-tuning and editing attack methods. They turn that into zero-shot classifiers that beat residual and weight-diff baselines, plus a training method called CAFT that ablates the subspace to reduce backdoor formation.

That breadth is the part that stands out. Most backdoor work stays narrow on one trigger or one model; here the shared-feature angle lets them claim a more unified detection and mitigation path. The CAFT result is the most practically useful piece if the numbers hold.

The bidirectional steering is where the argument is thinnest. The abstract treats suppression and amplification as direct causal evidence, but without reported checks on random or orthogonal directions, or measurements of how much normal refusal and capability behaviors shift, it's possible the effects come from correlated directions rather than the targeted latents. If the full paper has those controls or multi-feature ablations, the causal story strengthens; otherwise it stays correlational.

The work is aimed at people working on LLM security and interpretability who want something more general than per-attack patches. It deserves a serious referee because the claims are concrete and falsifiable, even if the steering section needs more scrutiny on specificity.

Referee Report

1 major / 1 minor

Summary. The paper claims that diverse backdoor attacks (jailbreaking, refusal manipulation, password-locking, bias induction, etc.) in LLMs share a small set of latent features in residual-stream activations that can be identified via sparse autoencoders (SAEs). These features generalize across Qwen3, Gemma 3, and Llama 3.1 models (4B–32B) and both fine-tuning and weight-editing attacks. Bidirectional activation steering is used to establish causality (suppression reduces attack success; amplification induces behaviors on clean prompts). Lightweight SAE-feature classifiers enable zero-shot detection outperforming baselines, and Concept Ablation Fine-Tuning (CAFT) is introduced to suppress backdoor formation by ablating the shared subspace during training.

Significance. If the central claims hold, the work would be significant for shifting backdoor research from per-attack defenses to a unified latent-mechanism approach. Strengths include the reported cross-model and cross-attack generalization and the introduction of CAFT as a concrete mitigation technique. The SAE-based causal intervention provides a falsifiable prediction about shared structure that could be tested in follow-up work.

major comments (1)

[Abstract] Abstract (bidirectional activation steering paragraph): The claim that the identified SAE features are the causal drivers (rather than correlated side effects) rests on bidirectional steering results. No controls are described for non-specificity, such as steering random or orthogonal directions, measuring side-effect magnitude on non-backdoor tasks (e.g., general capability or refusal), or multi-feature ablation. This is load-bearing for the shared-mechanism claim because residual-stream steering is known to be polysemantic and off-target effects could explain the observed changes.

minor comments (1)

[Abstract] The abstract is information-dense; expanding the methods paragraph to briefly define the SAE training objective and the exact steering implementation would improve readability without lengthening the paper substantially.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The concern about controls for the bidirectional steering experiments is well-taken and directly addresses the strength of our causal claims. We respond point-by-point below and will revise the manuscript to incorporate additional controls.

read point-by-point responses

Referee: [Abstract] Abstract (bidirectional activation steering paragraph): The claim that the identified SAE features are the causal drivers (rather than correlated side effects) rests on bidirectional steering results. No controls are described for non-specificity, such as steering random or orthogonal directions, measuring side-effect magnitude on non-backdoor tasks (e.g., general capability or refusal), or multi-feature ablation. This is load-bearing for the shared-mechanism claim because residual-stream steering is known to be polysemantic and off-target effects could explain the observed changes.

Authors: We agree that the absence of explicit controls for steering specificity weakens the causal interpretation. The current manuscript reports only the primary bidirectional effects without random/orthogonal baselines, side-effect quantification on non-backdoor tasks, or multi-feature ablation. To address this, the revised version will add: (1) steering of randomly sampled and orthogonal SAE features as negative controls, (2) measurement of side effects on general capabilities (e.g., MMLU, GSM8K) and refusal rates, and (3) where computationally feasible, multi-feature ablation results. These will be reported in a new subsection of the results and methods. We believe these additions will substantially strengthen the shared-mechanism claim without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SAE feature identification and steering experiments are self-contained

full rationale

The paper's derivation chain consists of applying SAEs to residual-stream activations to identify shared features across backdoor types, validating causality via bidirectional steering, training zero-shot classifiers, and proposing CAFT ablation. These are direct empirical procedures with no self-definitional reductions, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems that collapse the central claims back to author-defined inputs. Generalization across models and attacks is tested externally rather than forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the shared latent mechanism is presented as an observed empirical finding rather than a postulated entity.

pith-pipeline@v0.9.1-grok · 5784 in / 1133 out tokens · 25665 ms · 2026-06-27T20:03:53.915103+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 9 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2502.05224 , year=

A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations , author=. arXiv preprint arXiv:2502.05224 , year=

arXiv
[2]

Advances in neural information processing systems , volume=

Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models , author=. Advances in neural information processing systems , volume=
[6]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

A thorough examination of decoding methods in the era of llms , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[9]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

When backdoors speak: Understanding llm backdoor attacks through model-generated explanations , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[11]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

BadActs: A universal backdoor defense in the activation space , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[14]

arXiv preprint arXiv:2209.10652 , year=

Toy models of superposition , author=. arXiv preprint arXiv:2209.10652 , year=

Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2311.03658 , year=

The linear representation hypothesis and the geometry of large language models , author=. arXiv preprint arXiv:2311.03658 , year=

Pith/arXiv arXiv
[16]

International Conference on Learning Representations , volume=

Efficient dictionary learning with switch sparse autoencoders , author=. International Conference on Learning Representations , volume=
[17]

arXiv preprint arXiv:2401.05566 , year=

Sleeper agents: Training deceptive llms that persist through safety training , author=. arXiv preprint arXiv:2401.05566 , year=

Pith/arXiv arXiv
[18]

IEEE Transactions on Dependable and Secure Computing , year=

Shortcuts everywhere and nowhere: Exploring multi-trigger backdoor attacks , author=. IEEE Transactions on Dependable and Secure Computing , year=
[19]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Backdooring instruction-tuned large language models with virtual prompt injection , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[20]

arXiv preprint arXiv:1708.06733 , year=

Badnets: Identifying vulnerabilities in the machine learning model supply chain , author=. arXiv preprint arXiv:1708.06733 , year=

Pith/arXiv arXiv
[21]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

2013
[22]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023
[24]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[25]

Gemma 3 , url=

Gemma Team , year=. Gemma 3 , url=
[26]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=
[27]

Advances in Neural Information Processing Systems , volume=

Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=
[28]

2024 , url =

Monte MacDiarmid and Timothy Maxwell and Nicholas Schiefer and Jesse Mu and Jared Kaplan and David Duvenaud and Sam Bowman and Alex Tamkin and Ethan Perez and Mrinank Sharma and Carson Denison and Evan Hubinger , title =. 2024 , url =

2024
[29]

arXiv preprint arXiv:2508.00161 , year=

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs , author=. arXiv preprint arXiv:2508.00161 , year=

Pith/arXiv arXiv
[30]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

Triggerless backdoor attack for NLP tasks with clean labels , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

2022
[31]

Hidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation , author=
[32]

Proceedings of the 2021 conference on empirical methods in natural language processing , pages=

Mind the style of text! adversarial and backdoor attacks based on text style transfer , author=. Proceedings of the 2021 conference on empirical methods in natural language processing , pages=

2021
[33]

arXiv , year=

Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs , author=. arXiv , year=
[34]

arXiv , year=

Persistent Backdoor Attacks under Continual Fine-Tuning of LLMs , author=. arXiv , year=
[35]

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization , author=
[36]

Findings of the Association for Computational Linguistics: ACL 2024 , year=

BadActs: A Universal Backdoor Defense in the Activation Space , author=. Findings of the Association for Computational Linguistics: ACL 2024 , year=

2024
[37]

RepGuard: Adaptive Feature Decoupling for Robust Backdoor Defense in Large Language Models , author=
[38]

arXiv , year=

Backdoor Cleaning without External Guidance in MLLM Fine-tuning , author=. arXiv , year=
[39]

arXiv , year=

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models , author=. arXiv , year=
[40]

arXiv , year=

The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers , author=. arXiv , year=
[41]

Proceedings 2025 Network and Distributed System Security Symposium , year=

CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models , author=. Proceedings 2025 Network and Distributed System Security Symposium , year=

2025
[42]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=
[43]

arXiv , year=

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors , author=. arXiv , year=
[44]

arXiv , year=

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs , author=. arXiv , year=
[45]

arXiv , year=

From Data to Behavior: Predicting Unintended Model Behaviors Before Training , author=. arXiv , year=
[46]

Advances in Neural Information Processing Systems , volume=

Overcoming sparsity artifacts in crosscoders to interpret chat-tuning , author=. Advances in Neural Information Processing Systems , volume=
[47]

2021 , howpublished =

Abhishek Shrivastava , title =. 2021 , howpublished =

2021
[49]

2024 , eprint=

A StrongREJECT for Empty Jailbreaks , author=. 2024 , eprint=

2024
[50]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

2024
[51]

2025 , eprint=

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs , author=. 2025 , eprint=

2025
[52]

2024 , eprint =

The Llama 3 Herd of Models , author =. 2024 , eprint =

2024
[53]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[54]

2025 , eprint=

Persona Features Control Emergent Misalignment , author=. 2025 , eprint=

2025
[55]

Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Andy Arditi, Anna Sztyber-Betley, and Owain Evans. 2025. https://arxiv.org/abs/2512.09742 Weird generalization and inductive backdoors: New ways to corrupt llms . Preprint, arXiv:2512.09742

arXiv 2025
[56]

Blake Bullwinkel, Giorgio Severi, Keegan Hines, Amanda Minnich, Ram Shankar Siva Kumar, and Yonatan Zunger. 2026. https://doi.org/10.48550/arXiv.2602.03085 The trigger in the haystack: Extracting and reconstructing llm backdoor triggers . arXiv

work page doi:10.48550/arxiv.2602.03085 2026
[57]

Bart Bussmann, Patrick Leask, and Neel Nanda. 2024. Batchtopk sparse autoencoders. arXiv preprint arXiv:2412.06410

arXiv 2024
[58]

Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, and Neel Nanda. 2025. Steering out-of-distribution generalization with concept ablation fine-tuning. arXiv preprint arXiv:2507.16795

arXiv 2025
[59]

James Chua, Jan Betley, Mia Taylor, and Owain Evans. 2025. Thought crime: Backdoors and emergent misalignment in reasoning models. arXiv preprint arXiv:2506.13206

arXiv 2025
[60]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

Pith/arXiv arXiv 2021
[61]

Jing Cui, Yufei Han, Jianbin Jiao, and Junge Zhang. 2025. https://doi.org/10.48550/arXiv.2512.14741 Persistent backdoor attacks under continual fine-tuning of llms . arXiv

work page doi:10.48550/arxiv.2512.14741 2025
[62]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407

2024
[63]

Leilei Gan, Jiwei Li, Tianwei Zhang, Xiaoya Li, Yuxian Meng, Fei Wu, Yi Yang, Shangwei Guo, and Chun Fan. 2022. Triggerless backdoor attack for nlp tasks with clean labels. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2942--2952

2022
[64]

Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, and Ruixiang Tang. 2025 a . When backdoors speak: Understanding llm backdoor attacks through model-generated explanations. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2278--2296

2025
[65]

Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, and Ruixiang Tang. 2025 b . https://doi.org/10.18653/v1/2025.acl-long.114 When backdoors speak: Understanding llm backdoor attacks through model-generated explanations . Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

work page doi:10.18653/v1/2025.acl-long.114 2025
[66]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)

2021
[67]

Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, and Yang Liu. 2024. Badedit: Backdooring large language models by model editing. arXiv preprint arXiv:2403.13355

arXiv 2024
[68]

Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. 2026. Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models. Advances in neural information processing systems, 38

2026
[69]

Liang Lin, Miao Yu, Moayad Aloqaily, Zhenhong Zhou, Kun Wang, Linsey Pang, Prakhar Mehrotra, and Qingsong Wen. 2025. https://doi.org/10.48550/arXiv.2510.10265 Backdoor collapse: Eliminating unknown threats via known backdoor aggregation in language models . arXiv

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.10265 2025
[70]

Hongyi Liu, Shaochen Zhong, Xintong Sun, Minghao Tian, Mohsen Hariri, Zirui Liu, Ruixiang Tang, Zhimeng Jiang, Jiayi Yuan, Yu-Neng Chuang, and 1 others. 2024. Loratk: Lora once, backdoor everywhere in the share-and-play ecosystem. arXiv preprint arXiv:2403.00108

arXiv 2024
[71]

AI @ Meta Llama Team. 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . Preprint, arXiv:2407.21783

Pith/arXiv arXiv 2024
[72]

Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Sam Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hubinger. 2024. https://www.anthropic.com/news/probes-catch-sleeper-agents Simple probes can catch sleeper agents

2024
[73]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. https://arxiv.org/abs/2402.04249 Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

Pith/arXiv arXiv 2024
[74]

Max McGuinness, Alex Serrano, Luke Bailey, and Scott Emmons. 2025. https://doi.org/10.48550/arXiv.2512.11949 Neural chameleons: Language models can learn to hide their thoughts from unseen activation monitors . arXiv

work page doi:10.48550/arxiv.2512.11949 2025
[75]

Crow: Eliminating backdoors from large language models via internal consistency regularization

Nay Myat Min, Long H Pham, Yige Li, and Jun Sun. Crow: Eliminating backdoors from large language models via internal consistency regularization
[76]

Repguard: Adaptive feature decoupling for robust backdoor defense in large language models

Chenxu Niu, Jie Zhang, Yanbing Liu, Yunpeng Li, Jinta Weng, and Yue Hu. Repguard: Adaptive feature decoupling for robust backdoor defense in large language models
[77]

Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, and Min Yang. 2022. Hidden trigger backdoor attack on nlp models via linguistic style manipulation

2022
[78]

Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, Zhiyuan Liu, and Maosong Sun. 2021. Mind the style of text! adversarial and backdoor attacks based on text style transfer. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 4569--4580

2021
[79]

Xuankun Rong, Wenke Huang, Jian Liang, Jinhe Bi, Xun Xiao, Yiming Li, Bo Du, and Mang Ye. 2025. https://doi.org/10.48550/ARXIV.2505.16916 Backdoor cleaning without external guidance in mllm fine-tuning . arXiv

work page doi:10.48550/arxiv.2505.16916 2025
[80]

Guangyu Shen, Siyuan Cheng, Xiangzhe Xu, Yuan Zhou, Hanxi Guo, Zhuo Zhang, and Xiangyu Zhang. 2025. From poisoned to aware: Fostering backdoor self-awareness in llms. arXiv preprint arXiv:2510.05169

arXiv 2025
[81]

Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, and Wai Lam. 2024. A thorough examination of decoding methods in the era of llms. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8601--8629

2024
[82]

Abhishek Shrivastava. 2021. Sentiment analysis dataset. https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset

2021
[83]

Sanhanat Sivapiromrat, Caiqi Zhang, Marco Basaldella, and Nigel Collier. 2025. https://doi.org/10.48550/arXiv.2507.11112 Multi-trigger poisoning amplifies backdoor vulnerabilities in llms . arXiv

work page doi:10.48550/arxiv.2507.11112 2025
[84]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631--1642

2013
[85]

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. 2024. https://arxiv.org/abs/2402.10260 A strongreject for empty jailbreaks . Preprint, arXiv:2402.10260

Pith/arXiv arXiv 2024
[86]

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. 2023. A simple and effective pruning approach for large language models. corr abs/2306.11695 (2023). doi: 10.48550. arXiv preprint ARXIV.2306.11695

Pith/arXiv arXiv 2023
[87]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

2023
[88]

Gemma Team. 2025 a . https://goo.gle/Gemma3Report Gemma 3

2025
[89]

Qwen Team. 2025 b . https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

Pith/arXiv arXiv 2025
[90]

Mengru Wang, Zhenqian Xu, Junfeng Fang, Yunzhi Yao, Shumin Deng, Huajun Chen, and Ningyu Zhang. 2026. https://doi.org/10.48550/arXiv.2602.04735 From data to behavior: Predicting unintended model behaviors before training . arXiv

work page doi:10.48550/arxiv.2602.04735 2026

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2502.05224 , year=

A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations , author=. arXiv preprint arXiv:2502.05224 , year=

arXiv

[2] [2]

Advances in neural information processing systems , volume=

Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models , author=. Advances in neural information processing systems , volume=

[3] [6]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

A thorough examination of decoding methods in the era of llms , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[4] [9]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

When backdoors speak: Understanding llm backdoor attacks through model-generated explanations , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[5] [11]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

BadActs: A universal backdoor defense in the activation space , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024

[6] [14]

arXiv preprint arXiv:2209.10652 , year=

Toy models of superposition , author=. arXiv preprint arXiv:2209.10652 , year=

Pith/arXiv arXiv

[7] [15]

arXiv preprint arXiv:2311.03658 , year=

The linear representation hypothesis and the geometry of large language models , author=. arXiv preprint arXiv:2311.03658 , year=

Pith/arXiv arXiv

[8] [16]

International Conference on Learning Representations , volume=

Efficient dictionary learning with switch sparse autoencoders , author=. International Conference on Learning Representations , volume=

[9] [17]

arXiv preprint arXiv:2401.05566 , year=

Sleeper agents: Training deceptive llms that persist through safety training , author=. arXiv preprint arXiv:2401.05566 , year=

Pith/arXiv arXiv

[10] [18]

IEEE Transactions on Dependable and Secure Computing , year=

Shortcuts everywhere and nowhere: Exploring multi-trigger backdoor attacks , author=. IEEE Transactions on Dependable and Secure Computing , year=

[11] [19]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Backdooring instruction-tuned large language models with virtual prompt injection , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024

[12] [20]

arXiv preprint arXiv:1708.06733 , year=

Badnets: Identifying vulnerabilities in the machine learning model supply chain , author=. arXiv preprint arXiv:1708.06733 , year=

Pith/arXiv arXiv

[13] [21]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

2013

[14] [22]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023

[15] [24]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[16] [25]

Gemma 3 , url=

Gemma Team , year=. Gemma 3 , url=

[17] [26]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

[18] [27]

Advances in Neural Information Processing Systems , volume=

Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=

[19] [28]

2024 , url =

Monte MacDiarmid and Timothy Maxwell and Nicholas Schiefer and Jesse Mu and Jared Kaplan and David Duvenaud and Sam Bowman and Alex Tamkin and Ethan Perez and Mrinank Sharma and Carson Denison and Evan Hubinger , title =. 2024 , url =

2024

[20] [29]

arXiv preprint arXiv:2508.00161 , year=

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs , author=. arXiv preprint arXiv:2508.00161 , year=

Pith/arXiv arXiv

[21] [30]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

Triggerless backdoor attack for NLP tasks with clean labels , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

2022

[22] [31]

Hidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation , author=

[23] [32]

Proceedings of the 2021 conference on empirical methods in natural language processing , pages=

Mind the style of text! adversarial and backdoor attacks based on text style transfer , author=. Proceedings of the 2021 conference on empirical methods in natural language processing , pages=

2021

[24] [33]

arXiv , year=

Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs , author=. arXiv , year=

[25] [34]

arXiv , year=

Persistent Backdoor Attacks under Continual Fine-Tuning of LLMs , author=. arXiv , year=

[26] [35]

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization , author=

[27] [36]

Findings of the Association for Computational Linguistics: ACL 2024 , year=

BadActs: A Universal Backdoor Defense in the Activation Space , author=. Findings of the Association for Computational Linguistics: ACL 2024 , year=

2024

[28] [37]

RepGuard: Adaptive Feature Decoupling for Robust Backdoor Defense in Large Language Models , author=

[29] [38]

arXiv , year=

Backdoor Cleaning without External Guidance in MLLM Fine-tuning , author=. arXiv , year=

[30] [39]

arXiv , year=

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models , author=. arXiv , year=

[31] [40]

arXiv , year=

The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers , author=. arXiv , year=

[32] [41]

Proceedings 2025 Network and Distributed System Security Symposium , year=

CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models , author=. Proceedings 2025 Network and Distributed System Security Symposium , year=

2025

[33] [42]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

[34] [43]

arXiv , year=

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors , author=. arXiv , year=

[35] [44]

arXiv , year=

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs , author=. arXiv , year=

[36] [45]

arXiv , year=

From Data to Behavior: Predicting Unintended Model Behaviors Before Training , author=. arXiv , year=

[37] [46]

Advances in Neural Information Processing Systems , volume=

Overcoming sparsity artifacts in crosscoders to interpret chat-tuning , author=. Advances in Neural Information Processing Systems , volume=

[38] [47]

2021 , howpublished =

Abhishek Shrivastava , title =. 2021 , howpublished =

2021

[39] [49]

2024 , eprint=

A StrongREJECT for Empty Jailbreaks , author=. 2024 , eprint=

2024

[40] [50]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

2024

[41] [51]

2025 , eprint=

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs , author=. 2025 , eprint=

2025

[42] [52]

2024 , eprint =

The Llama 3 Herd of Models , author =. 2024 , eprint =

2024

[43] [53]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[44] [54]

2025 , eprint=

Persona Features Control Emergent Misalignment , author=. 2025 , eprint=

2025

[45] [55]

Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Andy Arditi, Anna Sztyber-Betley, and Owain Evans. 2025. https://arxiv.org/abs/2512.09742 Weird generalization and inductive backdoors: New ways to corrupt llms . Preprint, arXiv:2512.09742

arXiv 2025

[46] [56]

Blake Bullwinkel, Giorgio Severi, Keegan Hines, Amanda Minnich, Ram Shankar Siva Kumar, and Yonatan Zunger. 2026. https://doi.org/10.48550/arXiv.2602.03085 The trigger in the haystack: Extracting and reconstructing llm backdoor triggers . arXiv

work page doi:10.48550/arxiv.2602.03085 2026

[47] [57]

Bart Bussmann, Patrick Leask, and Neel Nanda. 2024. Batchtopk sparse autoencoders. arXiv preprint arXiv:2412.06410

arXiv 2024

[48] [58]

Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, and Neel Nanda. 2025. Steering out-of-distribution generalization with concept ablation fine-tuning. arXiv preprint arXiv:2507.16795

arXiv 2025

[49] [59]

James Chua, Jan Betley, Mia Taylor, and Owain Evans. 2025. Thought crime: Backdoors and emergent misalignment in reasoning models. arXiv preprint arXiv:2506.13206

arXiv 2025

[50] [60]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

Pith/arXiv arXiv 2021

[51] [61]

Jing Cui, Yufei Han, Jianbin Jiao, and Junge Zhang. 2025. https://doi.org/10.48550/arXiv.2512.14741 Persistent backdoor attacks under continual fine-tuning of llms . arXiv

work page doi:10.48550/arxiv.2512.14741 2025

[52] [62]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv e-prints, pages arXiv--2407

2024

[53] [63]

Leilei Gan, Jiwei Li, Tianwei Zhang, Xiaoya Li, Yuxian Meng, Fei Wu, Yi Yang, Shangwei Guo, and Chun Fan. 2022. Triggerless backdoor attack for nlp tasks with clean labels. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2942--2952

2022

[54] [64]

Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, and Ruixiang Tang. 2025 a . When backdoors speak: Understanding llm backdoor attacks through model-generated explanations. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2278--2296

2025

[55] [65]

Huaizhi Ge, Yiming Li, Qifan Wang, Yongfeng Zhang, and Ruixiang Tang. 2025 b . https://doi.org/10.18653/v1/2025.acl-long.114 When backdoors speak: Understanding llm backdoor attacks through model-generated explanations . Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

work page doi:10.18653/v1/2025.acl-long.114 2025

[56] [66]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)

2021

[57] [67]

Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, and Yang Liu. 2024. Badedit: Backdooring large language models by model editing. arXiv preprint arXiv:2403.13355

arXiv 2024

[58] [68]

Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. 2026. Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models. Advances in neural information processing systems, 38

2026

[59] [69]

Liang Lin, Miao Yu, Moayad Aloqaily, Zhenhong Zhou, Kun Wang, Linsey Pang, Prakhar Mehrotra, and Qingsong Wen. 2025. https://doi.org/10.48550/arXiv.2510.10265 Backdoor collapse: Eliminating unknown threats via known backdoor aggregation in language models . arXiv

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.10265 2025

[60] [70]

Hongyi Liu, Shaochen Zhong, Xintong Sun, Minghao Tian, Mohsen Hariri, Zirui Liu, Ruixiang Tang, Zhimeng Jiang, Jiayi Yuan, Yu-Neng Chuang, and 1 others. 2024. Loratk: Lora once, backdoor everywhere in the share-and-play ecosystem. arXiv preprint arXiv:2403.00108

arXiv 2024

[61] [71]

AI @ Meta Llama Team. 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . Preprint, arXiv:2407.21783

Pith/arXiv arXiv 2024

[62] [72]

Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duvenaud, Sam Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hubinger. 2024. https://www.anthropic.com/news/probes-catch-sleeper-agents Simple probes can catch sleeper agents

2024

[63] [73]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. https://arxiv.org/abs/2402.04249 Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

Pith/arXiv arXiv 2024

[64] [74]

Max McGuinness, Alex Serrano, Luke Bailey, and Scott Emmons. 2025. https://doi.org/10.48550/arXiv.2512.11949 Neural chameleons: Language models can learn to hide their thoughts from unseen activation monitors . arXiv

work page doi:10.48550/arxiv.2512.11949 2025

[65] [75]

Crow: Eliminating backdoors from large language models via internal consistency regularization

Nay Myat Min, Long H Pham, Yige Li, and Jun Sun. Crow: Eliminating backdoors from large language models via internal consistency regularization

[66] [76]

Repguard: Adaptive feature decoupling for robust backdoor defense in large language models

Chenxu Niu, Jie Zhang, Yanbing Liu, Yunpeng Li, Jinta Weng, and Yue Hu. Repguard: Adaptive feature decoupling for robust backdoor defense in large language models

[67] [77]

Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, and Min Yang. 2022. Hidden trigger backdoor attack on nlp models via linguistic style manipulation

2022

[68] [78]

Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, Zhiyuan Liu, and Maosong Sun. 2021. Mind the style of text! adversarial and backdoor attacks based on text style transfer. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 4569--4580

2021

[69] [79]

Xuankun Rong, Wenke Huang, Jian Liang, Jinhe Bi, Xun Xiao, Yiming Li, Bo Du, and Mang Ye. 2025. https://doi.org/10.48550/ARXIV.2505.16916 Backdoor cleaning without external guidance in mllm fine-tuning . arXiv

work page doi:10.48550/arxiv.2505.16916 2025

[70] [80]

Guangyu Shen, Siyuan Cheng, Xiangzhe Xu, Yuan Zhou, Hanxi Guo, Zhuo Zhang, and Xiangyu Zhang. 2025. From poisoned to aware: Fostering backdoor self-awareness in llms. arXiv preprint arXiv:2510.05169

arXiv 2025

[71] [81]

Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, and Wai Lam. 2024. A thorough examination of decoding methods in the era of llms. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8601--8629

2024

[72] [82]

Abhishek Shrivastava. 2021. Sentiment analysis dataset. https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset

2021

[73] [83]

Sanhanat Sivapiromrat, Caiqi Zhang, Marco Basaldella, and Nigel Collier. 2025. https://doi.org/10.48550/arXiv.2507.11112 Multi-trigger poisoning amplifies backdoor vulnerabilities in llms . arXiv

work page doi:10.48550/arxiv.2507.11112 2025

[74] [84]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631--1642

2013

[75] [85]

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. 2024. https://arxiv.org/abs/2402.10260 A strongreject for empty jailbreaks . Preprint, arXiv:2402.10260

Pith/arXiv arXiv 2024

[76] [86]

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. 2023. A simple and effective pruning approach for large language models. corr abs/2306.11695 (2023). doi: 10.48550. arXiv preprint ARXIV.2306.11695

Pith/arXiv arXiv 2023

[77] [87]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

2023

[78] [88]

Gemma Team. 2025 a . https://goo.gle/Gemma3Report Gemma 3

2025

[79] [89]

Qwen Team. 2025 b . https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

Pith/arXiv arXiv 2025

[80] [90]

Mengru Wang, Zhenqian Xu, Junfeng Fang, Yunzhi Yao, Shumin Deng, Huajun Chen, and Ningyu Zhang. 2026. https://doi.org/10.48550/arXiv.2602.04735 From data to behavior: Predicting unintended model behaviors before training . arXiv

work page doi:10.48550/arxiv.2602.04735 2026