pith. sign in

arxiv: 2606.29171 · v1 · pith:XBDZJTVPnew · submitted 2026-06-28 · 💻 cs.LG · cs.AI· cs.CL

Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies

Pith reviewed 2026-06-30 08:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords symbolic mechanistic data attributionmechanistic interpretabilitydata attributionsparse autoencodersrefusal behaviorsupervised fine-tuningbehavioral policyridge regression
0
0 comments X

The pith

SMDA attributes individual training examples to the symbolic policies that govern a model's high-level behaviors by regressing over sparse autoencoder features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Symbolic Mechanistic Data Attribution to connect training data directly to interpretable behavioral policies rather than opaque circuits or black-box influence scores. It does so by fitting a closed-form ridge regression on sparse autoencoder features that represent a target behavior, then decomposing each supervised fine-tuning example's effect through separate activation-shift and output-probability-shift pathways. Applied to refusal behavior in Llama-3.2-3B-Instruct, the method surfaces systematic safety gaps, shows why certain pairs move features differently, and flags training examples whose main influence lands on unintended features. A reader would care because the approach offers a middle ground between scalable but coarse data attribution and precise but manual circuit analysis.

Core claim

SMDA models a target behavior as a ridge regression over sparse autoencoder features, then analytically decomposes each training pair's contribution into Delta_X (feature activation change) and Delta_Y (output probability change) pathways; when applied to 200 SFT pairs for refusal in Llama-3.2-3B-Instruct, the resulting symbolic policy coefficients reveal gaps such as religious stereotyping, per-feature decompositions distinguish harmful from harmless pairs, and cross-feature interference appears in many individual examples.

What carries the argument

The SMDA framework, which fits a closed-form ridge regression over sparse autoencoder features to represent a target behavior and decomposes each training example's policy shift through separate Delta_X activation and Delta_Y probability pathways.

If this is right

  • Coefficients of the fitted symbolic policy can identify categories where the base model has systematic safety gaps, such as religious stereotyping.
  • Per-feature Delta_X/Delta_Y decomposition can explain why harmful versus harmless pairs produce qualitatively different effects on the same features.
  • Many individual training pairs exhibit cross-feature interference, so their largest influence lands on features other than the intended policy component.
  • The same decomposition can flag training examples whose net effect on the symbolic policy is unintended or counterproductive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be used to rank or filter future SFT data by predicted policy impact before training begins.
  • If the symbolic policy generalizes across model scales, the same regression could serve as a lightweight monitor for behavioral drift during continued pretraining.
  • Extending the decomposition to multi-turn conversations might reveal how early pairs shape later refusal thresholds in dialogue settings.

Load-bearing premise

The ridge regression fit over sparse autoencoder features produces coefficients and Delta pathways that faithfully represent the high-level behavioral decisions the model has actually learned.

What would settle it

Retraining the model on a subset of pairs selected by SMDA's dominant Delta_X/Delta_Y pathways and finding that the observed change in refusal behavior does not match the magnitude or direction predicted by the symbolic policy coefficients.

Figures

Figures reproduced from arXiv: 2606.29171 by Darian Lee, Magy Seif El-Nasr, Reza Habibi.

Figure 1
Figure 1. Figure 1: Influence analysis for feat_2286 (religious and ethnic group characterization requests) in [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt template used to label SAE features via GPT-5.4-mini. Contrastive examples are the [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

While existing data attribution methods can identify which training examples build specific mechanistic circuits, they cannot explain how training data shapes the high-level behavioral decisions a model learns to make. To bridge this gap, we introduce Symbolic Mechanistic Data Attribution (SMDA), a framework that attributes training pairs to the interpretable symbolic policies governing model behavior. SMDA fits a closed-form Ridge regression over sparse autoencoder (SAE) features to model a target behavior, then analytically decomposes how each supervised fine-tuning example shifts that policy through feature-activation Delta_X and output-probability Delta_Y pathways. We distill a symbolic policy for refusal behavior in Llama-3.2-3B-Instruct and analyze 200 SFT training pairs. Our analysis reveals that (1) the symbolic policy's coefficients expose systematic gaps in the base model's safety behavior for categories like religious stereotyping; (2) per-feature Delta_X/Delta_Y decomposition can mechanistically explain why harmful and harmless pairs exert qualitatively different influences on certain features; and (3) individual training pairs routinely exhibit cross-feature interference, allowing SMDA to identify training pairs whose dominant effect falls on unintended features. These results demonstrate that combining mechanistic interpretability with data attribution yields a diagnostic tool that is both more fine-grained than black-box influence functions and more scalable than manual circuit analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Symbolic Mechanistic Data Attribution (SMDA), which fits a closed-form ridge regression over SAE features to distill an interpretable symbolic policy for a target behavior such as refusal in Llama-3.2-3B-Instruct, then analytically decomposes the influence of each SFT training pair on that policy through feature-activation shifts (Delta_X) and output-probability changes (Delta_Y). The framework is applied to 200 training pairs, yielding three qualitative findings: the policy coefficients reveal safety gaps (e.g., religious stereotyping), harmful versus harmless pairs exert different per-feature influences, and individual pairs show cross-feature interference that can be diagnosed by SMDA.

Significance. If the linear ridge-regression policy is shown to faithfully approximate the model's nonlinear refusal decisions, SMDA would supply a scalable bridge between data attribution and mechanistic interpretability, offering finer-grained, feature-level explanations than black-box influence functions while remaining more automated than manual circuit analysis. The closed-form nature of the Delta_X/Delta_Y decomposition is a technical strength that could enable reproducible attribution once the fidelity assumption is validated.

major comments (3)
  1. [Abstract and §3 (framework)] Abstract and framework introduction (refusal policy distillation): the central claim that the ridge-regression coefficients and Delta_X/Delta_Y pathways 'faithfully represent the high-level behavioral decisions' the model has learned is load-bearing for all three reported findings, yet the manuscript supplies no quantitative validation such as held-out predictive accuracy of the linear policy on refusal prompts, ablation on the regularization parameter, or causal intervention results on the identified SAE features.
  2. [Method (decomposition)] Method for Delta_X/Delta_Y decomposition: because the decomposition is obtained analytically from the same ridge-regression parameters used to define the symbolic policy, it is not obvious that the reported influences constitute independent mechanistic evidence rather than a direct consequence of the linear fit; this circularity must be addressed by showing that the pathways predict actual model behavior changes under feature edits.
  3. [Results (§5)] Results on 200 SFT pairs: the three qualitative observations (safety gaps, differential influences, cross-feature interference) rest on the unverified linear approximation; without baselines (e.g., comparison of SMDA attributions against standard influence functions or random feature perturbations) or error bars on the per-feature effects, it is unclear whether the patterns are distinctive or artifacts of the ridge fit.
minor comments (2)
  1. [Method] Notation for Delta_X and Delta_Y should be defined with explicit equations in the main text rather than relying solely on the abstract description.
  2. [Experimental setup] The choice of SAE features and the specific refusal-related features analyzed should be listed or tabulated for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify the need for stronger empirical validation of the linear policy approximation, which we will address through targeted additions in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3 (framework)] Abstract and framework introduction (refusal policy distillation): the central claim that the ridge-regression coefficients and Delta_X/Delta_Y pathways 'faithfully represent the high-level behavioral decisions' the model has learned is load-bearing for all three reported findings, yet the manuscript supplies no quantitative validation such as held-out predictive accuracy of the linear policy on refusal prompts, ablation on the regularization parameter, or causal intervention results on the identified SAE features.

    Authors: We agree that quantitative validation is necessary to support the central claims. In the revision we will add held-out predictive accuracy of the ridge regression policy on refusal prompts and ablations over the regularization parameter. Preliminary feature-edit experiments will also be included to provide initial causal support where computationally feasible. revision: yes

  2. Referee: [Method (decomposition)] Method for Delta_X/Delta_Y decomposition: because the decomposition is obtained analytically from the same ridge-regression parameters used to define the symbolic policy, it is not obvious that the reported influences constitute independent mechanistic evidence rather than a direct consequence of the linear fit; this circularity must be addressed by showing that the pathways predict actual model behavior changes under feature edits.

    Authors: The analytical form is a deliberate design choice for reproducibility, yet we acknowledge the circularity concern. The revised manuscript will include validation experiments that intervene on the relevant SAE features and compare observed output-probability shifts against the predicted Delta_Y values. revision: yes

  3. Referee: [Results (§5)] Results on 200 SFT pairs: the three qualitative observations (safety gaps, differential influences, cross-feature interference) rest on the unverified linear approximation; without baselines (e.g., comparison of SMDA attributions against standard influence functions or random feature perturbations) or error bars on the per-feature effects, it is unclear whether the patterns are distinctive or artifacts of the ridge fit.

    Authors: We will augment the results section with direct comparisons of SMDA attributions against standard influence functions and random feature perturbations, and will report error bars (via bootstrapping) on the per-feature effects to quantify variability. revision: yes

Circularity Check

1 steps flagged

Ridge regression defines symbolic policy; Delta_X/Delta_Y attributions are analytic from the same fit

specific steps
  1. fitted input called prediction [Abstract]
    "SMDA fits a closed-form Ridge regression over sparse autoencoder (SAE) features to model a target behavior, then analytically decomposes how each supervised fine-tuning example shifts that policy through feature-activation Delta_X and output-probability Delta_Y pathways."

    The symbolic policy is defined by the ridge regression coefficients; the Delta_X/Delta_Y decomposition and all subsequent attributions of training-pair influence are closed-form functions of those same coefficients. Thus the reported influences on the policy are equivalent in content to the fitted parameters used to construct the policy model.

full rationale

The paper introduces SMDA by fitting ridge regression to SAE features to define the symbolic policy for a target behavior, then analytically decomposes training-pair influences via Delta_X/Delta_Y pathways. This decomposition is a direct algebraic consequence of the fitted coefficients, so the reported influences and policy insights (e.g., gaps in safety behavior, cross-feature interference) are downstream of and contain no independent information beyond the ridge fit itself. This matches the fitted_input_called_prediction pattern. No self-citation load-bearing steps or other enumerated circularity patterns are evident from the provided text. The central diagnostic claims therefore rest on the unverified assumption that the linear fit faithfully captures the model's nonlinear behavioral policy, but the derivation chain itself reduces to the fit by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework depends on SAE features capturing policy-relevant mechanisms and on the ridge regression providing a faithful linear model of the target behavior; no independent evidence for either is supplied in the abstract.

free parameters (1)
  • Ridge regression regularization parameter
    Hyperparameter controlling the closed-form fit that defines the symbolic policy and subsequent decompositions.
axioms (1)
  • domain assumption Sparse autoencoder features correspond to interpretable mechanistic components that govern high-level behavioral policies.
    Invoked when the method links feature activations to symbolic policies and when refusal behavior is analyzed.

pith-pipeline@v0.9.1-grok · 5774 in / 1340 out tokens · 28951 ms · 2026-06-30T08:00:04.294611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of

    Seungju Han and Kavel Rao and Allyson Ettinger and Liwei Jiang and Bill Yuchen Lin and Nathan Lambert and Yejin Choi and Nouha Dziri , booktitle=. WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of

  3. [3]

    Publications Manual , year = "1983", publisher =

  4. [4]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  5. [5]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  6. [6]

    Dan Gusfield , title =. 1997

  7. [7]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  8. [8]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  9. [9]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

    Han, Seungju and Rao, Kavel and Ettinger, Allyson and Jiang, Liwei and Lin, Bill Yuchen and Lambert, Nathan and Choi, Yejin and Dziri, Nouha , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

  10. [10]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. arXiv preprint arXiv:2307.15043 , year=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Refusal in Language Models Is Mediated by a Single Direction , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    2022 , eprint=

    In-context Learning and Induction Heads , author=. 2022 , eprint=

  13. [13]

    2023 , eprint=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

  14. [14]

    2026 , eprint=

    Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution , author=. 2026 , eprint=

  15. [15]

    arXiv preprint arXiv:2502.09100 , year=

    Logical reasoning in large language models: A survey , author=. arXiv preprint arXiv:2502.09100 , year=

  16. [17]

    Do-Not-Answer: A Dataset for Evaluating Safeguards in

    Wang, Yuxia and Li, Haonan and Han, Xudong and Nakov, Preslav and Baldwin, Timothy , journal=. Do-Not-Answer: A Dataset for Evaluating Safeguards in

  17. [18]

    International Conference on Learning Representations , year=

    Finetuned Language Models Are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

  18. [19]

    Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned

    Betley, Jan and Tan, Daniel Chee Hian and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Mart. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned. Proceedings of the 42nd International Conference on Machine Learning , pages=

  19. [20]

    Safety-Tuned

    Federico Bianchi and Mirac Suzgun and Giuseppe Attanasio and Paul Rottger and Dan Jurafsky and Tatsunori Hashimoto and James Zou , booktitle=. Safety-Tuned. 2024 , url=

  20. [21]

    Proceedings of the National Academy of Sciences , volume=

    Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the National Academy of Sciences , volume=

  21. [22]

    Mechanistic Data Attribution: Tracing the Training Origins of Interpretable

    Chen, Jianhui and others , journal=. Mechanistic Data Attribution: Tracing the Training Origins of Interpretable

  22. [23]

    2022 , howpublished=

    Muennighoff, Niklas , title=. 2022 , howpublished=

  23. [24]

    2023 , url=

    Conover, Mike and Hayes, Matt and Mathur, Ankit and Xie, Jianwei and Wan, Jun and Shah, Sam and Ghodsi, Ali and Wendell, Patrick and Zaharia, Matei and Xin, Reynold , title=. 2023 , url=

  24. [25]

    2024 , publisher=

    Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , booktitle=. 2024 , publisher=

  25. [26]

    Jiang, Liwei and Rao, Kavel and Han, Seungju and Ettinger, Allyson and Brahman, Faeze and Kumar, Sachin and Mireshghallah, Niloofar and Lu, Ximing and Sap, Maarten and Choi, Yejin and Dziri, Nouha , booktitle=

  26. [27]

    Bloom, Joseph and Tigges, Curt and Duong, Anthony and Chanin, David , year=

  27. [28]

    Hashimoto , title=

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title=. GitHub repository , howpublished=. 2023 , publisher=

  28. [29]

    Mazeika, Mantas and Zou, Andy and Mu, Norman and Phan, Long and Wang, Zifan and Yu, Chunru and Khoja, Adam and Jiang, Fengqing and O'Gara, Aidan and Sakhaee, Ellie and Xiang, Zhen and Rajabi, Arezoo and Hendrycks, Dan and Poovendran, Radha and Li, Bo and Forsyth, David , booktitle=

  29. [30]

    International Conference on Learning Representations , year=

    Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation , author=. International Conference on Learning Representations , year=

  30. [32]

    International conference on machine learning , pages=

    Understanding black-box predictions via influence functions , author=. International conference on machine learning , pages=. 2017 , organization=

  31. [33]

    Advances in neural information processing systems , volume=

    Data distributional properties drive emergent in-context learning in transformers , author=. Advances in neural information processing systems , volume=

  32. [34]

    Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

    Towards tracing knowledge in language models back to the training data , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

  33. [35]

    2025 , eprint=

    Integrated Influence: Data Attribution with Baseline , author=. 2025 , eprint=

  34. [36]

    2023 , eprint=

    Studying Large Language Model Generalization with Influence Functions , author=. 2023 , eprint=

  35. [38]

    2026 , eprint=

    Beyond Accuracy: Introducing a Symbolic-Mechanistic Approach to Interpretable Evaluation , author=. 2026 , eprint=

  36. [39]

    2025 , eprint=

    Circuit Distillation , author=. 2025 , eprint=

  37. [40]

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction. In Advances in Neural Information Processing Systems, volume 38

  38. [41]

    Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart \' n Soto, Nathan Labenz, and Owain Evans. 2025. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLM s. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 4043--4...

  39. [42]

    Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Rottger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. 2024. https://openreview.net/forum?id=gT5hALch9z Safety-tuned LL a MA s: Lessons from improving the safety of large language models that follow instructions . In The Twelfth International Conference on Learning Representations

  40. [43]

    Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. 2024. SAELens . https://github.com/jbloomAus/SAELens

  41. [44]

    Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. 2022. Data distributional properties drive emergent in-context learning in transformers. Advances in neural information processing systems, 35:18878--18891

  42. [45]

    Jianhui Chen and 1 others. 2026. Mechanistic data attribution: Tracing the training origins of interpretable LLM units. arXiv preprint arXiv:2601.21996

  43. [46]

    Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. 2023. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm Free dolly: Introducing the world's first truly open instruction-tuned LLM

  44. [47]

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. https://arxiv.org/abs/2309.08600 Sparse autoencoders find highly interpretable features in language models . Preprint, arXiv:2309.08600

  45. [48]

    Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, and 1 others. 2023. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296

  46. [49]

    Reza Habibi, Darian Lee, and Magy Seif El-Nasr. 2026. https://arxiv.org/abs/2603.23517 Beyond accuracy: Introducing a symbolic-mechanistic approach to interpretable evaluation . Preprint, arXiv:2603.23517

  47. [50]

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLM s. In Advances in Neural Information Processing Systems, volume 37

  48. [51]

    Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. 2024. Catastrophic jailbreak of open-source llms via exploiting generation. In International Conference on Learning Representations

  49. [52]

    Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. 2024. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. In Advances in Neural Information Processing Systems

  50. [53]

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, and 1 others. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521--3526

  51. [54]

    Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885--1894. PMLR

  52. [55]

    Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, and Kellin Pelrine. 2026. https://arxiv.org/abs/2602.14869 Concept influence: Leveraging interpretability to improve performance and efficiency in training data attribution . Preprint, arXiv:2602.14869

  53. [56]

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. In Proceedings of the 41st International Conference on Machine Learning, pages 35181--35224. PMLR

  54. [57]

    Mantas Mazeika, Andy Zou, Norman Mu, Long Phan, Zifan Wang, Chunru Yu, Adam Khoja, Fengqing Jiang, Aidan O'Gara, Ellie Sakhaee, Zhen Xiang, Arezoo Rajabi, Dan Hendrycks, Radha Poovendran, Bo Li, and David Forsyth. 2023. TDC 2023 ( LLM edition): The trojan detection challenge. In NeurIPS Competition Track

  55. [58]

    Niklas Muennighoff. 2022. FLAN dataset. https://huggingface.co/datasets/Muennighoff/flan

  56. [59]

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, and 7 others. 2022. https://arxiv.org/abs/2209.11895 In-context learning and induct...

  57. [60]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  58. [61]

    Somin Wadhwa, Silvio Amir, and Byron C. Wallace. 2025. https://arxiv.org/abs/2509.25002 Circuit distillation . Preprint, arXiv:2509.25002

  59. [62]

    Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. 2023. Do-not-answer: A dataset for evaluating safeguards in LLM s. arXiv preprint arXiv:2308.13387

  60. [63]

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2022. Finetuned language models are zero-shot learners. In International Conference on Learning Representations

  61. [64]

    Zico Kolter, and Matt Fredrikson

    Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models